Welcome to the Sklearn workbook. Let's first start with importing the libraries. We are not importing sklearn yet, do you know why?

In [1]:
# Import custom libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [2]:
%matplotlib notebook

Now, let's go ahead and read in the city bike data set. I modified this data set for the class exercise.

In [3]:
citibike_df = pd.read_csv('./citibike_simplified_for_phys472.csv')

In [6]:
citibike_df.head()

Unnamed: 0,tripduration,start station id,start station latitude,start station longitude,end station id,end station latitude,end station longitude,bikeid,usertype,birth year
0,3117,301,40.722174,-73.983688,301,40.722174,-73.983688,18070,Subscriber,1986.0
1,690,301,40.722174,-73.983688,349,40.718502,-73.983299,19699,Subscriber,1985.0
2,727,301,40.722174,-73.983688,2010,40.721655,-74.002347,20953,Subscriber,1982.0
3,698,301,40.722174,-73.983688,527,40.744023,-73.976056,23566,Subscriber,1976.0
4,351,301,40.722174,-73.983688,250,40.724561,-73.995653,17545,Subscriber,1959.0
5,597,301,40.722174,-73.983688,497,40.73705,-73.990093,17435,Subscriber,1979.0
6,1248,301,40.722174,-73.983688,505,40.749013,-73.988484,18236,Subscriber,1987.0
7,417,301,40.722174,-73.983688,268,40.719105,-73.999733,22454,Subscriber,1991.0
8,454,301,40.722174,-73.983688,128,40.727103,-74.002971,17910,Subscriber,1990.0
9,409,301,40.722174,-73.983688,439,40.726281,-73.98978,17127,Subscriber,1981.0


In [5]:
citibike_df.describe()

Unnamed: 0,tripduration,start station id,start station latitude,start station longitude,end station id,end station latitude,end station longitude,bikeid,birth year
count,4212.0,4212.0,4212.0,4212.0,4212.0,4212.0,4212.0,4212.0,3682.0
mean,843.398148,301.0,40.722174,-73.983688,425.393637,40.728414,-73.989654,20017.078822,1980.166214
std,2201.216509,0.0,0.0,0.0,386.641786,0.012544,0.011052,2882.760288,10.721339
min,60.0,301.0,40.722174,-73.983688,79.0,40.680342,-74.017134,14538.0,1940.0
25%,379.0,301.0,40.722174,-73.983688,285.0,40.721533,-73.996826,17478.0,1974.0
50%,627.0,301.0,40.722174,-73.983688,361.0,40.727434,-73.990214,20529.0,1983.0
75%,1028.0,301.0,40.722174,-73.983688,445.0,40.734927,-73.982154,22616.0,1988.0
max,126180.0,301.0,40.722174,-73.983688,3223.0,40.770513,-73.941,24353.0,1999.0


Let's look a little closer, what might be a problem.

In [8]:
citibike_df.head(20)

Unnamed: 0,tripduration,start station id,start station latitude,start station longitude,end station id,end station latitude,end station longitude,bikeid,usertype,birth year
0,3117,301,40.722174,-73.983688,301,40.722174,-73.983688,18070,Subscriber,1986.0
1,690,301,40.722174,-73.983688,349,40.718502,-73.983299,19699,Subscriber,1985.0
2,727,301,40.722174,-73.983688,2010,40.721655,-74.002347,20953,Subscriber,1982.0
3,698,301,40.722174,-73.983688,527,40.744023,-73.976056,23566,Subscriber,1976.0
4,351,301,40.722174,-73.983688,250,40.724561,-73.995653,17545,Subscriber,1959.0
5,597,301,40.722174,-73.983688,497,40.73705,-73.990093,17435,Subscriber,1979.0
6,1248,301,40.722174,-73.983688,505,40.749013,-73.988484,18236,Subscriber,1987.0
7,417,301,40.722174,-73.983688,268,40.719105,-73.999733,22454,Subscriber,1991.0
8,454,301,40.722174,-73.983688,128,40.727103,-74.002971,17910,Subscriber,1990.0
9,409,301,40.722174,-73.983688,439,40.726281,-73.98978,17127,Subscriber,1981.0


Aha! #nevereasy #alwaysmessy #seldomsuccessfull.

## 1. Sklearn for Data Cleaning
We have a NaN situation. Generally when we have a NaN value, rule of thumb is to ask whether it is significant. <br> Let's explore how we can best use sklearn for data cleaning and remember what we have learnt in Pandas Lecture. How many of these NaNs are there?

In [13]:
for key in citibike_df.keys():
    print('Key: ', key, 'Amount of NaNs: ', citibike_df[key].isnull().sum())

Key:  tripduration Amount of NaNs:  0
Key:  start station id Amount of NaNs:  0
Key:  start station latitude Amount of NaNs:  0
Key:  start station longitude Amount of NaNs:  0
Key:  end station id Amount of NaNs:  0
Key:  end station latitude Amount of NaNs:  0
Key:  end station longitude Amount of NaNs:  0
Key:  bikeid Amount of NaNs:  0
Key:  usertype Amount of NaNs:  0
Key:  birth year Amount of NaNs:  530


In [19]:
no_of_nans = citibike_df.isnull().sum().max()
print(no_of_nans)

530


In [22]:
percentage_of_nans = 530.*100./citibike_df.shape[0]
print('%{0:2.1f} of the birth year data has NaN value.'.format(percentage_of_nans))

%12.6 of the birth year data has NaN value.


Ideally, we would drop the birth year data because it is difficult to guess what this would be. But for argument's sake, let's use SimpleImputer to change the NaN values with a default birth year of 1962.

In [41]:
from sklearn.impute import SimpleImputer

In [42]:
imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=1962.)

In [43]:
imputer.fit(citibike_df["birth year"].values.reshape(-1,1))

SimpleImputer(fill_value=1962.0, strategy='constant')

In [44]:
imputer.statistics_

array([1962.])

In [45]:
citibike_df["birth year"] = imputer.transform(citibike_df["birth year"].values.reshape(-1,1))

In [46]:
for key in citibike_df.keys():
    print('Key: ', key, 'Amount of NaNs: ', citibike_df[key].isnull().sum())

Key:  tripduration Amount of NaNs:  0
Key:  start station id Amount of NaNs:  0
Key:  start station latitude Amount of NaNs:  0
Key:  start station longitude Amount of NaNs:  0
Key:  end station id Amount of NaNs:  0
Key:  end station latitude Amount of NaNs:  0
Key:  end station longitude Amount of NaNs:  0
Key:  bikeid Amount of NaNs:  0
Key:  usertype Amount of NaNs:  0
Key:  birth year Amount of NaNs:  0


## 2. Sklearn for Data Transforming

### 2.1 Handling Text and Categories
Now we need to deal with the categorial entries, aka 'usertype'. Let's start with finding out how many user types are out there.

In [47]:
print( citibike_df.usertype.unique() )

['Subscriber' 'Customer']


There are two. So let's use the LabelEncoder function to transform these values.

In [48]:
from sklearn.preprocessing import LabelEncoder

In [49]:
encoder = LabelEncoder()

In [51]:
encoded_usertypes = encoder.fit_transform(citibike_df.usertype)

In [52]:
citibike_df["usertype"] = encoded_usertypes

In [53]:
citibike_df.head(20)

Unnamed: 0,tripduration,start station id,start station latitude,start station longitude,end station id,end station latitude,end station longitude,bikeid,usertype,birth year
0,3117,301,40.722174,-73.983688,301,40.722174,-73.983688,18070,1,1986.0
1,690,301,40.722174,-73.983688,349,40.718502,-73.983299,19699,1,1985.0
2,727,301,40.722174,-73.983688,2010,40.721655,-74.002347,20953,1,1982.0
3,698,301,40.722174,-73.983688,527,40.744023,-73.976056,23566,1,1976.0
4,351,301,40.722174,-73.983688,250,40.724561,-73.995653,17545,1,1959.0
5,597,301,40.722174,-73.983688,497,40.73705,-73.990093,17435,1,1979.0
6,1248,301,40.722174,-73.983688,505,40.749013,-73.988484,18236,1,1987.0
7,417,301,40.722174,-73.983688,268,40.719105,-73.999733,22454,1,1991.0
8,454,301,40.722174,-73.983688,128,40.727103,-74.002971,17910,1,1990.0
9,409,301,40.722174,-73.983688,439,40.726281,-73.98978,17127,1,1981.0


### 2.2 Data Scaling
As we have learned ML algorithms perform better when the data range is similar. For example bikeid and start station id are very different from each other. Let's use StandardScaler to scale our input data.

In [54]:
from sklearn.preprocessing import StandardScaler

In [56]:
scaler = StandardScaler()

In [58]:
print(scaler.fit(citibike_df))

StandardScaler()


In [59]:
print(scaler.mean_)

[ 8.43398148e+02  3.01000000e+02  4.07221744e+01 -7.39836878e+01
  4.25393637e+02  4.07284142e+01 -7.39896537e+01  2.00170788e+04
  8.74169041e-01  1.97788034e+03]


In [61]:
X = scaler.transform(citibike_df)

In [63]:
print(X[:20])

[[ 1.03300688  0.          0.          0.         -0.32176658 -0.49751108
   0.53985333 -0.67550188  0.37939888  0.69432694]
 [-0.06969617  0.          0.          0.         -0.19760592 -0.79031253
   0.57507167 -0.11035138  0.37939888  0.60881509]
 [-0.05288529  0.          0.          0.          4.09887022 -0.53894211
  -1.14863476  0.32470001  0.37939888  0.35227956]
 [-0.06606138  0.          0.          0.          0.26282319  1.24451375
   1.23044693  1.23123055  0.37939888 -0.1607915 ]
 [-0.22372021  0.          0.          0.         -0.45368728 -0.30723515
  -0.5428611  -0.85764062  0.37939888 -1.61449284]
 [-0.11195055  0.          0.          0.          0.18522278  0.68853121
  -0.03974457 -0.89580303  0.37939888  0.09574403]
 [ 0.18383012  0.          0.          0.          0.20591622  1.64235227
   0.10585326 -0.61791135  0.37939888  0.77983878]
 [-0.19373323  0.          0.          0.         -0.40712704 -0.74221353
  -0.91209633  0.84544335  0.37939888  1.12188615]


Hey, why are we using NumPy notation all of a sudden?

## 3. Sklearn for Data Splitting
There are a few different ways we can split our cleaned, encoded, and transformed data (in other words: ✨MACHINE LEARNING READY✨ data). The most commonly used one is the train_test_split. But for that let's rewind back one step before the Scaling happened. Let's say we are trying to guess birthyear, again for argument's sake. If it was a real life problem, we would have never used Imputer for NaN values.

In [64]:
y = citibike_df[["birth year"]]

In [65]:
y.head()

Unnamed: 0,birth year
0,1986.0
1,1985.0
2,1982.0
3,1976.0
4,1959.0


Now we need to drop the birth_year from the data before we scale the data set. The ML algorithm should NEVER, EVER, NEVER, EVER see the data it is trying to predict.

In [68]:
X = citibike_df.drop(["birth year"], axis=1)

In [69]:
X.describe()

Unnamed: 0,tripduration,start station id,start station latitude,start station longitude,end station id,end station latitude,end station longitude,bikeid,usertype
count,4212.0,4212.0,4212.0,4212.0,4212.0,4212.0,4212.0,4212.0,4212.0
mean,843.398148,301.0,40.722174,-73.983688,425.393637,40.728414,-73.989654,20017.078822,0.874169
std,2201.216509,0.0,0.0,0.0,386.641786,0.012544,0.011052,2882.760288,0.331698
min,60.0,301.0,40.722174,-73.983688,79.0,40.680342,-74.017134,14538.0,0.0
25%,379.0,301.0,40.722174,-73.983688,285.0,40.721533,-73.996826,17478.0,1.0
50%,627.0,301.0,40.722174,-73.983688,361.0,40.727434,-73.990214,20529.0,1.0
75%,1028.0,301.0,40.722174,-73.983688,445.0,40.734927,-73.982154,22616.0,1.0
max,126180.0,301.0,40.722174,-73.983688,3223.0,40.770513,-73.941,24353.0,1.0


Now that we have our features (X) and target (y) data sets separated, we should now scale the features.

In [72]:
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

In [73]:
X_scaled

array([[ 1.03300688,  0.        ,  0.        , ...,  0.53985333,
        -0.67550188,  0.37939888],
       [-0.06969617,  0.        ,  0.        , ...,  0.57507167,
        -0.11035138,  0.37939888],
       [-0.05288529,  0.        ,  0.        , ..., -1.14863476,
         0.32470001,  0.37939888],
       ...,
       [ 0.03207782,  0.        ,  0.        , ..., -0.46569111,
         0.0700527 , -2.63574843],
       [ 0.33512806,  0.        ,  0.        , ...,  0.27811097,
         1.09592749,  0.37939888],
       [-0.27097242,  0.        ,  0.        , ...,  1.07948491,
         1.20278221,  0.37939888]])

Now we are ready to split our data set into training and test sets.

In [74]:
from sklearn.model_selection import train_test_split

In [75]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [76]:
print('The shapes of train and test sets for features are: ', np.shape(X_train), np.shape(X_test))

The size of train and test sets for features are:  (3369, 9) (843, 9)


In [77]:
print('The shapes of train and test sets for target are: ', np.shape(X_train), np.shape(X_test))

The shapes of train and test sets for target are:  (3369, 9) (843, 9)


## 4. Sklearn for Validation
We are *this 🤏* close to training machine learning models. But first let's equip ourselves with some validation techniques. Let's remember our NumPy exercises.

In [80]:
target_values = np.linspace(0,1,11)
print(target_values)

[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]


In [83]:
np.random.seed(123)

In [84]:
predicted_values1 = target_values + np.random.normal(0, 0.1, 11)
print(predicted_values1)

[-0.10856306  0.19973454  0.22829785  0.14937053  0.34213997  0.66514365
  0.35733208  0.65710874  0.92659363  0.81332596  0.93211138]


In [85]:
predicted_values2 = target_values + np.random.uniform(0, 0.1, 11)
print(predicted_values2)

[0.03980443 0.17379954 0.21824917 0.31754518 0.45315514 0.55318276
 0.6634401  0.78494318 0.87244553 0.96110235 1.07224434]


In [92]:
fig, ax = plt.subplots(figsize=(5,6))
ax.plot(np.arange(11), target_values, color='k')
ax.scatter(np.arange(11), target_values, color='k', label='True Values')
ax.scatter(np.arange(11), predicted_values1, color='b', label='Predictions-1')
ax.scatter(np.arange(11), predicted_values2, color='g', label='Predictions-2')
ax.grid()
ax.legend()
ax.set_xlabel('Indices')
ax.set_ylabel('Values')
plt.show()

<IPython.core.display.Javascript object>

We need a metric to evaluate how close blue and green dots are to the black line. Let's try all the metrics we talked about in the class.

In [93]:
from sklearn.metrics import mean_absolute_error as mae

In [94]:
mae1 = mae(target_values, predicted_values1)
mae2 = mae(target_values, predicted_values2)

print('Mae-1: ', mae1, 'Mae-2: ', mae2)

Mae-1:  0.1069949157796533 Mae-2:  0.05544651890308832


In [96]:
from sklearn.metrics import mean_squared_error as mse

In [98]:
mse1 = mse(target_values, predicted_values1, squared=False)
mse2 = mse(target_values, predicted_values2, squared=False)

print('Mse-1: ', mse1, 'Mse-2: ', mse2)

Mse-1:  0.12236967600532829 Mse-2:  0.059365272081740306


In [100]:
rmse1 = mse(target_values, predicted_values1, squared=True)
rmse2 = mse(target_values, predicted_values2, squared=True)

print('Rmse-1: ', rmse1, 'Rmse-2: ', rmse2)

Rmse-1:  0.01497433760564902 Rmse-2:  0.003524235529339055


👏 Congratulations, you have completed the Sklearn Workbook!