# Introduction to Machine Learning <font color='blue'> (35 min) </font>

# Google doc with code corrections is accessible at:
### https://docs.google.com/document/d/1phmpGjNJbHwxP7448taFqREw6Vw3qVSUEDhu1KcLxog/edit?usp=sharing

# 0) Importing the right tools <font color='blue'> (5 min) </font>

### <font color='red'>0.1) Import the necessary packages with their usual aliases: </font>

- pandas as pd
- numpy as np
- seaborn as sns
- matplotlib.pyplot as plt

In [1]:
from __future__ import division

#### IMPORT THE USUAL PACKAGES WITH THEIR ALIASES ####

%pylab inline

### <font color='red'>0.2) Import the dataset from <i>'../data/data_after_feature_engineering.csv'</i></font>

In [2]:
raw_data = #### IMPORT AND READ THE CSV DATA USING pd.read_csv() #### 

### <font color='red'>0.3) Copy the raw_data and print samples</font>

In [None]:
data = #### COPY THE RAW DATA WITH THE .copy() FUNCTION ####

In [None]:
#### PRINT SAMPLES USING .sample() AND CHECK THE FEATURES ####

# Predictive modeling

## 1) Variable encoding <font color='blue'> (10 min) </font>

Categorical variables need to be converted to numbers so as to train machine learning algorithms. There are different kinds of variables encoding, such as dummy-encoding: this method consists in building $n$ binary columns when a variable can take $n$ values. When regression is used, features need not be correlated, hence $n-1$ binary columns will be created.

### <font color='red'>1.1) Understand how the <i>pd.get_dummies(data.column_name_here)</i> function allows to create dummy variables</font>

In [None]:
#### TRY HERE TO DUMMY-ENCODE WEATHER CONDITIONS OF THE DATASET, USING pd.get_dummies(data.Conditions) ####

### <font color='red'>1.2) Print the names of the columns of the dataset using <i>data.columns</i>. Which ones should be dummified ?</font>

In [None]:
#### PRINT THE NAMES OF THE COLUMNS ####

### <font color='red'>1.3) Fill in the following loop so you append to the existing DataFrame the newly created dummy columns</font>

In [6]:
for variable_name in ['Conditions','start_day','is_weekend',
                      'start_moment','is_rainy','is_circle_trip']:
    print 'Dummifying the {} variable ...'.format(variable_name)
    
    dummies = #### CREATE DUMMIES FROM THE COLUMN data[variable_name], as you did above with pd.get_dummies() ####
    
    dummies.columns = ['{}_{}'.format(variable_name,x) for x in dummies.columns]  # this will rename the column
                                                                                  # in an appropriate way
        
    data = pd.concat([data,dummies],axis=1)  # This will append the dummy column to the existing dataframe

### <font color='red'>1.4) Once you are sure that the dummy columns have been created (check by printing samples), delete the old columns</font>

In [None]:
#### CHECK THAT DUMMIFICATION IS SUCCESSFUL BY PRINTING SAMPLES AND COLUMNS NAMES ####

In [None]:
for variable_name in ['Conditions','start_day','is_weekend',
                      'start_moment','is_rainy','is_circle_trip']:
    print 'Deleting the {} variable ...'.format(variable_name)
    #### DELETE THE OLD COLUMN USING del data[column_name_here] HERE ####

## 2) Correlation matrix <font color='blue'> (10 min) </font>

### <font color='red'>2.1) Using the <i>.corr()</i> method on <i>data</i>, print samples of the Pearson correlations between features within the dataset</font>

In [None]:
corr = #### COMPUTE THE PEARSON CORRELATION BETWEEN FEATURES HERE ####

In [None]:
#### PRINT SAMPLES OF THE CORRELATIONS HERE ####

### <font color='red'>2.2) Using the <i>sns.heatmap()</i> function of seaborn, plot the Pearson correlations between features. You can refer to https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.heatmap.html to add a mask and make the plot look better</font>

## 3) Cross-validation : trying to predict customer vs. subscriber <font color='blue'> (30 min) </font>

### <font color='red'>3.1) Use the <i>seaborn</i> function <i>sns.countplot(data.column_name_here)</i> to countplot the repartition of trips per user type (i.e the <i>usertype</i> column)</font>

In [None]:
#### COUNTPLOT THE REPARTITION OF TRIPS PER USER TYPE AND ADD A CLEAR TITLE ####

### <font color='red'> Run the following block, it will delete a few columns for predictive modeling purposes </font>

In [16]:
del data['starttime'], data['stoptime'], data['start station name'], data['end station name']
del data['gender'], data['birth year']
del data['bikeid']
del data['start station id'], data['end station id']

### <font color='red'>3.2) Print the different columns using <i>data.columns</i> </font>

In [None]:
#### PRINT THE COLUMNS OF THE DATASET ####

### <font color='red'>3.3) Run the following block. It builds arrays for the features, as well as the labels. Features will be used to predict the labels. Study their structures.</font>

In [19]:
labels = np.array(data.usertype)
del data['usertype']
features = np.array(data)

In [None]:
#### Study the structure of features and labels ####

### <font color='red'>3.4) How many observations/features do we have to make our models ? You can use the <i>.shape</i> attribute of features, and labels</font>

### <font color='red'>3.5) Import the scikit-learn package (called <i>sklearn</i>), that will be used for running machine learning algorithms</font>

In [None]:
#### IMPORT THE SCIKIT-LEARN PACKAGE ####

### <font color='red'>3.6) Binarize the labels of the dataset. You can use the following webpage:</font>
- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.label_binarize.html

In [None]:
from sklearn.preprocessing import label_binarize
binarized_labels = #### BINARIZE THE LABELS OF THE DATASET, AND RAVEL THE RESULT USING .ravel() ####
                   #### Subscriber will be label 1, Customer label 0 ####
                   #### This will be a binary classifiation problem ####

### <font color='red'>3.7) Show the binarized labels</font>

### Split training and testing sets

### <font color='red'>3.8) Split your dataset between a training and testing set (of size 30%). You can use the following webpage:</font>
- http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

In [85]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = #### SPLIT BETWEEN TRAIN AND TEST ####

### <font color='red'>3.9) Show the results of your split</font>

In [4]:
#### Show samples of X_train, X_test, y_train, y_test ####

### Random Forest classifier

### <font color='red'>3.10) Go to the following webpages to understand how to compute cross-validation scores of a Random Forest classifier in Python, on the training set:</font>
- http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html#sklearn.cross_validation.cross_val_score

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = #### DEFINE A RANDOM FOREST CLASSIFIER ####

### <font color='red'>Run the following block. It will compute a 3-fold cross-validation score using the </font><b>AUC scoring metric</b>,  <font color='red'>as explained in the slides</font>

In [None]:
from sklearn.cross_validation import cross_val_score
cross_val_score(rf, X_train, y_train, scoring='roc_auc')

### <font color='red'>3.11) Take some time to run as well as understand the following block. 

<b>This does exactly as <i>cross_val_score</i> function from the block above, but it is coded such that you will understand what happens at each iteration, as well as plot the ROC curve for every cross-validation split</b>. You can use the following webpages:</font>
- http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html
- http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
- http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

In [None]:
from scipy import interp

from sklearn.metrics import roc_curve, auc
from sklearn.cross_validation import StratifiedKFold

# Run classifier with cross-validation and plot ROC curves
cv = StratifiedKFold(y_train, n_folds=3)
classifier = RandomForestClassifier()

mean_tpr = 0.
mean_fpr = np.linspace(0, 1, 100)
all_tpr = []

plt.figure(figsize=(15,10))
for i, (train, test) in enumerate(cv,1):
    print 'Fold {}'.format(i)
    probas_ = classifier.fit(X_train[train], y_train[train]).predict_proba(X_train[test])
    # Compute ROC curve and area under the curve
    fpr, tpr, thresholds = roc_curve(y_train[test], probas_[:, 1], pos_label=1)
    mean_tpr += interp(mean_fpr, fpr, tpr)
    mean_tpr[0] = 0.0
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i, roc_auc))

plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck')

mean_tpr /= len(cv)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)

plt.plot(mean_fpr, mean_tpr, 'k--',
         label='Mean ROC (area = %0.2f)' % mean_auc, lw=2)

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for Random Forest Classifier',fontsize=17)
plt.legend(loc="lower right")
plt.show()

## 4) Final fit and plots <font color='blue'> (15 min) </font>

### <font color='red'>4.1) Define a random forest classifier, and fit it on the training set. You can return on:</font>
- http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
rf = #### DEFINE A RANDOM FOREST CLASSIFIER ####

#### FIT THE RANDOM FOREST CLASSIFIER ON THE TRAINING SET ####

### <font color='red'>4.2) Show (and if you can, plot!) the features importances using the <i>.feature_importances</i> attribute of your classifier. You can get hints on the following page:</font>
- http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html

In [None]:
#### PLOT THE FEATURES IMPORTANCES WITH RESPECT TO THE RANDOM FOREST ####

### <font color='red'>4.3) Run this block to get the final performance on test set, using <i>roc_auc_score</i></font>

In [40]:
from sklearn.metrics import roc_auc_score

y_predict_test = rf.predict_proba(X_test)[:,1]
print 'Final AUC score on test set : {:.2f}'.format(roc_auc_score(y_test, y_predict_test))

Final AUC score on test set : 0.85


### <font color='red'>4.4) Using <i>seaborn.distplot</i>, plot the distribution of trip durations with respect to user type (whether "Subscriber" or "Customer")</font>

In [None]:
#### PLOT THE DISTRIBUTION OF TRIP DURATIONS WITH RESPECT TO USER TYPES BY USING APPROPRIATE SLICES ON RAW_DATA ####

#### ADD X,Y LABELS, A TITLE, AND A LEGEND TO THIS PLOT ####

### <font color='red'>4.5) Using <i>seaborn.distplot</i>, plot the distribution of average speed with respect to user type (whether "Subscriber" or "Customer")</font>

In [None]:
#### PLOT THE DISTRIBUTION OF AVERAGE SPEEDS WITH RESPECT TO USER TYPES BY USING APPROPRIATE SLICES ON RAW_DATA ####

### <font color='red'>4.6) Run the following block to understand how fine-tuning parameters can help improve the performance of your models. Warning : this will take some time to run !</font>

In [None]:
rf = RandomForestClassifier(max_depth=20,max_features=10,n_estimators=50)
cross_val_score(rf, X_train, y_train, scoring='roc_auc')

# Free exploration/modeling of the dataset, for instance: <font color='blue'> (45 min) </font>
- Try to use other algorithms
- Try to enrich with other data (taxi trips, points of interests in neighborhoods)
- Try to predict other phenomena
- ....

# Please give us your feedback on the hands-on sessions at:
## <center>https://docs.google.com/forms/d/e/1FAIpQLScw_fPB1m6x_sMm59v_VHNBVcvfsMPhoqXwSjSiJQtzlpOJJA/viewform?usp=sf_link</center>