## 2.75 Machine Learning for Ranking using scikit learn.

We are going to be using [scikit learn](http://scikit-learn.org/stable/index.html)

<img src='files/resources/scikit-learn-logo-small.png' align='left'><h2>Machine Learning in Python</h2>

<br>
* Simple and efficient tools for data mining and data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license


## Data

In [2]:
import os 
import pandas as pd

In [3]:
data = pd.read_csv("data/fullDataset.tsv", sep="\t",header=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
data.head()

Unnamed: 0,key,query,Title,LeafCats,ItemID,X_unit_id,SCORE,label_relevanceGrade,label_relevanceBinary,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
0,215248,all star ticket 2015,6N/7D Disney World All Star Music~ All Inclusi...,29579,67884,793072069,-1:1:-1,3,0,2676.618,0,-9.020574,-5.221152,32.879559,0.0,241,-1000000,139,-100.0
1,42799,mga top bows,"MGTC/TD/TF, MGA, Morris Minor Top Wing Bolts s...",34206,21331,687419458,3::2,5,0,2676.618,0,-7.435843,-5.986507,24.565182,0.036317,0,0,208,-100.0
2,73041,cross document markers,New Cross Vice Gel Ink Pen Gift Set & Document...,165631,36048,687423583,3:3:3,6,1,2676.618,0,-8.030511,-4.509118,48.652847,0.0,0,0,226,-100.0
3,138766,wall cabinet white,White PAPER TOWEL ROLL HOLDER Cabinet Wall Mou...,20643,49003,780091360,-1:-2:-2,1,0,1557.416626,1,-8.218757,-6.132952,45.209755,0.0,1258,-1000000,158,1.670978
4,222660,2.5 jato cooling head,ENGINE COOLING HEAD TRAXXAS JATO TRX 2.5 NITRO...,34061,71436,793073501,3:3:3,6,1,2676.618,0,-8.599997,-5.31775,79.376099,0.805351,0,0,150,2.538091


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 78500 entries, 0 to 78499
Data columns (total 19 columns):
key                      78500 non-null int64
query                    78500 non-null object
Title                    78494 non-null object
LeafCats                 78500 non-null object
ItemID                   78500 non-null int64
X_unit_id                78500 non-null object
SCORE                    78500 non-null object
label_relevanceGrade     78500 non-null int64
label_relevanceBinary    78500 non-null float64
feature_1                78500 non-null float64
feature_2                78500 non-null float64
feature_3                78500 non-null float64
feature_4                78500 non-null float64
feature_5                78500 non-null float64
feature_6                78500 non-null float64
feature_7                78500 non-null float64
feature_8                78500 non-null float64
feature_9                78500 non-null int64
feature_10               78499 non-null 

In [6]:
data.describe()

Unnamed: 0,key,ItemID,label_relevanceGrade,label_relevanceBinary,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
count,78500.0,78500.0,78500.0,78500.0,78500.0,78500.0,78500.0,78500.0,78500.0,78500.0,78500.0,78500.0,78500.0,78499.0
mean,98346.614064,48698.03,5.254777,0.713653,2821.519304,0.562437,-7.75453,-4.280808,47.526718,0.439879,50746.22321,-240684.450178,171.255248,-62.985598
std,70253.651231,2890059.0,1.334406,8.44696,834.067146,0.49741,1.162798,0.772099,24.222248,18.439044,228258.768038,527649.016958,41.412263,48.710378
min,1.0,1.0,0.0,0.0,-1890.0,-9.658937,-13.563713,-10.324562,0.0,0.0,-1000000.0,-1000000.0,-100.0,-100.0
25%,39209.75,19084.75,5.0,0.0,2676.618,0.0,-8.285197,-4.736623,30.806008,0.101174,0.0,-1000000.0,144.0,-100.0
50%,78577.5,38313.5,6.0,1.0,2676.618,1.0,-7.651813,-4.327543,44.18965,0.365236,191.0,0.0,172.0,-100.0
75%,157546.25,57659.25,6.0,1.0,3237.216614,1.0,-7.390667,-3.766841,60.076828,0.613038,4707.25,98130.0,203.0,0.620428
max,234520.0,809746800.0,6.0,2363.75,5286.0,1.0,-2.72615,42.59576,273.239624,5166.0,3339990.0,1000000.0,240.0,10.940677


## Data Exploration

We will start by exploring the data - however first we remove all the records that contain missing data.  Generally machine learning algorithms require complete records - some types of classification and regression tree's are designed to deal with missing data - but in general we have to deal with it ourselves during preprocessing.

In [4]:
data = data.dropna()
data.describe()

Unnamed: 0,key,ItemID,label_relevanceGrade,label_relevanceBinary,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
count,78493.0,78493.0,78493.0,78493.0,78493.0,78493.0,78493.0,78493.0,78493.0,78493.0,78493.0,78493.0,78493.0,78493.0
mean,98353.149287,38385.73339,5.254774,0.683526,2821.567172,0.562598,-7.754471,-4.281164,47.528829,0.374066,50762.626642,-240647.46213,171.256023,-62.984023
std,70253.061135,22259.156287,1.334363,0.465103,834.042432,0.496069,1.162636,0.753205,24.222072,0.280593,228237.868119,527631.90141,41.400772,48.710849
min,1.0,1.0,0.0,0.0,-1890.0,0.0,-13.563713,-10.324562,0.0,0.0,0.0,-1000000.0,0.0,-100.0
25%,39212.0,19088.0,5.0,0.0,2676.618,0.0,-8.285108,-4.736623,30.808287,0.101141,0.0,-1000000.0,144.0,-100.0
50%,78584.0,38316.0,6.0,1.0,2676.618,1.0,-7.65179,-4.32751,44.190533,0.365175,191.0,0.0,172.0,-100.0
75%,157548.0,57660.0,6.0,1.0,3237.285645,1.0,-7.390667,-3.766839,60.078297,0.613038,4706.0,98130.0,203.0,0.620932
max,234520.0,77080.0,6.0,1.0,5286.0,1.0,-2.72615,-1.106641,273.239624,1.0,3339990.0,1000000.0,240.0,10.940677


In [12]:
from bokeh.charts import defaults, Bar, output_notebook, show
defaults.plot_width = 500
defaults.plot_height = 300
output_notebook(hide_banner=True)

We can see that low relevance results are under represented:

In [13]:
p = Bar(data, 'label_relevanceGrade',  values='label_relevanceGrade', agg='count', title="Distribution of Relevance Grade")
show(p)

<bokeh.io._CommsHandle at 0x7fbde1f83f50>

For machine learning a lazy algorithm could predict grade 6 for everything.  Predicting the rarer class 0 or 1 has less payoff.  To 'encourage' our algorithm to more discriminating classification we can turn the 6-level relevance score into a binary target.  Now it does not pay to classify everything as relevant.

In [14]:
p = Bar(data, 'label_relevanceBinary',  values='label_relevanceBinary', agg='count', title="Distribution of Relevance Grade")
show(p)

<bokeh.io._CommsHandle at 0x7fbde01f7610>

Let's look at the relationship between this binary relevance target and the features.

In [15]:
from bokeh.io import gridplot
from bokeh.charts import BoxPlot, output_file, show

defaults.plot_width = 400
defaults.plot_height = 300

def bplot(feature):
    return BoxPlot(data, values=feature, label="label_relevanceBinary", outliers=False)

p = gridplot ( [ 
        [ bplot("feature_1"), bplot("feature_2") ],
        [ bplot("feature_3"), bplot("feature_4") ],
        [ bplot("feature_5"), bplot("feature_6") ],
        [ bplot("feature_7"), bplot("feature_8") ],
        [ bplot("feature_9"), bplot("feature_10") ] ] ) 
show(p)

<bokeh.io._CommsHandle at 0x7fbdcab31350>

## Training / Test data

We also want to scale our features to have zero mean and unit standard deviation.  This avoids certain features dominating the machine learning objective function - something that can lead to poorly performing models.

To reduce data size for our virtual machine I take a 20k sample of the data - I also rebalance the data ensuring equal representation of relevant and not-relevant labels.  This simplifies our model accuracy comparison during class.

In [7]:
from sklearn import preprocessing
from sklearn.externals import joblib

scaler = preprocessing.StandardScaler().fit(data.loc[:,"feature_1":"feature_10"])

# Serialize for later when we deploy the model
joblib.dump(scaler,'models/scaler.pkl')

data_0 = data.loc[(data.label_relevanceBinary == 0),]
data_1 = data.loc[(data.label_relevanceBinary == 1),]

data_0 = data_0.sample(10000, replace=True)
data_1 = data_1.sample(10000, replace=True)

data_balanced = data_0.append(data_1)

y = data_balanced["label_relevanceBinary"].values
`
X = scaler.transform(data_balanced.loc[:,"feature_1":"feature_10"])

features = data_balanced.loc[:,"feature_1":"feature_10"].columns

print(X.mean(axis=0))
print(X.std(axis=0))

#data_balanced.to_csv('data/data_balanced.csv', index=False)

[-0.10354013 -0.0686359   0.00178727 -0.06263787 -0.04274909 -0.0990162
 -0.01290366  0.00490385 -0.00372273 -0.02053828]
[ 0.99086711  1.00628588  0.96885435  1.01401985  0.98621543  0.99620395
  0.96927898  0.98129891  0.99881457  0.99349875]


When machine learning we train on one data set and measure performance on another - this ensures that our performance metrics are representative of the actual performance of the model - this helps avoid overfitting.

In [8]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=73)

print("X training : " + str(X_train.shape))
print("y training : " + str(y_train.shape))
print("X test     : " + str(X_test.shape))
print("y test     : " + str(y_test.shape))

X training : (15000, 10)
y training : (15000,)
X test     : (5000, 10)
y test     : (5000,)


## Baseline Model

Model 01 - just predict everything to be relevant and check performance - this is our baseline.

<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'>It is a good idea to label your models.  
Keeping track of what worked and did not work is much easier this way.

There is no model to fit for our baseline case - we can create the predictions directly:

In [None]:
y_pred_01 = [1] * y_test.shape[0]

sklearn includes the metrics class which has many methods to compute performance metrics for a model.  Here we use `confusion_matrix()`, `classification_report()` and `accuracy_score()` methods:

In [27]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
print('Confusion Matrix:\n'+str(confusion_matrix(y_test, y_pred_01)))
print('\nClassification Report:\n'+str(classification_report(y_test, y_pred_01)))
print('Accuracy = {0:6.2f}'.format(accuracy_score(y_test, y_pred_01)))

Confusion Matrix:
[[   0 2483]
 [   0 2517]]

Classification Report:
             precision    recall  f1-score   support

        0.0       0.00      0.00      0.00      2483
        1.0       0.50      1.00      0.67      2517

avg / total       0.25      0.50      0.34      5000

Accuracy =   0.50


We need to define the metric we are going to use to evaluate and compare models.  There are many classification metrics - based on the cells in the confusion matrix:
 
|                  || *Predicted*  | *Class*   ||
|------------------||--------------------------|||
|                  ||     -    |    +    | Total |
|*True*   |   -     |   TN     |    FP   |  N    |
|*Classs* |   +     |   FN     |    TP   |  P    |
|         | Total   |   N*     |    P*   |       |


Now we can define metrics based on the contingency table:

| Name                      |   Definition  |   Synonyms                                  |
|---------------------------|---------------|---------------------------------------------|
| False Positive Rate       |   FP / N      | Type-I error, 1-Specificity                 |
| True Positive Rate        |   TP / P      | 1-Type II error, power, sensitivity, recall |
| Positive Predictive Value |   TP / P*     | Precision, 1-false discovery proportion     |
| Negative Predictive Value |   TN / N*     |                                             |

A good diagram (shown below) and more discussion is available on [wikipedia](https://en.wikipedia.org/wiki/Precision_and_recall)
<img src='files/resources/Precisionrecall.png' align='center'>

Sometimes we talk about the accuracy of a model as:

$
accuracy = \frac{(TP + TN)}{( N + P )}
$

Generally we are interested in precision and recall.  

* Precision
    * The fraction of cases predicted to be positive that are actually positive
    * High precision is acheived by ensuring that all the predicted positive cases are actually positive
    * Quality
* Recall
    * The fraction of the positive cases predicted to be positive 
    * High recall is acheived by a model predicts as many of the positive cases as possible 
    * Quantity

We can also define classification metrics that compbine both precision and recall.  The F1-score is one such metric that computes the harmonic mean of precision and recall - applying equal emphasis to precision as recall:

$
F_1 = 2 \times \frac{ precision . recall}{precision + recall} 
$

But we may choose to attach emphasis to recall:

$
F_2 = 5 \times \frac{ precision . recall}{ (4 . precision) + recall} 
$

Or we may choose to place higher emphasis on precision:

$
F_{0.5} = 1.25 \times \frac{ precision . recall}{( 0.25 . precision) + recall} 
$

For this exercise we are going to focus on the accuracy score.

We can see that recall is 100% - we predicted every value that was positive correctly - but precision is low - of those predicted to be positve only 68% were actually postive.  This yields and F1 score of 81%.  This is our baseline upon which we want to improve.

## Class Exercise

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>Your goal is to improve on this model.  The class is going to compete inorder to find the model with the best test accuracy.   
Here I provide a long list of the models to explore - you will need to search through hyperparameters and features.

<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'>Your results must be reproducible by others!!  
So when you beat the class best shout out and we will try and reproduce the results!

Models to experiment with:

```
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=?).

from sklearn import linear_model
clf = linear_model.LogisticRegression(penalty=?)

from sklearn import svm
clf = svm.SVC(kernel='linear', C=?)

from sklearn import svm
clf = svm.SVC(kernel='rbf', C=?)

from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_leaf=?, max_depth =?)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=?)

from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=150, learning_rate=1.0, max_depth=10)

```
See documentation [here](http://scikit-learn.org/stable/index.html)

## Support Vector Machine



In [15]:
from sklearn import svm

svc = svm.SVC(kernel='linear', C=1.0).fit(X_train, y_train)
#svc = svm.SVC(kernel='rbf',gamma=0.7, C=1.0).fit(X_train, y_train)
#svc = svm.SVC(kernel='poly', degree = 3, C=1.0).fit(X_train, y_train)
y_pred_02 = svc.predict(X_test)

In [16]:
print(confusion_matrix(y_test, y_pred_02))
print(classification_report(y_test, y_pred_02))
print(accuracy_score(y_test, y_pred_02))

[[1915  568]
 [1135 1382]]
             precision    recall  f1-score   support

        0.0       0.63      0.77      0.69      2483
        1.0       0.71      0.55      0.62      2517

avg / total       0.67      0.66      0.66      5000

0.6594


## Logistic Regression





In [2]:
from sklearn import linear_model
lgr = linear_model.LogisticRegression(penalty='l1').fit(X_train, y_train)
y_pred_03 = lgr.predict(X_test)

NameError: name 'X_train' is not defined

In [18]:
print(confusion_matrix(y_test, y_pred_03))
print(classification_report(y_test, y_pred_03))
print(accuracy_score(y_test, y_pred_03))

[[1775  708]
 [ 949 1568]]
             precision    recall  f1-score   support

        0.0       0.65      0.71      0.68      2483
        1.0       0.69      0.62      0.65      2517

avg / total       0.67      0.67      0.67      5000

0.6686


In [1]:
print(lgr.coef_)

NameError: name 'lgr' is not defined

## Decision Tree

In [19]:
from sklearn import tree

clf = tree.DecisionTreeClassifier(min_samples_leaf=5, max_depth = 4).fit(X_train, y_train)
y_pred_04 = clf.predict(X_test)

In [20]:
print(confusion_matrix(y_test, y_pred_04))
print(classification_report(y_test, y_pred_04))
print(accuracy_score(y_test, y_pred_04))

[[1740  743]
 [ 940 1577]]
             precision    recall  f1-score   support

        0.0       0.65      0.70      0.67      2483
        1.0       0.68      0.63      0.65      2517

avg / total       0.66      0.66      0.66      5000

0.6634


In [21]:
from sklearn.externals.six import StringIO
with open("tree.dot", "w") as f:
    f = tree.export_graphviz(clf, filled=True, out_file=f)

In [22]:
! dot -Tpdf tree.dot -o tree.pdf

## Random Forests

In [58]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=400).fit(X_train, y_train)
y_pred_05 = clf.predict(X_test)

In [61]:
print(confusion_matrix(y_test, y_pred_05))
print(classification_report(y_test, y_pred_05))
print(accuracy_score(y_test, y_pred_05))

[[1876  607]
 [ 664 1853]]
             precision    recall  f1-score   support

        0.0       0.74      0.76      0.75      2483
        1.0       0.75      0.74      0.74      2517

avg / total       0.75      0.75      0.75      5000

0.7458


In [62]:
# # Save clasifier for deployment in our search engine
# from sklearn.externals import joblib
# joblib.dump(clf,'models/mlr.pkl')

## K-nearest neighbour

In [25]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=10).fit(X_train, y_train)
y_pred_06 = clf.predict(X_test)

In [26]:
print(confusion_matrix(y_test, y_pred_06))
print(classification_report(y_test, y_pred_06))
print(accuracy_score(y_test, y_pred_06))

[[1783  700]
 [1010 1507]]
             precision    recall  f1-score   support

        0.0       0.64      0.72      0.68      2483
        1.0       0.68      0.60      0.64      2517

avg / total       0.66      0.66      0.66      5000

0.658


## Naive Bayes

In [27]:
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB().fit(X_train, y_train)
y_pred_07 = clf.predict(X_test)

In [28]:
print(confusion_matrix(y_test, y_pred_07))
print(classification_report(y_test, y_pred_07))
print(accuracy_score(y_test, y_pred_07))

[[1873  610]
 [1073 1444]]
             precision    recall  f1-score   support

        0.0       0.64      0.75      0.69      2483
        1.0       0.70      0.57      0.63      2517

avg / total       0.67      0.66      0.66      5000

0.6634


In [29]:
clf.theta_


array([[-0.25714973, -0.22427008,  0.02437619, -0.19526252, -0.08727644,
        -0.27808288, -0.05308899,  0.04019813, -0.00878475, -0.07612776],
       [ 0.25778788,  0.23304922, -0.0161784 ,  0.18498124,  0.08399293,
         0.28029969,  0.0492468 , -0.0394789 ,  0.0183676 ,  0.07924237]])

## Gradient Tree Boosting 

In [75]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=150, learning_rate=1.0, max_depth=10).fit(X_train, y_train)
y_pred_08 = clf.predict(X_test)

In [76]:
print(confusion_matrix(y_test, y_pred_08))
print(classification_report(y_test, y_pred_08))
print(accuracy_score(y_test, y_pred_08))

[[1800  683]
 [ 715 1802]]
             precision    recall  f1-score   support

        0.0       0.72      0.72      0.72      2483
        1.0       0.73      0.72      0.72      2517

avg / total       0.72      0.72      0.72      5000

0.7204


In [77]:
test_score = []
for i, p in enumerate(clf.staged_predict(X_test)):
    #print(i, clf.loss_(y_test, p))
    test_score.append(clf.loss_(y_test, p))


df = pd.DataFrame({ 'n' : range(len(test_score)), 'train' : clf.train_score_, 'test' : test_score})

from bokeh.charts import Line

line = Line(data=df, x='n', y=['test'])
show(line)

<bokeh.io._CommsHandle at 0x7efd541fe0d0>

In [78]:
from bokeh.charts import Bar
from bokeh.charts.attributes import CatAttr

df = pd.DataFrame({ 'importance' : clf.feature_importances_, 'feature' : features})
df.sort_values(by='importance', inplace=True, ascending=False)

bar = Bar(df, label=CatAttr(columns=['feature'], sort=False), values='importance'  )
show(bar)

<bokeh.io._CommsHandle at 0x7efd4f7265d0>

In [79]:
# Save clasifier for deployment in our search engine
from sklearn.externals import joblib
joblib.dump(clf,'models/mlr.pkl')

['models/mlr.pkl',
 'models/mlr.pkl_01.npy',
 'models/mlr.pkl_02.npy',
 'models/mlr.pkl_03.npy']