# Supervised Machine Learning - Model Evaluation

## How can I predict if seeing a campaign will influence clicks/sales?

Here we will be using a simulated dataset that you can download directly from dropbox (see GitHub - General Repository - DA5 - link to the dropbox at the end of the page). In class we'll explore actual data from an airline company (and that data is built in somewhat of a similar manner).



In [100]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Data Understanding

In [101]:
webdata = pd.read_excel('web_campaign_simulated.xlsx')

In [102]:
webdata.columns

Index(['id', 'age', 'female', 'used_search', 'referral', 'time_spent',
       'campaign_1', 'campaign_2', 'click', 'sell'],
      dtype='object')

In [103]:
webdata.dtypes

id              int64
age             int64
female          int64
used_search     int64
referral       object
time_spent      int64
campaign_1      int64
campaign_2      int64
click           int64
sell            int64
dtype: object

In [104]:
webdata.head()

Unnamed: 0,id,age,female,used_search,referral,time_spent,campaign_1,campaign_2,click,sell
0,1,40,1,1,tumblr,204,1,0,0,1
1,2,49,0,0,,239,1,0,0,0
2,3,20,1,0,google,238,0,0,0,1
3,4,19,1,0,google,111,1,1,0,1
4,5,46,1,1,twitter,159,0,0,1,1


In [105]:
webdata.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,9010.0,2141.001887,670.144769,1.0,2253.25,2483.0,2483.0,2483.0
age,9010.0,28.344728,11.573714,18.0,23.0,23.0,23.0,67.0
female,9010.0,0.140622,0.34765,0.0,0.0,0.0,0.0,1.0
used_search,9010.0,0.862486,0.344408,0.0,1.0,1.0,1.0,1.0
time_spent,9010.0,188.782464,108.709242,1.0,96.0,187.0,282.0,380.0
campaign_1,9010.0,0.863929,0.342883,0.0,1.0,1.0,1.0,1.0
campaign_2,9010.0,0.863263,0.343589,0.0,1.0,1.0,1.0,1.0
click,9010.0,0.386903,0.487068,0.0,0.0,0.0,1.0,1.0
sell,9010.0,0.700222,0.458186,0.0,0.0,1.0,1.0,1.0


### How do the outcome variables look like?

We want to predict/classify sell and click. Are they continuous or binary variables?

In [106]:
webdata['click'].value_counts()

0    5524
1    3486
Name: click, dtype: int64

In [107]:
webdata['sell'].value_counts()

1    6309
0    2701
Name: sell, dtype: int64

## 2. Data Preparation

The dataset seems to be reasonably prepared, most likely the referral column (object dtype, with strings) won't work for modeling. Let's convert it to a series of dummies instead.

In [108]:
webdata['referral'].value_counts()

google          1879
                1830
facebook         923
newsletter B     912
tumblr           879
newsletter A     869
twitter          859
nyt              859
Name: referral, dtype: int64

In [109]:
def check_referral(referral, site):
    if referral == site:
        return 1
    return 0

In [110]:
webdata['google'] = webdata['referral'].apply(check_referral, args=('google',))
webdata['facebook'] = webdata['referral'].apply(check_referral, args=('facebook',))
webdata['news_a'] = webdata['referral'].apply(check_referral, args=('newsletter A',))
webdata['news_b'] = webdata['referral'].apply(check_referral, args=('newsletter B',))
webdata['nyt'] = webdata['referral'].apply(check_referral, args=('nyt',))
webdata['tumblr'] = webdata['referral'].apply(check_referral, args=('tumblr',))
webdata['twitter'] = webdata['referral'].apply(check_referral, args=('twitter',))


In [111]:
webdata.columns

Index(['id', 'age', 'female', 'used_search', 'referral', 'time_spent',
       'campaign_1', 'campaign_2', 'click', 'sell', 'google', 'facebook',
       'news_a', 'news_b', 'nyt', 'tumblr', 'twitter'],
      dtype='object')

In [112]:
webdata.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,9010.0,2141.001887,670.144769,1.0,2253.25,2483.0,2483.0,2483.0
age,9010.0,28.344728,11.573714,18.0,23.0,23.0,23.0,67.0
female,9010.0,0.140622,0.34765,0.0,0.0,0.0,0.0,1.0
used_search,9010.0,0.862486,0.344408,0.0,1.0,1.0,1.0,1.0
time_spent,9010.0,188.782464,108.709242,1.0,96.0,187.0,282.0,380.0
campaign_1,9010.0,0.863929,0.342883,0.0,1.0,1.0,1.0,1.0
campaign_2,9010.0,0.863263,0.343589,0.0,1.0,1.0,1.0,1.0
click,9010.0,0.386903,0.487068,0.0,0.0,0.0,1.0,1.0
sell,9010.0,0.700222,0.458186,0.0,0.0,1.0,1.0,1.0
google,9010.0,0.208546,0.406292,0.0,0.0,0.0,0.0,1.0


## 3. Modeling

After the initial data preparation, I can give modeling a try. We'll use the logistic regression algorithm for this classification task, as we want to predict clicks and sales. Why logistic regression? Because we are not trying as much to predict a quantity (usually continuous variables), as we want to predict a choice (0-1).

You can read more about how scikit-learn implemented Logistic Regression [here](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), and how to run the code [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). 

* **Note:** The documentation is quite technical, so don't get too worried if you don't get all the steps. The most important (for our class) is that you have a high level understanding of what a clustering algorithm does.

### Step 1. Import the required packages from scikit-learn

In this tutorial we'll compare the performance of two different algorithms:
* Logistic Regression, which we used before
* K-Nearest Neighbors, which is a new classifier



In [113]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier


### Step 2. Split into training and test data

We'll split our dataset into a training and a testing set. 

* The training set is the data that will be used to train the model
* The testing set is the data that we will use to *evaluate* the model

Why are we splitting it like this?
* The idea is that we create a classifier that can use features (independent variables) to predict a category (dependent variable). We use a part (most) of the data to train this model / create this classifier.
* After creating the classifier, we then need to test it in data that have not been used before by it. That's where we use the testing set.

Importing the function that will help us split our data into a training set, and a test set.

In [114]:
from sklearn.model_selection import train_test_split


Note: if you get an error message that with the command above (especially if you could not update scikit-learn to version 18), try instead:

*from sklearn.cross_validation import train_test_split*

In [115]:
train, test = train_test_split(webdata, test_size=0.2, random_state=0)

What did the command above do?

1. I asked for a train and a test set to be created
2. I indicated that the data to be used was webdata
3. I indicated that the size of the test set should be 20% of the total dataset
4. The random_state is optional, but is an interesting thing to use. As the split between train/test includes randomizing the order of the rows, you'll always get different train/test splits every time you run the command. Using random_state makes sure that they always look the same.

Let's see how the train and the test set look like

In [116]:
train.head()

Unnamed: 0,id,age,female,used_search,referral,time_spent,campaign_1,campaign_2,click,sell,google,facebook,news_a,news_b,nyt,tumblr,twitter
4734,2483,23,0,1,newsletter A,146,1,1,0,1,0,0,1,0,0,0,0
4501,2483,23,0,1,twitter,217,1,1,1,1,0,0,0,0,0,0,1
4962,2483,23,0,1,nyt,48,1,1,1,1,0,0,0,0,1,0,0
2167,2168,65,1,0,twitter,117,0,1,0,1,0,0,0,0,0,0,1
3907,2483,23,0,1,twitter,72,1,1,1,1,0,0,0,0,0,0,1


In [117]:
test.head()

Unnamed: 0,id,age,female,used_search,referral,time_spent,campaign_1,campaign_2,click,sell,google,facebook,news_a,news_b,nyt,tumblr,twitter
8009,2483,23,0,1,newsletter A,212,1,1,0,1,0,0,1,0,0,0,0
3812,2483,23,0,1,newsletter B,103,1,1,0,1,0,0,0,1,0,0,0
8562,2483,23,0,1,tumblr,18,1,1,1,0,0,0,0,0,0,1,0
6670,2483,23,0,1,twitter,201,1,1,0,0,0,0,0,0,0,0,1
2339,2340,26,1,1,facebook,69,1,0,0,1,0,1,0,0,0,0,0


Let's see if the split worked out OK...

In [118]:
print(len(webdata), len(train), len(test))

9010 7208 1802


Indeed, 7208 (the training set) is 80% of the total size of the dataset, and 1802 is 20% of the dataset (testing set).

### A. Running the logistic regression

Let's remember that logistic regression tries to fit a line to separate the two categories.



### Step A.1 Instantiate the classifier

We'll use this classifier to actually classify the data. While creating the classifier, we can provide a few options. In the case of Logistic Regression, perhaps two are the most important: how many times you want the model to iterate (the higher, the higher the chance that it will converge), and whether you want an intercept (constant) in the regression. More options can be seen [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [119]:
logit_clf = LogisticRegression(max_iter=1000, fit_intercept = True)

### Step A.2 Train the classifier

To make my life a bit easier, I will create a list with the columns that I am interested on. If I ever need to add a new column to the analysis, I just go ahead and change the list and rerun the code.

In [120]:
features = ['age', 'female', 'google', 'facebook', 'sell', 'time_spent', 'campaign_1']

**Important:** Now I am not using the *webdata*  dataframe to train the model. I am using the *train* dataframe, which is our training set. 

In [122]:
logit_clf.fit(train[features], train['click'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1000, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Step A.3 Check the coefficients

We'll run this only for logistic regressions (KNN has a different structure), but let's have a look anyway...

In [123]:
pd.DataFrame(np.transpose(logit_clf.coef_), features)

Unnamed: 0,0
age,0.00211
female,0.004875
google,-0.051241
facebook,-0.279206
sell,1.16984
time_spent,0.001068
campaign_1,0.126238


Significance testing can still be done with statsmodels (see previous tutorial), but it won't be relevant for KNN. So let's jump into model evaluation.

### Step A.4 Model Evaluation

Now it's time to evaluate our model. Remember that we trained it using the training set, and now we can check how it would work out for the test set. Here we're actually predicting cases (as we did in the previous tutorial).


In [124]:
test['predicted_clicks_logit'] = logit_clf.predict(test[features])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


*Note: As discussed, the warning message above can be ignored for now... *

In the previous tutorial, we did predict some cases (with the .predict option), and now we are doing the same with the whole test set.

The key difference is that **we know** what the actual value of each case was. So we can check how accurate our classifier was.

In [125]:
test[['predicted_clicks_logit', 'click']].head()

Unnamed: 0,predicted_clicks_logit,click
8009,0,0
3812,0,0
8562,0,1
6670,0,0
2339,0,0


Let's see how accurate the classifier was...

If the there was an actual click, how good were the predictions?

In [126]:
test[test['click']==1]['predicted_clicks_logit'].value_counts()

0    596
1    115
Name: predicted_clicks_logit, dtype: int64

What if there were no clicks?

In [127]:
test[test['click']==0]['predicted_clicks_logit'].value_counts()

0    992
1     99
Name: predicted_clicks_logit, dtype: int64

Luckily, scikit-learn has a more efficient way to help us evaluate. We can use a classification report.

In [128]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

### Confusion Matrix

First we can run a confusion matrix, i.e., check how many cases were correctly categorized.

In [129]:
print(confusion_matrix(test['click'], test['predicted_clicks_logit']))

[[992  99]
 [596 115]]


The following link has the documentation for the confusion matrix: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In scikit-learn, that a 2 X 2 confusion matrix for a binary classification generally looks like:

| Actual/Predicted        | 0           | 1  |
| ------------- |:-------------:| -----:|
| 0      | True Negative | False Positive |
| 1      | False Negative    |   True Positive |




### Precision & Recall

I can also use precision & recall metrics.

In [130]:
print(classification_report(test['click'], test['predicted_clicks_logit']))

             precision    recall  f1-score   support

          0       0.62      0.91      0.74      1091
          1       0.54      0.16      0.25       711

avg / total       0.59      0.61      0.55      1802



What does it all mean?
* Precision: True Positives / (True Positives + False Positives)
* Recall: True Positives / (True Positives + False Negatives)


<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/440px-Precisionrecall.svg.png">  Source: Wikipedia


Now that I know how well the logistic regression algorithm worked, I can use the same features and train a KNeighborsClassifier.

### B. KNeighborsClassifier
### Step B.1 Instantiate the classifier

We will compare logistic regression with K-Nearest Neighboors. The main item to configure is the number of neighbors that the algorithm should take into account when trying to assess whether an observation belongs to a class or not. The lower the number, the more sensitive the model is to local variations. The higher the number, the more context it gets. More options can be seen [here](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier).

Just as an ilustration - some examples from https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/:
<img src="https://kevinzakka.github.io/assets/1nearestneigh.png">
<img src="https://kevinzakka.github.io/assets/20nearestneigh.png">

We'll start with 5 neighbors, which is the default for KNeighborsClassifier.


In [131]:
n_clf = KNeighborsClassifier(n_neighbors=5)

### Step B.2 Train the classifier

While I could change the features that I am using for this classifier, as my interest is in comparing the performance of logistic regression with K-neighbors, I'll use still the same feature list.

In [132]:
features

['age', 'female', 'google', 'facebook', 'sell', 'time_spent', 'campaign_1']

In [133]:
n_clf.fit(train[features], train['click'])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

### Step B.3 Check the coefficients

K-Nearest neighbors is **not** a regression-based model, so the idea of coefficients is not really that valid here - and scikit-learn does not report it. So we can skip this step. 

### Step B.4 Model Evaluation

Now it's time to evaluate our model. We'll do exactly the same as for logistic, but I'll use a different column to store the results.

In [134]:
test['predicted_clicks_nn'] = n_clf.predict(test[features])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Let's check the confusion matrix

In [135]:
print('K-Nearest Neighbors')
print(confusion_matrix(test['click'], test['predicted_clicks_nn']))

K-Nearest Neighbors
[[764 327]
 [452 259]]


Let's compare with logistic regression

In [136]:
print('Logistic Regression')
print(confusion_matrix(test['click'], test['predicted_clicks_logit']))

Logistic Regression
[[992  99]
 [596 115]]


Let's check Precision & Recall

In [137]:
print('K-Nearest Neighbors')
print(classification_report(test['click'], test['predicted_clicks_nn']))

K-Nearest Neighbors
             precision    recall  f1-score   support

          0       0.63      0.70      0.66      1091
          1       0.44      0.36      0.40       711

avg / total       0.55      0.57      0.56      1802



In [138]:
print('Logistic Regression')
print(classification_report(test['click'], test['predicted_clicks_logit']))

Logistic Regression
             precision    recall  f1-score   support

          0       0.62      0.91      0.74      1091
          1       0.54      0.16      0.25       711

avg / total       0.59      0.61      0.55      1802



### Question: Which model performed better? And why?

### We can still predict cases

We can still predict how specific cases would look like, and compare predictions between the two models. 



In [139]:
features

['age', 'female', 'google', 'facebook', 'sell', 'time_spent', 'campaign_1']

In [140]:
people = [[25, 0,0,1,1,200,1], [65,1,1,0,0,400,1]]

I can do the prediction if they will click or not:

In [141]:
logit_clf.predict(people)

array([0, 0])

In [142]:
n_clf.predict(people)

array([1, 0])

This does not seem too informative. Can we estimate the probabilities instead? Yes.

In [143]:
logit_clf.predict_proba(people)

array([[ 0.59446938,  0.40553062],
       [ 0.73526365,  0.26473635]])

In [144]:
n_clf.predict_proba(people)

array([[ 0.4,  0.6],
       [ 0.8,  0.2]])

# Challenges

* Create a model for predicting sell, and compare how logistic regression and k-nearest neighbors would perform. Which one performs best? Why?
* Make some changes to the KNeighborsClassifier (number of neighbors). What happens with precision & recall? 