# Model evaluation

AI Black Belt - Yellow (June 2019).

---

In this notebook, we will reuse the Census data from earlier. We will train a Scikit-Learn model and evaluate its performance.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Read data
df = pd.read_csv("data/adult.csv", index_col=0)
X = df.drop(["income"], axis=1)
y = df["income"]
y = (y == " >50K").values

# Treat missing values
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy="most_frequent")

X["workclass"] = imp.fit_transform(X["workclass"].values.reshape(-1, 1)).flatten()
X["occupation"] = imp.fit_transform(X["occupation"].values.reshape(-1, 1)).flatten()
X["native-country"] = imp.fit_transform(X["native-country"].values.reshape(-1, 1)).flatten()

# Convert categorical variables + Rescale
categories = X.dtypes == object

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
tf = make_column_transformer((OneHotEncoder(sparse=False), categories), 
                             (StandardScaler(), ~categories),
                             remainder="passthrough")
X_new = tf.fit_transform(X)

# Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, random_state=0)

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

## Metrics

### Default

All Scikit-Learn estimators come with a default evaluation metric.

In [None]:
print(clf.score(X_train, y_train))

### Accuracy

This metric corresponds to the proportion of correct predictions. It is the default implementation of the <code>score</code> method for classifier.

In [None]:
print(accuracy_score(y_train, clf.predict(X_train)))

<div class="alert alert-success">

<b>EXERCISE</b>:

Compare the accuracy of a decision tree against a simple baseline, such as <code>DummyClassifier</code>.
</div>

<div class="alert alert-success">

<b>EXERCISE</b>:

Can you rewrite <code>accuracy_score</code> using Numpy operations?
</div>

### Precision, recall and F-score

In [None]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import fbeta_score
print("Precision =", precision_score(y_train, clf.predict(X_train)))
print("Recall =", recall_score(y_train, clf.predict(X_train)))
print("F =", fbeta_score(y_train, clf.predict(X_train), beta=1))

<div class="alert alert-success">

<b>EXERCISE</b>: (optional)

Can you maximize Precision or Recall without necessarily having a good model?
</ul>
</div>

### ROC AUC

In [None]:
from sklearn.metrics import roc_auc_score
print("ROC AUC =", roc_auc_score(y_train, clf.predict_proba(X_train)[:, 1]))

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train, clf.predict_proba(X_train)[:, 1])
plt.plot(fpr, tpr)
plt.show()

## Confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(y_train, clf.predict(X_train))
print(cm)

<div class="alert alert-success">

<b>EXERCISE</b>:

Can you recompute Precision and Recall from the entries of the confusion matrix?
</ul>
</div>

## Classification report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train, clf.predict(X_train)))

## Bias of the training error

Notice how the accuracy of the model is actually quite high.

In [None]:
print(accuracy_score(y_train, clf.predict(X_train)))

Would the same model perform as well as on new data?

In [None]:
print(accuracy_score(y_test, clf.predict(X_test)))

<div class="alert alert-success">

<b>EXERCISE</b>:

Re-evaluate the accuracy, precision, recall, F-measure, ROC AUC, confusion matrix and classification report on test data.
</div>

<div class="alert alert-success">

<b>EXERCISE</b>: (optional)

Are these results stable when you change <code>random_state</code> in <code>train_test_split</code>?
</div>

<div class="alert alert-success">

<b>EXERCISE</b>: (optional)

Has the size of the test data any effect on the measured performance?
</div>

<div class="alert alert-success">

<b>EXERCISE</b>: (optional)

Evaluate the test performance of a <code>LogisticRegression</code> model.
</div>