# Classification plots

_Author: Christoph Rahmede_

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-1">Load the data</a></span></li><li><span><a href="#Determine-the-baseline" data-toc-modified-id="Determine-the-baseline-2">Determine the baseline</a></span></li><li><span><a href="#Standardize-and-create-a-train-test-split" data-toc-modified-id="Standardize-and-create-a-train-test-split-3">Standardize and create a train-test split</a></span></li><li><span><a href="#Grid-search-a-logistic-regression-model" data-toc-modified-id="Grid-search-a-logistic-regression-model-4">Grid search a logistic regression model</a></span></li><li><span><a href="#Illustrate-score-dependence-on-tuning-parameters" data-toc-modified-id="Illustrate-score-dependence-on-tuning-parameters-5">Illustrate score dependence on tuning parameters</a></span></li><li><span><a href="#Learning-curve" data-toc-modified-id="Learning-curve-6">Learning curve</a></span></li><li><span><a href="#Distribution-of-predicted-probabilities" data-toc-modified-id="Distribution-of-predicted-probabilities-7">Distribution of predicted probabilities</a></span></li><li><span><a href="#Confusion-matrix" data-toc-modified-id="Confusion-matrix-8">Confusion matrix</a></span></li><li><span><a href="#Class-prediction-error" data-toc-modified-id="Class-prediction-error-9">Class prediction error</a></span></li><li><span><a href="#Classification-report" data-toc-modified-id="Classification-report-10">Classification report</a></span></li><li><span><a href="#ROC-curve" data-toc-modified-id="ROC-curve-11">ROC curve</a></span></li><li><span><a href="#Precision-recall-curve" data-toc-modified-id="Precision-recall-curve-12">Precision-recall curve</a></span></li><li><span><a href="#Cumulative-gain-and-lift-curves" data-toc-modified-id="Cumulative-gain-and-lift-curves-13">Cumulative gain and lift curves</a></span></li></ul></div>

In [1]:
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

warnings.simplefilter('ignore')

In [2]:
from sklearn import datasets, metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

## Load the data

In this notebook, the only change is the choice of dataset. Here we have three class labels.
From the following code, not all the pieces will run. Check which ones can be carried out. If not, is it possible to still generalize the approach?

In [3]:
data = datasets.load_iris()

In [4]:
data.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [5]:
print(data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [6]:
X = pd.DataFrame(data.data, columns=data.feature_names)
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


## Determine the baseline

In [None]:
y = pd.Series(data.target)
y.value_counts(normalize=True)

## Standardize and create a train-test split

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=1)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)

## Grid search a logistic regression model

In [None]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
estimator = LogisticRegression(solver='liblinear')

In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV

params = {'C': np.logspace(-4, 4, 30), 
          'penalty': ['l1', 'l2'], 
          'fit_intercept': [True, False]}

model = GridSearchCV(estimator=estimator, param_grid=params,
                  cv=5, return_train_score=True)

model.fit(X_train, y_train)

print('Best Parameters:')
print(model.best_params_)
print('Best estimator C:')
print(model.best_estimator_.C)
print('Best estimator mean cross validated training score:')
print(model.best_score_)
print('Best estimator score on the full training set:')
print(model.score(X_train, y_train))
print('Best estimator score on the test set:')
print(model.score(X_test, y_test))
print('Best estimator coefficients:')
print(model.best_estimator_.coef_)

## Illustrate score dependence on tuning parameters

In [None]:
results = pd.DataFrame(model.cv_results_)
grouped = results.groupby(['param_penalty', 'param_fit_intercept'])
groups = list(grouped.groups.keys())

fig, ax = plt.subplots(figsize=(12, 6))
for i, group in enumerate(groups):
    grouped.get_group(group).plot(
        x='param_C', y='mean_test_score', logx=True, lw=3,
        label='Penalty: {} , Intercept: {}'.format(group[0], group[1]), ax=ax)
    ax.set_label('Penalty: {} , Intercept: {}'.format(group[0], group[1]))

ax.legend(loc=[1.1, 0])
plt.show()

> **Note:** Here you might have to modify some code.

In [None]:
df_coef = pd.DataFrame({'coef': model.best_estimator_.coef_.ravel(), 
                        'coef_abs': np.abs(model.best_estimator_.coef_.ravel())}, index=X.columns
                       ).sort_values(by='coef_abs')
df_coef.plot(y='coef_abs', kind='barh', color=(df_coef.coef.map(
    lambda x: 'darkorange' if x > 0 else 'b')).values, figsize=(12, 12))
plt.xlabel('abs(coef)')
plt.show()

## Learning curve

The learning curve is constructed in the same way as for regression, only that now we are looking at the chosen classification score.

In [None]:
import scikitplot as skplt

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
ax.grid()
skplt.estimators.plot_learning_curve(model.best_estimator_, X_train, y_train, shuffle=True, cv=5, ax=ax,
                                     title_fontsize=20, text_fontsize=20)
plt.show()

## Distribution of predicted probabilities

It can be informative to look at the distribution of predicted probabilities. Ideally we would like to see that the predicted probabilities for one class are very low and for the other very high so that we can clearly distinguish.

In [None]:
plt.hist(model.predict_proba(X_train), density=True, label=[0, 1],cumulative=False)
plt.yscale('log')
plt.legend(loc=[1.1,0])
plt.show()

In [None]:
plt.hist(model.predict_proba(X_test), density=True, label=[0, 1])
plt.yscale('log')
plt.legend(loc=[1.1,0])
plt.show()

## Confusion matrix

The confusion matrix informs us about how many observations from each class have been predicted to belong to any of the classes.

In [None]:
print(metrics.confusion_matrix(y_train, model.predict(X_train)))
print(metrics.confusion_matrix(y_test, model.predict(X_test)))

A nicer visualization is obtained with scikit-plot.

In [None]:
from matplotlib.colors import ListedColormap
cmap = ListedColormap(sns.color_palette("husl", 3))

In [None]:
skplt.metrics.plot_confusion_matrix(y_train, model.predict(
    X_train), labels=model.classes_, text_fontsize=16, 
    title_fontsize=20, figsize=(6, 6))
plt.show()

## Class prediction error

We can summarize the results from the confusion matrix also in the following way which can be useful in particular for more than two classes.

In [None]:
pd.DataFrame(metrics.confusion_matrix(y_train, model.predict(X_train)), 
             index=model.classes_
             ).plot(kind='bar', stacked=True, rot=0, title='Class prediction error')
plt.xlabel('actual class')
plt.ylabel('number of predicted class')
plt.show()

## Classification report

The classification report gives us a nice summary about a variety of classification scores.
The class scores have to be read from the perspective of taking the respective class as the positive class. Support indicates the number of observations in each class.

The scores are defined in the following way:


$$
\begin{eqnarray*}
{\rm precision} &=&\frac{\rm TP}{\rm TP+FP}\\
{\rm recall} &=&\frac{\rm TP}{\rm TP+FN}\\
{\rm F}_1 &=& \frac{2}{\frac{1}{\rm precision}+\frac{1}{\rm recall}}
\end{eqnarray*}
$$


The micro average here is identical to accuracy, 
the macro average is the unweighted average across all classes, the weighted average
weighs each class score by the fraction of observations in each class.

We will also need the false positive rate (FPR) which is

$$
{\rm FPR} =\frac{\rm FP}{\rm FP+TN}\\
$$

In [None]:
print(metrics.classification_report(y_train, model.predict(X_train)))

In [None]:
print(metrics.classification_report(y_test, model.predict(X_test)))

## ROC curve

We can plot the true positive rate (which is synonymous with recall) versus the 
False Positive Rate depending on the threshold.

In [None]:
skplt.metrics.plot_roc(y_train, model.predict_proba(X_train), plot_micro=False, plot_macro=False,
                       title_fontsize=20, text_fontsize=16, figsize=(8, 6), cmap=cmap)
plt.show()

## Precision-recall curve

Also the change of precision and recall with the threshold is informative. Usually there will be a trade-off between the two. 

Different combinations of precision and recall can lead to the same ${\rm F}_1$-score, which defines level lines in this plot.

In [None]:
def plot_f1_lines(figsize=(8, 6), fontsize=16):
    '''Create f1-score level lines to be added to the precison-recall plot'''

    fig, ax = plt.subplots(figsize=figsize)

    # add lines of constant F1 scores

    for const in np.linspace(0.2, 0.9, 8):
        x_vals = np.linspace(0.001, 0.999, 100)
        y_vals = 1./(2./const-1./x_vals)
        ax.plot(x_vals[y_vals > 0], y_vals[y_vals > 0],
                color='lightblue', ls='--', alpha=0.9)
        ax.set_ylim([0, 1])
        ax.annotate('f1={0:0.1f}'.format(const),
                    xy=(x_vals[-10], y_vals[-2]+0.0), fontsize=fontsize)

    return fig, ax

In [None]:
fig, ax = plot_f1_lines()
skplt.metrics.plot_precision_recall(
    y_train,
    model.predict_proba(X_train),
    plot_micro=False,
    title_fontsize=20,
    text_fontsize=16,
    cmap=cmap, ax=ax)
ax.legend(loc=[1.1, 0])
plt.show()

## Cumulative gain and lift curves

To obtain the cumulative gain, we sort the predicted probabilities for one of the classes, determine the fraction of the corresponding class labels seen in any of the entries up to a given probability threshold, and plot those versus the predicted probabilities.

For the lift, we divide the fraction of correct class labels seen by the number of observations seen so far.

In case of random order of the labels, we obtain the baseline curves.

In [None]:
def cumulative_gain(probabilities, labels):
    df = pd.DataFrame({'prob': probabilities, 'label': labels})
    df = df.sort_values(by='prob', ascending=False)
    df['label'] = df.label.cumsum()/df.label.sum()
    df['fraction'] = np.arange(1, len(df)+1)/len(df)
    df['lift'] = df.label/df.fraction
    return df

In [None]:
y.value_counts(normalize=True)

> **Note:** Here you might have to modify some code.

In [None]:
fig, ax = plt.subplots()
skplt.metrics.plot_cumulative_gain(y_train, model.predict_proba(X_train), ax=ax, title_fontsize=20, text_fontsize=16)
cumulative_gain(model.predict_proba(X_train)[:, 1], y_train).plot(
    x='fraction', y='label', lw=2, linestyle='--', c='k', ax=ax, label='Class 1')
cumulative_gain(model.predict_proba(X_train)[:, 0], (y_train == 0)*1).plot(
    x='fraction', y='label', lw=2, linestyle=':', c='k', ax=ax, label='Class 0')
ax.vlines(y.value_counts(normalize=True)[0], 0, 1, lw=2, linestyle='-.', label= 'Proportion Class 0')
ax.vlines(y.value_counts(normalize=True)[1], 0, 1, lw=2, linestyle='-.', label='Proportion Class 1')
ax.legend(loc=[1.1,0])
plt.show()

In [None]:
fig, ax = plt.subplots()
skplt.metrics.plot_lift_curve(y_train, model.predict_proba(
    X_train), ax=ax, title_fontsize=20, text_fontsize=16)
cumulative_gain(model.predict_proba(X_train)[:, 1], y_train).plot(
    x='fraction', y='lift', lw=2, linestyle='--', c='k', ax=ax, label='Class 1')
cumulative_gain(model.predict_proba(X_train)[:, 0], (y_train == 0)*1).plot(
    x='fraction', y='lift', lw=2, linestyle=':', c='k', ax=ax, label='Class 0')
ax.legend(loc=[1.1,0])
plt.show()

In [None]:
1/y.value_counts(normalize=True)