<a href="https://colab.research.google.com/github/Tarleton-Math/data-science-20-21/blob/master/data_science_20_21_notes_09_22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Intro to Sci-Kit Learn and Model Evaluation
## Class Notes 2020-09-22
## Data Science (masters)
## Math 5364 & 5366, Fall 20 & Spring 21
## Tarleton State University
## Dr. Scott Cook

In [None]:
! pip install --upgrade numpy
! pip install --upgrade pandas

Requirement already up-to-date: numpy in /usr/local/lib/python3.6/dist-packages (1.19.2)
Requirement already up-to-date: pandas in /usr/local/lib/python3.6/dist-packages (1.1.2)


We've written cross-validation and knn code by hand to strengthen our python skills.  But, let's face it - this is a lot of work.  And these are some of the simpler algorithms.

Scikit-learn offers a TON of pre-built data science algorithms that are both easier to use and much more powerful.  Let's start taking advantage of them.

First, meet "StratifiedKFold" and "StratifiedShuffleSplit".

- def: A cross-validation algorithm is *stratified* if it tried to ensure that all classes of the target variable are proportionately represented in each split.
    - Ex: Suppose we're working with the [Wisconsin breast cancer dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) of biopsy result of 212 (37%) malignant and 357 (63%) benign breast tumors.  We want each split to have about the some proportions of malignant and benign obervations.
    - The cv techniques from prior notes do not intentionally do this.  But these "statified" versions so.

- *StratifiedKFold* is a stratified version of $k$-fold
- *StratifiedShuffleSplit* is a stratified version of "ShuffleSplit".
    - sklearn's "ShuffleSplit" is almost identical to delete-$d$, except it also shuffles the rows.  That's great - we can use it for the initial holdout/modeling split and it automatically takes care of shuffling the data for us!

In [None]:
import numpy as np
import pandas as pd
from copy import deepcopy  # makes a copy of complex, nested data structures
from sklearn.datasets import load_wine
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit

data = load_wine()
n, q = data.data.shape
holdout_frac = 0.1

X, y = data.data, data.target
holdout_splitter = StratifiedShuffleSplit(n_splits=1, test_size=holdout_frac, random_state=42)
try:
    model_idx, holdout_idx = next(holdout_splitter.split(X, y))
except ValueError:
    # If holdout_frac is too small or big, one of the splits would have 0 observations.
    # This throws a ValueError.  In this case, all observations go to the modeling set,
    print('err')
    model_idx, holdout_idx = (), []
X_m, y_m = X[model_idx]  , y[model_idx]
X_h, y_h = X[holdout_idx], y[holdout_idx]
X_m.shape, X_h.shape

((160, 13), (18, 13))

In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

def run_model(X, y, pipe, show_confusion=True, n_splits=10):
    acc = []
    cfs = []
    train_splitter = StratifiedKFold(n_splits)
    f = 0
    for train, valid in train_splitter.split(X, y):
        y_true = y[valid]
        y_pred = pipe.fit(X[train], y[train]).predict(X[valid])
        cf = confusion_matrix(y_true, y_pred)
        a = cf.trace() / cf.sum()
        acc.append(a)
        cfs.append(cf)
        if show_confusion:
            k = pipe.named_steps['classify'].n_neighbors
            print(f"Confusion matrix for fold {f} for {k} neighbors")
            display_confusion_matrix(cf)
        f += 1
    return {'acc':acc, 'cfs':cfs, 'acc_mean':np.mean(acc)*100}

def display_confusion_matrix(cf):
    sns.heatmap(cf, annot=True)
    plt.xlabel("predicted")
    plt.ylabel("actual")
    plt.show()

def display_results(df):
    df = df.astype('float')
    with pd.option_context("display.max_rows", 1000):
        display(df.style.set_precision(2)
            .highlight_max(axis=0)
            .set_properties(**{'text-align':'center', 'border-width':'thin','border-style':'dotted'})
            .set_table_attributes('style="border-collapse:collapse"')
        )
    return df
    

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

show_confusion = True

pipe = Pipeline([('classify', KNeighborsClassifier(n_neighbors=5))])
res = run_model(X_m, y_m, pipe, show_confusion)
print(f"Mean accuracy over all cross-validations is {res['acc_mean']}%")

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

show_confusion = False
N_neighbors    = np.arange(1, 10+1)

df_results = pd.DataFrame(columns=['n_neighbors', 'acc_mean']).set_index('n_neighbors')
for k in N_neighbors:
    pipe = Pipeline([('classify', KNeighborsClassifier(n_neighbors=k))])
    res = run_model(X_m, y_m, pipe, show_confusion)
    df_results.loc[k, 'acc_mean'] = res['acc_mean']
display_results(df_results);

Unnamed: 0_level_0,acc_mean
n_neighbors,Unnamed: 1_level_1
1,70.0
2,61.88
3,67.5
4,68.12
5,71.25
6,71.25
7,66.88
8,66.25
9,70.0
10,66.88


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

show_confusion = False
N_neighbors    = np.arange(1, 10+1)

df_results = pd.DataFrame(columns=['n_neighbors', 'acc_mean']).set_index('n_neighbors')
for k in N_neighbors:
    pipe = Pipeline([('scale'   , StandardScaler()),
                     ('classify', KNeighborsClassifier(n_neighbors=k))])
    res = run_model(X_m, y_m, pipe, show_confusion)
    df_results.loc[k, 'acc_mean'] = res['acc_mean']
display_results(df_results);

Unnamed: 0_level_0,acc_mean
n_neighbors,Unnamed: 1_level_1
1,95.0
2,94.38
3,95.62
4,94.38
5,96.25
6,94.38
7,98.12
8,95.62
9,95.62
10,95.62


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

show_confusion = False
N_neighbors    = np.arange(1, 10+1)
N_components   = np.arange(1, q+1)

df_results = pd.DataFrame(columns=['n_components', 'n_neighbors', 'acc_mean']).set_index(['n_components', 'n_neighbors'])
for c in N_components:
    for k in N_neighbors:
        pipe = Pipeline([('scale'     , StandardScaler()),
                         ('dim_reduce', PCA(n_components=c)),
                         ('classify'  , KNeighborsClassifier(n_neighbors=k))])
        res = run_model(X_m, y_m, pipe, show_confusion)
        df_results.loc[(c, k), 'acc_mean'] = res['acc_mean']
display_results(df_results.unstack(0));

Unnamed: 0_level_0,acc_mean,acc_mean,acc_mean,acc_mean,acc_mean,acc_mean,acc_mean,acc_mean,acc_mean,acc_mean,acc_mean,acc_mean,acc_mean
n_components,1,2,3,4,5,6,7,8,9,10,11,12,13
n_neighbors,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
1,79.38,97.5,96.88,95.0,95.62,96.25,95.0,94.38,94.38,95.0,94.38,95.62,95.0
2,83.12,96.25,95.0,95.0,95.0,95.62,93.75,93.12,94.38,95.62,94.38,95.0,94.38
3,82.5,96.25,96.25,95.0,96.25,95.62,94.38,96.25,95.62,95.0,95.0,95.62,95.62
4,83.12,96.88,96.25,96.25,96.88,95.0,94.38,95.62,95.0,95.0,93.75,94.38,94.38
5,84.38,96.88,96.88,95.62,96.88,97.5,96.25,95.0,96.25,96.25,96.25,96.25,96.25
6,84.38,96.25,96.88,96.25,96.25,96.25,95.0,95.62,95.62,95.62,95.62,95.0,94.38
7,84.38,97.5,96.88,96.25,96.25,96.88,96.88,97.5,97.5,97.5,97.5,96.88,98.12
8,84.38,97.5,96.25,94.38,95.62,96.88,96.25,95.62,96.25,96.25,95.62,95.0,95.62
9,82.5,97.5,95.0,95.62,96.25,97.5,96.88,97.5,96.88,96.88,95.62,96.25,95.62
10,85.62,97.5,96.88,95.62,94.38,96.25,96.88,97.5,96.88,96.25,94.38,96.25,95.62
