# Tuning Example

This notebook demonstrates hyperparameter tuning

It uses the CHI Papers Data downloaded by the scripts in this project.  It trains various classifiers to predict whether a CHI paper is "recent" (written since 2005).

All of these are going to optimize for **accuracy**, the metric returned by the `score` function on a classifier.

## Setup

Import our general PyData packages:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

And some SciKit Learn:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.naive_bayes import MultinomialNB

And finally the Bayesian optimizer:

In [None]:
from skopt import BayesSearchCV

We want predictable randomness:

In [None]:
rng = np.random.RandomState(20201130)

In this notebook, I have SciKit-Learn run some tasks in parallel.  Let's configure the (max) number of parallel tasks in one place, so you can easily adjust it based on your computer's capacity:

In [None]:
NJOBS = 8

## Load Data

We're going to load the CHI Papers data from the CSV file, output by the other notebook:

In [None]:
papers = pd.read_csv('chi-papers.csv', encoding='utf8')
papers.info()

Let's treat empty abstracts as empty strings:

In [None]:
papers['abstract'].fillna('', inplace=True)
papers['title'].fillna('', inplace=True)

For our purposes we want all text - the title and the abstract.  We will join them with a space, so we don't fuse the last word of the title to the first word of the abstract:

In [None]:
papers['all_text'] = papers['title'] + ' ' + papers['abstract']

We're going to classify papers as *recent* if they're newer than 2005:

In [None]:
papers['IsRecent'] = papers['year'] > 2005

And make training and test data:

In [None]:
train, test = train_test_split(papers, test_size=0.2, random_state=rng)

Let's make a function for measuring accuracy:

In [None]:
def measure(model, text='all_text'):
    preds = model.predict(test[text])
    print(classification_report(test['IsRecent'], preds))

And look at the class distribution:

In [None]:
sns.countplot(train['IsRecent'])

## Classifying New Papers

Let's classify recent papers with k-NN on TF-IDF vectors:

In [None]:
base_knn = Pipeline([
    ('vectorize', TfidfVectorizer(stop_words='english', lowercase=True, max_features=10000)),
    ('class', KNeighborsClassifier(5))
])
base_knn.fit(train['all_text'], train['IsRecent'])

And measure it:

In [None]:
measure(base_knn)

## Tune the Neighborhood

Let's tune the neighborhood with a grid search:

In [None]:
tune_knn = Pipeline([
    ('vectorize', TfidfVectorizer(stop_words='english', lowercase=True, max_features=10000)),
    ('class', GridSearchCV(KNeighborsClassifier(), param_grid={
        'n_neighbors': [1, 2, 3, 5, 7, 10]
    }, n_jobs=NJOBS))
])
tune_knn.fit(train['all_text'], train['IsRecent'])

What did it pick?

In [None]:
tune_knn.named_steps['class'].best_params_

And measure it:

In [None]:
measure(tune_knn)

## SVD Neighborhood

Let's set up SVD-based neighborhood, and use random search to search both the latent feature count and the neighborhood size:

In [None]:
svd_knn_inner = Pipeline([
    ('latent', TruncatedSVD(random_state=rng)),
    ('class', KNeighborsClassifier())
])
svd_knn = Pipeline([
    ('vectorize', TfidfVectorizer(stop_words='english', lowercase=True)),
    ('class', RandomizedSearchCV(svd_knn_inner, param_distributions={
        'latent__n_components': stats.randint(1, 50),
        'class__n_neighbors': stats.randint(1, 25)
    }, n_iter=60, n_jobs=NJOBS, random_state=rng))
])
svd_knn.fit(train['all_text'], train['IsRecent'])

What parameters did we pick?

In [None]:
svd_knn['class'].best_params_

And measure it on the test data:

In [None]:
measure(svd_knn)

## SVD with scikit-optimize

Now let's cross-validate with SciKit-Optimize:

In [None]:
svd_knn_inner = Pipeline([
    ('latent', TruncatedSVD()),
    ('class', KNeighborsClassifier())
])
svd_bayes_knn = Pipeline([
    ('vectorize', TfidfVectorizer(stop_words='english', lowercase=True)),
    ('class', BayesSearchCV(svd_knn_inner, {
        'latent__n_components': (1, 50),
        'class__n_neighbors': (1, 25)
    }, n_jobs=NJOBS, random_state=rng))
])
svd_bayes_knn.fit(train['all_text'], train['IsRecent'])

What parameters did we pick?

In [None]:
svd_bayes_knn['class'].best_params_

And measure it:

In [None]:
measure(svd_bayes_knn)

## Naive Bayes

Let's give the Naive Bayes classifier a whirl:

In [None]:
nb = Pipeline([
    ('vectorize', TfidfVectorizer(stop_words='english', lowercase=True, max_features=10000)),
    ('class', MultinomialNB())
])
nb.fit(train['all_text'], train['IsRecent'])

In [None]:
measure(nb)

## Summary Accuracy

What does our test accuracy look like for our various classifiers?

In [None]:
models = {
    'kNN': base_knn,
    'kNN-CV': tune_knn,
    'kNN-SVD-Rand': svd_knn,
    'kNN-SVD-Bayes': svd_bayes_knn,
    'NB': nb
}

In [None]:
all_preds = pd.DataFrame()
for name, model in models.items():
    all_preds[name] = model.predict(test['all_text'])

In [None]:
acc = all_preds.apply(lambda ds: accuracy_score(test['IsRecent'], ds))
acc

In [None]:
acc.plot.bar()
plt.ylabel('Accuracy')
plt.show()