# Machine Learning and Statistics: Hands On 2

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd

Silence some expected warnings below.

In [None]:
import warnings, matplotlib.cbook
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=matplotlib.cbook.mplDeprecation)

## Exercise 1: Overfitting

For an extreme example of overfitting (=lack of ability to generalize), train a [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) using `n_neighbors=1`. First, load the data:

In [None]:
df_img_test = pd.read_hdf('data/sources_img_test.hf5')
df_img_test_normed = pd.read_hdf('data/sources_img_test_normed.hf5')
df_img_train_normed = pd.read_hdf('data/sources_img_train_normed.hf5')
nsrc_true_test = pd.read_hdf('data/nsrc_true_test.hf5')
nsrc_true_train = pd.read_hdf('data/nsrc_true_train.hf5')

Here is the driver function we used earlier:

In [None]:
from mls import plot_classification, scan_misclassified

In [None]:
def test_sk_classification(
    method, train_data=df_img_train_normed, test_data=df_img_test_normed):
    # Fit normed training images.
    fit = method.fit(train_data, nsrc_true_train)
    # Get training predictions.
    nsrc_train = fit.predict(train_data)
    plot_classification(nsrc_train, nsrc_true_train, label='train:')
    plt.show()
    # Get test predictions.
    nsrc_test = fit.predict(test_data)
    plot_classification(nsrc_test, nsrc_true_test, label='test:')
    plt.show()
    # Scan some test failures.
    scan_misclassified(nsrc_test, nsrc_true_test, df_img_test)
    plt.show()
    # Return the test predictions.
    return nsrc_test

In [None]:
from sklearn import neighbors

In [None]:
# Add your code here...

Do these results make sense based on how the KNeighborsClassifier [works](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)?

## Exercise 2: Regression Score

The default score used by all sklearn regression algorithms is the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination), which you can calculate with sklearn using [metrics.r2_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html):

In [None]:
from sklearn import metrics

In [None]:
# Generate some random points.
gen = np.random.RandomState(seed=123)
true = gen.uniform(size=100)
pred = gen.uniform(size=100)
plt.scatter(true, pred)
R2 = metrics.r2_score(true, pred)
print(f'R2 = {R2:.3f}')

Write a similar example which has R2 = 1.

In [None]:
# Add your solution here...

Write a similar example which has R2 = 0.

In [None]:
# Add your solution here...

## Exercise 3: Interpretation

Machine learning involves running a lot of tests with careful book keeping, then interpreting the results. For a taste of this, make a table of training and test classifier performance from the examples in the lectures, showing:
 - the best of the sklearn classifiers
 - the DNN classifier
 - the CNN classifier
 
Now decide which is the "best" method based on these numbers. How are you defining "best"?

## Exercise 4: Network Architecture

Train a dense network classifier with 64 nodes in both of the hidden layers, instead of 128. How does this affect performance. (Hint: you will need to cut & paste liberally from the [NeuralNet notebook](NeuralNet.ipynb)).

In [None]:
import tensorflow as tf

In [None]:
from mls import plot_classification, scan_misclassified

In [None]:
mkdir -p tfs/dnnc64

In [None]:
# The rest is up to you...

To put your results in perspective, calculate how many weights the 128-node and 64-node networks each have (both have 256 input nodes and 4 output nodes).  What do you learn from this?

## Exercise 5: Mission Impossible?

We found that the [LinearSVR](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html) regression had $R^2\simeq 0$ with its default hyperparameters, so did not learn anything useful from the training data.  Can you save its reputation and find a set of hyperparameters where it gives respectable performance?