# Project 3
## Problem Statement
As with Project 1, apply the ideas of ch. 1 - 3 as appropriate.
Develop and demonstrate your capabilities with:
* Decision Trees (ch. 6)
* Ensemble Learning and Random Forests (ch. 7)
* Dimensionality Reduction (ch. 8)

## Daniel's Task
Apply ensembling tecniques to the mnist dataset to improve algorithm performance

# Mnist Dataset Ensembling

In [1]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as pyplot
import numpy

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
X = mnist["data"]
y = mnist["target"]

from sklearn.model_selection import train_test_split
X_train, X_test,  y_train, y_test  = train_test_split(X,       y,       test_size=0.2, random_state=31415)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print("Training set:   ", X_train.shape, " ", y_train.shape)
print("Validation set: ", X_valid.shape, " ", y_valid.shape)
print("Testing set:    ", X_test.shape,  " ", y_test.shape)

Training set:    (44800, 784)   (44800,)
Validation set:  (11200, 784)   (11200,)
Testing set:     (14000, 784)   (14000,)


First I'm going to do a sanity check. If my results are similar to what I got in Project 1 then I'll know that I haven't clobbered anything too badly so far.

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

neighborsClsfr = KNeighborsClassifier(n_jobs=-1)
neighborsClsfr.fit(X_train, y_train)
print(neighborsClsfr.score(X_valid, y_valid))

0.9689285714285715


Looks good! Now I'm going to try some new algorithms on this dataset

## Random Forests

In [3]:
from sklearn.ensemble import RandomForestClassifier

randForest = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
randForest.fit(X_train, y_train)
print(randForest.score(X_valid, y_valid))

0.968125


Simple enough (thanks scikit learn!) and the results look stellar. Good enough that I'm kind of suspicious that it's overfitting pretty badly. I'll run some cross validation to see if that seems to be the case.

In [4]:
rf_crossVal = cross_val_score(randForest, X_train, y_train, cv=5, n_jobs=-1, scoring="accuracy")
print(rf_crossVal)
print(f"Avg: {rf_crossVal.mean():0.4f} +/- {rf_crossVal.std():0.4f}")

[0.96430563 0.96172302 0.96819196 0.96505526 0.96393882]
Avg: 0.9646 +/- 0.0021


Yeah, those results are pretty consistent with each other. Still, won't really know until I compare against the test set. I'm going to save that for the very end though.

## Boosting
Boosting sounds pretty cool and I want to try it out, so I'm going pull out the standard sci-kit learn booster.

In [5]:
from sklearn.ensemble import AdaBoostClassifier

booster = AdaBoostClassifier()
booster.fit(X_train, y_train)
print(booster.score(X_valid, y_valid))

0.7350892857142857


Well, that was less than dazzling so I'm going to try another booster that I've heard good things about. Still, I might come back to this at the end if my test dataset shows everything else to be horribly overfitted.

## XGBoost
I listened to a podcast about XGBoost a while back and that guy seemed quite impressed by the performance of that library, so I installed it via `pip3 install xgboost`.
Unfortunately it doesn't quite seem to like windows and didn't work out of the box. After a bit of poking around based on the error message I got I found that I could just move a couple files around in the python packages folder and it started working. Hopefully anyone else that tries this won't have that problem.

In [6]:
from xgboost import XGBClassifier

xbooster = XGBClassifier(n_jobs=-1)
xbooster.fit(X_train, y_train)
print (xbooster.score(X_valid, y_valid))

0.9328571428571428


Hey, not bad! Wonder if it will compare well with the other algorithms if I do some tuning.

In [7]:
# param_grid = { 
#    "learning_rate": [0.05, 0.1, 0.2, 0.4],
#    "subsample": [0.6, 0.75, 0.9],
#    "max_depth": [2, 3, 5],
#    "gamma": [0, 0.1, 0.3]
#}
#
#search = GridSearchCV(xbooster, param_grid, n_jobs=-1, cv=4)
#search.fit(X_train, y_train)
#print(search.best_params_)
#print(search.best_score_)
#print(search.refit_time_)

#results from the original 11 hour 6 minute run:
search = {
    "best_params": {'gamma': 0, 'learning_rate': 0.4, 'max_depth': 5, 'subsample': 0.9},
    "best_score": 0.9725223214285714,
    "refit_time": 169.8297770023346
}

xbooster.gamma = search["best_params"]["gamma"]
xbooster.learning_rate = search["best_params"]["learning_rate"]
xbooster.max_depth = search["best_params"]["max_depth"]
xbooster.subsample = search["best_params"]["subsample"]

print(search["best_score"])

0.9725223214285714


Yup. That may be the best one so far.

## Voting Classifier

In [8]:
from sklearn.ensemble import VotingClassifier

estimators=[
    ('xb', xbooster),
    ('knn', neighborsClsfr),
    ('randf', randForest)
]
voter = VotingClassifier(estimators=estimators)
voter.fit(X_train, y_train)
print(voter.score(X_valid, y_valid))

0.9766071428571429


That's even just a smidge better than xgboost, and this hasn't even been tuned. Though, tuning this beast would probably be quite a chore. For sake of time I'm going to dodge that mess and move on to bagging.

## Bagging

In [9]:
from sklearn.ensemble import BaggingClassifier

bagger = BaggingClassifier(KNeighborsClassifier(), max_samples=0.6, n_jobs=-1, random_state=42)
bagger.fit(X_train, y_train)
print(bagger.score(X_valid, y_valid))

0.9657142857142857


Awesome, this one also looks really good.

# Comparing against the test set
Now I'm going to see how all these hold up against my test set, which I have not used thus far. That will give me an idea of how badly I've overfitted everything.

In [11]:
compare_estimators = [
    neighborsClsfr,
    randForest,
    booster,
    xbooster,
    voter,
    bagger
]

for clf in compare_estimators:
    score = clf.score(X_test, y_test)
    print(f"{type(clf).__name__: <15} - {score:0.4f}")

KNeighborsClassifier - 0.9674
RandomForestClassifier - 0.9648
AdaBoostClassifier - 0.7346
XGBClassifier   - 0.9308
VotingClassifier - 0.9739
BaggingClassifier - 0.9646


Here are the validation and test accuracy results in a table:


|Algorithm  | validation set | test set | difference |
|-----------|----------------|----------|------------|
| neighbors | 0.9689 | 0.9674 | 0.0015 |
| randForest | 0.9681 | 0.9648 | 0.0033 |
| Ada Boost | 0.7351 | 0.7346 | 0.0005 |
| XGBoost | 0.9328 | 0.9308 | 0.0020 |
| Voting | 0.9766 | 0.9739 | 0.0034 |
| Bagging | 0.9657 | 0.9646 | 0.0011 |


Without more test cases, perhaps via cross validation, these difference values can't be taken too literally, but they are still quite small so the chances of any of these being terrible overfit is pretty small. 

As a side note, this bagger took a very long time to score the test set, despite not scoring as well as the voting classifier. If I were going to use the bagger on this set I would first try a simpler algorithm on the backend or using fewer estimators.

# Daniel Ashby Indirect Activity Report

| Date | Duration | Duration in Minutes | Collaborator(s) | Specific Task/Activity |
|:-------|:----|:---|:-|:-----------------------------|
| 1/19/19 | 3:00 | 180 | - | Reading Textbook Chapters 1-3 |
| 1/26/19 | 3:00 | 180 | - | Working on Project 1 |
| 2/2/19 | 7:00 | 420 | - | Working on Project 1 |
| 2/4/19 | 1:30 | 90 | - | Working on Project 1 |
| 2/6/19 | 1:00 | 60 | - | Working on Project 1 |
| 2/9/19 | 8:00 | 480 | - | Working on Project 1 |
| 2/11/19 | 4:00 | 240 | - | Working on Project 1 |
| - | - | - | - | - |
| 2/16/19 | 3:00 | 180 | - | Reading Textbook |
| 2/16/19 | 3:00 | 180 | - | Attempting to set up JupyterHub for collaboration |
| 2/23/19 | 4:00 | 240 | - | Reading Textbook |
| 3/2/19 | 3:00 | 180 | - | Attempting to set up JupyterHub for collaboration |
| 3/9/19 | 6:00 | 360 | Derek Byrne | Improving Titanic scores |
| 3/16/19 | 8:00 | 480 | Derek Byrne | Linear Regression |
| 3/18/19 | 8:00 | 480 | Derek Byrne | Linear Regression, and Support Vector Classification |
| - | - | - | - | - |
| 3/30/19 | 6:00 | 360 | - | Reading Textbook |
| 4/6/19 | 3:00 | 180 | - | Reading Textbook |
| 4/20/19 | 4:00 | 240 | Derek Byrne | Reading Textbook & Project 3 |
| 4/27 | 6:00 | 360 | Derek Byrne | Project 3 |
| 4/29 | 1:30 | 90 | Derek Byrne | Project 3 |
| - | - | - | - | - |
| Sum for current report | 20:30 | 1,230 | - | - |
| Cumulative sum for this course | 83:00 | 4,980 | - | - |