# 4.3.3 [Supervised Neural Nets](https://courses.thinkful.com/data-201v1/project/4.3.3)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score

%matplotlib inline

### Model Parameters:

1. Number of and complexity of _layers_ to include is determined by
    * Computation Resources
    * Cross Validation Searching for Convergence
    
2. _Alpha_ - neural networks use a regularization parameter that penalizes large coefficients (alpha scales with the penalty)
3. _Activatoin Functions_ -determines whether the output from an individual perceptron is binary or continuous. 
    * **relu** is the binary fuction and most often used.  
    * **sigmoid** is reasonable for continuous variables between _0 and 1_ allowing for more nuanced model (it does increase computation complexity)

In [2]:
artworks = pd.read_csv('https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv')

# Select Columns.
artworks = artworks[['Artist', 'Nationality', 'Gender', 'Date', 'Department',
                    'DateAcquired', 'URL', 'ThumbnailURL', 'Height (cm)', 'Width (cm)']]

# Convert URL's to booleans.
artworks['URL'] = artworks['URL'].notnull()
artworks['ThumbnailURL'] = artworks['ThumbnailURL'].notnull()

# Drop films and some other tricky rows.
artworks = artworks[artworks['Department']!='Film']
artworks = artworks[artworks['Department']!='Media and Performance Art']
artworks = artworks[artworks['Department']!='Fluxus Collection']

# Drop missing data.
artworks = artworks.dropna()

# transform DateAcquired into datetimeobject
artworks['DateAcquired'] = pd.to_datetime(artworks.DateAcquired)
artworks['YearAcquired'] = artworks.DateAcquired.dt.year


# Remove multiple nationalities, genders, and artists.
artworks.loc[artworks['Gender'].str.contains('\) \('), 'Gender'] = '\(multiple_persons\)'
artworks.loc[artworks['Nationality'].str.contains('\) \('), 'Nationality'] = '\(multiple_nationalities\)'
artworks.loc[artworks['Artist'].str.contains(','), 'Artist'] = 'Multiple_Artists'


# Convert dates to start date, cutting down number of distinct examples.
artworks['Date'] = pd.Series(artworks.Date.str.extract(
    '([0-9]{4})', expand=False))[:-1]

# Final column drops and NA drop.
sample = artworks.sample(50000, random_state=42)
X = sample.drop(['Department', 'DateAcquired', 'Artist', 'Nationality', 'Date'], 1)

# Create dummies separately.
artists = pd.get_dummies(sample.Artist)
nationalities = pd.get_dummies(sample.Nationality)
dates = pd.get_dummies(sample.Date)

# Concat with other variables, but artists slows this wayyyyy down so we'll keep it out for now
X = pd.get_dummies(X, sparse=True)
X = pd.concat([X, nationalities, dates], axis=1)

Y = sample.Department

# Build the Model

Classify what department a piece of art should be in (5-way classification question)

In [28]:
Y.value_counts()/len(Y)

Prints & Illustrated Books    0.5236
Photography                   0.2309
Architecture & Design         0.1112
Drawings                      0.1010
Painting & Sculpture          0.0333
Name: Department, dtype: float64

### Establish what a single 1000 perceptron layer looks like
Not the best, the cross val score is below the majority class and there is a wide range of fold values

In [4]:
# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(1000,))

cv_score = cross_val_score(mlp, X, Y, cv=5)
print("Cross Val Avg: %.3f"%cv_score.mean())
print("Cross Val Std: %.3f"%np.std(cv_score))

Cross Val Avg: 0.407
Cross Val Std: 0.123


# Try different hidden layer sizes
#### It's highly likely that the model could default to always guessing the majority class _Prints & Illustrated Books_   
#### To do better than always guessing _Prints & Illustrated Books_ accuracy should be above 0.52

#### Many Small Layers:
Better, the model's Accuracy Score average improved by 19 ppt and the standard deviation between folds is much better

In [3]:
# Many Small Layer
mlp = MLPClassifier(hidden_layer_sizes=(10,10,10,10,10,10,10,10,10,10))
%time mlp.fit(X, Y)

cv_score = cross_val_score(mlp, X, Y, cv=5)
print("Cross Val Avg: %.3f"%cv_score.mean())
print("Cross Val Std: %.3f"%np.std(cv_score))

CPU times: user 36.5 s, sys: 4.36 s, total: 40.8 s
Wall time: 21.9 s
Cross Val Avg: 0.598
Cross Val Std: 0.044


#### Few Large Layers
Already with 3 layers of 1000 the nodel is taking > 10 minutes to fit 1 fold. I killed tha job to try with fewer layers
Accuracy went up quite a bit but so did overfitting

In [5]:
# Many Small Layer
mlp = MLPClassifier(hidden_layer_sizes=(1000, 1000, 1000))
%time mlp.fit(X, Y)

cv_score = cross_val_score(mlp, X, Y, cv=5)
print("Cross Val Avg: %.3f"%cv_score.mean())
print("Cross Val Std: %.3f"%np.std(cv_score))



CPU times: user 8min 18s, sys: 1min 40s, total: 9min 58s
Wall time: 5min 8s
Cross Val Avg: 0.559
Cross Val Std: 0.041


In [6]:
# Many Small Layer
mlp = MLPClassifier(hidden_layer_sizes=(1000, 1000))
#%time mlp.fit(X, Y)

cv_score = cross_val_score(mlp, X, Y, cv=5)
print("Cross Val Avg: %.3f"%cv_score.mean())
print("Cross Val Std: %.3f"%np.std(cv_score))

Cross Val Avg: 0.614
Cross Val Std: 0.089


### Will starting small and following up with a larger layer help?


In [7]:
# Small then large
mlp = MLPClassifier(hidden_layer_sizes=(10,1000), random_state=33, )
%time mlp.fit(X, Y)

cv_score = cross_val_score(mlp, X, Y, cv=5)
print("Cross Val Avg: %.3f"%cv_score.mean())
print("Cross Val Std: %.3f"%np.std(cv_score))

CPU times: user 1min 28s, sys: 6.99 s, total: 1min 35s
Wall time: 49.7 s
Cross Val Avg: 0.641
Cross Val Std: 0.028


### What does it look like with the opposite? Start with a large layer and follow up with a smaller layer
This is just guessing the majority class no bueno

In [8]:
# Large then small This is guessing the majority class
mlp = MLPClassifier(hidden_layer_sizes=(1000,10), random_state=33, )
%time mlp.fit(X, Y)

cv_score = cross_val_score(mlp, X, Y, cv=5)
print("Cross Val Avg: %.3f"%cv_score.mean())
print("Cross Val Std: %.3f"%np.std(cv_score))

CPU times: user 2min 32s, sys: 26.6 s, total: 2min 58s
Wall time: 1min 33s
Cross Val Avg: 0.517
Cross Val Std: 0.000


### I'm curious about a handful of very small layers
This is just guessing the majority class

In [10]:
# Many Small Layer
mlp = MLPClassifier(hidden_layer_sizes=(4, 3, 4, 2, 3), random_state=33, )
%time mlp.fit(X, Y)

cv_score = cross_val_score(mlp, X, Y, cv=5)
print("Cross Val Avg: %.3f"%cv_score.mean())
print("Cross Val Std: %.3f"%np.std(cv_score))

CPU times: user 3.53 s, sys: 577 ms, total: 4.1 s
Wall time: 4.22 s
Cross Val Avg: 0.517
Cross Val Std: 0.000


## Rules of thumb for layer size vs. depth
* If more narrower layers the model could find complex patterns but also overfit
* Deeper networks are harder to train, provides more opportunities for the model to find itself stuck at a local minimum rather than the global minimum - this issue is challenging to diagnose