In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

We're ready to build our first neural network. We will have multiple features we feed into our model, each of which will go through a set of perceptron models to arrive at a response which will be trained to our output.

Like many models we've covered, this can be used as both a regression or classification model.

First, we need to load our dataset. For this example we'll use The Museum of Modern Art in New York's [public dataset](https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv) on their collection.

In [2]:
artworks = pd.read_csv('https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv')

In [6]:
artworks.columns

Index(['Artist', 'Nationality', 'Gender', 'Date', 'Department', 'DateAcquired',
       'URL', 'ThumbnailURL', 'Height (cm)', 'Width (cm)'],
      dtype='object')

We'll also do a bit of data processing and cleaning, selecting columns of interest and converting URL's to booleans indicating whether they are present.

In [7]:
# Select Columns.
artworks = artworks[['Artist', 'Nationality', 'Gender', 'Date', 'Department',
                    'DateAcquired', 'URL', 'ThumbnailURL', 'Height (cm)', 'Width (cm)']]

# Convert URL's to booleans.
artworks['URL'] = artworks['URL'].notnull()
artworks['ThumbnailURL'] = artworks['ThumbnailURL'].notnull()

# Drop films and some other tricky rows.
artworks = artworks[artworks['Department']!='Film']
artworks = artworks[artworks['Department']!='Media and Performance Art']
artworks = artworks[artworks['Department']!='Fluxus Collection']

# Drop missing data.
artworks = artworks.dropna()

In [8]:
artworks.head()

Unnamed: 0,Artist,Nationality,Gender,Date,Department,DateAcquired,URL,ThumbnailURL,Height (cm),Width (cm)
0,Otto Wagner,(Austrian),(Male),1896,Architecture & Design,1996-04-09,True,True,48.6,168.9
1,Christian de Portzamparc,(French),(Male),1987,Architecture & Design,1995-01-17,True,True,40.6401,29.8451
2,Emil Hoppe,(Austrian),(Male),1903,Architecture & Design,1997-01-15,True,True,34.3,31.8
3,Bernard Tschumi,(),(Male),1980,Architecture & Design,1995-01-17,True,True,50.8,50.8
4,Emil Hoppe,(Austrian),(Male),1903,Architecture & Design,1997-01-15,True,True,38.4,19.1


## Building a Model

Now, let's see if we can use multi-layer perceptron modeling (or "MLP") to see if we can classify the department a piece should go into using everything but the department name.

Before we import MLP from SKLearn and establish the model we first have to ensure correct typing for our data and do some other cleaning.

In [9]:
# Get data types.
artworks.dtypes

Artist           object
Nationality      object
Gender           object
Date             object
Department       object
DateAcquired     object
URL                bool
ThumbnailURL       bool
Height (cm)     float64
Width (cm)      float64
dtype: object

The `DateAcquired` column is an object. Let's transform that to a datetime object and add a feature for just the year the artwork was acquired.

In [10]:
artworks['DateAcquired'] = pd.to_datetime(artworks.DateAcquired)
artworks['YearAcquired'] = artworks.DateAcquired.dt.year
artworks['YearAcquired'].dtype

dtype('int64')

Great. Let's do some more miscellaneous cleaning.

In [11]:
# Remove multiple nationalities, genders, and artists.
artworks.loc[artworks['Gender'].str.contains('\) \('), 'Gender'] = '\(multiple_persons\)'
artworks.loc[artworks['Nationality'].str.contains('\) \('), 'Nationality'] = '\(multiple_nationalities\)'
artworks.loc[artworks['Artist'].str.contains(','), 'Artist'] = 'Multiple_Artists'

# Convert dates to start date, cutting down number of distinct examples.
artworks['Date'] = pd.Series(artworks.Date.str.extract(
    '([0-9]{4})', expand=False))[:-1]

# Final column drops and NA drop.
X = artworks.drop(['Department', 'DateAcquired', 'Artist', 'Nationality', 'Date'], 1)

# Create dummies separately.
artists = pd.get_dummies(artworks.Artist)
nationalities = pd.get_dummies(artworks.Nationality)
dates = pd.get_dummies(artworks.Date)

# Concat with other variables, but artists slows this wayyyyy down so we'll keep it out for now
X = pd.get_dummies(X, sparse=True)
X = pd.concat([X, nationalities, dates], axis=1)

Y = artworks.Department

In [12]:
X.head(5)

Unnamed: 0,URL,ThumbnailURL,Height (cm),Width (cm),YearAcquired,Gender_(),Gender_(Female),Gender_(Male),Gender_(male),Gender_\(multiple_persons\),...,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,True,True,48.6,168.9,1996,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,True,True,40.6401,29.8451,1995,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,True,True,34.3,31.8,1997,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,True,True,50.8,50.8,1995,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,True,True,38.4,19.1,1997,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Alright! We've done our prep, let's build the model.
# Neural networks are hugely computationally intensive.
# This may take several minutes to run.

# Import the model.
from sklearn.neural_network import MLPClassifier

# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(10,), verbose = True)
mlp.fit(X, Y)

Iteration 1, loss = 3.79300244
Iteration 2, loss = 0.89926426
Iteration 3, loss = 0.86375684
Iteration 4, loss = 0.84490584
Iteration 5, loss = 0.83250118
Iteration 6, loss = 0.82395657
Iteration 7, loss = 0.81416967
Iteration 8, loss = 0.80368374
Iteration 9, loss = 0.80150579
Iteration 10, loss = 0.79213936
Iteration 11, loss = 0.79388023
Iteration 12, loss = 0.78387082
Iteration 13, loss = 0.77698515
Iteration 14, loss = 0.77996325
Iteration 15, loss = 0.77197599
Iteration 16, loss = 0.76974177
Iteration 17, loss = 0.77160093
Iteration 18, loss = 0.77264468
Iteration 19, loss = 0.76762589
Iteration 20, loss = 0.76093326
Iteration 21, loss = 0.76784445
Iteration 22, loss = 0.76091920
Iteration 23, loss = 0.75709344
Iteration 24, loss = 0.75405602
Iteration 25, loss = 0.75457845
Iteration 26, loss = 0.75681126
Iteration 27, loss = 0.75259239
Iteration 28, loss = 0.75277600
Iteration 29, loss = 0.75341489
Iteration 30, loss = 0.74610018
Iteration 31, loss = 0.75219134
Iteration 32, los

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(10,), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=None, shuffle=True, solver='adam', tol=0.0001,
              validation_fraction=0.1, verbose=True, warm_start=False)

In [19]:
mlp.score(X, Y)

0.7124017853849275

In [11]:
Y.value_counts()/len(Y)

Prints & Illustrated Books    0.521662
Photography                   0.229354
Architecture & Design         0.111225
Drawings                      0.103381
Painting & Sculpture          0.034377
Name: Department, dtype: float64

In [16]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)



array([0.6826963 , 0.68890631, 0.59774991, 0.03352393, 0.53991238])

Now we got a lot of information from all of this. Firstly we can see that the model seems to overfit, though there is still so remaining performance when validated with cross validation. This is a feature of neural networks that aren't given enough data for the number of features present. _Neural networks, in general, like_ a lot _of data_. You may also have noticed something also about neural networks: _they can take a_ long _time to run_. Try increasing the layer size by adding a zero. Feel free to interrupt the kernel if you don't have time...

Also note that we created bools for artist's name but left them out. Both of the above points are the reason for that. It would take much longer to run and it would be much more prone to overfitting.

## Model parameters

Now, before we move on and let you loose with some tasks to work on the model, let's go over the parameters.

We included one parameter: hidden layer size. Remember in the previous lesson, when we talked about layers in a neural network. This tells us how many and how big to make our layers. Pass in a tuple that specifies each layer's size. Our network is 1000 neurons wide and one layer. (100, 4, ) would create a network with two layers, one 100 wide and the other 4.

How many layers to include is determined by two things: computational resources and cross validation searching for convergence. It's generally less than the number of input variables you have.

You can also set an alpha. Neural networks like this use a regularization parameter that penalizes large coefficients just like we discussed in the advanced regression section. Alpha scales that penalty.

Lastly, we'll discuss the activation function. The activation function determines whether the output from an individual perceptron is binary or continuous. By default this is a 'relu', or 'rectified linear unit function' function. In the exercise we went through earlier we used this binary function, but we discussed the _sigmoid_ as a reasonable alternative. The _sigmoid_ (called 'logistic' by SKLearn because it's a 'logistic sigmoid function') allows for continuous variables between 0 and 1, which allows for a more nuanced model. It does come at the cost of increased computational complexity.

If you want to learn more about these, study [activation functions](https://en.wikipedia.org/wiki/Activation_function) and [multilayer perceptrons](https://en.wikipedia.org/wiki/Multilayer_perceptron). The [Deep Learning](http://www.deeplearningbook.org/) book referenced earlier goes into great detail on the linear algebra involved.

You could also just test the models with cross validation. Unless neural networks are your specialty cross validation should be sufficient.

For the other parameters and their defaults, check out the [MLPClassifier documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier).

## Drill: Playing with layers

Now it's your turn. Using the space below, experiment with different hidden layer structures. You can try this on a subset of the data to improve runtime. See how things vary. See what seems to matter the most. Feel free to manipulate other parameters as well. It may also be beneficial to do some real feature selection work...

### 2 Layers

In [22]:
# Your code here. Experiment with hidden layers to build your own model.
mlp = MLPClassifier(hidden_layer_sizes=(10,10,), verbose = True)
mlp.fit(X, Y)

Iteration 1, loss = 3.16053403
Iteration 2, loss = 0.88329866
Iteration 3, loss = 0.85354554
Iteration 4, loss = 0.83338709
Iteration 5, loss = 0.81728485
Iteration 6, loss = 0.80561677
Iteration 7, loss = 0.79544845
Iteration 8, loss = 0.78619413
Iteration 9, loss = 0.78072637
Iteration 10, loss = 0.77366700
Iteration 11, loss = 0.76033490
Iteration 12, loss = 0.75466903
Iteration 13, loss = 0.74820363
Iteration 14, loss = 0.74569828
Iteration 15, loss = 0.74092709
Iteration 16, loss = 0.74360584
Iteration 17, loss = 0.73957750
Iteration 18, loss = 0.72902773
Iteration 19, loss = 0.72851655
Iteration 20, loss = 0.72462846
Iteration 21, loss = 0.72058093
Iteration 22, loss = 0.72019950
Iteration 23, loss = 0.72010914
Iteration 24, loss = 0.71468530
Iteration 25, loss = 0.71146170
Iteration 26, loss = 0.71894545
Iteration 27, loss = 0.70906292
Iteration 28, loss = 0.71202027
Iteration 29, loss = 0.70966973
Iteration 30, loss = 0.70443825
Iteration 31, loss = 0.70552475
Iteration 32, los

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(10, 10), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=None, shuffle=True, solver='adam', tol=0.0001,
              validation_fraction=0.1, verbose=True, warm_start=False)

In [23]:
mlp.score(X, Y)

0.7525729462540116

### 3 Layers

In [24]:
mlp = MLPClassifier(hidden_layer_sizes=(10,10,10,), verbose = True)
mlp.fit(X, Y)
mlp.score(X, Y)

Iteration 1, loss = 7.13069675
Iteration 2, loss = 0.97238058
Iteration 3, loss = 0.94614270
Iteration 4, loss = 0.91978684
Iteration 5, loss = 0.89434030
Iteration 6, loss = 0.86964946
Iteration 7, loss = 0.85740046
Iteration 8, loss = 0.84216617
Iteration 9, loss = 0.82580058
Iteration 10, loss = 0.81467766
Iteration 11, loss = 0.80140098
Iteration 12, loss = 0.78913047
Iteration 13, loss = 0.77738621
Iteration 14, loss = 0.77104105
Iteration 15, loss = 0.76020577
Iteration 16, loss = 0.74720431
Iteration 17, loss = 0.74234616
Iteration 18, loss = 0.73523346
Iteration 19, loss = 0.73193546
Iteration 20, loss = 0.73075560
Iteration 21, loss = 0.72442978
Iteration 22, loss = 0.72038693
Iteration 23, loss = 0.72492105
Iteration 24, loss = 0.71794780
Iteration 25, loss = 0.71948043
Iteration 26, loss = 0.71357810
Iteration 27, loss = 0.71370776
Iteration 28, loss = 0.71179753
Iteration 29, loss = 0.71263844
Iteration 30, loss = 0.70814405
Iteration 31, loss = 0.71199869
Iteration 32, los

0.7556438821055738

### 2 Layers, More Perceptrons

In [25]:
mlp = MLPClassifier(hidden_layer_sizes=(20,20,), verbose = True)
mlp.fit(X, Y)
mlp.score(X, Y)

Iteration 1, loss = inf
Iteration 2, loss = 0.85621906
Iteration 3, loss = 0.82228908
Iteration 4, loss = 0.80131270
Iteration 5, loss = 0.79005083
Iteration 6, loss = 0.78021699
Iteration 7, loss = 0.78740904
Iteration 8, loss = 0.77334643
Iteration 9, loss = 0.76537197
Iteration 10, loss = 0.76120634
Iteration 11, loss = 0.76238778
Iteration 12, loss = 0.75702284
Iteration 13, loss = 0.76321319
Iteration 14, loss = 0.75194733
Iteration 15, loss = 0.74627728
Iteration 16, loss = 0.74085316
Iteration 17, loss = 0.73213802
Iteration 18, loss = 0.73294486
Iteration 19, loss = 0.72259657
Iteration 20, loss = 0.71906043
Iteration 21, loss = 0.71592592
Iteration 22, loss = 0.71680231
Iteration 23, loss = 0.70778870
Iteration 24, loss = 0.70593538
Iteration 25, loss = 0.70360429
Iteration 26, loss = 0.70207057
Iteration 27, loss = 0.69479772
Iteration 28, loss = 0.69431287
Iteration 29, loss = 0.68837433
Iteration 30, loss = 0.68122720
Iteration 31, loss = 0.67662043
Iteration 32, loss = 0.6

0.7805894721310266

### 3 Layers, More Perceptrons

In [26]:
mlp = MLPClassifier(hidden_layer_sizes=(20,20,20,), verbose = True)
mlp.fit(X, Y)
mlp.score(X, Y)

Iteration 1, loss = 1.28426919
Iteration 2, loss = 0.88728695
Iteration 3, loss = 0.83414410
Iteration 4, loss = 0.80071046
Iteration 5, loss = 0.78597315
Iteration 6, loss = 0.76914400
Iteration 7, loss = 0.76570145
Iteration 8, loss = 0.75592142
Iteration 9, loss = 0.74892839
Iteration 10, loss = 0.74265261
Iteration 11, loss = 0.73573834
Iteration 12, loss = 0.73525688
Iteration 13, loss = 0.72574235
Iteration 14, loss = 0.72000486
Iteration 15, loss = 0.71950432
Iteration 16, loss = 0.71330884
Iteration 17, loss = 0.71347114
Iteration 18, loss = 0.70928905
Iteration 19, loss = 0.71026392
Iteration 20, loss = 0.70676800
Iteration 21, loss = 0.70295303
Iteration 22, loss = 0.69927120
Iteration 23, loss = 0.69373629
Iteration 24, loss = 0.69250684
Iteration 25, loss = 0.69454726
Iteration 26, loss = 0.69115561
Iteration 27, loss = 0.68990613
Iteration 28, loss = 0.68930042
Iteration 29, loss = 0.68385839
Iteration 30, loss = 0.68370048
Iteration 31, loss = 0.68608566
Iteration 32, los



0.7792522778413074

In [27]:
X.head(10)

Unnamed: 0,URL,ThumbnailURL,Height (cm),Width (cm),YearAcquired,Gender_(),Gender_(Female),Gender_(Male),Gender_(male),Gender_\(multiple_persons\),...,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,True,True,48.6,168.9,1996,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,True,True,40.6401,29.8451,1995,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,True,True,34.3,31.8,1997,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,True,True,50.8,50.8,1995,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,True,True,38.4,19.1,1997,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
5,True,True,35.6,45.7,1995,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
6,True,True,35.6,45.7,1995,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7,True,True,35.6,45.7,1995,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
8,True,True,35.6,45.7,1995,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
9,True,True,35.6,45.7,1995,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


### Using the artist information

In [28]:
X1 = pd.concat([X,artists], axis = 1)

In [30]:
from sklearn.decomposition import PCA
pca = PCA(50)
X_pca = pca.fit_transform(X1)
pca.explained_variance_ratio_

array([6.91741240e-01, 2.40930149e-01, 6.69276966e-02, 3.87476861e-05,
       3.29094373e-05, 1.81582096e-05, 1.37076820e-05, 1.08981090e-05,
       7.27309060e-06, 5.17886375e-06, 3.85871716e-06, 3.03505990e-06,
       2.92709217e-06, 2.66834956e-06, 2.57851523e-06, 2.52162887e-06,
       2.49663586e-06, 2.46027275e-06, 2.42758822e-06, 2.39638382e-06,
       2.23480898e-06, 2.19615640e-06, 2.13995427e-06, 2.02945316e-06,
       1.99470901e-06, 1.98757104e-06, 1.95397898e-06, 1.89564922e-06,
       1.84593888e-06, 1.77905066e-06, 1.72529371e-06, 1.67000334e-06,
       1.64257365e-06, 1.63667442e-06, 1.61790896e-06, 1.60809049e-06,
       1.56544767e-06, 1.51322709e-06, 1.50361238e-06, 1.43144885e-06,
       1.38842485e-06, 1.36984986e-06, 1.36471800e-06, 1.35572256e-06,
       1.35267998e-06, 1.32727545e-06, 1.31043910e-06, 1.29971192e-06,
       1.26921028e-06, 1.25904878e-06])

In [31]:
mlp = MLPClassifier(hidden_layer_sizes=(10,), verbose = True)
mlp.fit(X_pca, Y)
mlp.score(X_pca, Y)

Iteration 1, loss = 1.50051399
Iteration 2, loss = 0.86276146
Iteration 3, loss = 0.80094561
Iteration 4, loss = 0.77063425
Iteration 5, loss = 0.75193907
Iteration 6, loss = 0.73978572
Iteration 7, loss = 0.73101689
Iteration 8, loss = 0.72381304
Iteration 9, loss = 0.71946109
Iteration 10, loss = 0.71379571
Iteration 11, loss = 0.70870024
Iteration 12, loss = 0.70603491
Iteration 13, loss = 0.70293419
Iteration 14, loss = 0.69938529
Iteration 15, loss = 0.69849421
Iteration 16, loss = 0.69535451
Iteration 17, loss = 0.69325660
Iteration 18, loss = 0.69169465
Iteration 19, loss = 0.68851283
Iteration 20, loss = 0.68694737
Iteration 21, loss = 0.68503536
Iteration 22, loss = 0.68271993
Iteration 23, loss = 0.68150787
Iteration 24, loss = 0.67854041
Iteration 25, loss = 0.67678390
Iteration 26, loss = 0.67511541
Iteration 27, loss = 0.67411533
Iteration 28, loss = 0.67212419
Iteration 29, loss = 0.67102785
Iteration 30, loss = 0.66976895
Iteration 31, loss = 0.66689885
Iteration 32, los



0.7545095724667085

### 2 Layers, More Perceptions, Including Artist Data

In [32]:
mlp = MLPClassifier(hidden_layer_sizes=(20,20,), verbose = True)
mlp.fit(X_pca, Y)
mlp.score(X_pca, Y)

Iteration 1, loss = 1.15142792
Iteration 2, loss = 0.81316474
Iteration 3, loss = 0.73655320
Iteration 4, loss = 0.70101260
Iteration 5, loss = 0.68042556
Iteration 6, loss = 0.66727576
Iteration 7, loss = 0.65887031
Iteration 8, loss = 0.64478767
Iteration 9, loss = 0.64256025
Iteration 10, loss = 0.63611607
Iteration 11, loss = 0.63074550
Iteration 12, loss = 0.62872219
Iteration 13, loss = 0.62631003
Iteration 14, loss = 0.62141974
Iteration 15, loss = 0.62206599
Iteration 16, loss = 0.61989875
Iteration 17, loss = 0.61572853
Iteration 18, loss = 0.61121484
Iteration 19, loss = 0.61105599
Iteration 20, loss = 0.61161246
Iteration 21, loss = 0.60699019
Iteration 22, loss = 0.60635500
Iteration 23, loss = 0.60564889
Iteration 24, loss = 0.60626345
Iteration 25, loss = 0.60271948
Iteration 26, loss = 0.60104375
Iteration 27, loss = 0.60061143
Iteration 28, loss = 0.59950138
Iteration 29, loss = 0.59742964
Iteration 30, loss = 0.59623902
Iteration 31, loss = 0.59949427
Iteration 32, los

0.7981758825482312

### 2 Layers, *MORE* Perceptions, Including Artist Data

In [33]:
mlp = MLPClassifier(hidden_layer_sizes=(30,30,), verbose = True)
mlp.fit(X_pca, Y)
mlp.score(X_pca, Y)

Iteration 1, loss = 1.07699386
Iteration 2, loss = 0.75939570
Iteration 3, loss = 0.70583556
Iteration 4, loss = 0.67826211
Iteration 5, loss = 0.66524288
Iteration 6, loss = 0.64996889
Iteration 7, loss = 0.64156684
Iteration 8, loss = 0.63621414
Iteration 9, loss = 0.62778051
Iteration 10, loss = 0.61897602
Iteration 11, loss = 0.61761207
Iteration 12, loss = 0.60828298
Iteration 13, loss = 0.60806025
Iteration 14, loss = 0.60433983
Iteration 15, loss = 0.59994553
Iteration 16, loss = 0.59685591
Iteration 17, loss = 0.59355675
Iteration 18, loss = 0.58918191
Iteration 19, loss = 0.58549489
Iteration 20, loss = 0.58834710
Iteration 21, loss = 0.58412537
Iteration 22, loss = 0.57849519
Iteration 23, loss = 0.58094539
Iteration 24, loss = 0.57672195
Iteration 25, loss = 0.57501021
Iteration 26, loss = 0.56867973
Iteration 27, loss = 0.56746688
Iteration 28, loss = 0.56822108
Iteration 29, loss = 0.56501816
Iteration 30, loss = 0.56337541
Iteration 31, loss = 0.55945153
Iteration 32, los



0.8246246633959202

### 2 Layers, 100 Perceptrons Each, Less PCA data (5 Principle Components) 

In [34]:
mlp = MLPClassifier(hidden_layer_sizes=(100,100,), verbose = True)
mlp.fit(X_pca[:,:5], Y)
mlp.score(X_pca[:,:5], Y)

Iteration 1, loss = 0.96547785
Iteration 2, loss = 0.87747630
Iteration 3, loss = inf
Iteration 4, loss = 0.79732416
Iteration 5, loss = 0.77103687
Iteration 6, loss = 0.76181924
Iteration 7, loss = 0.75028375
Iteration 8, loss = 0.73690792
Iteration 9, loss = 0.72838240
Iteration 10, loss = 0.72086926
Iteration 11, loss = 0.71974732
Iteration 12, loss = 0.70599767
Iteration 13, loss = 0.70169025
Iteration 14, loss = 0.69878738
Iteration 15, loss = 0.69311944
Iteration 16, loss = 0.68762907
Iteration 17, loss = 0.67916909
Iteration 18, loss = 0.67311681
Iteration 19, loss = 0.67449041
Iteration 20, loss = 0.66967429
Iteration 21, loss = 0.66570991
Iteration 22, loss = 0.65922300
Iteration 23, loss = 0.65800984
Iteration 24, loss = 0.65392023
Iteration 25, loss = 0.65072840
Iteration 26, loss = 0.64871512
Iteration 27, loss = 0.64738533
Iteration 28, loss = 0.64390346
Iteration 29, loss = 0.63973187
Iteration 30, loss = 0.63859779
Iteration 31, loss = 0.63275697
Iteration 32, loss = 0.6

0.8035523257958611

It seems that past a certain threshold of perceptrons, the model does not improve it's predictive power much more. The sweet spot for this data set seemed to be 2 layers with 30 Layers each. Increasing more layers past 2 also did not seem to improve the models performs.