In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
%matplotlib inline

We're ready to build our first neural network. We will have multiple features we feed into our model, each of which will go through a set of perceptron models to arrive at a response which will be trained to our output.

Like many models we've covered, this can be used as both a regression or classification model.

First, we need to load our dataset. For this example we'll use The Museum of Modern Art in New York's [public dataset](https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv) on their collection.

In [25]:
artworks = pd.read_csv('art_data.csv')
artworks.Medium.unique()

array(['Ink and cut-and-pasted painted pages on paper',
       'Paint and colored pencil on print',
       'Graphite, pen, color pencil, ink, and gouache on tracing paper',
       ...,
       '16 notes: Ink, pencil, and colored pencil on paper (one with pin)',
       '35mm slides', 'Floppy disk'], dtype=object)

In [3]:
artworks.columns

Index(['Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

In [4]:
artworks.iloc[:, :12].head(2)

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,Dimensions,CreditLine
0,"Ferdinandsbrücke Project, Vienna, Austria, Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,"19 1/8 x 66 1/2"" (48.6 x 168.9 cm)",Fractional and promised gift of Jo Carole and ...
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,"16 x 11 3/4"" (40.6 x 29.8 cm)",Gift of the architect in honor of Lily Auchinc...


In [5]:
artworks.iloc[:, 12:].head(4)

Unnamed: 0,AccessionNumber,Classification,Department,DateAcquired,Cataloged,ObjectID,URL,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,885.1996,Architecture,Architecture & Design,1996-04-09,Y,2.0,http://www.moma.org/collection/works/2,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,1.1995,Architecture,Architecture & Design,1995-01-17,Y,3.0,http://www.moma.org/collection/works/3,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,
2,1.1997,Architecture,Architecture & Design,1997-01-15,Y,4.0,http://www.moma.org/collection/works/4,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.3,,,31.8,,
3,2.1995,Architecture,Architecture & Design,1995-01-17,Y,5.0,http://www.moma.org/collection/works/5,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,,,,50.8,,,50.8,,


In [6]:
artworks.Department.unique()

array(['Architecture & Design', 'Prints & Illustrated Books', 'Drawings',
       'Painting & Sculpture', 'Photography', 'Media and Performance Art',
       'Film', nan, 'Architecture & Design - Image Archive',
       'Fluxus Collection'], dtype=object)

We'll also do a bit of data processing and cleaning, selecting columns of interest and converting URL's to booleans indicating whether they are present.

In [7]:
# Select Columns.
artworks = artworks[['Artist', 'Nationality', 'Gender', 'Date', 'Department',
                    'DateAcquired', 'URL', 'ThumbnailURL', 'Height (cm)', 'Width (cm)']]

# Convert URL's to booleans.
artworks['URL'] = artworks['URL'].notnull()
artworks['ThumbnailURL'] = artworks['ThumbnailURL'].notnull()

# Drop films and some other tricky rows.
artworks = artworks[artworks['Department']!='Film']
artworks = artworks[artworks['Department']!='Media and Performance Art']
artworks = artworks[artworks['Department']!='Fluxus Collection']

# Drop missing data.
artworks = artworks.dropna()

In [8]:
artworks.head()

Unnamed: 0,Artist,Nationality,Gender,Date,Department,DateAcquired,URL,ThumbnailURL,Height (cm),Width (cm)
0,Otto Wagner,(Austrian),(Male),1896,Architecture & Design,1996-04-09,True,True,48.6,168.9
1,Christian de Portzamparc,(French),(Male),1987,Architecture & Design,1995-01-17,True,True,40.6401,29.8451
2,Emil Hoppe,(Austrian),(Male),1903,Architecture & Design,1997-01-15,True,True,34.3,31.8
3,Bernard Tschumi,(),(Male),1980,Architecture & Design,1995-01-17,True,True,50.8,50.8
4,Emil Hoppe,(Austrian),(Male),1903,Architecture & Design,1997-01-15,True,True,38.4,19.1


## Building a Model

Now, let's see if we can use multi-layer perceptron modeling (or "MLP") to see if we can classify the department a piece should go into using everything but the department name.

Before we import MLP from SKLearn and establish the model we first have to ensure correct typing for our data and do some other cleaning.

In [9]:
# Get data types.
artworks.dtypes

Artist           object
Nationality      object
Gender           object
Date             object
Department       object
DateAcquired     object
URL                bool
ThumbnailURL       bool
Height (cm)     float64
Width (cm)      float64
dtype: object

The `DateAcquired` column is an object. Let's transform that to a datetime object and add a feature for just the year the artwork was acquired.

In [10]:
artworks['DateAcquired'] = pd.to_datetime(artworks.DateAcquired)
artworks['YearAcquired'] = artworks.DateAcquired.dt.year
artworks['YearAcquired'].dtype

dtype('int64')

Great. Let's do some more miscellaneous cleaning.

In [11]:
# Remove multiple nationalities, genders, and artists.
artworks.loc[artworks['Gender'].str.contains('\) \('), 'Gender'] = '\(multiple_persons\)'
artworks.loc[artworks['Nationality'].str.contains('\) \('), 'Nationality'] = '\(multiple_nationalities\)'
artworks.loc[artworks['Artist'].str.contains(','), 'Artist'] = 'Multiple_Artists'

# Convert dates to start date, cutting down number of distinct examples.
artworks['Date'] = pd.Series(artworks.Date.str.extract(
    '([0-9]{4})', expand=False))[:-1]

# Final column drops and NA drop.
X = artworks.drop(['Department', 'DateAcquired', 'Artist', 'Nationality', 'Date'], 1)

# Create dummies separately.
artists = pd.get_dummies(artworks.Artist)
nationalities = pd.get_dummies(artworks.Nationality)
dates = pd.get_dummies(artworks.Date)

# Concat artists with other variables
X = pd.get_dummies(X, sparse=True)
X = pd.concat([X, nationalities, dates], axis=1)

Y = artworks.Department

In [41]:
# Alright! We've done our prep, let's build the model.
# Neural networks are hugely computationally intensive.
# This may take several minutes to run.

# Import the model.
from sklearn.neural_network import MLPClassifier

# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(1000,))
mlp.fit(X, Y)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [42]:
mlp.score(X, Y)

0.44894780126747619

In [None]:
Y.value_counts()/len(Y)

Prints & Illustrated Books    0.523371
Photography                   0.225514
Architecture & Design         0.112225
Drawings                      0.104233
Painting & Sculpture          0.034657
Name: Department, dtype: float64

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)



Now we got a lot of information from all of this. Firstly we can see that the model seems to overfit, though there is still so remaining performance when validated with cross validation. This is a feature of neural networks that aren't given enough data for the number of features present. _Neural networks, in general, like_ a lot _of data_. You may also have noticed something also about neural networks: _they can take a_ long _time to run_. Try increasing the layer size by adding a zero. Feel free to interrupt the kernel if you don't have time...

Also note that we created bools for artist's name but left them out. Both of the above points are the reason for that. It would take much longer to run and it would be much more prone to overfitting.

## Model parameters

Now, before we move on and let you loose with some tasks to work on the model, let's go over the parameters.

We included one parameter: hidden layer size. Remember in the previous lesson, when we talked about layers in a neural network. This tells us how many and how big to make our layers. Pass in a tuple that specifies each layer's size. Our network is 1000 neurons wide and one layer. (100, 4, ) would create a network with two layers, one 100 wide and the other 4.

How many layers to include is determined by two things: computational resources and cross validation searching for convergence. It's generally less than the number of input variables you have.

You can also set an alpha. Neural networks like this use a regularization parameter that penalizes large coefficients just like we discussed in the advanced regression section. Alpha scales that penalty.

Lastly, we'll discuss the activation function. The activation function determines whether the output from an individual perceptron is binary or continuous. By default this is a binary function ('relu', or 'rectified linear unit function'). In the exercise we went through earlier we used this binary function, but we discussed the _sigmoid_ as a reasonable alternative. The _sigmoid_ (called 'logistic' by SKLearn because it's a 'logistic sigmoid function') allows for continuous variables between 0 and 1, which allows for a more nuanced model. It does come at the cost of increased computational complexity.

If you want to learn more about these, study [activation functions](https://en.wikipedia.org/wiki/Activation_function) and [multilayer perceptrons](https://en.wikipedia.org/wiki/Multilayer_perceptron). The [Deep Learning](http://www.deeplearningbook.org/) book referenced earlier goes into great detail on the linear algebra involved.

You could also just test the models with cross validation. Unless neural networks are your specialty cross validation should be sufficient.

For the other parameters and their defaults, check out the [MLPClassifier documentaiton](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier).

## Drill: Playing with layers

Now it's your turn. Using the space below, experiment with different hidden layer structures. You can try this on a subset of the data to improve runtime. See how things vary. See what seems to matter the most. Feel free to manipulate other parameters as well. It may also be beneficial to do some real feature selection work...

## For Results Comparison
Refer to the <a href='https://github.com/AlliedToasters/neural_nets/blob/master/original_notebook.ipynb'>unaltered copy of the original notebook</a>.

In [12]:
#I read that scaling helps these models...
scl = StandardScaler()
X0 = scl.fit_transform(X)

In [16]:
mlp = MLPClassifier(
    hidden_layer_sizes=(1000,),
    verbose=True,
    early_stopping=True,
    tol=0,
)
mlp.fit(X0, Y)

Iteration 1, loss = 0.83011816
Validation score: 0.728715
Iteration 2, loss = 0.69326821
Validation score: 0.752128
Iteration 3, loss = 0.64082192
Validation score: 0.766351
Iteration 4, loss = 0.60608394
Validation score: 0.767802
Iteration 5, loss = 0.58286666
Validation score: 0.764996
Iteration 6, loss = 0.56315216
Validation score: 0.776316
Iteration 7, loss = 0.54932118
Validation score: 0.784636
Iteration 8, loss = 0.53341563
Validation score: 0.783475
Iteration 9, loss = 0.51995967
Validation score: 0.789377
Iteration 10, loss = 0.51408449
Validation score: 0.792086
Iteration 11, loss = 0.50266070
Validation score: 0.793247
Iteration 12, loss = 0.49475792
Validation score: 0.795472
Iteration 13, loss = 0.48261984
Validation score: 0.796923
Iteration 14, loss = 0.47917479
Validation score: 0.796923
Iteration 15, loss = 0.47295058
Validation score: 0.797891
Iteration 16, loss = 0.46498792
Validation score: 0.791989
Iteration 17, loss = 0.45742647
Validation score: 0.799342
Iterat

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(1000,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0, validation_fraction=0.1,
       verbose=True, warm_start=False)

In [17]:
#Just scaling the data helped a LOT.
mlp.score(X0, Y)

0.84898650283005173

In [18]:
#Now let's play with the layers...
mlp2 = MLPClassifier(
    hidden_layer_sizes=(100, 10), #2 layers deep, less wide for computation time,
    verbose=True,
    early_stopping=True,
    tol=0,
)
mlp2.fit(X0, Y)

Iteration 1, loss = 0.96080836
Validation score: 0.681889
Iteration 2, loss = 0.75073401
Validation score: 0.716622
Iteration 3, loss = 0.69546133
Validation score: 0.737229
Iteration 4, loss = 0.66093641
Validation score: 0.742260
Iteration 5, loss = 0.63639176
Validation score: 0.749903
Iteration 6, loss = 0.61744391
Validation score: 0.757933
Iteration 7, loss = 0.60064089
Validation score: 0.764222
Iteration 8, loss = 0.58705073
Validation score: 0.764803
Iteration 9, loss = 0.57539812
Validation score: 0.767899
Iteration 10, loss = 0.56473560
Validation score: 0.766351
Iteration 11, loss = 0.55688613
Validation score: 0.775348
Iteration 12, loss = 0.54916823
Validation score: 0.779992
Iteration 13, loss = 0.54055924
Validation score: 0.780089
Iteration 14, loss = 0.53313415
Validation score: 0.781734
Iteration 15, loss = 0.52717381
Validation score: 0.783862
Iteration 16, loss = 0.52275895
Validation score: 0.777090
Iteration 17, loss = 0.51764218
Validation score: 0.785797
Iterat

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(100, 10), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0, validation_fraction=0.1,
       verbose=True, warm_start=False)

In [19]:
#This arrangement did not improve performance, but didn't hurt much either.
#Here's another:
mlp3 = MLPClassifier(
    hidden_layer_sizes=(10, 100,),
    verbose=True,
    early_stopping=True,
    tol=0,
)
mlp3.fit(X0, Y)

Iteration 1, loss = 1.08202123
Validation score: 0.657217
Iteration 2, loss = 0.85859361
Validation score: 0.689048
Iteration 3, loss = 0.80682748
Validation score: 0.699884
Iteration 4, loss = 0.78158436
Validation score: 0.703173
Iteration 5, loss = 0.76585782
Validation score: 0.717299
Iteration 6, loss = 0.75269529
Validation score: 0.717589
Iteration 7, loss = 0.73999283
Validation score: 0.721943
Iteration 8, loss = 0.73074138
Validation score: 0.725522
Iteration 9, loss = 0.72250554
Validation score: 0.729876
Iteration 10, loss = 0.71649833
Validation score: 0.729973
Iteration 11, loss = 0.71143018
Validation score: 0.733359
Iteration 12, loss = 0.70559037
Validation score: 0.736745
Iteration 13, loss = 0.70145340
Validation score: 0.736745
Iteration 14, loss = 0.69720173
Validation score: 0.737713
Iteration 15, loss = 0.69373687
Validation score: 0.738100
Iteration 16, loss = 0.69054842
Validation score: 0.740132
Iteration 17, loss = 0.68710172
Validation score: 0.745066
Iterat

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(10, 100), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0, validation_fraction=0.1,
       verbose=True, warm_start=False)

In [20]:
#Oof, even worse. Finally:
mlp4 = MLPClassifier(
    hidden_layer_sizes=(100, 100, 100, 100, 100), #Going deep!
    verbose=True,
    early_stopping=True,
    tol=0,
)
mlp4.fit(X0, Y)

Iteration 1, loss = 0.86917642
Validation score: 0.715654
Iteration 2, loss = 0.67586654
Validation score: 0.747098
Iteration 3, loss = 0.61413401
Validation score: 0.766544
Iteration 4, loss = 0.57302312
Validation score: 0.779218
Iteration 5, loss = 0.54044572
Validation score: 0.784926
Iteration 6, loss = 0.51911872
Validation score: 0.798858
Iteration 7, loss = 0.49800599
Validation score: 0.797407
Iteration 8, loss = 0.47987704
Validation score: 0.804276
Iteration 9, loss = 0.46611884
Validation score: 0.808050
Iteration 10, loss = 0.45172131
Validation score: 0.811533
Iteration 11, loss = 0.44209085
Validation score: 0.809114
Iteration 12, loss = 0.42979313
Validation score: 0.812887
Iteration 13, loss = 0.42170982
Validation score: 0.809888
Iteration 14, loss = 0.40963363
Validation score: 0.819756
Iteration 15, loss = 0.40233513
Validation score: 0.818982
Iteration 16, loss = 0.39200376
Validation score: 0.823046
Iteration 17, loss = 0.38480053
Validation score: 0.821401
Iterat

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(100, 100, 100, 100, 100),
       learning_rate='constant', learning_rate_init=0.001, max_iter=200,
       momentum=0.9, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0,
       validation_fraction=0.1, verbose=True, warm_start=False)

In [21]:
#This iteration has done the best (Note this score includes training data)
mlp4.score(X0, Y)

0.87463596342702332

In [35]:
X['Width (cm)'].loc[0]

168.90000000000001

In [39]:
dims = list(artworks.loc[X.index].columns[-9:])
dims.remove('Height (cm)')
dims.remove('Width (cm)')
for dim in dims:
    X[dim] = np.where(artworks[dim].loc[X.index].notnull(), artworks[dim].loc[X.index], 0)

In [40]:
scl5 = StandardScaler()
X5 = scl5.fit_transform(X)

In [49]:
mlp5 = MLPClassifier(
    hidden_layer_sizes=(100, 100, 100, 100, 100), #Same everything
    verbose=True,
    early_stopping=True,
    tol=0,
    random_state=123,
    learning_rate_init=.0005,
    alpha=.001
)
mlp5.fit(X5, Y)
mlp5.score(X5, Y)

Iteration 1, loss = 0.93129795
Validation score: 0.720491
Iteration 2, loss = 0.69039451
Validation score: 0.752612
Iteration 3, loss = 0.62663965
Validation score: 0.761997
Iteration 4, loss = 0.58841403
Validation score: 0.788796
Iteration 5, loss = 0.55665391
Validation score: 0.790925
Iteration 6, loss = 0.53339074
Validation score: 0.795762
Iteration 7, loss = 0.51262899
Validation score: 0.805824
Iteration 8, loss = 0.49630070
Validation score: 0.805534
Iteration 9, loss = 0.48130418
Validation score: 0.809211
Iteration 10, loss = 0.46812858
Validation score: 0.814338
Iteration 11, loss = 0.45500033
Validation score: 0.816080
Iteration 12, loss = 0.44581472
Validation score: 0.817337
Iteration 13, loss = 0.43663355
Validation score: 0.825271
Iteration 14, loss = 0.42586110
Validation score: 0.814435
Iteration 15, loss = 0.41795237
Validation score: 0.824207
Iteration 16, loss = 0.41001509
Validation score: 0.820143
Validation score did not improve more than tol=0.000000 for two c

0.84981858642542696

In [46]:
mlp5.score(X5, Y)

0.8814861400029026

## Results
Without any adding or removing any features, I was able to massively improve performance by simply scaling the inputs. I got some improvements in performance by increasing the depth to 5 layers each 100 neurons wide.
Due to limited computation time and power, I was unable to run a full k-fold validation on each. However, the performance based on the SKlearn implimentation's built-in holdout validation group indicates a ~82% accuracy on the validation set.<br>
## Other Attempts
I also tried messing with the features but I got little traction with this approach. That work is here below:

In [9]:
art = pd.read_csv('art_data.csv')
art = art[art.Department.notnull()]
# Convert URL's to booleans.
art['URL'] = art['URL'].notnull()
art['ThumbnailURL'] = art['ThumbnailURL'].notnull()

# Drop films and some other tricky rows.
art = art[art['Department']!='Film']
art = art[art['Department']!='Media and Performance Art']
art = art[art['Department']!='Fluxus Collection']


#Look at unique departments.
art.Department.unique()

scl1 = StandardScaler()

In [10]:
#Which columns matter most?
art.columns

Index(['Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

In [11]:
#I say, "Look at the art." The physical dimensions give the most information.
physical = list(art.columns[-9:])
physical

['Circumference (cm)',
 'Depth (cm)',
 'Diameter (cm)',
 'Height (cm)',
 'Length (cm)',
 'Weight (kg)',
 'Width (cm)',
 'Seat Height (cm)',
 'Duration (sec.)']

In [12]:
#Let's try just using these features.
Y = art.Department
X = pd.DataFrame(index=art.index, columns=physical)
for phys in physical:
    X[phys] = np.where(art[phys].notnull(), art[phys], False)

print(X.head())
X1 = scl1.fit_transform(X)

   Circumference (cm)  Depth (cm)  Diameter (cm)  Height (cm)  Length (cm)  \
0                 0.0         0.0            0.0      48.6000          0.0   
1                 0.0         0.0            0.0      40.6401          0.0   
2                 0.0         0.0            0.0      34.3000          0.0   
3                 0.0         0.0            0.0      50.8000          0.0   
4                 0.0         0.0            0.0      38.4000          0.0   

   Weight (kg)  Width (cm)  Seat Height (cm)  Duration (sec.)  
0          0.0    168.9000               0.0              0.0  
1          0.0     29.8451               0.0              0.0  
2          0.0     31.8000               0.0              0.0  
3          0.0     50.8000               0.0              0.0  
4          0.0     19.1000               0.0              0.0  


In [8]:
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 10,), 
    verbose=True,
    early_stopping=True,
    validation_fraction=.1,
    learning_rate_init=.01,
    tol=0
)
mlp.fit(X1, Y)
mlp.score(X1, Y) #Not great. Adding more features.

Iteration 1, loss = 1.16031055
Validation score: 0.542477
Iteration 2, loss = 1.12087840
Validation score: 0.561517
Iteration 3, loss = 1.11065476
Validation score: 0.555547
Iteration 4, loss = 1.10459042
Validation score: 0.565309
Iteration 5, loss = 1.09814529
Validation score: 0.553288
Iteration 6, loss = 1.09408662
Validation score: 0.567406
Iteration 7, loss = 1.08956612
Validation score: 0.566438
Iteration 8, loss = 1.09074215
Validation score: 0.565147
Iteration 9, loss = 1.08529991
Validation score: 0.566519
Validation score did not improve more than tol=0.000000 for two consecutive epochs. Stopping.


0.57101594268379263

In [40]:
mlp.score(X, Y) #Not great. Add more features.

0.58446556509391334

In [14]:
#Add year
X['Date'] = pd.Series(art.Date.str.extract(
    '([0-9]{4})', expand=False))[:-1]

X.Date = np.where(X.Date.isnull(), 0, X.Date)
scl2 = StandardScaler()
X2 = scl2.fit_transform(X)

In [26]:
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 10, 100), 
    verbose=True,
    early_stopping=True,
    validation_fraction=.1,
    learning_rate_init=.001,
    tol=0,
    #solver = 'sgd',
    #activation = 'logistic',
    #learning_rate='adaptive',
    random_state=123
)
mlp.fit(X2, Y)
mlp.score(X2, Y)

Iteration 1, loss = 1.19715280
Validation score: 0.537878
Iteration 2, loss = 1.07707870
Validation score: 0.588463
Iteration 3, loss = 1.03347589
Validation score: 0.608713
Iteration 4, loss = 1.00941385
Validation score: 0.624203
Iteration 5, loss = 0.99472871
Validation score: 0.619201
Iteration 6, loss = 0.98631170
Validation score: 0.625978
Iteration 7, loss = 0.97610923
Validation score: 0.628883
Iteration 8, loss = 0.97021285
Validation score: 0.626543
Iteration 9, loss = 0.96496611
Validation score: 0.629044
Iteration 10, loss = 0.96061902
Validation score: 0.627350
Iteration 11, loss = 0.95571112
Validation score: 0.633643
Iteration 12, loss = 0.95217397
Validation score: 0.633885
Iteration 13, loss = 0.94881659
Validation score: 0.634853
Iteration 14, loss = 0.94750888
Validation score: 0.638322
Iteration 15, loss = 0.94347941
Validation score: 0.640258
Iteration 16, loss = 0.94024984
Validation score: 0.637676
Iteration 17, loss = 0.93729695
Validation score: 0.641468
Iterat

0.65622377848060409

In [29]:
cross_val_score(mlp, X2, Y, cv=5)

Iteration 1, loss = 1.25158789
Validation score: 0.542759
Iteration 2, loss = 1.11810827
Validation score: 0.567971
Iteration 3, loss = 1.07884865
Validation score: 0.588544
Iteration 4, loss = 1.05404848
Validation score: 0.595704
Iteration 5, loss = 1.03570319
Validation score: 0.598628
Iteration 6, loss = 1.02289195
Validation score: 0.595099
Iteration 7, loss = 1.01330919
Validation score: 0.609419
Iteration 8, loss = 1.00579035
Validation score: 0.614966
Iteration 9, loss = 0.99766252
Validation score: 0.609520
Iteration 10, loss = 0.99147373
Validation score: 0.606495
Iteration 11, loss = 0.98786230
Validation score: 0.619907
Iteration 12, loss = 0.98413329
Validation score: 0.622630
Iteration 13, loss = 0.98106727
Validation score: 0.620311
Iteration 14, loss = 0.97750994
Validation score: 0.617386
Iteration 15, loss = 0.97576583
Validation score: 0.625958
Iteration 16, loss = 0.97299362
Validation score: 0.622227
Iteration 17, loss = 0.96976023
Validation score: 0.623941
Iterat

array([ 0.56937722,  0.5763211 ,  0.56224786,  0.57092141,  0.58125555])

In [45]:
art.loc[art['Gender'].str.contains('\) \('), 'Gender'] = '\(multiple_persons\)'
art.loc[art['Nationality'].str.contains('\) \('), 'Nationality'] = '\(multiple_nationalities\)'
art.loc[art['Artist'].str.contains(','), 'Artist'] = 'Multiple_Artists'

ValueError: cannot index with vector containing NA / NaN values

In [48]:
art[['Artist', 'Nationality', 'Gender', 'Date', 'Department',
                    'DateAcquired', 'URL', 'ThumbnailURL', 'Height (cm)', 'Width (cm)']].isnull().any()

TypeError: bad operand type for unary ~: 'list'

In [22]:
art.Medium.unique()

NameError: name 'art' is not defined