# Deep mushroom: model comparison

Author: Luis Bronchal<br>Date: April 27, 2017


## Summary

There are several "classic" models which fit well with this dataset and achieve a great accuracy. We have compared some of them (*Logistic Regression, KNN, Trees, Naive Bayes, SVM and Random Forest*)

We have also implemented a neural network with Keras and experimented obtaining the values of the hidden layer for each input. We have used t-SNE to project this data into a two dimension plot to evaluate the ability of the neural network to transform raw input data into useful features.

## Analysis

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import math

In [None]:
data = pd.read_csv("../input/mushrooms.csv")

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data['stalk-root'].value_counts()

More than 30% of the values of **stalk-root** are missing values. That's to much missing data to do imputation (if you are playing with poison mushrooms). We are going to remove the feature (we are going to lose some signal and noise with this). Let's see if we can achieve good accuracy despite this.

In [None]:
100*len(data.loc[data['stalk-root']=='?']) / sum(data['stalk-root'].value_counts())

In [None]:
data = data.drop('stalk-root', 1)

The variable to predict (class) is very balanced:

In [None]:
data['class'].value_counts()

We prepare the data to be used in the neural network model:

In [None]:
Y = pd.get_dummies(data.iloc[:,0],  drop_first=False)
X = pd.DataFrame()
for each in data.iloc[:,1:].columns:
    dummies = pd.get_dummies(data[each], prefix=each, drop_first=False)
    X = pd.concat([X, dummies], axis=1)

## Modeling

### Classic models

We are going to compare the performance of some "classic" machine learning models: *Logistic Regression, KNN, Trees, Naive Bayes, Suport Vector Machines and Random Forest*.

We are going to use cross validation and the AUC metric.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(probability=True)))
models.append(('RF', RandomForestClassifier()))

In [None]:
from sklearn.model_selection import cross_val_score, KFold

seed = 321

# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=10, random_state=seed)
    cv_results = cross_val_score(model, X_train, y_train.iloc[:,1], cv=kfold, scoring='roc_auc')    
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

The cross validation show a good performance. Let's see in a plot:

In [None]:
# Compare Algorithms
fig = plt.figure(figsize=(16, 8))
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results, vert=False)
ax.set_yticklabels(names)
plt.show()

Let's see the performance of each model with the test data. They works really well:

In [None]:
from collections import defaultdict
from sklearn.metrics import roc_auc_score

model_predictions = defaultdict()
model_score = defaultdict(np.float)
for name, model in models:
    model.fit(X_train, y_train.iloc[:,1])
    my_pred = model.predict(X_test)
    model_predictions[name] = my_pred
    model_score[name] = roc_auc_score(y_test.iloc[:,1], my_pred)

    msg = "%s: %f" % (name, model_score[name])
    print(msg) 

We are going to see how the different models are correlated between them. They are highly correlated although NB is a little less.

In [None]:
model_predicions_df = pd.DataFrame(model_predictions)

In [None]:
corrmat = model_predicions_df.corr()
corrmat

In [None]:
sns.heatmap(corrmat)
plt.show()

### Neural Network model

We are going to build a neural network model with Keras. We'll check its accuracy, but our main objective here is to inspect the values of
the hidden layer and to project them into a two dimension plot to see how the neural network identify by itself the different groups

In [None]:

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.optimizers import SGD
from keras.callbacks import EarlyStopping
from sklearn.model_selection import cross_val_score
from keras import backend as K

seed = 123456 

def create_model():
    model = Sequential()
    model.add(Dense(20, input_dim=X.shape[1], kernel_initializer='uniform', activation='relu'))
    model.add(Dropout(0.1))
    model.add(Dense(2, activation='softmax'))
    sgd = SGD(lr=0.01, momentum=0.7, decay=0, nesterov=False)
    model.compile(loss='binary_crossentropy' , optimizer='sgd', metrics=['accuracy'])
    return model

We train the model and get the associated training graphs:

In [None]:
model = create_model()
history = model.fit(X.values, Y.values, validation_split=0.20, epochs=300, batch_size=100, verbose=0)


# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
print("Training accuracy: %.2f%% / Validation accuracy: %.2f%%" % 
      (100*history.history['acc'][-1], 100*history.history['val_acc'][-1]))

We are going to obtain the values of the layer previous to the output layer:

In [None]:
from keras import backend as K
import numpy as np

layer_of_interest=0
intermediate_tensor_function = K.function([model.layers[0].input],[model.layers[layer_of_interest].output])
intermediate_tensor = intermediate_tensor_function([X.iloc[0,:].values.reshape(1,-1)])[0]

In [None]:
intermediates = []
color_intermediates = []
for i in range(len(X)):
    output_class = np.argmax(Y.iloc[i,:].values)
    intermediate_tensor = intermediate_tensor_function([X.iloc[i,:].values.reshape(1,-1)])[0]
    intermediates.append(intermediate_tensor[0])
    if(output_class == 0):
        color_intermediates.append("#0000ff")
    else:
        color_intermediates.append("#ff0000")

The penultimate layer has 20 neurons. We are going to build a t-SNE projection:

In [None]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
intermediates_tsne = tsne.fit_transform(intermediates)

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(x = intermediates_tsne[:,0], y=intermediates_tsne[:,1], color=color_intermediates)
plt.show()

## Conclusion

We have obtained a clear image where the different classes are very identificable (poison and edible mushrooms)