# Accurrate classification with only 8 variables (NN)

We first used an exploratory **Neural Network** using all 22 variables. We then performed multiple **Chi-squared** analysis to identify the most relevant variables together with their correlations. 

We finally build a second NN able to classify the mushrooms based on a subset of 8 variables.

In [None]:
import numpy as np 
import pandas as pd
import random

from keras import layers, optimizers, regularizers
from keras.layers import Dense, Dropout, BatchNormalization, Activation
from keras.models import Sequential

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import preprocessing, model_selection 

In [None]:
data = pd.read_csv("../input/mushrooms.csv")
data.tail(5)

In [None]:
np.unique(data["class"].values, return_counts=True)

the distribution among poisonous and edible mushrooms is balanced. We are good to go. 


# A simple 2-layers NN on incomplete data

It might be interesting to see if a standard NN could infer the class of mushrooms with incomplete data, for a positive test would suggest a high correlation of some variables.

We fist need a function to easily inject **NaN** values into the dataset:

In [None]:
def injectNAN(df, ratio, columns):      
    for i in range(int(len(df)*len(data.columns.values)*ratio)):
        df.iloc[random.randint(0,8123)][random.randint(1,columns)] =np.nan
    return(df)

In [None]:
data = injectNAN(data, 0.5,22)
data.tail()

we then need to convert our categorical variables to numerical ones

In [None]:
colList = list(data.columns.values)[1:]
data = pd.get_dummies(data, columns=colList)

def toNumeric(s): 
    if s == "e": 
        return(0)
    elif s == "p": 
        return(1)

data["class"] = data["class"].apply(toNumeric)
data.tail(5)

In [None]:
X = data.iloc[:,1:].values # first columns
Y = data.iloc[:,0:1].values # last columns

X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X,Y,test_size=0.03)

print(X_train.shape,Y_train.shape,X_test.shape,Y_test.shape)

In [None]:
shroomModel2 = Sequential()
# layer 1
shroomModel2.add(Dense(30, input_dim=117, activation='relu', name='fc0',kernel_regularizer=regularizers.l2(0.01)))

#layer 2
shroomModel2.add(Dense(1, name='fc2',bias_initializer='zeros'))
shroomModel2.add(Activation('sigmoid'))

shroomModel2.summary()

In [None]:
shroomModel2.compile(optimizer = "adam", loss = "logcosh", metrics = ["binary_accuracy"])

In [None]:
shroomModel2.fit(x = X_train, y = Y_train, epochs = 30,verbose=1, batch_size = 64,validation_data=(X_test, Y_test))

Even with half our data missing, the model remains pretty accurate. Let's try to understand why? are some variables highly correlated?

## Let's open the black box!

Since all our data is made of categorical variables, the <a href="http://people.stat.sc.edu/hendrixl/stat205/Lecture%20Notes/Chi-square%20for%20Contingency%20Tables.pdf">Chi-squared test of Independence</a> is the only statistical test we will perform. 

We will perform it on every factor pairs

In [None]:
from scipy.stats import chi2_contingency

In [None]:
df = pd.read_csv("../input/mushrooms.csv")

factors_paired = [(i,j) for i in df.columns.values for j in df.columns.values] 

chi2, p_values =[], []

for f in factors_paired:
    if f[0] != f[1]:
        chitest = chi2_contingency(pd.crosstab(df[f[0]], df[f[1]])) # Chi2 test for every contingency table possible
        chi2.append(chitest[0])
        p_values.append(chitest[1])
    else:      # for same factor pair
        chi2.append(0)
        p_values.append(0)
    
chi2 = np.array(chi2).reshape((23,23)) # shape it as a matrix
chi2 = pd.DataFrame(chi2, index=df.columns.values, columns=df.columns.values) # then a df for convenience

p_values = np.array(p_values).reshape((23,23)) # shape it as a matrix
p_values = pd.DataFrame(p_values, index=df.columns.values, columns=df.columns.values) # then a df for convenience

In [None]:
chi2.head()

do we have uncertain correlations?

In [None]:
p_values[(p_values >= 0.05)]

appart from the "veil type" anomaly, seems OK, we are good to go

In [None]:
sns.heatmap(chi2,vmax=4000, center=1,square=True,robust=False,xticklabels=True , yticklabels=True, cmap="YlGnBu", linewidths=.5)
plt.show()

The **Class** seems highly correlated with i) the **odor**, ii) the **spore print color**, iii) the **gill-color**. 

Moreover, colors bellow and above the ring are highly correlated with a lot of different variables. No wonder why our NN could still remain accurate with only 50% of our data. 

Information concerning i) the **cap**,  ii) the **veil** and iii) the **gill spacing, attachment and size** doesn't seem to play a great rôle in determining the mushroom class. 


With this in mind, we could build a NN able to perform the classification on a small subset of variables.

## A more minimal NN

Let's first get rid of the unrelevant variables

In [None]:
dropped_variables= ["cap-shape", "cap-surface","cap-color", "gill-attachment", "gill-spacing", "stalk-shape", "veil-type","veil-color", "ring-number","habitat","population","stalk-surface-below-ring","stalk-color-above-ring"]
data = pd.read_csv("../input/mushrooms.csv").drop(dropped_variables,axis=1)
data.head()

we convert data to numerical values

In [None]:
colList = list(data.columns.values)[1:]
data = pd.get_dummies(data, columns=colList)

data["class"] = data["class"].apply(toNumeric)

then, we constitute our test-train set

In [None]:
X = data.iloc[:,1:].values # first columns
Y = data.iloc[:,0:1].values # last columns

X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X,Y,test_size=0.03)

print(X_train.shape,Y_train.shape,X_test.shape,Y_test.shape)

we then use the same 2-layers NN architecture

In [None]:
shroomModel3 = Sequential()
# layer 1
shroomModel3.add(Dense(30, input_dim=57, activation='relu', name='fc0',kernel_regularizer=regularizers.l2(0.01)))

#layer 2
shroomModel3.add(Dense(1, name='fc2',bias_initializer='zeros'))
shroomModel3.add(Activation('sigmoid'))

shroomModel3.summary()

In [None]:
shroomModel3.compile(optimizer = "adam", loss = "logcosh", metrics = ["binary_accuracy"])

In [None]:
shroomModel3.fit(x = X_train, y = Y_train, epochs = 40,verbose=1, batch_size = 128,validation_data=(X_test, Y_test))

In [None]:
preds = shroomModel3.evaluate(x = X_test, y = Y_test)
print()
print ("Loss = " + str(preds[0]))
print ("Test Accuracy = " + str(preds[1]))

## Conclusion

It is possible to accurately determine the class of mushrooms based on a subset of 8 variables: 
<ul>
    <li> bruises	odor
    <li> gill-size
    <li> gill-color
    <li> stalk-root
    <li> stalk-color-above-ring
    <li> stalk-color-below-ring
    <li> ring-type
    <li> spore-print-color
</ul>

									