# Methods of Mushrooms Classification

### Uri Katz

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv('../input/mushrooms.csv')
df.head()

**Mission : Find different methods for poisonous classification:**

## Data Exploration

In [None]:
df.info()

In [None]:
df.shape

All 22 Features (and the class) are categorical features (type string) , with 8124 samples. 

In [None]:
print('Class in %')
df['class'].value_counts()/df['class'].count()*100

51% of the mushrooms are edible , 48% are poisonous. Class is well balanced.

**Encoding Data: 
**Since all the features are pure categorical data , without any ordinal relation , we can not see them as ordinal variables. Therefore i will use One hot encoding (indicators of 1/0) :

In [4]:
df_onehot = pd.get_dummies(df);
df_onehot = df_onehot.drop(['class_e'],axis=1) # Now , class_p is an indicator of 1=poisonous , 0 = edible
df_onehot.head()

Now we have new matrix of size 8124X118 where each column is one hot labled. Class_p is our target with class_p=1 for poisonous and 0 for edible. Next i will check correlation between the class and the features using correlation measurment (pearson) 

**Correlation **
Here we will seek for features that are correlated with the class , which means that they are correlated with poisonous mushrooms 

In [None]:
corr = df_onehot.corr().loc[:,'class_p']
top_10_corr =corr.abs().sort_values(ascending=False).head(n=11).iloc[1:]
top_10_corr

**Top 10 features with high correlation (in absolute value)**


For each of the correlated feature , we can see the amount of poisonous mushrooms in % that belong to the relevant group:

In [None]:
highcorr = pd.DataFrame()
for var in top_10_corr.index:
    highcorr[var] = 100*df_onehot[['class_p',var]].groupby([var]).sum()/3916

highcorr

* We see that 96% of the poisonous mushrooms do not belong to odor type n
* 55% of the poisonous mushrooms have odor type f
* 56.8% of poisonous mushrooms are of gill size n  while all other (43.2%) with gill size b
* 84ֵ% of poisonous mushrooms do not have bruises (f)

## Models

### Splitting the dataset

Here i will split the dataset into two groups : Train (15%) and Test (85%). 
random seed set to 42.

In [5]:
from sklearn.model_selection import train_test_split
X = df_onehot.drop(['class_p'],axis=1)
y = df_onehot.class_p
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.85,random_state = 42)


### 1. Decision Tree Classifier

In [17]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier , export_graphviz
import graphviz
treeclf = DecisionTreeClassifier(random_state=42)
treeclf.fit(X_train,y_train)

10 fold cross validation on the training set:

In [18]:
mean_cv_score =cross_val_score(treeclf, X_train, y_train, cv=10,scoring='accuracy').mean()
print('Decision Tree Classifier mean 10 cv Accuracy score:{0:.3}'.format(mean_cv_score))

We have a very good accuracy , lets check how is the model behaving on our large test set.

In [19]:
from sklearn.metrics import confusion_matrix
y_pred_tree = treeclf.predict(X_test)
conf = confusion_matrix(y_test, y_pred_tree)
plt.figure(figsize=(5,5))
sns.heatmap(conf, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Greens')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix', size = 15)
plt.show()

Our Test set is quite large , we recive very good accuracy (100%) and manage to label correct all the samples. 

We can inspect the tree :

In [None]:
dot_data = export_graphviz(treeclf, out_file=None,feature_names=X.columns,filled=True, rounded=True,special_characters=True)  
graphviz.Source(dot_data)  


We can  look at the importances of the features used

In [None]:
tree_features = pd.Series(treeclf.feature_importances_).sort_values(ascending=False).where(lambda x:x>0).dropna()
plt.figure(figsize=(8,5))
tree_features.plot.bar(color='g',align='center')
plt.xticks(range(len(tree_features.index)),X.columns.values[tree_features.index])
plt.title('Feature importances')
plt.ylabel('Importance')
plt.show()

We can see that only 11 features from the 117 features we had made any importance, and the indication if the sample have odor type n is the most influential. We already saw that odor type n was the most correlated with the poisonous mushrooms. It is intresting to understand why some of the other correlated variables didnt "made it" into the importance list.

To summarize, the Decision Tree Classifier gave us really good method for binary classification of poisonous vs edible, it  also emphasized the importances of odor (mainly type n) , stalk root (type c) and stalk surface below the ring. All the features (group , not as indicators) that we have seen in the correlation test have been part of the importance list!.

### 2. Regularized Logistic Regression

Here , i will try another classifier - logistic regression with regularization parameters 

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
logreg = LogisticRegression()
logreg.fit(X_train,y_train);

Here i will run GridSearch cross validation to choose the right regularization parameters for the logistic regression:

In [13]:
c_space = np.logspace(-5, 8, 50) #log space for the C parameter
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}
logreg_cv = GridSearchCV(logreg,param_grid,cv=10)
logreg_cv.fit(X_train,y_train);

In [14]:
print("Tuned Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Accuracy: {}".format(logreg_cv.best_score_))

In [23]:
y_pred_logreg = logreg_cv.predict(X_test)
conf = confusion_matrix(y_test, y_pred_logreg)
plt.figure(figsize=(5,5))
sns.heatmap(conf, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Greens')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix for Log regression', size = 15)
plt.show()

Again very high accuracy...

### 3.  Multivariate Bernoulli Naive Bays

In [28]:
from sklearn.naive_bayes import BernoulliNB
bnb_clf = BernoulliNB()
bnb_clf.fit(X_train, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
y_pred_bnb = bnb_clf.predict(X_test)
conf = confusion_matrix(y_test, y_pred_bnb)
plt.figure(figsize=(5,5))
sns.heatmap(conf, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Greens')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix for Naive Bays', size = 15)
plt.show()