<font size= 5pt>Building a binary classifier for the mushroom dataset</font><br><br>


In [1]:
# importing neccesary packages
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('bmh')

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [2]:
# load and prep the data
df= pd.read_csv('data/cleaned_mushroom.csv')

df0 = df.copy()
df0.head()
df0.drop('Unnamed: 0', axis= 1, inplace= True)

In [4]:
col_names= ['cap_shape', 'cap_surface', 'cap_color', 'bruise?', 'odor',
       'gill attachment', 'gill spacing', 'gill size', 'gill color',
       'stalk shape', 'stalk root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       ' stalk-color-below-ring', 'veil-type', 'veil color', 'ring number',
       'ring type', 'spore-print-color', 'population', 'habitat']

In [5]:
# perform transformation and label encoding
label_encoder = LabelEncoder()

for i in col_names:
    df0[i]= label_encoder.fit_transform(df0[i])

# poisonous: 1, edible= 0
df0['class']= np.where(df0['class'] == 'poisonous', 1, 0)

In [6]:
# creating dummy variables for each categorical feature..
df_dum= pd.get_dummies(df0.iloc[:, 1:], columns= col_names)

In [7]:
X = df0.iloc[:, 1:]
y = df0.iloc[:, 0]

X_dum = df_dum.values
y_dum = df0.iloc[:, 0].values

In [8]:
X_train, X_test, y_train, y_test= train_test_split(X, y)

In [34]:
# fit various models with default parameters for (no dummy)
logit = LogisticRegression(C= 0.01)
tree = DecisionTreeClassifier(max_depth=8, max_features=5)
forest  = RandomForestClassifier(max_features=5, max_depth=5, n_estimators=10)
mNB = MultinomialNB(alpha= 10)          # high alpha means moore smoothing and less complex models.

names = ['logit', 'decision tree', 'random forest', 'naive bayes']
models = [logit, tree, forest, mNB]

for name, model in zip(names, models):
    model.fit(X_train, y_train)
    test_scores = {
        name+' test score': np.round(model.score(X_test, y_test),4)
    }
    train_scores = {
        name+' train score': np.round(model.score(X_train, y_train),4)
    }
    print('\n',test_scores)
    print(train_scores)


 {'logit test score': 0.9266}
{'logit train score': 0.9184}

 {'decision tree test score': 1.0}
{'decision tree train score': 1.0}

 {'random forest test score': 0.9902}
{'random forest train score': 0.9915}

 {'naive bayes test score': 0.836}
{'naive bayes train score': 0.8285}


<font color= 'steelblue'> Note</font><br>
We are interested in a model that does well on the training data and does even better in the test data, thus we want a model that has less chance of overfiting the data but good in generalizing.
Here **Naive Bayes Algorithm** performs badly among the rest, but might do better in generalzing.

**Logistics Regression** and **Random Forest** model might be our good to go model for prediction.

In [31]:
# fiting the model with dummy data
X_train1, X_test1, y_train1, y_test1= train_test_split(X_dum, y_dum)

In [33]:
logit = LogisticRegression(C= 0.01) # searching for optimal regularization. high C means less regularization
tree = DecisionTreeClassifier(max_depth=8, max_features=5)# pre-prunning model.
forest  = RandomForestClassifier(max_features=5, max_depth=5, n_estimators=10)# pre-prunning model.
mNB = MultinomialNB(alpha= 10)                    # high alpha means moore smoothing and less complex models.

names = ['logit', 'decision tree', 'random forest', 'naive bayes']
models = [logit, tree, forest, mNB]

for name, model in zip(names, models):
    model.fit(X_train1, y_train1)
    test_scores = {
        name+' test score': np.round(model.score(X_test1, y_test1),4)
    }
    train_scores = {
        name+' train score': np.round(model.score(X_train1, y_train1), 4)
    }
    print('\n',test_scores)
    print(train_scores)


 {'logit test score': 0.9838}
{'logit train score': 0.9856}

 {'decision tree test score': 0.9724}
{'decision tree train score': 0.9744}

 {'random forest test score': 0.9828}
{'random forest train score': 0.9828}

 {'naive bayes test score': 0.9355}
{'naive bayes train score': 0.9329}


<font color='steelblue'><b>Note</b></font><br>
The Naive Bayes algorithms even does better after transforming the data. I think is a sure go to model for prediciton.
So, finally we choose the Naive Bayes algorithm to predict new instances, It has less chance overfitting the data and have a predicition score on the test set. **93%** accuracy is pretty good.