# Assignment 3

This homework is a classification task to identify whether a mushroom is edible or poisonous. 
You have to submit the following items to MyCourseVille:

xxx.ipynb – a source code

 

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). 

 

Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the credibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

 

Data Set Information: https://archive.ics.uci.edu/ml/datasets/mushroom

In [136]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Classification

The goal is to differentiate between edible and non-edible mushrooms.

1. Load "ModifiedEdibleMushroom.csv" data from the link below (note: this data set has been preliminarily prepared.).
   https://github.com/pvateekul/2110446_DSDE_2023s2/blob/main/code/Week03_ML/mushroom2020_dataset.csv

In [137]:
# !wsl wget https://raw.githubusercontent.com/pvateekul/2110446_DSDE_2023s2/main/code/Week03_ML/mushroom2020_dataset.csv

In [138]:
df = pd.read_csv('mushroom2020_dataset.csv')

In [139]:
df.head()

Unnamed: 0,id,label,cap-shape,cap-surface,bruises,odor,gill-attachment,gill-spacing,gill-size,stalk-shape,...,ring-number,ring-type,spore-print-color,population,habitat,cap-color-rate,gill-color-rate,veil-color-rate,stalk-color-above-ring-rate,stalk-color-below-ring-rate
0,1,p,x,s,t,p,f,c,n,e,...,o,p,k,s,u,1.0,3.0,1.0,1.0,1.0
1,2,e,x,s,t,a,f,c,b,e,...,o,p,n,n,g,2.0,3.0,1.0,1.0,1.0
2,3,e,b,s,t,l,f,c,b,e,...,o,p,n,n,m,3.0,1.0,1.0,1.0,1.0
3,4,p,x,y,t,p,f,c,n,e,...,o,p,k,s,u,3.0,1.0,1.0,1.0,1.0
4,5,e,x,s,f,n,f,w,b,t,...,o,e,n,a,g,4.0,3.0,1.0,1.0,1.0


In [140]:
df.shape

(5824, 24)

In [141]:
df.isnull().sum()

id                               0
label                           60
cap-shape                        0
cap-surface                     27
bruises                         99
odor                            99
gill-attachment                 99
gill-spacing                   130
gill-size                      121
stalk-shape                    121
stalk-root                      31
stalk-surface-above-ring        31
stalk-surface-below-ring        31
veil-type                       62
ring-number                     62
ring-type                       62
spore-print-color               56
population                      56
habitat                         31
cap-color-rate                  27
gill-color-rate                121
veil-color-rate                 62
stalk-color-above-ring-rate     31
stalk-color-below-ring-rate     62
dtype: int64

2. Drop rows where the target (label) variable is missing.

In [142]:
df_dropmissing = df.dropna(subset=['label'],axis=0)

In [143]:
df_dropmissing.shape

(5764, 24)

3. Drop the following variables:  
  'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate', 'veil-color-rate','veil-type'

In [144]:
dropped_cols = ['id','gill-attachment', 'gill-spacing', 'gill-size', 
                'gill-color-rate', 'stalk-root', 'stalk-surface-above-ring', 
                'stalk-surface-below-ring', 'stalk-color-above-ring-rate',
                'stalk-color-below-ring-rate','veil-color-rate','veil-type']
df_drop = df_dropmissing.drop(labels=dropped_cols, axis=1)

4. Examine the number of rows, the number of digits, and whether any are missing.

In [145]:
df_drop.shape

(5764, 12)

In [146]:
df_drop.isnull().sum()

label                  0
cap-shape              0
cap-surface           27
bruises               99
odor                  99
stalk-shape          121
ring-number           62
ring-type             62
spore-print-color     56
population            56
habitat               31
cap-color-rate        27
dtype: int64

5. Fill missing values by adding the mean for numeric variables and the mode for nominal variables.

In [147]:
df_drop_num = df_drop[df_drop.select_dtypes(include=np.number).columns]
# print(df_drop_num)

In [148]:
from sklearn.impute import SimpleImputer

num_imp=SimpleImputer(missing_values=np.NaN, strategy='mean')
df_imputed = df_drop.copy()
df_imputed['cap-color-rate']=pd.DataFrame(num_imp.fit_transform(df_drop_num), index=df_drop.index)
df_imputed.isnull().sum()

label                  0
cap-shape              0
cap-surface           27
bruises               99
odor                  99
stalk-shape          121
ring-number           62
ring-type             62
spore-print-color     56
population            56
habitat               31
cap-color-rate         0
dtype: int64

In [149]:
df_drop_cat = df_drop[df_drop.select_dtypes(exclude=np.number).columns]
cat_imp=SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
df_imputed[df_drop.select_dtypes(exclude=np.number).columns]=pd.DataFrame(cat_imp.fit_transform(df_drop_cat), index=df_drop.index)
df_imputed.isnull().sum()

label                0
cap-shape            0
cap-surface          0
bruises              0
odor                 0
stalk-shape          0
ring-number          0
ring-type            0
spore-print-color    0
population           0
habitat              0
cap-color-rate       0
dtype: int64

6. Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1

In [150]:
df_codelabeled = df_imputed.copy()
df_codelabeled.loc[df_codelabeled['label'] == 'e', 'label'] = 1
df_codelabeled.loc[df_codelabeled['label'] == 'p', 'label'] = 0
df_codelabeled['label'] = df_codelabeled['label'].astype(int)

In [151]:
tuple(df_codelabeled['label'].value_counts())

(3660, 2104)

7. Convert the nominal variable to numeric using a dummy code with drop_first = True.

In [152]:
# print(df_codelabeled['cap-shape'].unique())
# print(df_codelabeled['cap-surface'].unique())
# print(df_codelabeled['bruises'].unique())
# print(df_codelabeled['odor'].unique())
# print(df_codelabeled['stalk-shape'].unique())
# print(df_codelabeled['ring-number'].unique())
# print(df_codelabeled['ring-type'].unique())
# print(df_codelabeled['spore-print-color'].unique())
# print(df_codelabeled['population'].unique())
# print(df_codelabeled['habitat'].unique())
df_codelabeled.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5764 entries, 0 to 5823
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   label              5764 non-null   int32  
 1   cap-shape          5764 non-null   object 
 2   cap-surface        5764 non-null   object 
 3   bruises            5764 non-null   object 
 4   odor               5764 non-null   object 
 5   stalk-shape        5764 non-null   object 
 6   ring-number        5764 non-null   object 
 7   ring-type          5764 non-null   object 
 8   spore-print-color  5764 non-null   object 
 9   population         5764 non-null   object 
 10  habitat            5764 non-null   object 
 11  cap-color-rate     5764 non-null   float64
dtypes: float64(1), int32(1), object(10)
memory usage: 562.9+ KB


In [153]:
df_prepped = df_codelabeled.copy()
nominal_columns = list(df_prepped.columns)
nominal_columns.remove('label')
nominal_columns.remove('cap-color-rate')
# nominal_columns
dummy_df = pd.get_dummies(df_prepped[nominal_columns], drop_first=True)
df_prepped = pd.concat([df_prepped, dummy_df], axis=1)
df_prepped = df_prepped.drop(nominal_columns, axis=1)
print(df_prepped.shape)

(5764, 43)


8. Split train/test with 20% test, stratify, and seed = 2020.

In [154]:
from sklearn.model_selection import train_test_split
y = df_prepped.pop('label')
X = df_prepped

X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y,test_size=0.20, random_state=2020)

In [155]:
X_train.shape, X_test.shape

((4611, 42), (1153, 42))

9. Create a Random Forest with GridSearch on training data with 5 CV.

    ​'criterion':['gini','entropy']  
    'max_depth': [2,3,6]  
    'min_samples_leaf':[2,5,10]  
    'N_estimators':[100,200]  
    'random_state': 2020

In [156]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [172]:
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=dict(
        criterion= ['gini','entropy'],
        max_depth =  [2,3,6],
        min_samples_leaf = [2,5,10],
        n_estimators = [100,200],
        random_state =  [2020]
    ),
    cv=5,
    n_jobs=-1 # Parallel
)
grid_search.fit(X_train, y_train)

In [173]:
# grid_search_result = grid_search.cv_results_
# pd.DataFrame.from_dict(grid_search_result)
model = grid_search.best_estimator_
y_pred = model.predict(X_test)
# y_pred[:10]

In [176]:
grid_search.best_params_

{'criterion': 'gini',
 'max_depth': 6,
 'min_samples_leaf': 2,
 'n_estimators': 100,
 'random_state': 2020}

10. Predict the testing data set with confusion_matrix and classification_report.

In [175]:
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_pred,digits=4))
print(confusion_matrix(y_test,y_pred,labels=[0,1]))

              precision    recall  f1-score   support

           0     0.9932    0.9986    0.9959       732
           1     0.9976    0.9881    0.9928       421

    accuracy                         0.9948      1153
   macro avg     0.9954    0.9934    0.9944      1153
weighted avg     0.9948    0.9948    0.9948      1153

[[731   1]
 [  5 416]]
