-- Jaimy Van Audenhove --

# Mushroom Classification

<b>The deadline for the notebook is 02/06/2023</b>.


<b>The deadline for the video is 09/06/2023</b>.

## The dataset

You are asked to predict whether a mushroom is poisonous or edibile, based on its physical characteristics. The dataset is provided in the accompanying file 'mushroom.csv'. A full description of the data set can be found in the file 'metadata.txt'.

The data set can be loaded using following commands (make sure to put the dataset in your iPython notebook directory):

In [45]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

#read and randomly shuffle data
mushroom = pd.read_csv('mushroom.csv', sep=';')

feature_cols = ['cap-diameter','cap-shape','cap-surface','cap-color','does-bruise-or-bleed','gill-attachment','gill-spacing','gill-color','stem-height','stem-width','stem-root','stem-surface','stem-color','veil-type','veil-color','has-ring','ring-type','spore-print-color','habitat','season']

#Create the feature and target vectors
X = mushroom[feature_cols]
y = mushroom['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

## Minimum Requirements

You will need to train at least 3 different models on the data set. Make sure to include the reason for your choice (e.g., for dealing with categorical features).

* Define the problem, analyze the data, prepare the data for your model.
* Train at least 3 models (e.g. decision trees, nearest neighbour, ...) to predict whether a mushroom is of poisonous or edible. You are allowed to use any machine learning model from scikit-learn or other methods, as long as you motivate your choice.
* For each model, optimize the model parameters settings (tree depth, hidden nodes/decay, number of neighbours,...). Show which parameter setting gives the best model.
* Compare the best parameter settings for both models and estimate their errors on unseen data. Investigate the learning process critically (overfitting/underfitting). Can you show that one of the models performs better?

All results, plots and code should be handed in as an interactive <a href='http://ipython.org/notebook.html'>iPython notebook</a>. Simply providing code and plots does not suffice, you are expected to accompany each technical section by explanations and discussions on your choices/results/observation/etc in the notebook and in a video (by recording your screen en voice). 

<b>The deadline for the notebook is 02/06/2023</b>.

<b>The deadline for the video is 09/06/2023</b>.

## Optional Extensions

You are encouraged to try and see if you can further improve on the models you obtained above. This is not necessary to obtain a good grade on the assignment, but any extensions on the minimum requirements will count for extra credit. Some suggested possibilities to extend your approach are:

* Build and host an API for your best performing model. You can create a API using pyhton frameworks such as FastAPI, Flask, ... You can host een API for free on Heroku, using your student credit on Azure, ...
* Try to combine multiple models. Ensemble and boosting methods try to combine the predictions of many, simple models. This typically works best with models that make different errors. Scikit-learn has some support for this, <a href="http://scikit-learn.org/stable/modules/ensemble.html">see here</a>. You can also try to combine the predictions of multiple models manually, i.e. train multiple models and average their predictions
* You can always investigate whether all features are necessary to produce a good model. Feel free to lookup additional resources and papers to find more information, see e.g <a href='https://scikit-learn.org/stable/modules/feature_selection.html'> here </a> for the feature selection module provided by scikit-learn library.

## Additional Remarks

* Depending on the model used, you may want to <a href='http://scikit-learn.org/stable/modules/preprocessing.html'>scale</a> or <a href='https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features'>encode</a> your (categorical) features X and/or outputs y
* Refer to the <a href='http://scipy.org/docs.html'>SciPy</a> and <a href='http://scikit-learn.org/stable/documentation.html'>Scikit learn</a> documentations for more information on classifiers and data handling.
* You are allowed to use additional libraries, but provide references for these.
* The assignment is **individual**. All results should be your own. Plagiarism will not be tolerated.

In [46]:
mushroom.head()

Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,p,15.26,x,g,o,f,e,,w,16.95,...,s,y,w,u,w,t,g,,d,w
1,p,16.6,x,g,o,f,e,,w,17.99,...,s,y,w,u,w,t,g,,d,u
2,p,14.07,x,g,o,f,e,,w,17.8,...,s,y,w,u,w,t,g,,d,w
3,p,14.17,f,h,e,f,e,,w,15.77,...,s,y,w,u,w,t,p,,d,w
4,p,14.64,x,h,o,f,e,,w,16.53,...,s,y,w,u,w,t,p,,d,w


In [3]:
mushroom.tail()

Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
61064,p,1.18,s,s,y,f,f,f,f,3.93,...,,,y,,,f,f,,d,a
61065,p,1.27,f,s,y,f,f,f,f,3.18,...,,,y,,,f,f,,d,a
61066,p,1.27,s,s,y,f,f,f,f,3.86,...,,,y,,,f,f,,d,u
61067,p,1.24,f,s,y,f,f,f,f,3.56,...,,,y,,,f,f,,d,u
61068,p,1.17,s,s,y,f,f,f,f,3.25,...,,,y,,,f,f,,d,u


In [4]:
mushroom.describe()

Unnamed: 0,cap-diameter,stem-height,stem-width
count,61069.0,61069.0,61069.0
mean,6.733854,6.581538,12.14941
std,5.264845,3.370017,10.035955
min,0.38,0.0,0.0
25%,3.48,4.64,5.21
50%,5.86,5.95,10.19
75%,8.54,7.74,16.57
max,62.34,33.92,103.91


In [5]:
mushroom.describe(include=['O'])

Unnamed: 0,class,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
count,61069,61069,46949,61069,61069,51185,36006,61069,9531,22945,61069,3177,7413,61069,58598,6354,61069,61069
unique,2,7,11,12,2,7,3,12,5,8,13,1,6,2,8,7,8,4
top,p,x,t,n,f,a,c,w,s,s,w,u,w,f,f,k,d,a
freq,33888,26934,8196,24218,50479,12698,24710,18521,3177,6025,22926,3177,5474,45890,48361,2118,44209,30177


In [6]:
# Check the % of missing values
missing = pd.concat([mushroom.isnull().sum(), 100 * mushroom.isnull().mean()], axis = 1)
missing.columns=['count', '%']
missing.sort_values(by='count')

Unnamed: 0,count,%
class,0,0.0
has-ring,0,0.0
stem-color,0,0.0
habitat,0,0.0
stem-height,0,0.0
gill-color,0,0.0
stem-width,0,0.0
does-bruise-or-bleed,0,0.0
cap-color,0,0.0
cap-shape,0,0.0


In [7]:
# Drop veil-type since there is only ~5% of data for it
mushroom.drop('veil-type', axis=1,inplace=True)

In [8]:
# Fill in empty values wiht 'unknown'
mushroom = mushroom.fillna(value='unknown')

In [9]:
mushroom.isnull().sum()

class                   0
cap-diameter            0
cap-shape               0
cap-surface             0
cap-color               0
does-bruise-or-bleed    0
gill-attachment         0
gill-spacing            0
gill-color              0
stem-height             0
stem-width              0
stem-root               0
stem-surface            0
stem-color              0
veil-color              0
has-ring                0
ring-type               0
spore-print-color       0
habitat                 0
season                  0
dtype: int64

In [10]:
mushroom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61069 entries, 0 to 61068
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   class                 61069 non-null  object 
 1   cap-diameter          61069 non-null  float64
 2   cap-shape             61069 non-null  object 
 3   cap-surface           61069 non-null  object 
 4   cap-color             61069 non-null  object 
 5   does-bruise-or-bleed  61069 non-null  object 
 6   gill-attachment       61069 non-null  object 
 7   gill-spacing          61069 non-null  object 
 8   gill-color            61069 non-null  object 
 9   stem-height           61069 non-null  float64
 10  stem-width            61069 non-null  float64
 11  stem-root             61069 non-null  object 
 12  stem-surface          61069 non-null  object 
 13  stem-color            61069 non-null  object 
 14  veil-color            61069 non-null  object 
 15  has-ring           

In [36]:
mushroom['cap-diameter'] = pd.qcut(mushroom['cap-diameter'],4,['1stQ','2ndQ','3rdQ','4thQ'])

In [37]:
mushroom['stem-width'] = pd.qcut(mushroom['stem-width'],4,['1stQ','2ndQ','3rdQ','4thQ'])

In [38]:
mushroom['stem-height'] = pd.qcut(mushroom['stem-height'],4,['1stQ','2ndQ','3rdQ','4thQ'])

In [39]:
#rename target column 'class' to 'poisonous'
mushroom.rename(columns={'class':'poisonous'},inplace=True)

In [40]:
mushroom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61069 entries, 0 to 61068
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   poisonous             61069 non-null  int64   
 1   cap-diameter          61069 non-null  category
 2   cap-shape             61069 non-null  int64   
 3   cap-surface           61069 non-null  int64   
 4   cap-color             61069 non-null  int64   
 5   does-bruise-or-bleed  61069 non-null  int64   
 6   gill-attachment       61069 non-null  int64   
 7   gill-spacing          61069 non-null  int64   
 8   gill-color            61069 non-null  int64   
 9   stem-height           61069 non-null  category
 10  stem-width            61069 non-null  category
 11  stem-root             61069 non-null  int64   
 12  stem-surface          61069 non-null  int64   
 13  stem-color            61069 non-null  int64   
 14  veil-color            61069 non-null  int64   
 15  ha

In [13]:
corr = mushroom.corr()

In [14]:
corr["cap-diameter"].sort_values(ascending=False)

cap-diameter    1.00000
stem-width      0.69533
stem-height     0.42256
Name: cap-diameter, dtype: float64

In [15]:
from sklearn.preprocessing import LabelEncoder
# Preprocess the data
label_encoder = LabelEncoder()
for column in mushroom.columns:
    mushroom[column] = label_encoder.fit_transform(mushroom[column])

In [16]:
#object values to numerical
mushroom.head(10)

Unnamed: 0,poisonous,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,stem-width,stem-root,stem-surface,stem-color,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,1,1481,6,2,6,0,2,3,10,1577,1656,4,8,11,5,1,2,6,0,3
1,1,1614,6,2,6,0,2,3,10,1681,1766,4,8,11,5,1,2,6,0,2
2,1,1362,6,2,6,0,2,3,10,1662,1721,4,8,11,5,1,2,6,0,3
3,1,1372,2,3,1,0,2,3,10,1463,1545,4,8,11,5,1,5,6,0,3
4,1,1419,6,3,6,0,2,3,10,1537,1667,4,8,11,5,1,5,6,0,3
5,1,1489,6,2,6,0,2,3,10,1666,1826,4,8,11,5,1,5,6,0,2
6,1,1440,2,3,6,0,2,3,10,1653,1636,4,8,11,5,1,2,6,0,3
7,1,1441,6,3,1,0,2,3,10,1585,1691,4,8,11,5,1,5,6,0,2
8,1,1240,2,2,6,0,2,3,10,1609,1816,4,8,11,5,1,5,6,0,0
9,1,1310,2,2,1,0,2,3,10,1490,1635,4,8,11,5,1,5,6,0,3


In [17]:
# Split the data into train and test sets
X = mushroom.drop('poisonous', axis=1)
y = mushroom['poisonous']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

# Model 1: Decision tree

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Split the data into train and test sets
X = mushroom.drop('poisonous', axis=1)
y = mushroom['poisonous']

# Train the Decision Tree model
modelDT = DecisionTreeClassifier()
modelDT.fit(X_train, y_train)

# Evaluate the model
decision_tree_accuracy = modelDT.score(X_test, y_test)
print(f'Decision Tree Accuracy: {decision_tree_accuracy}')

Decision Tree Accuracy: 0.9977075487145898


# Model 2: Random Forest

In [42]:
from sklearn.ensemble import RandomForestClassifier

# Train the Random Forest model
modelRF = RandomForestClassifier()
modelRF.fit(X_train, y_train)

# Evaluate the model
random_forest_accuracy = modelRF.score(X_test, y_test)
print(f'Random Forest Accuracy: {random_forest_accuracy}')


['cap-diameter', 'cap-shape', 'cap-surface', 'cap-color', 'does-bruise-or-bleed', 'gill-attachment', 'gill-spacing', 'gill-color', 'stem-height', 'stem-width', 'stem-root', 'stem-surface', 'stem-color', 'veil-color', 'has-ring', 'ring-type', 'spore-print-color', 'habitat', 'season']
Random Forest Accuracy: 1.0


# Model 3: K-Nearest Neighbors

In [32]:
import warnings
from sklearn.neighbors import KNeighborsClassifier

# Train the K-Nearest Neighbors model.
# Using the default of 5 neighbors since this gave the best results
modelKNN = KNeighborsClassifier()
modelKNN.fit(X_train, y_train)

# For the Future warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=FutureWarning)
    knn_accuracy = modelKNN.score(X_test, y_test)
print(f'K-Nearest Neighbors Accuracy: {knn_accuracy}')

K-Nearest Neighbors Accuracy: 0.8166857704273784


# Cross Validation

### Decision tree

In [50]:
from sklearn.model_selection import cross_val_score

scoreDT = cross_val_score(modelDT, X, y, cv=5)

print("Cross-Validation Accuracy Scores:", scoreDT)
print("Mean Accuracy:", scoreDT.mean())

Cross-Validation Accuracy Scores: [0.55534632 0.51490093 0.46856067 0.59857541 0.58896258]
Mean Accuracy: 0.5452691822921378


### Random Forest

In [51]:
scoreRF = cross_val_score(modelRF, X, y, cv=5)

print("Cross-Validation Accuracy Scores:", scoreRF)
print("Mean Accuracy:", scoreRF.mean())

Cross-Validation Accuracy Scores: [0.51801212 0.4880465  0.57303095 0.52415261 0.65413903]
Mean Accuracy: 0.5514762426564955


### K-Nearest Neighbours

In [53]:
#For the future warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=FutureWarning)
    scoreKNN = cross_val_score(modelKNN, X, y, cv=5)

print("Cross-Validation Accuracy Scores:", scoreKNN)
print("Mean Accuracy:", scoreKNN.mean())

Cross-Validation Accuracy Scores: [0.44653676 0.45693467 0.62321926 0.51023416 0.5929747 ]
Mean Accuracy: 0.525979907887655


# Conclusion
### Compare the best parameter settings

In [62]:
# Compare the accuracy results
Aresults = {
    'Decision Tree': decision_tree_accuracy,
    'Random Forest': random_forest_accuracy,
    'K-Nearest Neighbors': knn_accuracy
}

best_Amodel = max(Aresults, key=Aresults.get)
best_Aaccuracy = Aresults[best_model]

print(f'Accuracy conclusion:')
print(f'The model with the highest accuracy is {best_Amodel} with an accuracy of {best_Aaccuracy:.4f}.\n')

# Compare the cross-validation results
Cresults = {
    'Decision Tree': scoreDT.mean(),
    'Random Forest': scoreRF.mean(),
    'K-Nearest Neighbors': scoreKNN.mean()
}

best_Cmodel = max(Cresults, key=Cresults.get)
best_Caccuracy = Cresults[best_model]

print('Cross-validation conclusion:')
print(f'The model with the highest accuracy is {best_Cmodel} with a mean accuracy of {best_Caccuracy:.4f}.')

Accuracy conclusion:
The model with the highest accuracy is Random Forest with an accuracy of 1.0000.

Cross-validation conclusion:
The model with the highest accuracy is Random Forest with a mean accuracy of 0.5515.


# Extra: API for Random Forest Model

In [35]:
#Dumping the model pkl
import joblib

joblib.dump(modelRF, '/Users/jaimyva/Developer/AI/RandomForestModel.pkl')

['/Users/jaimyva/Developer/AI/RandomForestModel.pkl']