## Mushroom Classifier
Dennis Mott
<br>
April 2024

## Overview

This notebook utilizes the mushroom dataset from The Audubon Society Field Guide to North American Mushrooms. The goal of this notebook is to display how a KNN Classifer can impute missing values and the importance of dimensionality reduction using PCA. Two simple models (Random Forest Classifier, Logisitc Regression) were trained and used to make predictions on whether a mushroom is edible or poisonous based on the attributes below. The models were trained with and without PCA dimensionality reduction to compare. The performance measures: accuracy, precision, recall, training time were used to compare the effect of PCA on the models. 

## Attributes

1. `cap_shape`:                bell=b,conical=c,convex=x,flat=f,                                knobbed=k,sunken=s
2. `cap_surface`:              fibrous=f,grooves=g,scaly=y,smooth=s
3. `cap_color`:                brown=n,buff=b,cinnamon=c,gray=g,green=r,                                pink=p,purple=u,red=e,white=w,yellow=y
4. `bruises`:                 bruises=t,no=f
5. `odor`:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,                             musty=m,none=n,pungent=p,spicy=s
6. `gill_attachment`:          attached=a,descending=d,free=f,notched=n
7. `gill_spacing`:             close=c,crowded=w,distant=d
8. `gill_size`:                broad=b,narrow=n
9. `gill_color`:               black=k,brown=n,buff=b,chocolate=h,gray=g,                           green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
10. `stalk_shape`:              enlarging=e,tapering=t
11. `stalk_root`:               bulbous=b,club=c,cup=u,equal=e,                                  rhizomorphs=z,rooted=r,missing=?
12. `stalk_surface_above_ring`: fibrous=f,scaly=y,silky=k,smooth=s
13. `stalk_surface_below_ring`: fibrous=f,scaly=y,silky=k,smooth=s
14. `stalk_color_above_ring`:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,                              pink=p,red=e,white=w,yellow=y
15. `stalk_color_below_ring`:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,                              pink=p,red=e,white=w,yellow=y
16. `veil_type`:                partial=p,universal=u
17. `veil_color`:               brown=n,orange=o,white=w,yellow=y
18. `ring_number`:              none=n,one=o,two=t
19. `ring_type`:                cobwebby=c,evanescent=e,flaring=f,large=l,                             none=n,pendant=p,sheathing=s,zone=z
20. `spore_print_color`:        black=k,brown=n,buff=b,chocolate=h,green=r,                            orange=o,purple=u,white=w,yellow=y
21. `population`:               abundant=a,clustered=c,numerous=n,                                 scattered=s,several=v,solitary=y
22. `habitat`:                  grasses=g,leaves=l,meadows=m,paths=p,                                  urban=u,waste=w,woods=d

### Target Variable
`edible_poisonous`: edible=e, poisonous=p

## Explore Data and Prep for ML Models

In [1]:
# standard imports
import numpy as np
import pandas as pd

In [2]:
# define column names from 'agaricus-lepiota.names' file
col_names = ['edible_poisonous',
            'cap_shape',
            'cap_surface',
            'cap_color',
            'bruises',
            'odor',
            'gill_attachment',
            'gill_spacing',
            'gill_size',
            'gill_color',
            'stalk_shape',
            'stalk_root',
            'stalk_surface_above_ring',
            'stalk_surface_below_ring',
            'stalk_color_above_ring',
            'stalk_color_below_ring',
            'veil_type',
            'veil_color',
            'ring_number',
            'ring_type',
            'spore_print_color',
            'population',
            'habitat']

In [3]:
# import data
mushrooms =  pd.read_csv('data/agaricus-lepiota.data', sep=',', names=col_names, index_col=False)

# view first 5 rows of mushrooms data set
mushrooms.head(5)

Unnamed: 0,edible_poisonous,cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment,gill_spacing,gill_size,gill_color,...,stalk_surface_below_ring,stalk_color_above_ring,stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


The data is entrirely categorical and the `stalk_root` column is the only column with missing values. The missing values are labeled as '?'. These values will be replaced with NaN to make it easier to identify and replace.

In [4]:
# view missing data in 'stalk_root' column
mushrooms.iloc[:,11].value_counts(dropna=False)

b    3776
?    2480
e    1120
c     556
r     192
Name: stalk_root, dtype: int64

In [5]:
# replace '?' with NaN and verify
mushrooms['stalk_root'].replace(to_replace='?', value=np.nan, inplace=True)
mushrooms.iloc[:,11].value_counts(dropna=False)

b      3776
NaN    2480
e      1120
c       556
r       192
Name: stalk_root, dtype: int64

## Replace Missing Values with KNN Classifier

`stalk_root` is the only column with missing data. A KNN classifier will be used to predict these missing values. The default settings are used which means the number of nearest neighbors is n=5 and the distance metric is 'minkoski' with p=2 which is the same as the euclidean distance.  The predicted values will then be used to replace the missing values in the mushrooms data set. Before applying KNN, all columns will be one-hot encoded with the exception of the `stalk_root` column which will be label encoded.

In [6]:
# create copy of dataframe
mushrooms_copy = mushrooms.copy()

# get indices of NaN values
indices_missing = np.where(mushrooms_copy['stalk_root'].isnull())[0]
indices_not_missing = np.where(mushrooms_copy['stalk_root'].notnull())[0]

In [7]:
# create training and test data sets
X_train = mushrooms_copy.iloc[indices_not_missing].drop(['stalk_root'], axis=1)
X_test = mushrooms_copy.iloc[indices_missing].drop(['stalk_root'], axis=1)

y_train = mushrooms_copy['stalk_root'].iloc[indices_not_missing]
y_test = mushrooms_copy['stalk_root'].iloc[indices_missing]

As seen below, the features are one-hot encoded and the response variable is label encoded before using the KNN classifier. This converts the categorical attributes to integers and is important for the KNN classifier and the ML models they will be fed into later on. The response data is label encoded so that it remains a multi-class variable. If the response variable was one-hot encoded, it would turn into a multi-label response variable where multiple KNN's could be created and combined to make a prediction. There could be ties where multiple `stalk_root` classes could be predicted for a single set of attributes so in the end it is better to label encode to keep a single column response variable.

In [8]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# one-hot encode feature variables
cat_encoder = OneHotEncoder()
cat_encoder.fit(mushrooms_copy.drop(['stalk_root'], axis=1))
X_train_encode = cat_encoder.transform(X_train)
X_test_encode = cat_encoder.transform(X_test)

# label encode response variable
lab_encoder = LabelEncoder()
lab_encoder.fit(y_train)
y_train_encode = lab_encoder.transform(y_train)

In [9]:
from sklearn.neighbors import KNeighborsClassifier

# instantiate KNN classifier with default settings
knn = KNeighborsClassifier(n_neighbors = 5,
                           p = 2, # euclidean distance
                           metric = 'minkowski'
                          )

knn.fit(X_train_encode, y_train_encode)
y_predict = knn.predict(X_test_encode)
y_predict

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


array([0, 0, 2, ..., 2, 0, 2])

In [10]:
# missing values predicted by KNN classifier
missing_values = lab_encoder.inverse_transform(y_predict)
unique, counts = np.unique(missing_values, return_counts=True)
print(np.asarray((unique, counts)).T)

[['b' 1891]
 ['c' 65]
 ['e' 524]]


In [11]:
# bring predicted values back into dataframe
mushrooms['stalk_root'] = mushrooms['stalk_root'].fillna(pd.Series(missing_values, index=indices_missing))

# check to ensure 'stalk_root' contains no missing values
mushrooms.iloc[:,11].value_counts(dropna=False)

b    5667
e    1644
c     621
r     192
Name: stalk_root, dtype: int64

## Build ML Models

A RandomForestClassifier as well as a LogisticRegression model are ceated below to predict whether a mushroom is edible or poisonous. The features will be one-hot encoded while the target varible, `edible_poisonous` will be label encoded.

In [12]:
from sklearn.model_selection import train_test_split

# create training and test data sets
X = mushrooms.drop(['edible_poisonous'], axis=1)
y = mushrooms['edible_poisonous']

# one-hot encode feature variables
cat_encoder = OneHotEncoder()
cat_encoder.fit(X)
X_encode = cat_encoder.transform(X)

# label encode response variable
lab_encoder = LabelEncoder()
lab_encoder.fit(y)
y_encode = lab_encoder.transform(y)

# split into train and test data
X_train, X_test, y_train, y_test = train_test_split(X_encode, y_encode, test_size=0.2, random_state=42)

In [13]:
from sklearn.ensemble import RandomForestClassifier

# instantiate randomforest classifier using default parameters
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)

rf_clf_predict = rf_clf.predict(X_test)

rf_clf_time = %timeit -n1 -r1 -o rf_clf.fit(X_train, y_train)

386 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [14]:
from sklearn.linear_model import LogisticRegression

# instantiate logistic regression using default parameters
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

log_reg_predict = log_reg.predict(X_test)

log_reg_time = %timeit -n1 -r1 -o log_reg.fit(X_train, y_train)

44.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


By one-hot encoding the response variable using these settings: drop='first', sparse=False, the encoded values will be the same as if the response variable was label encoded. This is because the response variable is binary. Using one-hot encode, with the settings previously mentioned, will make two new columns but will drop the first column, making only one new column with either '0' or '1' value. This is exactly what label encode outputs.

In [15]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

# create function to calculate accuracy, precision, recall, and model training time
def scores(y_true, y_pred, model_train_time):
    acc =  accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, average='macro')
    rec = recall_score(y_true, y_pred, average='macro')
    time = model_train_time.average
    return pd.DataFrame([('Accuracy',acc),('Precision',prec),('Recall',rec),('Time',time)], columns=['Item', 'Score'])

rf_clf_scores = scores(y_test, rf_clf_predict, rf_clf_time) 
log_reg_scores = scores(y_test, log_reg_predict, log_reg_time)

print('---random forest classifier---')
print(rf_clf_scores)
print('')
print('---logistic regression---')
print(log_reg_scores)

---random forest classifier---
        Item     Score
0   Accuracy  1.000000
1  Precision  1.000000
2     Recall  1.000000
3       Time  0.386029

---logistic regression---
        Item     Score
0   Accuracy  1.000000
1  Precision  1.000000
2     Recall  1.000000
3       Time  0.044658


The models are summarized by accuracy, precision, recall and training time as seen above. Both models show 100% accuracy, precision, and recall. This is slightly not surprising since it was provided that when `habitat`=leaves and `cap_color`=white or `population`=clustered and `cap_color`=white would yield 100% accuracy. The logistic model is nearly 7x faster than the random forest model.

## Dimensionality Reduction Using PCA

When a dataset has many categorical attributes, a one-hot encoder is typically used to convert the data from categorical to numberic for ML models. In the case of one-hot encoder (with drop='first') where there are more than 2 classes per attribute, the encoder will increase the number of dimensions in the dataset. This can lead to overfitting due to the large number of dimensions creating a space that increases the risk of having data that is spread out. In comes PCA to save the day. PCA will find the principal components that yield the most variance within the data. PCA will also automatically center the data when using sklearn.

For this assignment, PCA with a target variance of 95% was used to find the minimum amount of principal components that explain this variance. The number of dimensions were reduced nearly 65%, from 116 to 41 as seen below.

In [16]:
from sklearn.decomposition import PCA

# instantiate PCA class
pca = PCA(n_components=0.95)
X_train_reduced = pca.fit_transform(X_train.toarray())
X_test_reduced = pca.transform(X_test.toarray())

# percentage of reduced dimensions
print('percentage of reduced dimensions:', round(100 * (1 -(pca.n_components_ / X_train.shape[1])), 2), '%')

# number of dimensions
print('number of dimensions after PCA:', pca.n_components_)

percentage of reduced dimensions: 64.66 %
number of dimensions after PCA: 41


In [17]:
# check number of dimensions before and after pca

print('X_train:', X_train.shape[1])
print('X_train_reduced:', X_train_reduced.shape[1])

X_train: 116
X_train_reduced: 41


Retrain the models using the reduced dimensions after using PCA.

In [18]:
# instantiate randomforest classifier using default parameters
rf_clf_pca = RandomForestClassifier(random_state=42)
rf_clf_pca.fit(X_train_reduced, y_train)

rf_clf_pca_predict = rf_clf_pca.predict(X_test_reduced)

rf_clf_pca_time = %timeit -n1 -r1 -o rf_clf.fit(X_train, y_train)

317 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [19]:
# instantiate logistic regression using default parameters
log_reg_pca = LogisticRegression()
log_reg_pca.fit(X_train_reduced, y_train)

log_reg_pca_predict = log_reg_pca.predict(X_test_reduced)

log_reg_pca_time = %timeit -n1 -r1 -o log_reg.fit(X_train, y_train)

41.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [20]:
# check performance scores on test data
rf_clf_pca_scores = pd.DataFrame(scores(y_test, rf_clf_pca_predict, rf_clf_pca_time)) 
log_reg_pca_scores = pd.DataFrame(scores(y_test, log_reg_pca_predict, log_reg_pca_time))

print('---random forest classifier---')
print(rf_clf_pca_scores,)
print('')
print('---logistic regression---')
print(log_reg_pca_scores)

---random forest classifier---
        Item     Score
0   Accuracy  1.000000
1  Precision  1.000000
2     Recall  1.000000
3       Time  0.317053

---logistic regression---
        Item     Score
0   Accuracy  0.996308
1  Precision  0.996260
2     Recall  0.996349
3       Time  0.041054


Again, the performance of the models are summarized by accuracy, precision, recall and training time. With the reduced dimensions, all measures for the logistic regression model dropped an insignificant amount compared to the full data model.

## Model Analysis

In [21]:
# create performance table summary
models = ['Random Forest', 'Random Forest', 'Random Forest', 'Random Forest', 'Logistic Regression', 'Logistic Regression', 'Logistic Regression','Logistic Regression']
items = ['Accuracy', 'Precision', 'Recall', 'Time', 'Accuracy', 'Precision', 'Recall', 'Time']
full_data = pd.concat([rf_clf_scores['Score'], log_reg_scores['Score']])
pca_reduced = pd.concat([rf_clf_pca_scores['Score'], log_reg_pca_scores['Score']])

summary = pd.DataFrame(list(zip(models, items, full_data, pca_reduced)), columns=['Models','Item','Full Data','PCA Reduced'])
print(summary)

                Models       Item  Full Data  PCA Reduced
0        Random Forest   Accuracy   1.000000     1.000000
1        Random Forest  Precision   1.000000     1.000000
2        Random Forest     Recall   1.000000     1.000000
3        Random Forest       Time   0.386029     0.317053
4  Logistic Regression   Accuracy   1.000000     0.996308
5  Logistic Regression  Precision   1.000000     0.996260
6  Logistic Regression     Recall   1.000000     0.996349
7  Logistic Regression       Time   0.044658     0.041054


## Conclusion

In this notebook, a KNN classifer was succesfully used to impute missing values for the `stalk_root` attribute. The features were one-hot encoded and the response variable, `edible_poisonous`, was label encoded prior to training the RandomForest and LogisticRegrssion ML models. PCA was used to reduce the number of dimensions in the training set by nearly 65%, from 116 to 41. It would appear there is some overfitting since there are perfect scores for both models. Note that the default hyperparameters, such as max_depth='None' for RandomForest, of the models are probably the leading cause of such scores. The PCA Reduced scores are insignificantly less than the scores using the Full Data eventhough many of the dimensions were removed. With any data set, it is important to reduce the dimensionality as much as possible, especially with sparse data, to reduce the risk of overfitting.
