# Mushroom Stew

## Develop a mushroom stew, that is visually appealing, pleasingly smelling, and preferably non-toxic.

* Explore the fields, which ones could affect the taste or smell, which can be ignored? 
* Which fields may affect if the stew is visually appealing? 
* Use graphics to support your choices

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
from statsmodels.tools.eval_measures import rmse
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.inspection import permutation_importance

In [None]:
# Set Plotting Scale Params
plt.rcParams['figure.figsize'] = (16, 12)
sns.set_style('darkgrid')
%matplotlib inline

## Defining Classes/Functions

In [None]:
### Defining Classes/Functions

def classification(method, x_dat, y_dat, **params): 
    
    #fit model
    mod = Pipeline([('encode', OneHotEncoder(sparse=False)), ('classify', method(**params))])
    mod.fit(x_dat, y_dat)
    y_pred = mod.predict(x_dat)
    
    #print results
    print("Results for {}:".format(method.__name__))
    print(classification_report(y_dat, y_pred))
    print("Test Accuracy: {}%".format(round(mod.score(x_dat, y_dat)*100,2)))
    
    #print confusion matrix
    y_pred_rf = y_pred
    y_true_rf = y_dat
    cm = confusion_matrix(y_true_rf, y_pred_rf)
    f, ax = plt.subplots(figsize =(5,5))
    sns.heatmap(cm,annot = True,linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
    plt.xlabel("y_pred_rf")
    plt.ylabel("y_true_rf")
    plt.title('Confusion Matrix')
    plt.show()
    
    #Calculate permutation feature importance
    # permutation feature importance - the decrease in a model score when a single feature value is randomly shuffled
    # thus the drop in the model score is indicative of how much the model depends on the feature
    # (n_jobs=-1 means using all processors)
    imp = permutation_importance(mod, x_dat, y_dat, n_jobs=-1)
    
    #Generate feature importance plot
    plt.figure(figsize=(12,8))
    importance_data = pd.DataFrame({'feature':x_dat.columns, 'importance':imp.importances_mean})
    sns.barplot(x='importance', y='feature', data=importance_data)
    plt.title('Permutation Feature Importance')
    plt.xlabel('Mean Decrease in F1 Score')
    plt.ylabel('')
    plt.show()

## Load Dataset, Explore and Display Features

In [None]:
col_names=['class','cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor','gill-attachment',\
           'gill-spacing','gill-size','gill-color','stalk-shape','stalk-root','stalk-surface-above-ring',\
           'stalk-surface-below-ring','stalk-color-above-ring','stalk-color-below-ring','veil-type',\
           'veil-color','ring-number','ring-type','spore-print-color','population','habitat'] 

mushroom_df = pd.read_csv('expanded.csv', names=col_names, header=None)

In [None]:
pd.set_option("display.max_columns", None)
mushroom_df.head(5)

In [None]:
mushroom_df.dtypes

In [None]:
mushroom_df.describe()

In [None]:
profile = ProfileReport(mushroom_df)
profile

## Feature Engineering


### The field `veil-type` doesn't contribute any information (all are the same value) and can be dropped from the dataset

In [None]:
mushroom_df_adj = mushroom_df.drop('veil-type', axis=1)
mushroom_df_adj.shape

### Label encode the "class" column

In [None]:
# label_encoder object
label_encoder =LabelEncoder()
# Encode labels in column. 
mushroom_df_adj['class']= label_encoder.fit_transform(mushroom_df_adj['class']) # 0 is Edible, 1 is Poisonous

### Split into feature and target data

In [None]:
# Note these x and y values will be used in the first set of classification models
# The second set of classification models uses this y value but a modified x value
x_mushroom = mushroom_df_adj.drop(["class"], axis=1)
y_mushroom = mushroom_df_adj["class"]

## Feature Exploration

### Violin Plot

In [None]:
# Creating Violin Plot for encoded features

labelencoder=LabelEncoder() # Must encode for this plot type

mushroom_df_encoded = mushroom_df.copy()
for column in mushroom_df_encoded.columns:
    mushroom_df_encoded[column] = labelencoder.fit_transform(mushroom_df_encoded[column])

df_div = pd.melt(mushroom_df_encoded.drop("veil-type",axis=1),
                 'class', var_name='Characteristics')
fig, ax = plt.subplots(figsize=(22,10))

p = sns.violinplot(ax = ax, x='Characteristics', y='value',
                   hue='class', split = True, data=df_div,
                   inner = 'quartile', palette = 'Set1')

df_no_class = mushroom_df_encoded.drop(['class','veil-type'],axis = 1)
p.set_xticklabels(rotation = 90, labels = list(df_no_class.columns))
ax.set_title('Violin Plot of Mushroom Edibility by Feature')
sns.set(font_scale=1.8)
plt.show()

As we can see from the above violin plot, odor, gill color, spore print color, and habitat seem to have strong indications of edibility. Let's see if our intuition is correct by looking at the edibility numbers for each variable.

### Edibility Stacked Barcharts

In [None]:
# Creating stacked barcharts for each feature in the dataset
sns.set_style("whitegrid")

fig, axes = plt.subplots(11,2, figsize=(24,88))
axes = axes.flatten()

for column, ax in zip(mushroom_df.drop(["class"], axis=1).columns, axes):
    uniq_vals = mushroom_df[column].unique()
    count_edible=[]
    count_poison=[]

    for j in uniq_vals:
        count_edible.append(len(mushroom_df[(mushroom_df[column]==j)
                                            & (mushroom_df['class']=='EDIBLE')]))
        # Edible Bar
        
        count_poison.append(len(mushroom_df[(mushroom_df[column]==j)
                                            & (mushroom_df['class']=='POISONOUS')]))
        # Poison Bar
    ax.bar(uniq_vals, count_edible, label='EDIBLE',color='b')
    ax.bar(uniq_vals, count_poison, label='POISONOUS', bottom=count_edible,color='r')

    ax.set_ylabel('Count')
    ax.set_title('Mushroom Edibility by Feature: '+column.upper())
    ax.legend()

fig.tight_layout()
plt.show()

From the above histograms, we can conclude that `bruises?`might also be a good indicator of edibility.

### Pleasant Smell

In [None]:
good_odor = ['ALMOND','ANISE','SPICY'] # Up to interpretation I suppose
sns.set()

uniq_vals = mushroom_df['odor'].unique()
count_edible=[]
count_poison=[]
for j in uniq_vals:
    count_edible.append(len(mushroom_df[(mushroom_df['odor']==j)
                                            & (mushroom_df['class']=='EDIBLE')]))
    count_poison.append(len(mushroom_df[(mushroom_df['odor']==j)
                                            & (mushroom_df['class']=='POISONOUS')]))
fig, ax = plt.subplots(figsize=(9,6))

ax.bar(uniq_vals, count_edible, label='EDIBLE',color='b')
ax.bar(uniq_vals, count_poison, label='POISONOUS', bottom=count_edible,color='r')

ax.set_ylabel('Count')
ax.set_title('Mushroom Edibility by Feature: ODOR')
ax.legend()
plt.show()

We can see from the above plot that nearly all mushrooms with a smell that is not Almond, Anise, or None are Poisonous. To make a pleasant smelling soup is to then indeed make an edible one, which will be quite handy for foragers.

### Correlation Heatmaps

Let's look at the data another way, and observe what kind of effect different combinations of the variables have.

In [None]:
# Create crosstables of different groupings of the variables
cap_xtab = pd.crosstab(mushroom_df['class'],\
                   columns=[mushroom_df['cap-shape'], mushroom_df['cap-surface'], \
                            mushroom_df['cap-color'], mushroom_df['bruises?']]) 

gill_xtab = pd.crosstab(mushroom_df['class'],\
                    columns=[mushroom_df['odor'],mushroom_df['gill-attachment'],\
                            mushroom_df['gill-spacing'], mushroom_df['gill-size'], \
                            mushroom_df['gill-color']])

stalk_xtab = pd.crosstab(mushroom_df['class'],\
                     columns=[mushroom_df['stalk-shape'],mushroom_df['stalk-root'],\
                            mushroom_df['stalk-surface-above-ring'],\
                            mushroom_df['stalk-surface-below-ring'],\
                            mushroom_df['stalk-color-above-ring'],\
                            mushroom_df['stalk-color-below-ring']])

other_xtab = pd.crosstab(mushroom_df['class'],\
                    columns=[mushroom_df['veil-type'],mushroom_df['veil-color'],\
                            mushroom_df['ring-number'],mushroom_df['spore-print-color'],
                            mushroom_df['population'], mushroom_df['habitat']])


In [None]:
# An example of what one of the cross tables looks like 
gill_xtab

In [None]:
fig, ax = plt.subplots(figsize=(12,4.5))
heatmap=sns.heatmap(cap_xtab)
heatmap.set_yticklabels(heatmap.get_yticklabels(), rotation=0)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,4.5))
heatmap=sns.heatmap(gill_xtab)
heatmap.set_yticklabels(heatmap.get_yticklabels(), rotation=0)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,4.5))
heatmap=sns.heatmap(stalk_xtab)
heatmap.set_yticklabels(heatmap.get_yticklabels(), rotation=0)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,4.5))
heatmap=sns.heatmap(other_xtab)
heatmap.set_yticklabels(heatmap.get_yticklabels(), rotation=0)
plt.show()

## Sample Decision Tree

In [None]:
dummies = pd.get_dummies(x_mushroom)
tree = DecisionTreeClassifier()
tree.fit(dummies, y_mushroom)

fig = plt.figure(figsize=(120,50))
out = plot_tree(tree,filled=True, feature_names = dummies.columns, rounded=True, proportion=True,\
                class_names = ['Edible', 'Poisonous'])
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor('black')
        arrow.set_linewidth(2)

In [None]:
featurelist=[]
for i in dummies.columns:
    featurelist.append(i)

In [None]:
print(export_text(tree, feature_names = featurelist))

## Classification Models
#### Note: where applicable, random_state=42 in our models sets a seed so that results will be reproducible

### Decision Tree Classifier

In [None]:
classification(DecisionTreeClassifier, x_mushroom, y_mushroom, random_state=42)

### C-Support Vector Classifier

In [None]:
classification(SVC, x_mushroom, y_mushroom, random_state=42)

### K Neighbors Classifier

In [None]:
classification(KNeighborsClassifier, x_mushroom, y_mushroom, n_neighbors=5)

### Logistic Regression Classifier

In [None]:
classification(LogisticRegression, x_mushroom, y_mushroom, random_state=42) 

### Gaussian Naive Bayes Classifier

In [None]:
classification(GaussianNB, x_mushroom, y_mushroom)

### Random Forest Classifier

In [None]:
# Note: random_state=42 sets a seed so the results are reproducible
classification(RandomForestClassifier, x_mushroom, y_mushroom, n_estimators=100, random_state=42)  

### Linear Discriminant Classifier

In [None]:
classification(LinearDiscriminantAnalysis, x_mushroom, y_mushroom)

### Neural Network Multi-layer Perceptron Classifier

In [None]:
classification(MLPClassifier, x_mushroom, y_mushroom, random_state=42)

## COVID version: what if we lose our sense of smell?

Odor is obviously the most powerful predictive attribute. What happens to our models if we drop that variable and we can only identify mushrooms visually?

In [None]:
# create the dataframes and appropriate variables 
x_visual = x_mushroom.drop(columns='odor')

### Decision Tree Classifier

In [None]:
classification(DecisionTreeClassifier, x_visual, y_mushroom, random_state=42)

### C-Support Vector Classifier

In [None]:
classification(SVC, x_visual, y_mushroom, random_state=42) 

### K-Neighbors Classifier

In [None]:
classification(KNeighborsClassifier, x_visual, y_mushroom, n_neighbors=5)

### Logistic Regression Classifier

In [None]:
classification(LogisticRegression, x_visual, y_mushroom, max_iter=10000, random_state=42) 

### Gaussian Naive Bayes Classifier

In [None]:
classification(GaussianNB, x_visual, y_mushroom)

### Random Forest Classifier

In [None]:
classification(RandomForestClassifier, x_visual, y_mushroom, n_estimators=100, random_state=42)

### Linear Discriminant Classifier

In [None]:
classification(LinearDiscriminantAnalysis, x_visual, y_mushroom)

### Neural Network Multi-layer Perceptron Classifier

In [None]:
classification(MLPClassifier, x_visual, y_mushroom, random_state=42)

## Conclusions

After examining features of this dataset, performing analysis and feature exploration, and creating some classification models, we can decisively conclude that making a pleasant smelling mushroom stew also equates to making one that will not result in a trip to the hostpital. This is to say, an 'Odor' of None, Anise, or Almond on a mushroom is the strongest indicator of an Edible Mushroom. Following **'Odor'**, **'gill-size-broad**, and **'spore-print-color'**(not chocolate, green or white) are the next best indicators of an edible fungus.

Overall, every classifier performed excellent on the set, all able to classify edible mushrooms with above 96% accuracy. The Decision Tree classifier could be considered the most useful due to its flowchart display, which a forager could print out to accompany them in the woods


If we were to imagine a scenario where the sense of smell is not available to us, then it is important to consider a wider range of factors. The best indicators we could look for would be a broad gill size (**'gill-size'**), a rooted stalk root (**'stalk-root'**) and Crowded **'gill-spacing'**.

In this case, the only classifier that would be advised against would be the Naive Bayes, as it could only predict with 83% accuracy, not a chance many would like to take.