# Mushroom Trip

João Pedro Evangelista,

September 03, 2017

## Introduction

Hello, this is a analysis towards a model development exploring the mushroom dataset and it's features, we will:
- Clean the data
- Explore the features and relationships
- Encode categories
- Select best feature to reduce the amount of data needed from future input.

### Start Of

Let's start importing the dataset from the file and seeing it's contents

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('../input/mushrooms.csv')

In [None]:
df.describe()

Looking at the describe table, we can find that features are in a good distribution, but the *veil-type* as the same value for count and freq, meaning it has only one unique value, this isn't very important for our model to learn from it, so let's drop it. 

In [None]:
df.drop('veil-type', axis=1, inplace=True)

In [None]:
# find missing values
for col in df.columns:
    u = df[col].unique()
    if 'nan' in u or 'NaN' in u:
        print('Missing values at', col)
    else:
        print(col, 'ok')

In [None]:
# inspect deeply the uniqueness
for col in df.columns:
    print(col, "->", ", ".join(df[col].unique()))

In [None]:
# wait, there is only letters, and a ? on stalk-root, a missed missing value?
df[df['stalk-root'] == '?']['stalk-root'].count()

### 2480 missing values, but what is 'stalk-root' by the way ?

From [wikipedia](https://en.wikipedia.org/wiki/Stipe_%28mycology%29):

>  In mycology, a stipe (/ˈstaɪp/) is the stem or stalk-like feature supporting the cap of a mushroom.

>  The evolutionary benefit of a stipe is generally considered to be in mediating spore dispersal. An elevated mushroom will more easily release its spores into wind currents or onto passing animals. Nevertheless, many mushrooms do not have stipes

Indeed there is no missing value, but instead a missing category. So let's create a new category for `?` instead of considering a missing value

In [None]:
df['stalk-root'] = df['stalk-root'].apply(lambda x: 'no-presence' if x == '?' else x)
u = df['stalk-root'].unique()
print('stalk-root ->', ', '.join(u))

In [None]:
import missingno as mgo
mgo.bar(df)

Seems we are good to proceed. There is not missing values, neither visible skewness on our dataset

## Analysing Features

### Remember the problem: Classify whenether the mush is safe to eat or not.

🤔 One way to think about it, is to incorporate a human expert. Since we want oour model to predict as good as a human expert, who the latter would solve the classification, what features would be useful for them ?
Thinking that way we will search for which features are the most important and how they influence on the resulting class.

### Preparing the field

Since all of our data is based on categories, we will need to somehow make them numerics, because most of visualization tools we will use and the model does not know how to treat strings
We will use `sklearn.preprocessing` module to do our job on  a copy of original dataset.

In [None]:
from sklearn.preprocessing import LabelEncoder

def encode_features(df, encoder=LabelEncoder):
    """Encodes the given df features using an encoder.
    Returns an array with dict elements mapping the column name and the instance of fitted encoder
    to be used onwards on inverse transformation, and the transformed dataframe
    """
    acc = []
    for name in df.columns:
        fitted = encoder().fit(df[name].values)
        df[name] = fitted.transform(df[name].values)
        dic = (name, fitted)
        acc.append(dic)
    return acc, df

def get_encoder(qname, encoders):
    "Search for the encoder of given column name, returning it when found, otherwise None."
    for name, encoder in encoders:
        if qname == name:
            return encoder
    return None

In [None]:
encoders, edf = encode_features(df.copy())

In [None]:
# check the encoding
class_encoder = get_encoder('class', encoders)
encoded_values_of_class = edf['class'].values
original_values_of_class =  class_encoder.inverse_transform(encoded_values_of_class)

pd.DataFrame({'Encoded': encoded_values_of_class, 'Original': original_values_of_class}).head()

🍄  Seems our encoder is given *p*, assumed *posion*, the positive label, when our problem is to give a positive class for the ones safe to eat, i.e. the ones named *e*.

The encoder is a bit hard to refit in order to keep simple when used on a pipeline, because of that we will change our target classification.

Before our proble was defined as:

$$y = \begin{cases}1 & edible\\0 & poisonous\end{cases}$$

But since the encoder is a bit harsh on us, we will invert the question to be: *Is this mushroom poisonous ?*:

$$y = \begin{cases}1 & poisonous\\0 & edible\end{cases}$$



### It is time to find the features that influence the most the resulting classification

Let's start with the correlation aproach then decide if we need to move on to a more MLish approach

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
sns.set_palette('Set2')

In [None]:
corr = edf.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(20, 15))
ax = sns.heatmap(corr, annot=True, linecolor='w', linewidths=0.2, fmt='.2f', mask=mask)
ax.tick_params(axis='both', which='major', labelsize=14)
plt.show()

Looking at the correlation matrix, we see that are a strong correlation ($ \geq |0.5| $) with class against the following features:

 - bruises
 - gill-color
 - gill-size
 
 Now we will see a classifier to see what features it takes the most importance

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import accuracy_score

### Why Random Forest ?

A Random Forest seems to be a good choice, basically because we are dealing with decisions using categories, which make the algorithm more confident when learning, also it as advantages over a simple Decision Tree such as **usage of a percentage of features per Tree**,  reduced variance and they are fast!

In [None]:
rfclf = RandomForestClassifier()
X = edf.drop('class', axis=1)
y = edf['class'].values

X_train, X_test, y_train, y_test = train_test_split(X, y)

rfclf.fit(X_train, y_train)
pred = rfclf.predict(X_test)
print('Prediction Accuracy:',accuracy_score(y_test, pred) * 100, '%')

### Yeaaah!! No one dies by mushroom!
(I hope)

In [None]:
importance = rfclf.feature_importances_
feats = edf.drop('class', axis=1).columns
importance_df = pd.DataFrame({'Features': feats, 'Importance': importance})

In [None]:
feats = edf.drop('class', axis=1).columns
importance_df = pd.DataFrame({'Features': feats, 'Importance': importance})

In [None]:
plt.figure(figsize=(20,10))
plt.title('Feature Importance with RandomForestClassifier', fontsize=16)
ax = sns.barplot(data=importance_df, x='Features', y='Importance')
ax.tick_params(axis='both', which='major', labelsize=14)
plt.xticks(rotation=90)
plt.xlabel('Features', fontsize=15)
plt.ylabel('mean(Importance)', fontsize=15)
plt.show()

Even with the randomness of the RTClassifier, the most appearing features of all runs I did, I get those:
- odor
- gill-size
- gill-color
- bruises

So let's explore more on how they are distributed

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(20, 15))
fig.suptitle('Recurrent features with most importance', fontsize=20)
gillsize_ax = axs[0][0]
gillcolor_ax = axs[0][1]
odor_ax = axs[1][0]
bruises_ax = axs[1][1]

ax = sns.countplot(x='gill-size', data=edf, ax=gillsize_ax, hue='class')
ax.legend(['edible', 'posionous'], loc='best')
ax.set_xticklabels(get_encoder('gill-size', encoders).inverse_transform(edf['gill-size'].unique()))

ax = sns.countplot(x='gill-color', data=edf, ax=gillcolor_ax, hue='class')
ax.legend(['edible', 'posionous'], loc='best')
ax.set_xticklabels(get_encoder('gill-color', encoders).inverse_transform(edf['gill-color'].unique()))

ax = sns.countplot(x='odor', data=edf, ax=odor_ax, hue='class')
ax.legend(['edible', 'posionous'], loc='best')
ax.set_xticklabels(get_encoder('odor', encoders).inverse_transform(edf['odor'].unique()))

ax = sns.countplot(x='bruises', data=edf, ax=bruises_ax, hue='class')
ax.legend(['edible', 'posionous'], loc='best')
ax.set_xticklabels(get_encoder('bruises', encoders).inverse_transform(edf['bruises'].unique()))
plt.show()

As we can see the bins of each category are distinguished easly, making each contribution a weight one for the output prediction.

Now let's make sklearn selection the features again, for demostration and assurance that our hypothesis is on the right way.

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

selection = SelectKBest(f_classif, k=4)
kbest_X = selection.fit_transform(X, y)
# source: https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
mask = selection.get_support()
kbest_X_acc = []
for b, feat_name in zip(mask, edf.columns):
    if b:
        kbest_X_acc.append(feat_name)


pd.DataFrame(kbest_X, columns=kbest_X_acc)

Damn, looks like sklearn is against our hypothesis. It gives us only gill-size as the only feature we also selected. Let's compare with the RF

In [None]:
# with KBest
kb_clf = RandomForestClassifier()
X_train, X_test, y_train, y_test = train_test_split(kbest_X, y)
kb_clf.fit(X_train, y_train)
pred = kb_clf.predict(X_test)
print('Accuracy with KBest Features on RandomForest: {:.2f}%'.format(accuracy_score(y_test, pred) *100))

In [None]:
# with our selected features
hclf = RandomForestClassifier()
hX = edf[['gill-size', 'gill-color', 'odor', 'bruises']].values
X_train, X_test, y_train, y_test = train_test_split(hX, y)
hclf.fit(X_train, y_train)
pred = hclf.predict(X_test)
print('Accuracy with Our Hypothesis Features on RandomForest: {:.2f}%'.format(accuracy_score(y_test, pred) *100))

🤘 AHA!
Looks like our classifier fitting only with the features we selected runs better than the one that sklearn selected the features, Even if seems biased because we used RandomForest to select the features, our features perform better on other algorithms.

In [None]:
from sklearn.linear_model import LogisticRegression
# with KBest
kb_clf = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(kbest_X, y)
kb_clf.fit(X_train, y_train)
pred = kb_clf.predict(X_test)
print('Accuracy with KBest Features on Logistic Regression: {:.2f}%'.format(accuracy_score(y_test, pred) *100))
​
# with our selected features
hclf = LogisticRegression()
hX = edf[['gill-size', 'gill-color', 'odor', 'bruises']].values
X_train, X_test, y_train, y_test = train_test_split(hX, y)
hclf.fit(X_train, y_train)
pred = hclf.predict(X_test)
print('Accuracy with Our Hypothesis Features on Logistic Regression: {:.2f}%'.format(accuracy_score(y_test, pred) *100))

## Conclusion

Here are some points we can learn from this:

 - Not all missing data are useless, sometimes we need to understand why is missing, it could improve our hypothesis.
- Selecting the best features could also made by hand, well when the data size allows it and you have an expert that helps you, even I did not count on one to select the features from here.
- Do not forget to explore the data and the correlations, it could give you more insight about your problem

-------------------------------------
Thanks!