## Mushroom Classification
> Safe to eat or deadly poison?

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

    LEARNING MODEL: Random Forest

#### Cloing the repository to download the dataset

In [1]:
!git clone https://github.com/AbhayVAshokan/ML-Assignment-2.git

fatal: destination path 'ML-Assignment-2' already exists and is not an empty directory.


#### Loading training and test dataset

In [2]:
import pandas as pd
import numpy as np

#### Reading train and test data

In [3]:
train = pd.read_csv('ML-Assignment-2/train/train.csv')
test = pd.read_csv('ML-Assignment-2/test/test.csv')

#### Displaying first 5 rows of the training set

In [4]:
train.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,e,f,f,n,t,n,f,c,b,p,t,b,s,s,g,p,p,w,o,p,k,y,d
1,p,k,s,e,f,f,f,c,n,b,t,?,s,k,w,p,p,w,o,e,w,v,d
2,e,x,y,u,f,n,f,c,n,h,e,?,s,f,w,w,p,w,o,f,h,v,d
3,e,x,y,g,t,n,f,c,b,w,t,b,s,s,p,p,p,w,o,p,n,y,d
4,p,f,y,n,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,g


In [5]:
train.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499,6499
unique,2,6,4,10,2,9,2,2,2,12,2,5,4,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,t,b,s,s,w,w,p,w,o,p,w,v,d
freq,3386,2906,2604,1838,3767,2830,6325,5463,4508,1361,3662,3032,4149,3961,3554,3521,6499,6333,5984,3201,1890,3228,2535


#### Displaying first 5 rows of test set

In [6]:
test.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,e,x,y,w,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,k,s,g
1,p,x,y,g,f,f,f,c,b,p,e,b,k,k,p,p,p,w,o,l,h,y,g
2,e,x,f,e,t,n,f,c,b,w,t,b,s,s,g,g,p,w,o,p,n,y,d
3,e,b,s,w,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,k,s,g
4,e,k,s,g,f,n,f,w,b,p,e,?,s,s,w,w,p,w,t,p,w,s,g


#### Preprocessing steps

<ol>
    <li>Label Encoder: Label Encoding converts the labels into numeric form</li>
    <li>Standard Scaler: Normalizes values to be centered at 0</li>
</ol>

In [7]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

def preprocessing(data):
    labelencoder = LabelEncoder()
    for col in data.columns:
        data[col] = labelencoder.fit_transform(data[col])
    
    X = data.iloc[:,1:23]   # all rows, all the features and no labels
    y = data.iloc[:, 0]     # all rows, labels only

    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    
    return X, y

In [8]:
X_train, y_train = preprocessing(train)
X_test, y_test = preprocessing(test) 

#### Random Forest Classifier

In [9]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

#### Displaying mean results of 50-fold cross-validation

In [10]:
from sklearn.model_selection import cross_val_score

cross_val_score(model, X_test, y_test, cv = 50).mean()

1.0

#### Predicting the outputs of the test set

In [11]:
y_pred = model.predict(X_test)

####Displaying the confusion matrix of the predicted values

In [12]:
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(y_test, y_pred)
confusion_matrix

array([[822,   0],
       [  0, 803]])

#### Displaying the precision score of the predicted values 

In [13]:
from sklearn.metrics import precision_score

precision_score(y_test, y_pred)

1.0