# Mushroom Classification
This notebook will contain my analysis of the mushroom classification dataset posted on kaggle. https://www.kaggle.com/uciml/mushroom-classification

From the website:

## Context

Although this dataset was originally contributed to the UCI Machine Learning repository nearly 30 years ago, mushroom hunting (otherwise known as "shrooming") is enjoying new peaks in popularity. Learn which features spell certain death and which are most palatable in this dataset of mushroom characteristics. And how certain can your model be?*

## Content

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

Time period: Donated to UCI ML 27 April 1987
Inspiration

What types of machine learning models perform best on this dataset?
Which features are most indicative of a poisonous mushroom?
Acknowledgements

This dataset was originally donated to the UCI Machine Learning repository. You can learn more about past research using the data here.

In [64]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
# First let's start by reading in the data
mushroom_data = pd.read_csv('mushrooms.csv')

In [48]:
# Take a look at the data
mushroom_data.head()
mushroom_data.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [90]:
# Let's categorize the dataset with numeric values
# First store the target data
y = pd.Categorical(mushroom_data['class'])
y = y.codes
print y.shape

(8124,)


In [91]:
# Now for the X data
data = mushroom_data.as_matrix()
data = data[:,1:]
# print pd.Categorical(data[:,0]).codes
# print pd.Categorical(data[:,0]).codes.shape
# print data.shape
X = []
# Run through each column and convert the categories to numeric values
for i in range(data.shape[1]):
    X.append(pd.Categorical(data[:,i]).codes)
    
# Convert X to a numpy array for easier visualization speed etc.
X = np.array(X)
X = np.transpose(X)
print X

[[5 2 4 ..., 2 3 5]
 [5 2 9 ..., 3 2 1]
 [0 2 8 ..., 3 2 3]
 ..., 
 [2 2 4 ..., 0 1 2]
 [3 3 4 ..., 7 4 2]
 [5 2 4 ..., 4 1 2]]


In [92]:
# Split the data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [94]:
# Try a random forest classifier with the default settings
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [98]:
# Evaluate the accuracy
from sklearn.metrics import accuracy_score
print accuracy_score(y_test, clf.predict(X_test))

1.0


(2681,)