# Data in a day

## Import libraries

In [None]:
# numpy is for maths (http://www.numpy.org/)
import numpy as np

# 🐼 is to work with tables of data (http://pandas.pydata.org/)
import pandas as pd

# sklearn is for machine learning (http://scikit-learn.org)
from sklearn import tree

from sklearn.model_selection import train_test_split

# matplotlib is to make plots, pandas using it under the hood
# Display plots in this page rather than open another page
%matplotlib inline

import seaborn as sns

from sklearn import preprocessing
import graphviz 

## Source the data

In [None]:
df = pd.read_csv('../data-sets/mushrooms.csv')

## Explore and transform the data

In [None]:
df.head()

In [None]:
df.describe()

Let's try and visualise this data with the help of https://www.kaggle.com/surajit346/ml-models-and-visualizations-for-beginners

In [None]:
sns.countplot(x='odor',hue='class',data=df)

### Dealing with categories

Text data needs to be further transformed in order to use machine learning - ie. we need to turn it into numbers.

Let's start with the column (the one we want to predict)

For the features, e.g. odor, we need to go a bit further and treat each type (e.g. odor=n, odor=a) as its own feature.

This is called **one hot** encoding

In [None]:
one_hot = pd.get_dummies(df.drop(['class'],axis=1))
one_hot.head()

## Building a model

In [None]:
# http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
model = tree.DecisionTreeClassifier()
model.fit(one_hot,df[['class']])

## Evaluate your model

In [None]:
dot_data = tree.export_graphviz(model, 
                                out_file=None, 
                                feature_names=one_hot.columns,
                                filled=True, 
                                rounded=True,  
                                class_names=["e","p"],
                                special_characters=True)
graph = graphviz.Source(dot_data)
graph 

In [None]:
dot_data = tree.export_graphviz(model,
                                out_file=None,
                                feature_names=one_hot.columns,class_names=le.classes_,
                                filled=True, 
                                rounded=True, 
                                special_characters=True
                               ) 
graph = graphviz.Source(dot_data)
graph

It is unfortuante that we can't smell a picture! Can we predict without the smell?

In [None]:
one_hot = pd.get_dummies(df.drop(['class','odor'],axis=1))
model = tree.DecisionTreeClassifier(max_leaf_nodes=3)
model.fit(one_hot,classes)
dot_data = tree.export_graphviz(model, out_file=None,feature_names=one_hot.columns,class_names=le.classes_,filled=True, rounded=True, special_characters=True) 
graph = graphviz.Source(dot_data)
graph

Quantitatively, how well are the predictions? Let's look at th proportion of predictions we got right


In [None]:
model.score(one_hot,classes)

### Avoiding overfitting

Split data into test (30%) and train (70%)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(one_hot, classes, test_size=0.3, random_state=0)

In [None]:
df[df['ring-number']=="n"]['stalk-surface-above-ring']