# Data in a day

## Import libraries

In [None]:
# 🐼 is to work with tables of data (http://pandas.pydata.org/)
import pandas as pd

# sklearn is for machine learning (http://scikit-learn.org)
from sklearn import tree

from sklearn.model_selection import train_test_split

# matplotlib is to make plots, pandas using it under the hood
# Display plots in this page rather than open another page
%matplotlib inline

import seaborn as sns

import graphviz 

from sklearn.model_selection import cross_val_score, GridSearchCV

## Source the data

In [None]:
df = pd.read_csv('../data-sets/mushrooms.csv')

## Explore and transform the data

In [None]:
df.head()

In [None]:
df.describe()

Let's try and visualise this data with the help of https://www.kaggle.com/surajit346/ml-models-and-visualizations-for-beginners

In [None]:
sns.countplot(x='odor',hue='class',data=df)

### Dealing with categories

Text data needs to be further transformed in order to use machine learning - ie. we need to turn it into numbers.

e.g. odor, we need to treat each type (e.g. odor=n, odor=a) as its own feature to avoid creating ordering in the data where none exists

This is called **one hot** encoding

In [None]:
one_hot = pd.get_dummies(df)
one_hot.head()

## Building a model

We don't want to include the thing we want to predict as the input data, so lets drop it. Also let's put the classes into their own variable for convenience

In [None]:
one_hot = one_hot.drop(['class_e','class_p'],axis=1)
classes = df['class']

In [None]:
# http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
model = tree.DecisionTreeClassifier()
model.fit(one_hot,classes)

## Evaluate your model

Decision trees are great because you can visualise them

In [None]:
# http://scikit-learn.org/stable/modules/tree.html#classification
dot_data = tree.export_graphviz(model, 
                                out_file=None, 
                                feature_names=one_hot.columns,
                                filled=True, 
                                rounded=True,  
                                class_names=["e","p"],
                                special_characters=True)
graph = graphviz.Source(dot_data)
graph 

It is unfortuante that we can't smell a picture! Can we predict without the smell?

In [None]:
one_hot = pd.get_dummies(df.drop(['class','odor'],axis=1))
model = tree.DecisionTreeClassifier()
model.fit(one_hot,classes)
dot_data = tree.export_graphviz(model, 
                                out_file=None, 
                                feature_names=one_hot.columns,
                                filled=True, 
                                rounded=True,  
                                class_names=["e","p"],
                                special_characters=True)
graph = graphviz.Source(dot_data)
graph 

Quantitatively, how well are the predictions? Let's look at the proportion of predictions we got right


In [None]:
model.score(one_hot,classes)

We got everything correct, amazing! Well, actually not really. Decision trees in in python are designed to create as many leaves so as to fit the data perfectly. This can lead to overfitting.

### Avoiding overfitting

To avoid overfitting we need to prune our tree at some depth. But what depth to choose?

Split the data 10 pieces, 9 are used to train our model, 1 is used to test. Let's see how well the 10 models do when we don't limit the tree depth

In [None]:
# http://scikit-learn.org/stable/modules/cross_validation.html
train_test = cross_val_score(model, one_hot, classes, cv=10)
print(train_test)

Averaging to get an overall score when we don't limit the depth of the tree

In [None]:
train_test.mean()

We can easily check lots of tree depths automatically

In [None]:
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

parameters = {'max_depth':range(3,20)}
modelSearch = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4, cv=3)
modelSearch.fit(one_hot, classes)
tree_model = modelSearch.best_estimator_
print (modelSearch.best_score_, modelSearch.best_params_) 

one_hot = pd.get_dummies(df.drop(['class','odor'],axis=1))
model2 = tree.DecisionTreeClassifier(max_depth=modelSearch.best_params_["max_depth"])
model2.fit(one_hot,classes)
dot_data = tree.export_graphviz(model2, 
                                out_file=None, 
                                feature_names=one_hot.columns,
                                filled=True, 
                                rounded=True,  
                                class_names=["e","p"],
                                special_characters=True)
graph = graphviz.Source(dot_data)
graph 