# Data in a day

## Import libraries

In [None]:
# numpy is for maths (http://www.numpy.org/)
import numpy as np

# 🐼 is to work with tables of data (http://pandas.pydata.org/)
import pandas as pd

# sklearn is for machine learning (http://scikit-learn.org)
from sklearn import tree

# MIGHT REMOVE THIS AS PANDAS PROVIDES A MORE CONVENIENT INTERFACE FOR PLOTTING
# # matplotlib is to make plots
# import matplotlib.pyplot as plt

# matplotlib is to make plots, pandas using it under the hood
# Display plots in this page rather than open another page
%matplotlib inline

import seaborn as sns

from sklearn import preprocessing
import graphviz 

## Source the data

In [None]:
df = pd.read_csv('../data-sets/mushrooms.csv')

## Explore and transform the data

In [None]:
df.head()

In [None]:
df.describe()

Let's try and visualise this data with the help of https://www.kaggle.com/surajit346/ml-models-and-visualizations-for-beginners

In [None]:
sns.countplot(x='gill-color',hue='class',data=df)

Text data needs to be further transformed in order to use machine learning - ie. we need to turn it into numbers.

Let's start with the column (the one we want to predict)

In [None]:
# https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931
# that link is for R but it is an interesting discussion anyway
le = preprocessing.LabelEncoder()
classes = le.fit_transform(df['class'])
print("class labels", le.classes_)
print("classes as numbers", classes)

For the features, e.g. odor, we need to go a bit further and treat each type (e.g. odor=n, odor=a) as its own feature.

This is called **one hot** encoding

In [None]:
one_hot = pd.get_dummies(df.drop(['class'],axis=1))
one_hot.head()

## Building a model

In [None]:
# http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
model = tree.DecisionTreeClassifier()
model.fit(one_hot,classes)

## Evaluate your model

In [None]:
dot_data = tree.export_graphviz(model, out_file=None,feature_names=one_hot.columns,class_names=le.classes_,filled=True, rounded=True, special_characters=True) 
graph = graphviz.Source(dot_data)
graph

It is unfortuante that we can't smell a picture! Can we predict without the smell?

In [None]:
one_hot = pd.get_dummies(df.drop(['class','odor'],axis=1))
model = tree.DecisionTreeClassifier()
model.fit(one_hot,classes)
dot_data = tree.export_graphviz(model, out_file=None,feature_names=one_hot.columns,class_names=le.classes_,filled=True, rounded=True, special_characters=True) 
graph = graphviz.Source(dot_data)
graph

Quantitatively, how well are the predictions? Let's look at th proportion of predictions we got right


In [None]:
model.score(one_hot,classes)

## Avoiding overfitting