# Classifying brains with Machine Learning
In this notebook you will learn how to use machine learning to predict whether or not a brain belongs to a modern bird or a non-avian dinosaur. 

First import pandas, numpy, and matplotlib.pyplot:

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
import pydotplus
import graphviz

ModuleNotFoundError: No module named 'graphviz'

We will also need the tree module of the sklearn library:

In [None]:
from sklearn import tree

Read the bird_dino_data.csv file into a dataframe and add two new columns:
- Brain vs body mass (use total endocranium / body mass*1000)
- Cerebrum vs total brain (use cerebrum / endocranium)

In [None]:
df = pd.read_csv("bird_dino_data.csv")
df.head()

In [None]:
df["Brain vs. Bodymass"] = df["Total Endocranium (cm3)"]/(df["Body Mass (kg)"]*1000)
df["Cerebrum vs. Total Brain"] = df["Cerebrum (cm3)"]/df["Total Endocranium (cm3)"]

Find the head of your dataframe to check that your changes are correct:

In [None]:
df.head()

Our machine learning library requires that we replace our "classes" with integers instead of strings. 

Change the values of the "Bird or Dino" column from "Bird" to "0" and from "Dino" to "1":

Hints: 
- use .loc indexes
- you can reassign the value in a dataframe column using =

*Ignore the warning. Pandas tries to discourage you from making changes to the original dataframe, but it's ok*

In [None]:
df["Bird or Dino"].loc[df["Bird or Dino"]=="Bird"]=0
df["Bird or Dino"].loc[df["Bird or Dino"]=="Dino"]=1

Find the head of your dataframe to check that your changes are correct:

In [None]:
df.head(27)

Our machine learning algorithm requires a numpy matrix instead of a dataframe.

PAUSE: When you get to this point, let your Helen Fellow know and we will review numpy matrices before we continue with machine learning

We can convert the dataframe to a numpy matrix using the .to_numpy() method. Assign your matrix to a variable:

In [None]:
numdf= df.to_numpy()

Print out the data type of matrix and the first value in the matrix (note: this is a two dimensional matrix):

In [None]:
print(type(numdf))

Now we will create our classifier. Just as it is common to call a dataframe "df" it is common to call a classifier "clf":

In [None]:
clf = tree.DecisionTreeClassifier()

Next, we will select the x and y data for our algorithm. x should be the two columns we will use to train the algorithm (brain to body ratio and cerebrum to whole brain ratio). y should be the first column which contains our "class labels".

Hint: You can use slicing to select a particular value from every row of a numpy array. For example, using the index [:,1] will select the second column.

In [None]:
x= numdf[:,9:]
y= numdf[:,1]

Next we will use the .fit() method to fit our data to the classifier:

In [None]:
dtree = clf.fit(x,y)

We can visualize the path of the decision tree's decision making using the .plot_tree function and matplotlib.pyplot's plt.show function:

In [None]:
tree.plot_tree(dtree,class_names=["Bird","Dino"])
plt.show()

Now let's test out our decision tree with some data from one of the brains we studied! We can use the .predict_proba method. 

A result of array([[1., 0.]]) means the algorithm is certain it's a bird and a result of array([[0., 1.]]) means the algorithm is certain it's a dinosaur.

For example:

In [None]:
# This is the brain to body mass ratio and cerebrum to whole brain ratio for the woodpecker:
clf.predict_proba([[0.22,0.71]])

In [None]:
clf.predict_proba([[0.0001, 432]])

Try it with the data from your brain specimen!

In [None]:
df.head()

## Bonus Challenge: 
Try to train another classifier that's based on the size of each brain region and test it out!

In [None]:
Clf_2 = tree.DecisionTreeClassifier()
a= numdf[:,4:9]
b= numdf[:,1]

In [None]:
Clf_2 = clf.fit(a,b)

In [None]:
tree.plot_tree(Clf_2,class_names=["Bird","Dino"])
plt.show()

In [None]:
dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names= ['Brain Body Ratio', 'Cerebrum Ratio'],  
                      class_names=['Bird','Dino'],  
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 


# Extra Bonus from Gabrielle

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.5)

clrTree = tree.DecisionTreeClassifier()
clrTree = clrTree.fit(x_train, y_train)
outTree = clrTree.predict(x_test)

print("Accuracy for Decision Tree Classifier: " + str(accuracy_score(y_test, outTree)*100)+"%")