Before we get to the actual task you will have to run the following code cells. These get the colaboratory backend set up properly.

In [0]:
!pip install pydot_ng
!apt-get install graphviz
!pip install graphviz

In [0]:
# basic libraries you need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# these libraries are need for the tree visualization (which is provided)
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from graphviz import Source
import pydot_ng as pydot

# These are the functions from sklearn we recommend using for this assignment
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score

# The following is code for uploading a file to the colab.research.google 
# environment.

# library for uploading files
from google.colab import files 

def upload_files():
    # initiates the upload - follow the dialogues that appear
    uploaded = files.upload()

    # verify the upload
    for fn in uploaded.keys():
        print('User uploaded file "{name}" with length {length} bytes'.format(
            name=fn, length=len(uploaded[fn])))

    # uploaded files need to be written to file to interact with them
    # as part of a file system
    for filename in uploaded.keys():
        with open(filename, 'wb') as f:
            f.write(uploaded[filename])

# Decision Tree Task

For this task you will be building a model to predict if a breast mass is malignant or benign. The dataset we provided contains 30 features and 1 binary class label. The label is whether the mass is malignant or benign. The features describe some characteristics of the cell nuclei in the mass. More information about the dataset we will be using can be found here:

https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

We want two things out of this task:

1) An evaluation of a decision tree algorithm using cross validation

2) A visualization of the resulting white box model

This notebook will lead you through step-by-step what you should do to accomplish points 1) and 2). This notebook assumes you have done the Friday evening portions of the module. Feel free to borrow code that you wrote on Friday evening.

### Load Data

load the data using the upload files function. If you are working locally, set the path in read_csv to be a path on your local machine.

In [0]:
# upload the breast-cancer-data.csv file
upload_files()

In [0]:
# load the following data frame
cancer_df = pd.read_csv("breast-cancer-data.csv")

Do some basic EDA on the DataFrame. If you did not take module 2 (where we learnt more about EDA), just execute the .head() and .info() methods.



In [0]:
# EDA

### Initialize the display_tree function

We have provided a function for displaying the contents of a tree. Note that if your tree is too deep or wide it will be difficult to read. Run the following code cell so you can use the function later:

In [0]:
def display_tree(tree):
    """
    args:
        tree -  a DecisionTreeClassifier object that has been trained.
    """
    dot_data = StringIO()
    Source(export_graphviz(tree,
                           out_file=dot_data,
                           filled=True,
                           rounded=True,
                           special_characters=True,
                           class_names=["Benign", "Malignant"],
                           feature_names=features.columns))
    graph = pydot.graph_from_dot_data(dot_data.getvalue())

    graph.write_png("tmp_image.png")
    tmp_image = plt.imread("tmp_image.png")

    plt.figure(figsize=(20, 10))
    plt.imshow(tmp_image)
    plt.grid(False)

### Preparing the data

Use the following code to split the dataset into features and labels. This will convert the labels into 1 or 0 (for malignant or benign respectively). After running this, `features` will contain the features and `labels` will contain the labels.

In [0]:
# sets the random seed. This makes it so your experiments are reproducible
np.random.seed(1337)

# the label column is call diagnosis. Grab just that column and
# convert to 1 or 0
label_column_name = "diagnosis"
labels = cancer_df.loc[:, label_column_name]
labels = labels.apply(lambda x: 1 if x=="M" else 0)

# everything from column 2 and on is a feature
feat_column_names = list(cancer_df.columns[2:-1])
features = cancer_df.loc[:, feat_column_names]

### The Cross Validation

This is where you take over. Perform 5-fold Stratified Cross Validation using the features and labels prepared for you. You should use the DecisionTreeClassifier model. Information about that model can be found here:

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Note that you should be able to get away with default parameter values, except for max_depth. I recommend something small (e.g. 2, 3, or 4) to make the tree easier to visualize. Feel free to play around with the other values if you want, **but** you should leave that until you have finished the task (so that you have enough time).

Evaluate the results of your cross validation using the accuracy, precision, and recall scores. sklearn functions for computing those scores have been imported for you. The API docs for them are here:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

Score should be computed on all the predictions at once i.e. there will be one score for each metric. Collect the true labels and predictions for each fold and compute the metric at the end. Hint: Lists in python have a method call .extend() which lets you put the contents of one list or iterable into the list you are call extend with.


In [0]:
## Put your cross validation code here!

### Visualize the tree

To see what a decision tree on this data might look like (white box model) we are going to have you train another tree using the **whole** dataset and display the resulting tree. Train a new tree using `features` and `labels`. After, call the display_tree method the resulting tree object as an argument.

In [0]:
# code goes here
