# Mini-Project - Decision Tree

##### Student Tags

Author: Anderson Hitoshi Uyekita    
Mini-Project: Support Vector Machine  
Course: Data Science - Foundations II  
COD: ND111  
Date: 16/01/2019    

***

## Table of Contents
- [Introduction](#intro)
- [Given code](#code)
    - [Function](#function)
- [Part 1](#part_1)
- [Part 2](#part_2)

### General Information

This Jupyter Notebook (in Python 2) aims to record all exercise coded to the Support Vector Machine Mini Project.

## Introduction <a id='intro'></a>

In this project, we will again try to identify the authors in a body of emails, this time using a decision tree. The starter code is in decision_tree/dt_author_id.py.

### Instructions

#### Decision Tree Mini Project

In this project, we will again try to classify emails, this time using a decision tree. The starter code is in `decision_tree/dt_author_id.py`.

#### Part 1: Get the Decision Tree Running
Get the decision tree up and running as a classifier, setting `min_samples_split=40`.  It will probably take a while to train.  What’s the accuracy?

#### Part 2: Speed It Up
You found in the SVM mini-project that the parameter tune can significantly speed up the training time of a machine learning algorithm.  A general rule is that the parameters can tune the complexity of the algorithm, with more complex algorithms generally running more slowly.  

Another way to control the complexity of an algorithm is via the number of features that you use in training/testing.  The more features the algorithm has available, the more potential there is for a complex fit.  We will explore this in detail in the “Feature Selection” lesson, but you’ll get a sneak preview now.

* find the number of features in your data.  The data is organized into a numpy array where the number of rows is the number of data points and the number of columns is the number of features; so to extract this number, use a line of code like `len(features_train[0]`)
* go into `tools/email_preprocess.py`, and find the line of code that looks like this:

$$selector = SelectPercentile(\text{f_classif}, percentile=1)$$

Change percentile from 10 to 1.

* What’s the number of features now?
* What do you think SelectPercentile is doing?  Would a large value for percentile lead to a more complex or less complex decision tree, all other things being equal?
* Note the difference in training time depending on the number of features.  
* What’s the accuracy when `percentile = 1`?


## Given Code <a id='code'></a>

In [1]:
#!/usr/bin/python

""" 
    This is the code to accompany the Lesson 2 (SVM) mini-project.

    Use a SVM to identify emails from the Enron corpus by their authors:    
    Sara has label 0
    Chris has label 1
"""
    
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess
from email_preprocess_dt import preprocess_dt


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884


### Importing Packages

In [2]:
# Importing packages.

# Importing the Scikit Learn package of Support Vector Machine
from sklearn import tree

# Importing the Scikit Learn to calcutate the accuracy.
from sklearn.metrics import accuracy_score

### Defining a function <a id='function'></a>

This function aims to save lines of code.

In [3]:
# Function to calculate the accuracy.
def my_dt(min_samples_split, features_train = features_train, labels_train = labels_train,
                   features_test = features_test, labels_test = labels_test, prediction = False):
    """
    This function will calculate the accuracy given the training and test inputs.
    """
    
    # Creating the classifier using the linear kernel.
    clf = tree.DecisionTreeClassifier(min_samples_split = 50)

    # Saving time to compute the elapse time of fitting process.
    t0 = time()

    # Fitting/Training clf based on training dataframes.
    clf.fit(features_train, labels_train)

    # Calculating the elapse time of fit calculation.
    print "training time:", round(time()-t0, 3), "s"

    # Saving time to compute the elapse time of predicting process. 
    t1 = time()

    # Storing the predict from features_test in pred.
    pred = clf.predict(features_test)

    # Calculating the elapse time of predicting calculation.
    print "predict time:", round(time()-t1, 3), "s"

    # Calculating the accuracy and storing in acc.
    acc = accuracy_score(pred, labels_test)

    # Printing the acc.
    print "Accuracy:", round(acc,4)
    
    # Returning or not
    if prediction == True:
        return pred
    

## Part 1 <a id='part_1'></a>

Get the decision tree up and running as a classifier, setting `min_samples_split=40`.  It will probably take a while to train.  What’s the accuracy?


In [4]:
my_dt(min_samples_split = 40)

training time: 45.212 s
predict time: 0.05 s
Accuracy: 0.9721


>What is the accuracy of your decision tree?

Accuracy: 0.9733

## Part 2 <a id='part_2'></a>

You found in the SVM mini-project that the parameter tune can significantly speed up the training time of a machine learning algorithm. A general rule is that the parameters can tune the complexity of the algorithm, with more complex algorithms generally running more slowly.

Another way to control the complexity of an algorithm is via the number of features that you use in training/testing. The more features the algorithm has available, the more potential there is for a complex fit. We will explore this in detail in the “Feature Selection” lesson, but you’ll get a sneak preview now.

**What's the number of features in your data?** (Hint: the data is organized into a numpy array where the number of rows is the number of data points and the number of columns is the number of features; so to extract this number, use a line of code like len(features_train[0]).)

In [5]:
len(features_train[0])

3785

>How many features are in the data?

3785.

***

go into ../tools/email_preprocess.py, and find the line of code that looks like this:

selector = SelectPercentile(f_classif, percentile=10)

Change percentile from 10 to 1, and rerun dt_author_id.py. **What’s the number of features now?**

In [6]:
# I have create an other email_preprocess.py with the change in percentile from 10 to 1.
features_train_1, features_test_1, labels_train_1, labels_test_1 = preprocess_dt()

# Number of features
len(features_train_1[0])

no. of Chris training emails: 7936
no. of Sara training emails: 7884


379

>How many features are in the data now?

379.

***

What do you think SelectPercentile is doing? Would a large value for percentile lead to a more complex or less complex decision tree, all other things being equal? Note the difference in training time depending on the number of features.

>Would a large number of features give you a more or less complex decision tree, all other things being equal?

more complex

***

What's the accuracy of your decision tree when you use only 1% of your available features (i.e. percentile=1)?

In [7]:
my_dt(min_samples_split = 40, features_train = features_train_1, labels_train = labels_train_1,
                              features_test = features_test_1, labels_test = labels_test_1, prediction = False)

training time: 4.14 s
predict time: 0.0 s
Accuracy: 0.9636


>What's the accuracy of your decision tree when you use only 1% of your available features (i.e. percentile=1)?

Accuracy: 0.9636