# Using a Decision Tree to categorize emails

## Decision Tree Mini Project

In this project, we will again try to classify emails, this time using a decision tree.   The starter code is in decision_tree/dt_author_id.py.

### Part 1: Get the Decision Tree Running
Get the decision tree up and running as a classifier, setting min_samples_split=40.  It will probably take a while to train.  What’s the accuracy?

### Part 2: Speed It Up
You found in the SVM mini-project that the parameter tune can significantly speed up the training time of a machine learning algorithm.  A general rule is that the parameters can tune the complexity of the algorithm, with more complex algorithms generally running more slowly.  

Another way to control the complexity of an algorithm is via the number of features that you use in training/testing.  The more features the algorithm has available, the more potential there is for a complex fit.  We will explore this in detail in the “Feature Selection” lesson, but you’ll get a sneak preview now.

- find the number of features in your data.  The data is organized into a numpy array where the number of rows is the number of data points and the number of columns is the number of features; so to extract this number, use a line of code like len(features_train[0])
- go into tools/email_preprocess.py, and find the line of code that looks like this:     selector = SelectPercentile(f_classif, percentile=1)  Change percentile from 10 to 1.
- What’s the number of features now?
- What do you think SelectPercentile is doing?  Would a large value for percentile lead to a more complex or less complex decision tree, all other things being equal?
- Note the difference in training time depending on the number of features.  
- What’s the accuracy when percentile = 1?

In [1]:
from time import time
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from email_preprocess import preprocess_emails

In [2]:
features_train, features_test, labels_train, labels_test = preprocess_emails()

restore from cache


In [3]:
from cache_sklearn_model import retrieve_cached_model, save_cached_model

def optimize_dt(amount_of_training_data = 1,
                features = None,
                labels = None,
                **kwargs,
               ):
    if features is not None and labels is not None:
        pass
    elif amount_of_training_data == 1:
        features = features_train
        labels = labels_train
    else:
        features,_,labels,_ = train_test_split(
            features_train,
            labels_train,
            train_size=amount_of_training_data,
            random_state=91,
        )

    print("training on", len(features), "out of", len(features_train),
          "(", len(features)/len(features_train)*100 ,"%)"
         )
    
    clf = DecisionTreeClassifier(random_state = 45, **kwargs)
    
    data_desc = f'preprocess_emails_{amount_of_training_data*100}'
    [is_restored, clf, meta] = retrieve_cached_model(clf, data_desc)
    
    fit_delta = meta.get("fit_time_sec")
    if not is_restored:
        t = time()
        clf.fit(features, labels)
        fit_delta = round(time()-t, 3)
        save_cached_model(clf, data_desc, {"fit_time_sec": fit_delta})
    print("clf fit time:", fit_delta, "s")
    
    # output predictions
    t = time()
    labels_pred = clf.predict(features_test)
    pred_delta = round(time()-t, 3)
    print("clf predict time:", pred_delta, "s")
    
    accuracy = accuracy_score(labels_pred, labels_test)
    print("accuracy:", accuracy)
    
    print(f'| {amount_of_training_data} | {kwargs} | {fit_delta}s | {pred_delta}s | {accuracy} |')
    
    return clf

## How long do DTs take to compute?

In [4]:
_ = optimize_dt(amount_of_training_data = 0.10)

training on 1582 out of 15820 ( 10.0 %)
clf fit time: 0.94 s
clf predict time: 0.016 s
accuracy: 0.8799772468714449
| 0.1 | {} | 0.94s | 0.016s | 0.8799772468714449 |


In [5]:
_ = optimize_dt(amount_of_training_data = 0.20)

training on 3164 out of 15820 ( 20.0 %)
clf fit time: 4.844 s
clf predict time: 0.011 s
accuracy: 0.934584755403868
| 0.2 | {} | 4.844s | 0.011s | 0.934584755403868 |


In [6]:
_ = optimize_dt(amount_of_training_data = 0.40)

training on 6328 out of 15820 ( 40.0 %)
clf fit time: 19.802 s
clf predict time: 0.01 s
accuracy: 0.9556313993174061
| 0.4 | {} | 19.802s | 0.01s | 0.9556313993174061 |


In [7]:
_ = optimize_dt(amount_of_training_data = 0.80)

training on 12656 out of 15820 ( 80.0 %)
clf fit time: 57.472 s
clf predict time: 0.011 s
accuracy: 0.9732650739476678
| 0.8 | {} | 57.472s | 0.011s | 0.9732650739476678 |


In [8]:
_ = optimize_dt(amount_of_training_data = 1)

training on 15820 out of 15820 ( 100.0 %)
clf fit time: 80.054 s
clf predict time: 0.013 s
accuracy: 0.9908987485779295
| 1 | {} | 80.054s | 0.013s | 0.9908987485779295 |


## How do different params perform on this dataset?