# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?

## 2. Exploring the Data

Let's go ahead and read in the student dataset first.

_To execute a code cell, click inside it and press **Shift+Enter**._

In [1]:
# Import libraries
import numpy as np
import pandas as pd

In [2]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


Before exploring, get an overview of the dataset. 
- Display data as HTML (with help of direct output of pandas)
- Display the data attribute details (from README.md)

In [3]:
# Display dataset
student_data

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes
5,GP,M,16,U,LE3,T,4,3,services,other,...,yes,no,5,4,2,1,2,5,10,yes
6,GP,M,16,U,LE3,T,2,2,other,other,...,yes,no,4,4,4,1,1,3,0,yes
7,GP,F,17,U,GT3,A,4,4,other,teacher,...,no,no,4,1,4,1,1,1,6,no
8,GP,M,15,U,LE3,A,3,2,services,other,...,yes,no,4,2,2,1,1,1,0,yes
9,GP,M,15,U,GT3,T,3,4,other,other,...,yes,no,5,5,1,1,1,5,0,yes


Attributes for student-data:

- school - student's school (binary: "GP" or "MS")
- sex - student's sex (binary: "F" - female or "M" - male)
- age - student's age (numeric: from 15 to 22)
- address - student's home address type (binary: "U" - urban or "R" - rural)
- famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
- Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
- Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
- guardian - student's guardian (nominal: "mother", "father" or "other")
- traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
- studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- failures - number of past class failures (numeric: n if 1<=n<3, else 4)
- schoolsup - extra educational support (binary: yes or no)
- famsup - family educational support (binary: yes or no)
- paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- activities - extra-curricular activities (binary: yes or no)
- nursery - attended nursery school (binary: yes or no)
- higher - wants to take higher education (binary: yes or no)
- internet - Internet access at home (binary: yes or no)
- romantic - with a romantic relationship (binary: yes or no)
- famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- freetime - free time after school (numeric: from 1 - very low to 5 - very high)
- goout - going out with friends (numeric: from 1 - very low to 5 - very high)
- Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- health - current health status (numeric: from 1 - very bad to 5 - very good)
- absences - number of school absences (numeric: from 0 to 93)
- passed - did the student pass the final exam (binary: yes or no)

Now, can you find out the following facts about the dataset?
- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features

_Use the code block below to compute these values. Instructions/steps are marked using **TODO**s._

In [4]:
n_students = student_data.shape[0]
n_features = student_data.shape[1] - 1
n_passed = 0
for entry in student_data['passed']:
    if entry == 'yes':
        n_passed = n_passed + 1
n_failed = n_students - n_passed
grad_rate = float(n_passed) / float(n_students) * 100.0
print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [5]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [6]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

In [7]:
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# Then, select features (X) and corresponding labels (y) for the training and test sets
def shuffle_data(data_features, data_label):
    ''' 
    Given data as two different pandas arrays (features and labels), 
    split them up and return as shuffled arrays.
    '''
    data_indices = data_features.index
    shuffled_data_indices = np.random.permutation(data_indices)
    return (data_features.reindex(shuffled_data_indices), 
            data_label.reindex(shuffled_data_indices))

X_all_shuffled, y_all_shuffled = shuffle_data(X_all, y_all)

# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset
def split_data(data_features, data_label, num_train_samples, no_print=False):
    '''
    Split data from two different pandas arrays (features and labels),
    as training features, training labels, testing features and testing 
    labels arrays. No. of training samples desired is used as a measure 
    to split.
    '''
    # Initial check to make sure inputs are right
    if data_features.shape[0] == data_label.shape[0]:
        if no_print == False:
            print "Splitting features and labels into {} training and {} testing samples".format(
            num_train_samples, data_features.shape[0]-num_train_samples)
        return (data_features[:num_train_samples], data_label[:num_train_samples], 
                data_features[num_train_samples:], data_label[num_train_samples:])         

X_train, y_train, X_test, y_test = split_data(X_all_shuffled, y_all_shuffled, num_train)

print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])

Splitting features and labels into 300 training and 95 testing samples
Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

- What are the general applications of this model? What are its strengths and weaknesses?
- Given what you know about the data so far, why did you choose this model to apply?
- Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.

### Sample training and prediction

In [8]:
# Train a model
import time

def train_classifier(clf, X_train, y_train, no_print=False):
    '''
    Given a classifier, fit data to it and measure time taken.
    '''
    if no_print == False:
        print "Training {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    total_time = end - start
    if no_print == False:
        print "Done!\nTraining time (secs): {:.3f}".format(total_time)
    return total_time
    
# Choose a model, import it and instantiate an object
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()

# Fit model to training data
train_classifier(clf, X_train, y_train)  # note: using entire training set here
print clf  # you can inspect the learned model by printing it

Training DecisionTreeClassifier...
Done!
Training time (secs): 0.003
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')


In [9]:
# Predict on training set and compute F1 score
from sklearn.metrics import f1_score

def predict_labels(clf, features, target, is_train_data=False, no_print=False):
    '''
    Given a classifier that has already been fit, predict
    outputs on data, and measure time taken. Depending on the type
    of data viz., training or testing, the std outputs differ.
    '''
    if is_train_data:
        data_type_string = "train data"
    else:
        data_type_string = "test data"
    if no_print == False:
        print "Predicting labels on {} using {}...".format(data_type_string, clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    total_time = end - start
    if no_print == False:
        print "Done!\nPrediction time on {} (secs): {:.3f}".format(data_type_string, total_time)
    return f1_score(target.values, y_pred, pos_label='yes'), total_time

# Predict classifer and print its score
print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train, is_train_data=True)[0])

Predicting labels on train data using DecisionTreeClassifier...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 1.0


In [10]:
# Predict on test data
print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test)[0])

Predicting labels on test data using DecisionTreeClassifier...
Done!
Prediction time on test data (secs): 0.001
F1 score for test set: 0.75


#### Helper functions for different models

In [11]:
def train_predict(clf, X_train, y_train, X_test, y_test):
    '''
    Helper function to perform training and prediction given a classifier,
    with outputs. Useful for small or single runs.
    '''
    print "------------------------------------------"
    print "Training set size: {}".format(len(X_train))
    train_classifier(clf, X_train, y_train)
    print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train, is_train_data=True)[0])
    print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test)[0])

In [12]:
from sklearn.cross_validation import train_test_split # For stratified split

def shuffle_and_split_data(data_features, data_labels, num_train_samples, stratify=False):
    '''
    Shuffle and split data; Wrapper function for shuffle_data() 
    and split_data() with no_print enabled in case of normal split.
    
    If stratify flag is enabled, split the data in a stratified
    manner i.e., Make sure the balance of labels in train and 
    test is the same.
    '''
    if not stratify:
        X_shuffled, y_shuffled = shuffle_data(data_features, data_labels)
        X_train, y_train, X_test, y_test = split_data(
            X_shuffled, y_shuffled, num_train_samples, no_print=True)
    else:
        X_train, X_test, y_train, y_test = train_test_split(
        data_features, data_labels, train_size=num_train_samples, stratify=data_labels)
    
    return X_train, y_train, X_test, y_test

In [13]:
# For easiness
num_train_samples = num_train
training_sample_sizes = [100, 200, 300]

In [14]:
from sklearn.preprocessing import StandardScaler

# Helper function to perform sample runs on models with desired subsets of training data
def sample_run_train_predict(clf, data_features, data_labels, num_train_samples, 
                             training_sample_sizes, stratify=False, scale=False):
    # Shuffle and split data
    X_train, y_train, X_test, y_test = shuffle_and_split_data(
        data_features, data_labels, num_train_samples, stratify)
    
    if scale:
        scaler = StandardScaler().fit(X_train)
        X_train = scaler.transform(X_train)
        X_test = scaler.transform(X_test)

    # Run train_predict on various subsets of training data
    for size in training_sample_sizes:
        train_predict( clf, X_train[:size], y_train[:size], X_test, y_test )

In [15]:
# Helper function to perform several runs on a given classifier and print model performance
def n_runs_classifier(clf, num_runs, data_features, data_labels, num_train_samples, 
                      stratify=False, scale=False):
    # Arrays to store each run's run times and f1 scores
    run_times_fit = np.zeros([num_runs])
    run_times_predict_train = np.zeros([len(training_sample_sizes), num_runs])
    run_times_predict_test = np.zeros([len(training_sample_sizes), num_runs])
    run_f1_scores_train = np.zeros([len(training_sample_sizes), num_runs])
    run_f1_scores_test = np.zeros([len(training_sample_sizes), num_runs])

    start = time.time()
    # Run fit and predict multiple times for each training set size
    for idx in range(num_runs):
        X_train, y_train, X_test, y_test = shuffle_and_split_data(
            data_features, data_labels, num_train_samples, stratify)
        
        if scale:
            scaler = StandardScaler().fit(X_train)
            X_train = scaler.transform(X_train)
            X_test = scaler.transform(X_test)

        for size_idx, size in enumerate(training_sample_sizes):
            run_times_fit[idx] = train_classifier(clf, X_train[:size], y_train[:size], no_print=True)
            run_f1_scores_train[size_idx, idx], run_times_predict_train[size_idx, idx] = predict_labels(
                clf, X_train[:size], y_train[:size], is_train_data=True, no_print=True)
            run_f1_scores_test[size_idx, idx], run_times_predict_test[size_idx, idx] = predict_labels(
                clf, X_test, y_test, no_print=True)
    end = time.time()
    
    total_time = end - start
    
    # Print the run statistics
    print "Completed {} runs with {} classifier in {:.3f} seconds".format(
        num_runs, clf.__class__.__name__, (end-start))
    print "Mean training/fitting run times (secs): {:.6f}".format(np.mean(run_times_fit))

    for size_idx, size in enumerate(training_sample_sizes):
        print "\nFor {} training samples:".format(size)
        print "\tMean prediction times (secs):"
        print "\t\tTrain data: {:.6f}".format(np.mean(run_times_predict_train[size_idx]))
        print "\t\tTest data: {:.6f}".format(np.mean(run_times_predict_test[size_idx]))
        print "\tF1 scores:"
        print "\t\tTrain data: {:.6f}".format(np.mean(run_f1_scores_train[size_idx]))
        print "\t\tTest data: {:.6f}".format(np.mean(run_f1_scores_test[size_idx]))

### Model 1: Naive Bayes Classifier

A naive Bayes classifier is used on training set sizes of 100, 200 and 300 samples. First, a sample run with *default parameters* is executed. It is followed by 100 runs of the classifier with shuffled data in all three sample sizes, and the corresponding training and testing errors are output.

#### Sample run:

In [16]:
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
print "Sample run with {} classifier".format(clf.__class__.__name__)
sample_run_train_predict(clf, X_all, y_all, num_train_samples, training_sample_sizes)

Sample run with GaussianNB classifier
------------------------------------------
Training set size: 100
Training GaussianNB...
Done!
Training time (secs): 0.002
Predicting labels on train data using GaussianNB...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 0.541666666667
Predicting labels on test data using GaussianNB...
Done!
Prediction time on test data (secs): 0.001
F1 score for test set: 0.340425531915
------------------------------------------
Training set size: 200
Training GaussianNB...
Done!
Training time (secs): 0.002
Predicting labels on train data using GaussianNB...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 0.805970149254
Predicting labels on test data using GaussianNB...
Done!
Prediction time on test data (secs): 0.001
F1 score for test set: 0.725806451613
------------------------------------------
Training set size: 300
Training GaussianNB...
Done!
Training time (secs): 0.003
Predicting labels on train da

#### Full run - 100 iterations:

In [17]:
# Number of runs desired for GaussianNB
num_runs_gaussian_nb = 100

# Perform num_runs on GaussianNB
n_runs_classifier(clf, num_runs_gaussian_nb, X_all, y_all, num_train_samples)

Completed 100 runs with GaussianNB classifier in 1.663 seconds
Mean training/fitting run times (secs): 0.001667

For 100 training samples:
	Mean prediction times (secs):
		Train data: 0.000536
		Test data: 0.000530
	F1 scores:
		Train data: 0.654711
		Test data: 0.563431

For 200 training samples:
	Mean prediction times (secs):
		Train data: 0.000735
		Test data: 0.000540
	F1 scores:
		Train data: 0.781360
		Test data: 0.715857

For 300 training samples:
	Mean prediction times (secs):
		Train data: 0.000920
		Test data: 0.000538
	F1 scores:
		Train data: 0.797191
		Test data: 0.753155


### Model 2: SVM

A Support Vector Machine based classifier (SVC) is used with a few readily available kernel configurations on training set sizes of 100, 200 and 300 samples. It is followed by 100 runs of the _best_ classifier with shuffled data in all three sample sizes, and the corresponding training and testing errors are output. Note that the _best_ classifier with the most suitable kernel is chosen manually with no support from grid search. This is done to narrow down the _best_ possible kernel configuration for the current context.

#### 1. Linear kernel

In [18]:
from sklearn.svm import SVC

clf = SVC(kernel='linear')
print "Sample run with {}".format(clf)
sample_run_train_predict(clf, X_all, y_all, num_train_samples, training_sample_sizes, scale=True)

Sample run with SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.008
Predicting labels on train data using SVC...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 0.88188976378
Predicting labels on test data using SVC...
Done!
Prediction time on test data (secs): 0.000
F1 score for test set: 0.671875
------------------------------------------
Training set size: 200
Training SVC...
Done!
Training time (secs): 0.015
Predicting labels on train data using SVC...
Done!
Prediction time on train data (secs): 0.002
F1 score for training set: 0.85409252669
Predicting labels on test data using SVC...
Done!
Prediction time on test data (secs): 0.001
F1 score for test set: 0.7591240875

#### 2. RBF kernel

In [19]:
clf = SVC(kernel='rbf')
print "Sample run with {}".format(clf)
sample_run_train_predict(clf, X_all, y_all, num_train_samples, training_sample_sizes, scale=True)

Sample run with SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.002
Predicting labels on train data using SVC...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 0.918238993711
Predicting labels on test data using SVC...
Done!
Prediction time on test data (secs): 0.001
F1 score for test set: 0.758169934641
------------------------------------------
Training set size: 200
Training SVC...
Done!
Training time (secs): 0.006
Predicting labels on train data using SVC...
Done!
Prediction time on train data (secs): 0.004
F1 score for training set: 0.903846153846
Predicting labels on test data using SVC...
Done!
Prediction time on test data (secs): 0.003
F1 score for test set: 0.75496

#### 3. Poly kernel (Degree 2)

In [20]:
clf = SVC(kernel='poly', degree=2)
print "Sample run with {}".format(clf)
sample_run_train_predict(clf, X_all, y_all, num_train_samples, training_sample_sizes, scale=True)

Sample run with SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=2, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.001
Predicting labels on train data using SVC...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 0.89932885906
Predicting labels on test data using SVC...
Done!
Prediction time on test data (secs): 0.001
F1 score for test set: 0.79746835443
------------------------------------------
Training set size: 200
Training SVC...
Done!
Training time (secs): 0.004
Predicting labels on train data using SVC...
Done!
Prediction time on train data (secs): 0.003
F1 score for training set: 0.89932885906
Predicting labels on test data using SVC...
Done!
Prediction time on test data (secs): 0.001
F1 score for test set: 0.8181818

#### 4. Poly kernel (Degree 3)

In [21]:
clf = SVC(kernel='poly', degree=3)
print "Sample run with {}".format(clf)
sample_run_train_predict(clf, X_all, y_all, num_train_samples, training_sample_sizes, scale=True)

Sample run with SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.002
Predicting labels on train data using SVC...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 0.938775510204
Predicting labels on test data using SVC...
Done!
Prediction time on test data (secs): 0.001
F1 score for test set: 0.771241830065
------------------------------------------
Training set size: 200
Training SVC...
Done!
Training time (secs): 0.004
Predicting labels on train data using SVC...
Done!
Prediction time on train data (secs): 0.003
F1 score for training set: 0.938511326861
Predicting labels on test data using SVC...
Done!
Prediction time on test data (secs): 0.002
F1 score for test set: 0.7712

#### 5. Sigmoid kernel

In [22]:
clf = SVC(kernel='sigmoid')
print "Sample run with {}".format(clf)
sample_run_train_predict(clf, X_all, y_all, num_train_samples, training_sample_sizes, scale=True)

Sample run with SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.001
Predicting labels on train data using SVC...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 0.802395209581
Predicting labels on test data using SVC...
Done!
Prediction time on test data (secs): 0.001
F1 score for test set: 0.819875776398
------------------------------------------
Training set size: 200
Training SVC...
Done!
Training time (secs): 0.003
Predicting labels on train data using SVC...
Done!
Prediction time on train data (secs): 0.002
F1 score for training set: 0.805970149254
Predicting labels on test data using SVC...
Done!
Prediction time on test data (secs): 0.001
F1 score for test set: 0.8

#### Best kernel for SVC
The performance of rbf kernel and poly (degree 2) kernel look very similar. Run 100 runs for each configuration and pick the best performing model of the two.

#### 100 runs poly (Degree 2)

In [23]:
# Number of runs desired with poly kernel
num_runs_svc_poly_2 = 100

clf = SVC(kernel='poly', degree=2)

n_runs_classifier(clf, num_runs_svc_poly_2, X_all, y_all, num_train_samples, scale=True)

Completed 100 runs with SVC classifier in 3.573 seconds
Mean training/fitting run times (secs): 0.008088

For 100 training samples:
	Mean prediction times (secs):
		Train data: 0.000823
		Test data: 0.000782
	F1 scores:
		Train data: 0.913766
		Test data: 0.793857

For 200 training samples:
	Mean prediction times (secs):
		Train data: 0.002667
		Test data: 0.001337
	F1 scores:
		Train data: 0.902932
		Test data: 0.791767

For 300 training samples:
	Mean prediction times (secs):
		Train data: 0.005381
		Test data: 0.001811
	F1 scores:
		Train data: 0.896085
		Test data: 0.788523


#### 100 runs rbf

In [24]:
# Number of runs desired with rbf kernel
num_runs_svc_rbf = 100

clf = SVC(kernel='rbf')

n_runs_classifier(clf, num_runs_svc_rbf, X_all, y_all, num_train_samples, scale=True)

Completed 100 runs with SVC classifier in 4.578 seconds
Mean training/fitting run times (secs): 0.010747

For 100 training samples:
	Mean prediction times (secs):
		Train data: 0.001101
		Test data: 0.001049
	F1 scores:
		Train data: 0.928500
		Test data: 0.796965

For 200 training samples:
	Mean prediction times (secs):
		Train data: 0.003860
		Test data: 0.001873
	F1 scores:
		Train data: 0.915469
		Test data: 0.800702

For 300 training samples:
	Mean prediction times (secs):
		Train data: 0.008179
		Test data: 0.002671
	F1 scores:
		Train data: 0.906537
		Test data: 0.805171


rbf seems to be (very) marginally better than poly (degree 2), in terms of F1 scores on test data. Objectively, degree 2 poly kernel is less resource intensive (from fitting and prediction times), but the difference shouldn't really be too signficant in reality.

### Model 3: Boosting

AdaBoostClassifier, an ensemble based classifier, is used with two settings: base estimator (or classifier) which performs relatively well alone, and that which results in _weak learning_ (theoretically). First, sample runs on both settings are performed. It is followed by 100 runs of the _best_ setting with shuffled data in all three sample sizes, and the corresponding training and testing errors are output.

#### A helper function to balance data
This helper function is required to make sure the base estimators (decision trees) are not prone to biasing due to imbalance in proportion of labels in the dataset.

In [25]:
def balance_dataset(data_features, data_labels):
    '''
    Helper function balance dataset to have equal number
    of yes and no labels. Typical usecase involves calling
    this function for each instance of training because the 
    tuples with the higher number of labels are randomized.
    '''
    y_yes_indices = data_labels.index[np.where(y_all == 'yes')[0]]
    y_yes_indices = np.random.permutation(y_yes_indices)
    y_no_indices = data_labels.index[np.where(y_all == 'no')[0]]
    y_no_indices = np.random.permutation(y_no_indices)
    processed_indices = y_yes_indices[:130].tolist()
    processed_indices.extend(y_no_indices.tolist())
    X_balanced, y_balanced = data_features.reindex(processed_indices), data_labels.reindex(processed_indices)
    return X_balanced, y_balanced

X_balanced, y_balanced = balance_dataset(X_all, y_all)
num_total_samples_adaboost = len(y_balanced)
num_train_samples_adaboost = num_total_samples_adaboost - 60 # 60 just to maintain 
                                                             # neat numbers instead of a ratio
adaboost_training_sizes = [100, 150, 200]

#### 1. Sample run on a _strong_ base estimator

In [26]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

X_balanced, y_balanced = balance_dataset(X_all, y_all)

# First, see performance of a decision tree that can classify relatively well
# This will be the base estimator for AdaBoost
clf_base = DecisionTreeClassifier(max_depth = 15)
print "Base estimator {}".format(clf_base)
sample_run_train_predict(clf_base, X_balanced, y_balanced, num_train_samples_adaboost, 
                         adaboost_training_sizes, stratify=True)

# Then boost the base estimator 
clf = AdaBoostClassifier(clf_base)
print "\nSample run with boosted classifier {}".format(clf)
sample_run_train_predict(clf, X_balanced, y_balanced, num_train_samples_adaboost, 
                         adaboost_training_sizes, stratify=True)

Base estimator DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=15,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
------------------------------------------
Training set size: 100
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.003
Predicting labels on train data using DecisionTreeClassifier...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 1.0
Predicting labels on test data using DecisionTreeClassifier...
Done!
Prediction time on test data (secs): 0.001
F1 score for test set: 0.561403508772
------------------------------------------
Training set size: 150
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.003
Predicting labels on train data using DecisionTreeClassifier...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 1.0
Pre

#### 2. Sample run on a _weak_ base estimator

In [27]:
# Likewise, see performance with a "weak" decision tree
X_balanced, y_balanced = balance_dataset(X_all, y_all)

clf_base = DecisionTreeClassifier(max_depth=1)
print "Base estimator {}".format(clf_base)
sample_run_train_predict(clf_base, X_balanced, y_balanced, num_train_samples_adaboost, 
                         adaboost_training_sizes, stratify=True)

# Then boost the base estimator 
clf = AdaBoostClassifier(clf_base)
print "\nSample run with boosted classifier {}".format(clf)
sample_run_train_predict(clf, X_balanced, y_balanced, num_train_samples_adaboost, 
                         adaboost_training_sizes, stratify=True)

Base estimator DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
------------------------------------------
Training set size: 100
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels on train data using DecisionTreeClassifier...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 0.758064516129
Predicting labels on test data using DecisionTreeClassifier...
Done!
Prediction time on test data (secs): 0.000
F1 score for test set: 0.72
------------------------------------------
Training set size: 150
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels on train data using DecisionTreeClassifier...
Done!
Prediction time on train data (secs): 0.001
F1 score for training set: 0.74468

#### 100 runs on weak base estimator
Note: Although the case with weak base estimator looks similar to the strong estimator with respect to F1 scores on test data, it is important to see that the F1 scores of the training data in the latter case is almost always 1.0, implying huge chances of overfitting. So, _best_ classifier should objectively be the one with weak base estimator.

In [28]:
# Number of runs desired with AdaBoost
num_runs_adaboost = 100

# Not using the n_runs_classifier() helper function here 
# because data balancing needs to be performed for each run

# Arrays to store each run's run times and f1 scores
run_times_fit = np.zeros([num_runs_adaboost])
run_times_predict_train = np.zeros([len(adaboost_training_sizes), num_runs_adaboost])
run_times_predict_test = np.zeros([len(adaboost_training_sizes), num_runs_adaboost])
run_f1_scores_train = np.zeros([len(adaboost_training_sizes), num_runs_adaboost])
run_f1_scores_test = np.zeros([len(adaboost_training_sizes), num_runs_adaboost])

start = time.time()
# Run fit and predict multiple times for each training set size
for idx in range(num_runs_adaboost):
    X_balanced, y_balanced = balance_dataset(X_all, y_all)
    
    X_train, y_train, X_test, y_test = shuffle_and_split_data(
        X_balanced, y_balanced, num_train_samples_adaboost, stratify=True)

    for size_idx, size in enumerate(adaboost_training_sizes):
        run_times_fit[idx] = train_classifier(clf, X_train[:size], y_train[:size], no_print=True)
        run_f1_scores_train[size_idx, idx], run_times_predict_train[size_idx, idx] = predict_labels(
            clf, X_train[:size], y_train[:size], is_train_data=True, no_print=True)
        run_f1_scores_test[size_idx, idx], run_times_predict_test[size_idx, idx] = predict_labels(
            clf, X_test, y_test, no_print=True)
end = time.time()

total_time = end - start

# Print the run statistics
print "Completed {} runs with {} classifier in {:.3f} seconds".format(
    num_runs_adaboost, clf.__class__.__name__, (end-start))
print "Mean training/fitting run times (secs): {:.3f}".format(np.mean(run_times_fit))

for size_idx, size in enumerate(adaboost_training_sizes):
    print "\nFor {} training samples:".format(size)
    print "\tMean prediction times (secs):"
    print "\t\tTrain data: {:.3f}".format(np.mean(run_times_predict_train[size_idx]))
    print "\t\tTest data: {:.3f}".format(np.mean(run_times_predict_test[size_idx]))
    print "\tF1 scores:"
    print "\t\tTrain data: {:.3f}".format(np.mean(run_f1_scores_train[size_idx]))
    print "\t\tTest data: {:.3f}".format(np.mean(run_f1_scores_test[size_idx]))

Completed 100 runs with AdaBoostClassifier classifier in 41.426 seconds
Mean training/fitting run times (secs): 0.121

For 100 training samples:
	Mean prediction times (secs):
		Train data: 0.009
		Test data: 0.008
	F1 scores:
		Train data: 0.950
		Test data: 0.587

For 150 training samples:
	Mean prediction times (secs):
		Train data: 0.010
		Test data: 0.008
	F1 scores:
		Train data: 0.865
		Test data: 0.606

For 200 training samples:
	Mean prediction times (secs):
		Train data: 0.010
		Test data: 0.008
	F1 scores:
		Train data: 0.820
		Test data: 0.616


## 5. Choosing the Best Model

- Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?
- In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).
- Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.
- What is the model's final F<sub>1</sub> score?

In [29]:
# TODO: Fine-tune your model and report the best F1 score