# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

The goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?

### A1. :
In this case we are solving a “classification problem”. Each student in the data set will be categorize according to the parameters as "need early intervention" or does not "need early intervention". The algorithm will fit each student into one of these categories.

## 2. Exploring the Data

Let's go ahead and read in the student dataset first.



In [1]:
# Import libraries
import time
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
from sklearn import cross_validation
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV


In [2]:
# Read student data
student_data = pd.read_csv("F:\\Nanodegree\\student_intervention\\student-data.csv")
print ("Student data read successfully!")
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


Some statistics facts about the dataset?
- Total number of students : 395
- Number of students who passed : 265
- Number of students who failed : 130
- Graduation rate of the class (%) :  67%
- Number of features : 31



In [3]:
# Compute desired values 
n_students = len(student_data)
n_features = len(student_data.columns)
n_passed = len(student_data[student_data['passed']=='yes'])
n_failed = len(student_data[student_data['passed']=='no'])
grad_rate =100* n_passed/n_students   

print ("Total number of students: {}".format(n_students))
print ("Number of students who passed: {}".format(n_passed))
print ("Number of students who failed: {}".format(n_failed))
print ("Number of features: {}".format(n_features))
print ("Graduation rate of the class: {:.2f}%".format(grad_rate))

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 31
Graduation rate of the class: 67.09%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns

Separate the data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [4]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print ("Feature column(s):-\n{}".format(feature_cols))
print ("Target column: {}".format(target_col))

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print ("\nFeature values:-")
X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,yes,no,no,4,3,4,1,1,3,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,yes,no,5,3,3,1,1,3,4
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,yes,no,4,3,2,2,3,3,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,yes,3,2,2,1,1,5,2
4,GP,F,16,U,GT3,T,3,3,other,other,...,yes,no,no,4,3,2,1,2,5,4


### Preprocess feature columns

As we can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [5]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  
            # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print ("Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns)))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

In [6]:
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_all, 
                                                                     y_all, 
                                                                     test_size=0.24, 
                                                                     random_state=0)

print ("Training set: {} samples".format(X_train.shape[0]))
print ("Test set: {} samples".format(X_test.shape[0]))
# Note: If you need a validation set, extract it from within training data

Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

- What are the general applications of this model? What are its strengths and weaknesses?
- Given what you know about the data so far, why did you choose this model to apply?
- Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.



 ## Support Vector Machines Classification O(n3): 
 ### General applications: 
Given, data points that belong into two clusters, each data point can be represented by a n-dimensional vector. It is possible to find many n-1 dimensional hyper planes that will separate the data points into two clusters. Support Vector Machines Classification will find the hyper plane that represents the hyper plane with maximum separation or margins. SVM is helpful in text categorization, images classifications, highly efficient in protein classifications and classification of hand writers’ characters. 
 ### Strengths: 
Several specialized algorithms that enable quick solving of the quadratic programming problem arise from the model. Kernel SVM is available in many toolkits. For images classifications SVMs achieve significantly higher search accuracy than traditional query refinement schemes. 
 ### Weaknesses: 
The model is not easy to interpret. The classification is into two class and the solution is uncelebrated (we cannot estimate the degree of certainty of a given solutions). All input data must be labeled (no unsupervised learning). 
 ### Why did you choose this model to apply? 
The structure of the data set is a classic case for SVM, each label can be describe with a vector of values. The problem that we want to solve is classification. I hope that the training will be done in a short time. The solution that I need is binary and I do not care about the certainty for each student.

In [7]:
# Train a model

def train_classifier(clf, X_train, y_train):
    print ("Training {}...".format(clf.__class__.__name__))
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    print ("Done!\nTraining time (secs): {:.3f}".format(end - start))

# TODO: Choose a model, import it and instantiate an object
from sklearn import svm
clf = svm.SVC(C=2)

# Fit model to training data
train_classifier(clf, X_train, y_train)  # note: using entire training set here
#print clf  # you can inspect the learned model by printing it

Training SVC...
Done!
Training time (secs): 0.011


In [8]:
# Predict on training set and compute F1 score

def predict_labels(clf, features, target):
    print ("Predicting labels using {}...".format(clf.__class__.__name__))
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    print ("Done!\nPrediction time (secs): {:.3f}".format(end - start))
    return f1_score(target.values, y_pred, pos_label='yes')

train_f1_score = predict_labels(clf, X_train, y_train)
print ("F1 score for training set: {}".format(train_f1_score))

Predicting labels using SVC...
Done!
Prediction time (secs): 0.011
F1 score for training set: 0.9135254988913526


In [9]:
# Predict on test data
print ("F1 score for test set: {}".format(predict_labels(clf, X_test, y_test)))


Predicting labels using SVC...
Done!
Prediction time (secs): 0.006
F1 score for test set: 0.7464788732394366


In [10]:
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test):
    print ("------------------------------------------")
    print ("Training set size: {}".format(len(X_train)))
    train_classifier(clf, X_train, y_train)
    print ("F1 score for training set: {}".format(predict_labels(clf, X_train, y_train)))
    print ("F1 score for test set: {}".format(predict_labels(clf, X_test, y_test)))

train_predict(clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(clf, X_train, y_train, X_test, y_test)

# TODO: Run the helper function above for desired subsets of training data
# Note: Keep the test set constant

------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.003
Predicting labels using SVC...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.9343065693430657
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.7801418439716312
------------------------------------------
Training set size: 200
Training SVC...
Done!
Training time (secs): 0.008
Predicting labels using SVC...
Done!
Prediction time (secs): 0.005
F1 score for training set: 0.910958904109589
Predicting labels using SVC...
Done!
Prediction time (secs): 0.003
F1 score for test set: 0.7746478873239436
------------------------------------------
Training set size: 300
Training SVC...
Done!
Training time (secs): 0.016
Predicting labels using SVC...
Done!
Prediction time (secs): 0.012
F1 score for training set: 0.9135254988913526
Predicting labels using SVC...
Done!
Prediction time (secs): 0.004
F1 score for test set: 0.7464

## AdaBoost Adaptive Boosting Classification
 ###  General applications: 
A Meta-algorithm that can be used in conjunction with other types of learning algorithms and improve their performance. The output of the other learning algorithms ('weak learners') is combined as a weighted sum to create the boosted classifier. AdaBoost used as a Standard algorithm for Face and object Detection, 
 ### Strengths: 
Less susceptible to overfitting than other learning algorithms, the best out-of-the-box classifier. 
 ### Weaknesses: 
Sensitive to uniform noisy data and outliers, AdaBoost depends on data and weak learner and can fail if weak classifiers are overfit or underfit.
 ### Why did you choose this model to apply?  
As part of the exploration I wanted to test a predictive model that will be less susceptible to overfitting and since Adaboost is consider the best out-of-the-box classifier, I wanted to check if a boosting model will do better.

In [11]:
# Train a model

# TODO: Choose a model, import it and instantiate an object
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=2)

# Fit model to training data
train_classifier(clf, X_train, y_train)  # note: using entire training set here
#print clf  # you can inspect the learned model by printing it

Training AdaBoostClassifier...
Done!
Training time (secs): 0.005


In [12]:
# Predict on training set and compute F1 score

train_f1_score = predict_labels(clf, X_train, y_train)
print ("F1 score for training set: {}".format(train_f1_score))

Predicting labels using AdaBoostClassifier...
Done!
Prediction time (secs): 0.003
F1 score for training set: 0.8278867102396513


In [13]:
# Predict on test data
print ("F1 score for test set: {}".format(predict_labels(clf, X_test, y_test)))

Predicting labels using AdaBoostClassifier...
Done!
Prediction time (secs): 0.002
F1 score for test set: 0.7887323943661971


In [14]:
train_predict(clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(clf, X_train, y_train, X_test, y_test)

------------------------------------------
Training set size: 100
Training AdaBoostClassifier...
Done!
Training time (secs): 0.007
Predicting labels using AdaBoostClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.8169014084507042
Predicting labels using AdaBoostClassifier...
Done!
Prediction time (secs): 0.002
F1 score for test set: 0.7887323943661971
------------------------------------------
Training set size: 200
Training AdaBoostClassifier...
Done!
Training time (secs): 0.005
Predicting labels using AdaBoostClassifier...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.8243243243243242
Predicting labels using AdaBoostClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.7971014492753624
------------------------------------------
Training set size: 300
Training AdaBoostClassifier...
Done!
Training time (secs): 0.005
Predicting labels using AdaBoostClassifier...
Done!
Prediction time (secs): 0.000
F1 score for training

## Extremely Randomized Trees
 ### General applications:  
Ensemble learning method for classification, that construct a multitude decision trees at training, outputting the class of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. The algorithm, inducing random forest "bagging" idea and the random selection of features, in order to construct a collection of decision trees with controlled variance. Random forests use tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. Extremely randomized trees or ExtraTrees are trained like in an ordinary random forest, but additionally the top-down splitting in the tree learner is randomized. Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. 
 ### Strengths: 
More complex classifier (a larger forest) getting more accurate nearly monotonically. 
 ### Weaknesses: 
For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.
### Why did you choose this model to apply?
The fact that we deal with students and as we know humans are not easy to predict, a model that will include randomization in the process of creating classifiers might captured something that could not be captured by the well formulated models.

In [15]:
# Train a model

# TODO: Choose a model, import it and instantiate an object
from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier(n_estimators=40)

# Fit model to training data
train_classifier(clf, X_train, y_train)  # note: using entire training set here
#print clf  # you can inspect the learned model by printing it

Training ExtraTreesClassifier...
Done!
Training time (secs): 0.093


In [16]:
# Predict on training set and compute F1 score

train_f1_score = predict_labels(clf, X_train, y_train)
print ("F1 score for training set: {}".format(train_f1_score))

Predicting labels using ExtraTreesClassifier...
Done!
Prediction time (secs): 0.019
F1 score for training set: 1.0


In [17]:
# Predict on test data
print ("F1 score for test set: {}".format(predict_labels(clf, X_test, y_test)))

Predicting labels using ExtraTreesClassifier...
Done!
Prediction time (secs): 0.013
F1 score for test set: 0.7913669064748202


In [18]:
train_predict(clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(clf, X_train, y_train, X_test, y_test)

------------------------------------------
Training set size: 100
Training ExtraTreesClassifier...
Done!
Training time (secs): 0.086
Predicting labels using ExtraTreesClassifier...
Done!
Prediction time (secs): 0.007
F1 score for training set: 1.0
Predicting labels using ExtraTreesClassifier...
Done!
Prediction time (secs): 0.007
F1 score for test set: 0.7659574468085107
------------------------------------------
Training set size: 200
Training ExtraTreesClassifier...
Done!
Training time (secs): 0.068
Predicting labels using ExtraTreesClassifier...
Done!
Prediction time (secs): 0.008
F1 score for training set: 1.0
Predicting labels using ExtraTreesClassifier...
Done!
Prediction time (secs): 0.008
F1 score for test set: 0.75
------------------------------------------
Training set size: 300
Training ExtraTreesClassifier...
Done!
Training time (secs): 0.077
Predicting labels using ExtraTreesClassifier...
Done!
Prediction time (secs): 0.009
F1 score for training set: 1.0
Predicting labels 

## 5. Choosing the Best Model

- Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?

The training time of SVM achieve the shortest time with 0.01 sec, where the Adaboost and the Extremely Randomized Trees achieved 0.06 0.1 respectively. This is expected, considering the fact that SVM is usually fast model.

Prediction in all cases is ~ 10 times lower than the corresponding training time (SVM 0.003 Adaboost 0.005 Extremely Randomized Trees 0.010).

F1 score for the SVM model depict decreasing trend as Training set size increase, where F1 score for the Adaboost and Extremely Randomized Trees depict increasing trend.

Considering the above, and the given problem limitations, the most appropriate algorithm is the Adaboost. The main reason for that is the nature of the data set, which is expected to be large and even larger in the feature. SVM is the fastest model, but the performance of the mode decrease for larger training size. Adaboost provides a stable or even increasing performance as the training set increase and reasonable computation time that is longer than the SVM but shorter than the Extremely Randomized Trees.


- In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).

AdaBoost uses a number of training sample to pick a number of good 'classifiers'. AdaBoost will look at a number of classifiers and find out which is the best predictor of a label based on the sample. After it has chosen the best classifier it will continue to find another until some threshold is reached and those classifiers combined together will provide the end result.

- Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.
- What is the model's final F<sub>1</sub> score?
F1 score for test set: 0.79

In [19]:
# TODO: Fine-tune your model and report the best F1 score

from sklearn import svm


#Grid search
regressor = AdaBoostClassifier()
parameters = {'n_estimators':(2, 4, 8, 20, 40, 80, 150)}

def performance_metric(label, prediction):
    return f1_score(label, prediction, pos_label='yes')

scorer = make_scorer(performance_metric, greater_is_better=True)

reg = GridSearchCV(regressor, parameters, scorer, cv=5)
reg.fit(X_all, y_all)
clf = reg.best_estimator_

print (clf)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=2, random_state=None)


In [20]:

train_predict(clf, X_train, y_train, X_test, y_test)


------------------------------------------
Training set size: 300
Training AdaBoostClassifier...
Done!
Training time (secs): 0.007
Predicting labels using AdaBoostClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.8278867102396513
Predicting labels using AdaBoostClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.7887323943661971


Reference

http://wikipedia.org/ 

http://stackoverflow.com/

http://scikit-learn.org/ 
