# Project: Building a Student Intervention System

In this project, our goal is to identify students who might need early intervention before they fail to graduate

This is a classification problem in supervised learning, because we have 2 discrete outputs - either the student needs the early intervention or not.

## Exploring the Data
Let's run the code cell below to load necessary Python libraries and load the student data. The last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"

Student data read successfully!


### Implementation: Data Exploration
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students.

In [2]:
n_students = len(student_data.count(axis=1))
n_features = len(student_data.count(axis=0))-1 #Because the 'passed' column is the target, not a feature
n_passed = len(student_data[student_data.passed == "yes"])
n_failed = len(student_data[student_data.passed == "no"])
grad_rate = (n_passed / float(n_students))*100

print "Total number of students: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identifying feature and target columns
It is often the case that the data we obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's separate the student data into feature and target columns to see if any features are non-numeric.

In [3]:
# Extract feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()

Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

### Preprocess Feature Columns

As we can observe, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [4]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets.

In [5]:
from sklearn.cross_validation import train_test_split

num_train = 300
num_test = X_all.shape[0] - num_train
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, stratify = y_all, random_state=42)

print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])

Training set has 300 samples.
Testing set has 95 samples.


## Training and Evaluating Models
In this section, we will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. We will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. We will then fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F<sub>1</sub> score. We will need to produce three tables (one for each model) that shows the training set size, training time, prediction time, F<sub>1</sub> score on the training set, and F<sub>1</sub> score on the testing set.

###  Model Application

- KNN
Real-world application: Predicting the compressive strength of various mixtures of cement ingredients. All of the ingredients were relatively non-volatile with respect to the response or each other. KNN made reliable predictions on it. If we have parametric regression for example which is defined by the function/parameters, then we would use those parameters to estimate the output. However, where KNN is good is non-parametric regression because we are using data to estimate the output. However, with only changing number of neighbors we can change the output a lot, so we need to be very careful when tuning n_neighbors parameter. The pros are fast training and easily adding more points, but the cons are slow querying and the fact that we need to store data. This model could be a good candidate because we dont have a lot of data, we have non-parametric classification, so the slow querying is not an issue for us, and with the proper parameter tuning we can get decent results.


- Stochastic Gradient Descent (SGDC)
Real-world application: We can use neural networks to recognize handwritten digits. The idea is to estimate the gradient by computing the part of the gradient for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient, and this helps speed up gradient descent, and thus learning. Pros are that it is a fairly well studied algorithm, so for the most part the problems with GD have solutions. However, sometimes calculating the gradient can be very expensive or intractable if the size of our data is large. Stochastic gradient descent is a good option which theoretically will converge to a local max. Since we do not have lots of data in this problem, and with tuning parameters, this should be a good option for us.


- Support Vector Machines
Real-world application: Binary classification. They have the advantage that you can use them for non-linear decision boundaries. But the disadvantage is that you can't attribute meaning (e.g. why the features being X result in the sample being classified y). That is because the idea behind SVM is to find the greates margin between two different sets. It is computing it mathematically to find the hyperplane which will separate data with the greates margin. If the data is not linearly separable in 2D, it will hit up higher dimension and then try to separate it. It is heavily relied on vector and thus the name Support Vector Machines. SVM performs poorly with large datasets and with datasets with lots of noise. This is not the case in our problem, so I think that this is good option

### Setup
Let's run the code cell below to initialize three helper functions which we can use for training and testing the three supervised learning models we've chosen above. The functions are as follows:
- `train_classifier` - takes as input a classifier and training data and fits the classifier to the data.
- `predict_labels` - takes as input a fit classifier, features, and a target labeling and makes predictions using the F<sub>1</sub> score.
- `train_predict` - takes as input a classifier, and the training and testing data, and performs `train_clasifier` and `predict_labels`.
 - This function will report the F<sub>1</sub> score for both the training and testing data separately.

In [6]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print "Trained model in {:.4f} seconds".format(end - start)

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print "Made predictions in {:.4f} seconds.".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print "Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print "F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test))

### Implementation: Model Performance Metrics
With the predefined functions above, we will now import the three supervised learning models of our choice and run the `train_predict` function for each one. We will need to train and predict on each classifier for three different training set sizes: 100, 200, and 300. Hence, we should expect to have 9 different outputs below — 3 for each model using the varying training set sizes.

In [7]:
# TODO: Import the three supervised learning models from sklearn
# from sklearn import model_A
from sklearn.neighbors import KNeighborsClassifier

# from sklearn import model_B
from sklearn.svm import SVC

# from skearln import model_C
from sklearn.linear_model import SGDClassifier

# TODO: Initialize the three models
clf_A = KNeighborsClassifier()
clf_B = SVC(random_state=42)
clf_C = SGDClassifier(random_state=42)

for clf in [clf_A, clf_B, clf_C]:
    print "\n{}: \n".format(clf.__class__.__name__)
    for n in [100, 200, 300]:
        train_predict(clf, X_train[:n], y_train[:n], X_test, y_test)


KNeighborsClassifier: 

Training a KNeighborsClassifier using a training set size of 100. . .
Trained model in 0.0038 seconds
Made predictions in 0.0018 seconds.
F1 score for training set: 0.8252.
Made predictions in 0.0010 seconds.
F1 score for test set: 0.7586.
Training a KNeighborsClassifier using a training set size of 200. . .
Trained model in 0.0006 seconds
Made predictions in 0.0025 seconds.
F1 score for training set: 0.8097.
Made predictions in 0.0014 seconds.
F1 score for test set: 0.7857.
Training a KNeighborsClassifier using a training set size of 300. . .
Trained model in 0.0009 seconds
Made predictions in 0.0048 seconds.
F1 score for training set: 0.8539.
Made predictions in 0.0017 seconds.
F1 score for test set: 0.8138.

SVC: 

Training a SVC using a training set size of 100. . .
Trained model in 0.0060 seconds
Made predictions in 0.0016 seconds.
F1 score for training set: 0.8354.
Made predictions in 0.0010 seconds.
F1 score for test set: 0.8025.
Training a SVC using a t

### Tabular Results

** Classifer 1 - KNN **  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               | 0.0006                  | 0.0014                 |   0.8252         | 0.7586          |
| 200               | 0.0006                  | 0.0027                 |   0.8097         | 0.7857          |
| 300               | 0.0009                  | 0.0062                 |   0.8539         | 0.8138          |

** Classifer 2 - Support Vector Classifier **  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               | 0.0011                  | 0.0007                 | 0.8354           | 0.8025          |
| 200               | 0.0035                  | 0.0024                 | 0.8431           | 0.8105          |
| 300               | 0.0060                  | 0.0040                 | 0.8664           | 0.8052          |

** Classifer 3 - Stochastic Gradient Descent**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               | 0.0005                  | 0.0001                 | 0.8075           | 0.8025          |
| 200               | 0.0006                  | 0.0001                 | 0.8212           | 0.7703          |
| 300               | 0.0008                  | 0.0002                 | 0.7960           | 0.7500          |

## Choosing the Best Model
In this final section, we will choose from the three supervised learning models the *best* model to use on the student data. We will then perform a grid search optimization for the model over the entire training set (`X_train` and `y_train`) by tuning at least one parameter to improve upon the untuned model's F<sub>1</sub> score. 

### Choosing the Best Model

Based on the experiments performed earlier, it is easy to see that Support Vector Classifier performs best in terms of F1 testing score, as the score does not drop bellow 80% which is decent. With tuning parameters, we can achieve even higher score. However, we can see that the training time of Support Vector Classifier is the highest, but since we do not have lots of data, that fact is not as important to us as the F1 testing score. The best model for this problem is SVC.

### Model in Layman's Terms

SVC is algorithm with a strong mathematical background. It heavily relies on vectors and it is based on linear separation by finding the biggest margin (biggest empty space). If the data is not linearly separable in a certain dimension, SVC is capable of raising the dimension and trying to find the hyperplane which will linearly separate the data. It is great for binary classification, and our problem is binary classification because we need to decide if someone is either pass the year or not. 

### Implementation: Model Tuning
Let's fine tune our chosen model now. We will use grid search (`GridSearchCV`) with at least one important parameter tuned with at least 3 different values. We will need to use the entire training set for this.

In [8]:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer

parameters = {'degree': [1,2,3], 'kernel':['rbf'], 'tol':[0.3, 0.5, 0.8] }
 
clf = SVC(random_state=42)

f1_scorer = make_scorer(f1_score, pos_label='yes')

grid_obj = GridSearchCV(clf, parameters, scoring=f1_scorer)

grid_obj = grid_obj.fit(X_train, y_train)

clf = grid_obj.best_estimator_

print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))

Made predictions in 0.0050 seconds.
Tuned model has a training F1 score of 0.8739.
Made predictions in 0.0018 seconds.
Tuned model has a testing F1 score of 0.8212.


### Final F<sub>1</sub> Score

The F1 score for training is now increased from 86.64% to 87.39%. And the score for testing is now increased from 80.52% to 82.12%. With the slight tuning, we managed to get better results with SVC.