# Project : BUILDING A STUDENT INTERVENTION SYSTEM

### Determining the problem: Classification vs Regression

One can say that is a Classification supervised machine Learning Problem. Before we elaborate more on what makes it a Classification problem. One can give defintions of both Problems. 
Classification is discrete
regression is continous 

## Exploring the Data

Let's go ahead and read in the student dataset first.

In [52]:
# Import libraries
import numpy as np
import pandas as pd

In [53]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print ("Student data read successfully!")
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


In [54]:
student_data

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes
5,GP,M,16,U,LE3,T,4,3,services,other,...,yes,no,5,4,2,1,2,5,10,yes
6,GP,M,16,U,LE3,T,2,2,other,other,...,yes,no,4,4,4,1,1,3,0,yes
7,GP,F,17,U,GT3,A,4,4,other,teacher,...,no,no,4,1,4,1,1,1,6,no
8,GP,M,15,U,LE3,A,3,2,services,other,...,yes,no,4,2,2,1,1,1,0,yes
9,GP,M,15,U,GT3,T,3,4,other,other,...,yes,no,5,5,1,1,1,5,0,yes


Now, we can find the following facts about the dataset:
- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features

In [55]:
# Computing the desired values 
n_students = len(student_data)
n_features = (student_data.shape[1]) - 1
n_passed = len(student_data[student_data['passed'] == "yes"])
n_failed = len(student_data[student_data['passed'] == "no"])
grad_rate = float(n_passed/n_students*100)
 
print ("Total number of students: {}".format(n_students))
print ("Number of students who passed: {}".format(n_passed))
print ("Number of students who failed: {}".format(n_failed))
print ("Number of features: {}".format(n_features))
print ("Graduation rate of the class: {:.2f}%".format(grad_rate))

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [56]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print ("Feature column(s):-\n{}".format(feature_cols))
print ("Target column: {}".format(target_col))

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print ("\nFeature values:-")
print (X_all.head())  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [57]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print ("Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns)))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

In [58]:
import numpy as np
from sklearn import cross_validation

# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_all, y_all, test_size = 95, random_state = 1)
print ("Training set: {} samples".format(X_train.shape[0]))
print ("Test set: {} samples".format(X_test.shape[0]))
# Note: If you need a validation set, extract it from within training data

Training set: 300 samples
Test set: 95 samples


## Training and Evaluating Models
For this section we are going to choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model we will:

- Fit this model to the training data, try to predict labels (for both training and test sets)
- Measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.


In [59]:
# Train a model
import time

def train_classifier(clf, X_train, y_train):
    print ("Training {}...".format(clf.__class__.__name__))
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    print ("Done!\nTraining time (secs): {:.3f}".format(end - start))

# TODO: Choose a model, import it and instantiate an object
from sklearn import tree  
clf_decision_tree = tree.DecisionTreeClassifier() 

# Fit model to training data
train_classifier(clf_decision_tree, X_train, y_train)  # note: using entire training set here
#print clf  # you can inspect the learned model by printing it

Training DecisionTreeClassifier...
Done!
Training time (secs): 0.008


In [60]:
# Predict on training set and compute F1 score
from sklearn.metrics import f1_score

def predict_labels(clf, features, target):
    print ("Predicting labels using {}...".format(clf.__class__.__name__))
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    print ("Done!\nPrediction time (secs): {:.3f}".format(end - start))
    return f1_score(target.values, y_pred, pos_label='yes')

train_f1_score = predict_labels(clf_decision_tree, X_train, y_train)
print ("F1 score for training set: {}".format(train_f1_score))

Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.008
F1 score for training set: 1.0


In [61]:
# Predict on test data
print ("F1 score for test set: {}".format(predict_labels(clf_decision_tree, X_test, y_test)))

Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.71875


In [62]:
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test):
    print ("------------------------------------------")
    print ("Training set size: {}".format(len(X_train)))
    train_classifier(clf, X_train, y_train)
    print ("F1 score for training set: {}".format(predict_labels(clf, X_train, y_train)))
    print ("F1 score for test set: {}".format(predict_labels(clf, X_test, y_test)))

# TODO: Run the helper function above for desired subsets of training data
X_train_200 = X_train[:200]
y_train_200 = y_train[:200]
X_train_100 = X_train[:100]
y_train_100 = y_train[:100]
train_predict(clf_decision_tree, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_decision_tree, X_train_100, y_train_100, X_test, y_test)


# Note: Keep the test set constant

------------------------------------------
Training set size: 200
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.015
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.75
------------------------------------------
Training set size: 100
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.6386554621848739


In [63]:
# TODO: Train and predict using two other models
from sklearn import svm 
clf_svm = svm.SVC()
train_classifier(clf_svm, X_train, y_train)
train_f1_score = predict_labels(clf_svm, X_train, y_train)
print ("F1 score for training set: {}".format(train_f1_score))
print ("F1 score for test set: {}".format(predict_labels(clf_svm, X_test, y_test)))
X_train_200 = X_train[:200]
y_train_200 = y_train[:200]
X_train_100 = X_train[:100]
y_train_100 = y_train[:100]
train_predict(clf_svm, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_svm, X_train_100, y_train_100, X_test, y_test)

Training SVC...
Done!
Training time (secs): 0.016
Predicting labels using SVC...
Done!
Prediction time (secs): 0.010
F1 score for training set: 0.8583877995642701
Predicting labels using SVC...
Done!
Prediction time (secs): 0.018
F1 score for test set: 0.8461538461538461
------------------------------------------
Training set size: 200
Training SVC...
Done!
Training time (secs): 0.006
Predicting labels using SVC...
Done!
Prediction time (secs): 0.004
F1 score for training set: 0.8580645161290322
Predicting labels using SVC...
Done!
Prediction time (secs): 0.002
F1 score for test set: 0.8407643312101911
------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.002
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.8590604026845637
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.8333333333333333


In [64]:
from sklearn.neighbors import KNeighborsClassifier
clf_knn = KNeighborsClassifier(n_neighbors=2)
train_classifier(clf_knn, X_train, y_train)
train_f1_score = predict_labels(clf_knn, X_train, y_train)
print ("F1 score for training set: {}".format(train_f1_score))
print ("F1 score for test set: {}".format(predict_labels(clf_knn, X_test, y_test)))
X_train_200 = X_train[:200]
y_train_200 = y_train[:200]
X_train_100 = X_train[:100]
y_train_100 = y_train[:100]
train_predict(clf_knn, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_knn, X_train_100, y_train_100, X_test, y_test)

Training KNeighborsClassifier...
Done!
Training time (secs): 0.002
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.012
F1 score for training set: 0.8249258160237388
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.006
F1 score for test set: 0.6306306306306306
------------------------------------------
Training set size: 200
Training KNeighborsClassifier...
Done!
Training time (secs): 0.001
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.008
F1 score for training set: 0.8125000000000001
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.008
F1 score for test set: 0.6126126126126127
------------------------------------------
Training set size: 100
Training KNeighborsClassifier...
Done!
Training time (secs): 0.002
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.7254901960784313
Predicting labels using K

## Choosing the Best Model

- In this section we are going to Fine-tune the model. The tool that which will be used is Gridsearch, with at least one important parameter tuned and with at least 3 settings. We will be using the entire training set for this.
- We also determine the model's final F<sub>1</sub> score

In [80]:
# TODO: Fine-tune your model and report the best F1 score
from sklearn import grid_search, datasets
poly = 1
parameters = {'kernel' : ('linear', 'rbf', 'poly'), 'C':[1]}
clf_grid_search = grid_search.GridSearchCV(estimator = clf_svm, param_grid = parameters)
clf_grid_search.fit(X_train, y_train)

print (clf_grid_search.best_score_)

0.686666666667
