<h1>Predicting Students Grades</h1>

In my Datamining class we had two hours time to make predictions on this dataset. For that we only had two hours dev time the results are pretty good. It's important to note that I completely skipped on feature engineering to quickly do modeling 
instead. So, it's obvious that the results can be improved. 

Dataset: https://archive.ics.uci.edu/ml/datasets/student+performance

Good random forest tutorial: https://chrisalbon.com/machine_learning/trees_and_forests/random_forest_classifier_example/

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:95% !important; }</style>"))  # Use all space available in the browser
pd.options.mode.chained_assignment = None                             # removes unnecessary error

In [4]:
data = pd.read_csv("student-por.csv", sep=";")
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


<h3>Transform text-based features</h3>

This can be done two ways:
1. Categorize in one Column. E.g. for Job: at_home -> 0, health -> 1 ...
[This can be done with Pandas factorize method or with scikit-learn's LabelEncoder]
This is not always a good idea! One example is to code ['teacher','service','other'] with [1,2,3], which would produce weird things like 'teacher' is lower than 'service', and if you average a 'teacher' and a 'other' you will get a 'service'. In other cases, however it makes total sense. For example, when you code ['low', 'medium', 'high'] with [1,2,3].
2. Make a new column for each value with one-hot encoding:
[This can be done with Pandas get_dummies or scikit-learn OneHotEncoder]
E.g. ['teacher','service','other'] all get their own column with ones where the feature occurs and zeroes elsewhere.

However: With decision trees in any form, this does not matter because of the way that they work. So, choosing factorize is completely fine here.

In [7]:
atts = ['school',"sex","address","famsize","Pstatus","Mjob","Fjob","reason","guardian","schoolsup","famsup","paid","activities","nursery","higher","internet","romantic",]
dataRows = data.loc[:,atts]
for i,_ in enumerate(dataRows):
    data[atts[i]] = pd.factorize(data[atts[i]])[0]
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,0,0,18,0,0,0,4,4,0,0,...,4,3,4,1,1,3,4,0,11,11
1,0,0,17,0,0,1,1,1,0,1,...,5,3,3,1,1,3,2,9,11,11
2,0,0,15,0,1,1,1,1,0,1,...,4,3,2,2,3,3,6,12,13,12
3,0,0,15,0,0,1,4,2,1,2,...,3,2,2,1,1,5,0,14,14,14
4,0,0,16,0,0,1,3,3,2,1,...,4,3,2,1,2,5,0,11,13,13


In [57]:
''' Split data into training and test set '''

training_data, test_data = train_test_split(data, test_size=0.2, shuffle=True)
print("taining shape:",training_data.shape,"test shape:",test_data.shape)

taining shape: (519, 33) test shape: (130, 33)


<h2> Create a random forest Classifier </h2>

In [58]:
# By convention, clf means 'Classifier'
clf = RandomForestClassifier(n_estimators=130, max_features=15, max_depth=7, n_jobs=3, random_state=0)

# get all features but the last one, which is the one we want to predict
features = training_data.columns[:training_data.shape[1]-1] 

# Train the Classifier to take the training features and learn how they relate to the training y (the species)
clf.fit(training_data[features], training_data['G3'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=7, max_features=15, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=130, n_jobs=3,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [59]:
# predict grades for training and test dataset
pred_results_test  = clf.predict(test_data[features])
pred_results_train = clf.predict(training_data[features])

<h2> Results </h2>

In [61]:

correct_percent_test = 1 - (np.count_nonzero(pred_results_test - test_data['G3'])/pred_results_test.shape[0])
correct_percent_train = 1 - np.count_nonzero(pred_results_train - training_data['G3'])/training_data['G3'].shape[0]
l1_loss = np.sum(np.abs(pred_results_test - test_data['G3'].values)) # L1 Loss
l2_loss = np.sum(np.square(pred_results_test - test_data['G3'].values)) # L2 Loss

print("Accuracy test:     ", correct_percent_test) # ~0.523
print("Accuracy training: ", correct_percent_train)

print("\nL1 Loss test:      ", l1_loss)
print("L2 Loss test:      ",   l2_loss)

Accuracy test:      0.5461538461538462
Accuracy training:  0.9364161849710982

L1 Loss test:       83
L2 Loss test:       247


In our Datamining course we did not find the ideal parameters for the random forest classifier. We had an accuracy around 0.49. Only one Group had a better result of 0.52 Accuracy on the test data by using the AdaBoost algorithm. I tried it below. However by lowering the max_features and the max_depth I got a pretty consistent accuracy around 0.52 which is equal to the result of the other group.

<h2> My try on the AdaBoost algorithm </h2>

After the course I was curious and tried out the AdaBoost algorithm on my own. But I was kind of disappointed  by the results.

In [64]:
bdt = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=7), n_estimators=80, algorithm="SAMME", learning_rate=0.9)
bdt.fit(training_data[features], training_data['G3'])

correct_percent_test = bdt.score(X=test_data[features],y=test_data['G3'])
correct_percent_training = bdt.score(X=training_data[features],y=training_data['G3'])

print("Accuracy test:     ",correct_percent_test) # ~ 0.46 to 0.515
print("Accuracy training: ",correct_percent_training) 

Accuracy test:      0.530769230769
Accuracy training:  1.0


The results seem to vary a lot (from 0.42 Accuracy up to 0.53 Accuracy). So, in my case I don't get better results with the AdaBoost algorithm (in fact they are worse or the same as random forest). I don't know if the other group removed some features beforehand or did some other optimization to improve their results

<h2>In conclusion</h2> 

an important take away: Decision trees tend to overfit the data. So, limiting the Number of max_features and max_depth is a good idea. AdaBoost seem to have much higher chances of overfitting so the results on the test set look pretty random and worse than the ones from random forest.