# Supervised Learning - Artificial Intelligence

## Students' Dropout and Success

### Notebook by Henrique Pinho, João Lopes and Luís Marques

## Introduction

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. 

In this notebook, we will be using Supervised learning to predict if a student graduates or dropout.

## Required libraries

[[ go back to the top ]](#Table-of-contents)

If you don't have Python on your computer, you can use the [Anaconda Python distribution](http://continuum.io/downloads) to install most of the Python packages you need. Anaconda provides a simple double-click installer for your convenience.

This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:

* **NumPy**: Provides a fast numerical array structure and helper functions.
* **pandas**: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
* **scikit-learn**: The essential Machine Learning package in Python.
* **matplotlib**: Basic plotting library in Python; most other Python plotting libraries are built on top of it.
* **Seaborn**: Advanced statistical plotting library.

To make sure you have all of the packages you need, install them with `conda`:

    conda install numpy pandas scikit-learn matplotlib seaborn
    
    conda install -c conda-forge watermark

`conda` may ask you to update some of them if you don't have the most recent version. Allow it to do so.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpt
import sklearn as sk
import seaborn as sb
import time
from sklearn import tree
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

## Checking the data


The next step is to look at the data we're working with. Even curated data sets from the government can have errors in them, and it's vital that we spot these errors before investing too much time in our analysis.

Generally, we're looking to answer the following questions:

* Is there anything wrong with the data?
* Are there any quirks with the data?
* Do I need to fix or remove any of the data?

Let's start by reading the data into a pandas DataFrame.

In [67]:
student_data = pd.read_csv('data.csv', delimiter=';')
student_data.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [68]:
student_data.isnull().any().sum()

0

No missing values were found.

In [None]:
student_data.describe()

Next we split the data in the three targets.

In [None]:
enrolled = student_data[student_data.Target == "Enrolled"].drop(columns=['Target'])
graduated = student_data[student_data.Target == "Graduate"].drop(columns=['Target'])
dropout = student_data[student_data.Target == "Dropout"].drop(columns=['Target'])

In [None]:
plt.title('Number of students per Target')
plt.bar(['Graduated', 'Enrolled', 'Dropout'], [len(graduated), len(enrolled), len(dropout)])

In this project we will focus only on the success or dropout of the students. All rows with target Enrolled will be left out.

In [None]:
student_data_clean = student_data.drop(student_data[student_data['Target'] == 'Enrolled'].index, inplace=False)
student_data_clean.describe()
#student_data.describe()

In this project our focus is in the students that graduate or dropout, so the students that are enrolled will be left out from this analysis.

In [None]:
graduate_data = student_data_clean[student_data_clean.Target == 'Graduate']
graduate_data.describe()

In [None]:
dropout_data = student_data_clean[student_data_clean.Target == 'Dropout']
dropout_data.describe()

Since there is more graduated student then to balance the dataset a Stratified Sampling was applyed. 

In [None]:
graduate_data_1 = graduate_data.iloc[:500,:]
graduate_data_2 = graduate_data.iloc[500:1000,:]
graduate_data_3 = graduate_data.iloc[1000:1500,:]
graduate_data_4 = graduate_data.iloc[1500:,:]

graduate_data_1_sample = graduate_data_1.sample(n=350, axis=0)
graduate_data_2_sample = graduate_data_2.sample(n=350, axis=0)
graduate_data_3_sample = graduate_data_3.sample(n=350, axis=0)
graduate_data_4_sample = graduate_data_4.sample(n=350, axis=0)



graduate_sample = pd.concat([graduate_data_1_sample, graduate_data_2_sample, graduate_data_3_sample, graduate_data_4_sample])
graduate_sample.describe()

In [None]:
student_data_clean = pd.concat([graduate_sample, dropout_data])
student_data_clean.describe()

In [None]:
# This line tells the notebook to show plots inside of the notebook
%matplotlib inline

#sb.pairplot(student_data_clean.sample(100), hue='Target')
;

In [None]:
student_data_corr = student_data_clean.corr()
mask = np.zeros_like(student_data_corr)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(15,15))
with sb.axes_style("white"):
    ax = sb.heatmap(student_data_corr, linewidths=0.1, cmap="YlGnBu", annot=True, square=True, mask=mask, fmt='.2f', annot_kws={"size": 6}, vmax=1, vmin=-1)
    plt.show()

From this correlation matrix we can exctract features that are strongly correlated with eachother. Values with an absolute value of more than 0.9 which is our criteria for correlated features.

In [None]:
student_data_corr = student_data_clean.corr().abs()

upper = student_data_corr.where(np.triu(np.ones(student_data_corr.shape), k=1).astype(bool))

to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]

old_n_columns = len(student_data_clean.columns)

student_data_clean.drop(to_drop, axis=1, inplace=True)

print('Dropped ' + str(old_n_columns-len(student_data_clean.columns)) + ' columns')
student_data_clean.describe()

Auxiliar function to retrieve the inputs and labels from dataset provided

In [None]:
def get_inputs_labels(dataset, scaler=None):
    all_inputs = dataset.drop('Target', axis=1)
    all_labels = dataset['Target']
    
    if scaler != None:
        scaler = scaler.fit(all_inputs)
        all_inputs = scaler.transform(all_inputs)

    return all_inputs, all_labels
        
all_inputs, all_labels = get_inputs_labels(student_data_clean)

Auxiliar function to perform parameter tunning with cross validation

In [None]:



def tune_model(dataset, model_instance, parameter_grid, cross_validation=StratifiedKFold(n_splits=10), scaler=None, oversample=False): 
    all_inputs, all_labels = get_inputs_labels(dataset, scaler)
    
    if oversample:
        steps = [('sampling', SMOTE()), ('model', model_instance)]
        model_instance = Pipeline(steps=steps)


    grid_search = GridSearchCV(
        model_instance,
        param_grid=parameter_grid,
        cv=cross_validation,
        scoring="f1_weighted"
    )

    grid_search.fit(all_inputs, all_labels)
    print('Best score: {}'.format(grid_search.best_score_))
    print('Best parameters: {}'.format(grid_search.best_params_))

    grid_search.best_estimator_
    return grid_search

### Time Measure

In [None]:



def measure_time(dataset, model_instance, params, scaler=None, oversample=False):
    all_inputs, all_labels = get_inputs_labels(dataset, scaler)

    if oversample:
        steps = [('sampling', SMOTE()), ('model', model_instance)]
        model_instance = Pipeline(steps=steps)
    model_instance.set_params(**params)

    (training_inputs,
    testing_inputs,
    training_classes,
    testing_classes) = train_test_split(all_inputs, all_labels, test_size=0.25, random_state=1)
    
    start = time.time()
    model_instance.fit(training_inputs, training_classes)
    end = time.time()
    return end - start

### Decision Tree

In [None]:


decision_tree_classifier = DecisionTreeClassifier()


parameters = {'criterion': ['gini', 'entropy'],
                'splitter': ['best', 'random'],
                'max_depth': [8],
            }



decision_tree_w_parameters = GridSearchCV(decision_tree_classifier,
                            param_grid=parameters)


decision_tree_w_parameters.fit(training_inputs, training_classes)

with open('decision_tree.dot', 'w') as out_file:
    out_file = tree.export_graphviz(decision_tree_w_parameters.best_estimator_ , out_file=out_file)

decision_tree_w_parameters.score(testing_inputs, testing_classes)


In [None]:


parameter_grid = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'max_depth': range(1, 9),
    'max_features': range(1, 9)
}

dt_original = tune_model(student_data, DecisionTreeClassifier(), parameter_grid)

In [None]:
dt = tune_model(student_data_clean, DecisionTreeClassifier(), parameter_grid)

In [None]:
parameter_grid = {
    'model__criterion': ['gini', 'entropy'],
    'model__splitter': ['best', 'random'],
    'model__max_depth': range(1, 7),
    'model__max_features': range(1, 7)
}

dt_os_fs = tune_model(student_data_clean, DecisionTreeClassifier(), parameter_grid, oversample=True)

### SVM

In [None]:
X, y = get_inputs_labels(student_data_clean)

# Without standardizing the data:
svc = SVC()

# cross_val_score returns a list of the scores, which we can visualize
# to get a reasonable estimate of our classifier's performance
cv_scores = cross_val_score(svc, X, y, cv=10)

plt.hist(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))

In [None]:
# Standardizing the data:
standardized_X, y = get_inputs_labels(student_data_clean, scaler = StandardScaler())

svc = SVC()

# cross_val_score returns a list of the scores, which we can visualize
# to get a reasonable estimate of our classifier's performance
cv_scores = cross_val_score(svc, standardized_X, y, cv=10)
plt.hist(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))

By comparing both histograms, it can be easily concluded that the standardization is really necessary and produces better and more consistent results.

Still the cross validation scores vary a lot based on the training data chosen. Therefore we should do some parameter tuning to see what the best parameters are for our dataset that don't overfit the data. This can be achieved by a GridSearch. This will be addressed below.

In [None]:
parameter_grid = {
    'C': [1, 10, 50], 
    'gamma': [0.001, 0.0001],
    'kernel': ['linear', 'poly', 'rbf']
    #'kernel': ['linear', 'rbf', 'sigmoid']
}

# No oversampling / No feature selection
svc_original = tune_model(student_data, SVC(), parameter_grid, scaler=StandardScaler())

In [None]:
# No oversampling / Feature selection
svc = tune_model(student_data_clean, SVC(), parameter_grid, scaler=StandardScaler())

In [None]:
parameter_grid = {
    'model__C': [1, 10, 50], 
    'model__gamma': [0.001, 0.0001],
    # 'kernel': ['linear', 'poly', 'rbf']
    'model__kernel': ['linear', 'rbf', 'sigmoid']
}

# Oversampling / Feature Selection
svc_os_fs = tune_model(student_data_clean, SVC(), parameter_grid, scaler=StandardScaler(), oversample=True)

### K-nearest neighbours (KNN)

In [None]:
# Without standardizing the data

X, y = get_inputs_labels(student_data_clean)

knn = neighbors.KNeighborsClassifier()

# cross_val_score returns a list of the scores, which we can visualize
# to get a reasonable estimate of our classifier's performance
cv_scores = cross_val_score(knn, X, y, cv=10)
plt.hist(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))
;

In [None]:
# Standardizing the data
standardized_X, y = get_inputs_labels(student_data_clean, scaler=StandardScaler())

knn = neighbors.KNeighborsClassifier()

cv_scores = cross_val_score(knn, standardized_X, y, cv=10)
plt.hist(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))
;

In [None]:
parameter_grid =  {
    'n_neighbors':[4,5,6,7,10,15],
    'leaf_size':[5, 10, 15, 20, 50, 100],
    'n_jobs':[-1],
    'algorithm':['auto']
}

# No oversampling / No feature selection
knn = neighbors.KNeighborsClassifier()
knn_original = tune_model(student_data, knn, parameter_grid, scaler=StandardScaler())

In [None]:
# No oversampling / Feature selection
knn = neighbors.KNeighborsClassifier()
knn = tune_model(student_data_clean, knn, parameter_grid, scaler=StandardScaler())

In [None]:
parameter_grid = {
    'model__n_neighbors':[4,5,6,7,10,15],
    'model__leaf_size':[5, 10, 15, 20, 50, 100],
    'model__n_jobs':[-1],
    'model__algorithm':['auto']
}

# Oversampling / Feature Selection
knn_os_fs = tune_model(student_data_clean, neighbors.KNeighborsClassifier(), parameter_grid, scaler=StandardScaler(), oversample=True)

### Naive Bayes

In [None]:


parameter_grid = {}

# No oversampling / No feature selection
nb_original = tune_model(student_data, GaussianNB(), parameter_grid, scaler=StandardScaler())

In [None]:
# No oversampling / Feature selection
nb = tune_model(student_data_clean, GaussianNB(), parameter_grid, scaler=StandardScaler())

In [None]:
parameter_grid = {}

# Oversampling / Feature Selection
nb_os_fs = tune_model(student_data_clean, GaussianNB(), parameter_grid, scaler=StandardScaler(), oversample=True)

### Random Forest Classifier

In [None]:
parameter_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [5, 10, 15],
    'n_jobs': [-1], #Use all cores
    'criterion': ['gini', 'entropy']
}

# No oversampling / No feature selection
rfc_original = tune_model(student_data, RandomForestClassifier(), parameter_grid)

In [None]:
# No oversampling / Feature selection
rfc = tune_model(student_data_clean, RandomForestClassifier(), parameter_grid)

In [None]:
parameter_grid = {
    'model__n_estimators': [10, 50, 100, 200],
    'model__max_depth': [5, 10, 15],
    'model__n_jobs': [-1], #Use all cores
    'model__criterion': ['gini', 'entropy']
}

# Oversampling / Feature Selection
rfc_os_fs = tune_model(student_data_clean, RandomForestClassifier(), parameter_grid, oversample=True)

### Comparing Models

In [None]:
scores = {
    "Decision Tree" : [dt_original, dt, dt_os_fs],
    "SVC" : [svc_original, svc, svc_os_fs],
    "K-nearest Neighbours" : [knn_original, knn, knn_os_fs],
    "Naive Bayes" : [nb_original, nb, nb_os_fs],
    "Random Forest" : [rfc_original, rfc, rfc_os_fs]
}

labels = ["Original Data","Modified Data", "Oversampled Modified Data"]

ind = np.arange(5)

plt.figure(figsize=(11,11))
plt.bar(ind, [i[0].best_score_ for i in scores.values()], 0.2)
ax = plt.bar(ind + 0.2, [i[1].best_score_ for i in scores.values()], 0.2)
ax = plt.bar(ind + 0.4, [i[2].best_score_ for i in scores.values()], 0.2)
plt.xticks(ind, scores.keys())
plt.legend(labels,loc=2)
plt.ylim(0, 1)
plt.show()

### Analysing Times

In [None]:
times = {
    "Decision Tree" : [
        measure_time(student_data, DecisionTreeClassifier(), dt_original.best_params_),
        measure_time(student_data_clean, DecisionTreeClassifier(), dt.best_params_),
        measure_time(student_data_clean, DecisionTreeClassifier(), dt_os_fs.best_params_, oversample=True)
    ],
    "SVC" : [
        measure_time(student_data, SVC(), svc_original.best_params_, scaler=StandardScaler()),
        measure_time(student_data_clean, SVC(), svc.best_params_, scaler=StandardScaler()),
        measure_time(student_data_clean, SVC(), svc_os_fs.best_params_, oversample=True, scaler=StandardScaler())
    ],
    "K-nearest Neighbours" : [
        measure_time(student_data, neighbors.KNeighborsClassifier(), knn_original.best_params_, scaler=StandardScaler()),
        measure_time(student_data_clean, neighbors.KNeighborsClassifier(), knn.best_params_, scaler=StandardScaler()),
        measure_time(student_data_clean, neighbors.KNeighborsClassifier(), knn_os_fs.best_params_, oversample=True, scaler=StandardScaler())
    ],
    "Naive Bayes" : [
        measure_time(student_data, GaussianNB(), nb_original.best_params_, scaler=StandardScaler()),
        measure_time(student_data_clean, GaussianNB(), nb.best_params_, scaler=StandardScaler()),
        measure_time(student_data_clean, GaussianNB(), nb_os_fs.best_params_, oversample=True, scaler=StandardScaler())
    ],
    "Random Forest" : [
        measure_time(student_data, RandomForestClassifier(), rfc_original.best_params_),
        measure_time(student_data_clean, RandomForestClassifier(), rfc.best_params_),
        measure_time(student_data_clean, RandomForestClassifier(), rfc_os_fs.best_params_, oversample=True)
    ]
}

labels = ["No oversampling/No feature selection","No oversampling/Feature selection", "Oversampling/Feature selection"]

ind = np.arange(5)

plt.figure(figsize=(10,9))
plt.bar(ind, [i[0] for i in times.values()], 0.2)
ax = plt.bar(ind + 0.2, [i[1] for i in times.values()], 0.2)
ax = plt.bar(ind + 0.4, [i[2] for i in times.values()], 0.2)
plt.xticks(ind, times.keys())
plt.legend(labels,loc=1)

# plt.ylim(0.7, 1)
plt.show()

### Conclusion

The proposed work was to test and compare different Supervised Machine Learning models for classification of the **Students' Success or Dropout** dataset. The tested models were **Decision Tree**, **Support Vector Machines**, **K-nearest Neighbours**, **Naive Bayes** and **Random Forest**.

After some exploratory data analysis we decided to drop some features based on their correlation with each other. This proved to be only effective in the **Naive Bayes** and **Random Forest Classifiers**.

To evaluate each model and choose the best parameters for each one, we used SKLearn's GridSearchCV to test different set of parameters. To score the models we used f1 wighted score. We also tried combining oversampling with and without feature selection. Looking at the benchmarks we can conclude that oversampling does not improve the scores on our models while increasing significantly the training time.

In terms of scoring, it can be concluded that the best models for our classification problem is the **Support Vector Machine**, followed closely by the **K-nearest Neighbors**. However when we take a look at the time needed to train each model, the **Support Vector Machine** takes much longer than **K-nearest Neighbours**, making **K-nearest neighbours** the best model overall. This appears to be related to the fact that **K-nearest Neighbours** can be trained with the flag n_jobs=-1 which makes it use all the cores in the CPU while **Support Vector Machine** does not support this option.