## Comparison of the 5 supervised machine learning techniques

1. Using full model (16 features)
2. Using full model + Data Balancing


### Basic Data Preparation

In [1]:
# import everything we need first
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('xAPI-Edu-Data.csv')

In [3]:
#make the random capitalisations in some of the columns consistent
df.rename(index=str, columns={'gender':'Gender', 'NationalITy':'Nationality',
                               'raisedhands':'RaisedHands', 'VisITedResources':'VisitedResources'},
                               inplace=True)


### Importing classifiers, metrics, train/test split, time

In [4]:
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

import time


### 1. Using full model

#### 1.1 Label Encoding of Categorical Data

In [5]:
from sklearn.preprocessing import LabelEncoder

target = {
    'M': 1,
    'L': 0,
    'H': 2
}
df['Class'] = df['Class'].map(target)

X = df.drop('Class', axis=1)
y = df['Class']

# Encoding our categorical columns in X
labelEncoder = LabelEncoder()
cat_columns = X.dtypes.pipe(lambda x: x[x == 'object']).index
for col in cat_columns:
    X[col] = labelEncoder.fit_transform(X[col])    

#### 1.2 Split dataset into train and test data

In [6]:
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X, y, random_state=42)

#### 1.3 Models training and performance evaluation

In [7]:
keys = []
scores = []
timings = []
models = {'Logistic Regression': LogisticRegression(solver = 'lbfgs', multi_class = 'multinomial'), 
          'K-NN Classification': KNeighborsClassifier(n_neighbors = 3),
          'Support Vector Machines': SVC(kernel = 'linear'), 'Decision Tree': DecisionTreeClassifier(),
          'Random Forest': RandomForestClassifier(n_estimators=150, random_state=42)}
modelss = {'Decision Tree': DecisionTreeClassifier()}

for k,v in models.items():
    mod = v
    start_time = time.time()
    mod.fit(X_train_full, y_train_full)
    time_elapsed = time.time() - start_time
    pred = mod.predict(X_test_full)
    print(str(k) + '\n')
    print("Confusion matrix:")
    print(confusion_matrix(y_test_full, pred))
    accuracy = accuracy_score(y_test_full, pred)
    print("Accuracy score: "+ str(accuracy))
    print("Training took %s seconds." % (time_elapsed))
    print('\n' + '\n')
    keys.append(k)
    scores.append(accuracy)
    timings.append(time_elapsed)
    table_fullmodel = pd.DataFrame({'Model':keys, 'Accuracy score':scores, 'Training time':timings})
print("Comparison of the 5 supervised machine learning techniques using full model")
print(table_fullmodel)



Logistic Regression

Confusion matrix:
[[31  3  0]
 [ 8 37 13]
 [ 0 13 15]]
Accuracy score: 0.6916666666666667
Training took 0.061884164810180664 seconds.



K-NN Classification

Confusion matrix:
[[28  6  0]
 [11 33 14]
 [ 2  9 17]]
Accuracy score: 0.65
Training took 0.002994060516357422 seconds.



Support Vector Machines

Confusion matrix:
[[30  4  0]
 [ 5 46  7]
 [ 0  8 20]]
Accuracy score: 0.8
Training took 0.712092399597168 seconds.



Decision Tree

Confusion matrix:
[[28  6  0]
 [ 6 43  9]
 [ 0  6 22]]
Accuracy score: 0.775
Training took 0.003991127014160156 seconds.



Random Forest

Confusion matrix:
[[31  3  0]
 [ 4 48  6]
 [ 0  7 21]]
Accuracy score: 0.8333333333333334
Training took 0.23934626579284668 seconds.



Comparison of the 5 supervised machine learning techniques using full model
                     Model  Accuracy score  Training time
0      Logistic Regression        0.691667       0.061884
1      K-NN Classification        0.650000       0.002994
2  Support Vec

### 2. Using full model + Data Balancing

#### 2.1 Performing data balancing

Before data balancing

In [8]:
df['Class'].value_counts()

1    211
2    142
0    127
Name: Class, dtype: int64

After data balancing

In [9]:
# we will randomly take 70% from class M (encoded as 1) as samples to balance the data.
df_balanced = pd.read_csv('xAPI-Edu-Data.csv')
df_balanced = df_balanced.drop(df_balanced.query("Class == 'M'").sample(frac = 0.3).index)
df_balanced['Class'].value_counts()

M    148
H    142
L    127
Name: Class, dtype: int64

#### 2.2 Label Encoding of Categorical Data

In [10]:
from sklearn.preprocessing import LabelEncoder

target = {
    'M': 1,
    'L': 0,
    'H': 2
}
df_balanced['Class'] = df_balanced['Class'].map(target)

X_bal = df_balanced.drop('Class', axis=1)
y_bal = df_balanced['Class']

# Encoding our categorical columns in X
labelEncoder = LabelEncoder()
cat_columns = X_bal.dtypes.pipe(lambda x: x[x == 'object']).index
for col in cat_columns:
    X_bal[col] = labelEncoder.fit_transform(X_bal[col])    

#### 2.3 Split dataset into train and test data

In [11]:
X_train_balanced, X_test_balanced, y_train_balanced, y_test_balanced = train_test_split(X_bal, y_bal, random_state=42)

#### 2.4 Models training and performance evaluation

In [12]:
keys = []
scores = []
timings = []
models = {'Logistic Regression': LogisticRegression(solver = 'lbfgs', multi_class = 'multinomial'), 
          'K-NN Classification': KNeighborsClassifier(n_neighbors = 3),
          'Support Vector Machines': SVC(kernel = 'linear'), 'Decision Tree': DecisionTreeClassifier(),
          'Random Forest': RandomForestClassifier(n_estimators=150, random_state=42)}

for k,v in models.items():
    mod = v
    start_time = time.time()
    mod.fit(X_train_balanced, y_train_balanced)
    time_elapsed = time.time() - start_time
    pred = mod.predict(X_test_balanced)
    print(str(k) + '\n')
    print("Confusion matrix:")
    print(confusion_matrix(y_test_balanced, pred))
    accuracy = accuracy_score(y_test_balanced, pred)
    print("Accuracy score: "+ str(accuracy))
    print("Training took %s seconds." % (time_elapsed))
    print('\n' + '\n')
    keys.append(k)
    scores.append(accuracy)
    timings.append(time_elapsed)
    table_fullmodel_balanced = pd.DataFrame({'Model':keys, 'Accuracy score':scores, 'Training time':timings})

print("Comparison of the 5 supervised machine learning techniques using full model with balanced dataset")
print(table_fullmodel_balanced)



Logistic Regression

Confusion matrix:
[[34  2  0]
 [ 8 17 13]
 [ 1  5 25]]
Accuracy score: 0.7238095238095238
Training took 0.05484938621520996 seconds.



K-NN Classification

Confusion matrix:
[[32  3  1]
 [ 9 18 11]
 [ 3  9 19]]
Accuracy score: 0.6571428571428571
Training took 0.0029909610748291016 seconds.



Support Vector Machines

Confusion matrix:
[[30  6  0]
 [ 2 27  9]
 [ 0  8 23]]
Accuracy score: 0.7619047619047619
Training took 0.6273202896118164 seconds.



Decision Tree

Confusion matrix:
[[30  6  0]
 [ 4 23 11]
 [ 1  8 22]]
Accuracy score: 0.7142857142857143
Training took 0.0049822330474853516 seconds.



Random Forest

Confusion matrix:
[[32  4  0]
 [ 3 27  8]
 [ 0  6 25]]
Accuracy score: 0.8
Training took 0.22344398498535156 seconds.



Comparison of the 5 supervised machine learning techniques using full model with balanced dataset
                     Model  Accuracy score  Training time
0      Logistic Regression        0.723810       0.054849
1      K-NN Classific