# Working with a new loan default dataset

https://www.kaggle.com/csafrit2/higher-education-students-performance-evaluation

Abstract
The data was collected from the Faculty of Engineering and Faculty of Educational Sciences students in 2019. The purpose is to predict students' end-of-term performances using ML techniques.

Attribute Information:
Student ID
1- Student Age (1: 18-21, 2: 22-25, 3: above 26)
2- Sex (1: female, 2: male)
3- Graduated high-school type: (1: private, 2: state, 3: other)
4- Scholarship type: (1: None, 2: 25%, 3: 50%, 4: 75%, 5: Full)
5- Additional work: (1: Yes, 2: No)
6- Regular artistic or sports activity: (1: Yes, 2: No)
7- Do you have a partner: (1: Yes, 2: No)
8- Total salary if available (1: USD 135-200, 2: USD 201-270, 3: USD 271-340, 4: USD 341-410, 5: above 410)
9- Transportation to the university: (1: Bus, 2: Private car/taxi, 3: bicycle, 4: Other)
10- Accommodation type in Cyprus: (1: rental, 2: dormitory, 3: with family, 4: Other)
11- Mother's education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.)
12- Father's education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.)
13- Number of sisters/brothers (if available): (1: 1, 2:, 2, 3: 3, 4: 4, 5: 5 or above)
14- Parental status: (1: married, 2: divorced, 3: died - one of them or both) ***Listed as "Kids"…woops
15- Mother's occupation: (1: retired, 2: housewife, 3: government officer, 4: private sector employee, 5: self-employment, 6: other)
16- Father's occupation: (1: retired, 2: government officer, 3: private sector employee, 4: self-employment, 5: other)
17- Weekly study hours: (1: None, 2: <5 hours, 3: 6-10 hours, 4: 11-20 hours, 5: more than 20 hours)
18- Reading frequency (non-scientific books/journals): (1: None, 2: Sometimes, 3: Often)
19- Reading frequency (scientific books/journals): (1: None, 2: Sometimes, 3: Often)
20- Attendance to the seminars/conferences related to the department: (1: Yes, 2: No)
21- Impact of your projects/activities on your success: (1: positive, 2: negative, 3: neutral)
22- Attendance to classes (1: always, 2: sometimes, 3: never)
23- Preparation to midterm exams 1: (1: alone, 2: with friends, 3: not applicable)
24- Preparation to midterm exams 2: (1: closest date to the exam, 2: regularly during the semester, 3: never)
25- Taking notes in classes: (1: never, 2: sometimes, 3: always)
26- Listening in classes: (1: never, 2: sometimes, 3: always)
27- Discussion improves my interest and success in the course: (1: never, 2: sometimes, 3: always)
28- Flip-classroom: (1: not useful, 2: useful, 3: not applicable)
29- Cumulative grade point average in the last semester (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49)
30- Expected Cumulative grade point average in the graduation (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49)
31- Course ID
32- OUTPUT Grade (0: Fail, 1: DD, 2: DC, 3: CC, 4: CB, 5: BB, 6: BA, 7: AA)

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('../data/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

../data/._student_prediction.csv
../data/student_prediction.csv


In [2]:
df = pd.read_csv("../data/student_prediction.csv")
df.head()

Unnamed: 0,STUDENTID,AGE,GENDER,HS_TYPE,SCHOLARSHIP,WORK,ACTIVITY,PARTNER,SALARY,TRANSPORT,...,PREP_STUDY,PREP_EXAM,NOTES,LISTENS,LIKES_DISCUSS,CLASSROOM,CUML_GPA,EXP_GPA,COURSE ID,GRADE
0,STUDENT1,2,2,3,3,1,2,2,1,1,...,1,1,3,2,1,2,1,1,1,1
1,STUDENT2,2,2,3,3,1,2,2,1,1,...,1,1,3,2,3,2,2,3,1,1
2,STUDENT3,2,2,2,3,2,2,2,2,4,...,1,1,2,2,1,1,2,2,1,1
3,STUDENT4,1,1,1,3,1,2,1,2,1,...,1,2,3,2,2,1,3,2,1,1
4,STUDENT5,2,2,1,3,2,2,1,3,1,...,2,1,2,2,2,1,2,2,1,1


In [15]:
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans

from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier as dtc
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif

import pickle

In [4]:
df = df.drop('STUDENTID', axis=1)

In [5]:
X = df.drop('GRADE', axis=1)
y = df['GRADE']

# list discrete features that have integer dtypes for using MI (Mutual Information)
discrete_features = X.dtypes == int

In [6]:
def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_classif(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores  # show a few features with their MI scores

COURSE ID        0.623271
CUML_GPA         0.250852
MOTHER_EDU       0.184561
STUDY_HRS        0.152261
SCHOLARSHIP      0.142022
FATHER_JOB       0.134495
EXP_GPA          0.127521
GENDER           0.120415
SALARY           0.117921
#_SIBLINGS       0.104088
MOTHER_JOB       0.102861
AGE              0.100847
FATHER_EDU       0.097289
TRANSPORT        0.093659
IMPACT           0.081894
LIVING           0.077654
READ_FREQ        0.075782
READ_FREQ_SCI    0.066456
NOTES            0.065915
PREP_STUDY       0.059436
LIKES_DISCUSS    0.052028
HS_TYPE          0.050828
KIDS             0.047981
LISTENS          0.047148
ATTEND           0.039263
PARTNER          0.037048
WORK             0.032927
ACTIVITY         0.032521
CLASSROOM        0.029650
PREP_EXAM        0.029195
ATTEND_DEPT      0.028835
Name: MI Scores, dtype: float64

In [7]:
def drop_uninformative(df, mi_scores):
    return df.loc[:, mi_scores > 0]

X = drop_uninformative(X, mi_scores)

In [8]:
kmeans = KMeans(n_clusters=8, random_state=0)
X["Cluster"] = kmeans.fit_predict(X)

In [9]:
decision_tree = dtc(random_state=0)
decision_tree.fit(X,y)

predict = cross_val_predict(estimator = decision_tree, X = X, y = y, cv = 5)
print("Classification Report: \n",classification_report(y, predict))

Classification Report: 
               precision    recall  f1-score   support

           0       0.14      0.12      0.13         8
           1       0.33      0.43      0.37        35
           2       0.37      0.42      0.39        24
           3       0.19      0.19      0.19        21
           4       0.00      0.00      0.00        10
           5       0.00      0.00      0.00        17
           6       0.17      0.08      0.11        13
           7       0.38      0.35      0.36        17

    accuracy                           0.26       145
   macro avg       0.20      0.20      0.19       145
weighted avg       0.23      0.26      0.24       145



In [10]:
random_forest = RandomForestClassifier(random_state = 0)
random_forest.fit(X, y)
predict = cross_val_predict(estimator = random_forest, X = X, y = y, cv = 5)
print("Classification Report: \n",classification_report(y, predict))

Classification Report: 
               precision    recall  f1-score   support

           0       0.17      0.12      0.14         8
           1       0.34      0.66      0.45        35
           2       0.37      0.29      0.33        24
           3       0.29      0.19      0.23        21
           4       0.00      0.00      0.00        10
           5       0.11      0.06      0.08        17
           6       0.20      0.08      0.11        13
           7       0.42      0.59      0.49        17

    accuracy                           0.32       145
   macro avg       0.24      0.25      0.23       145
weighted avg       0.27      0.32      0.28       145



In [11]:
knn = KNeighborsClassifier()
knn.fit(X,y)
predict = cross_val_predict(estimator = knn, X = X, y = y, cv = 5)
print("Classification Report: \n",classification_report(y, predict))

Classification Report: 
               precision    recall  f1-score   support

           0       0.12      0.12      0.12         8
           1       0.28      0.54      0.37        35
           2       0.27      0.17      0.21        24
           3       0.00      0.00      0.00        21
           4       0.00      0.00      0.00        10
           5       0.20      0.18      0.19        17
           6       0.18      0.15      0.17        13
           7       0.33      0.35      0.34        17

    accuracy                           0.24       145
   macro avg       0.17      0.19      0.17       145
weighted avg       0.20      0.24      0.21       145



In [12]:
gnb = GaussianNB()
gnb.fit(X,y)
predict = cross_val_predict(estimator = gnb, X = X, y = y, cv = 5)
print("Classification Report: \n",classification_report(y, predict))

Classification Report: 
               precision    recall  f1-score   support

           0       0.17      0.50      0.26         8
           1       0.36      0.11      0.17        35
           2       0.20      0.04      0.07        24
           3       0.14      0.05      0.07        21
           4       0.06      0.20      0.09        10
           5       0.12      0.06      0.08        17
           6       0.05      0.08      0.06        13
           7       0.25      0.53      0.34        17

    accuracy                           0.16       145
   macro avg       0.17      0.20      0.14       145
weighted avg       0.20      0.16      0.14       145



In [13]:
scv = SVC()
scv.fit(X,y)
predict = cross_val_predict(estimator = scv, X = X, y = y, cv = 5)
print("Classification Report: \n",classification_report(y, predict))

Classification Report: 
               precision    recall  f1-score   support

           0       0.50      0.12      0.20         8
           1       0.25      0.63      0.36        35
           2       0.33      0.12      0.18        24
           3       0.00      0.00      0.00        21
           4       0.00      0.00      0.00        10
           5       0.30      0.18      0.22        17
           6       0.00      0.00      0.00        13
           7       0.21      0.41      0.27        17

    accuracy                           0.25       145
   macro avg       0.20      0.18      0.15       145
weighted avg       0.20      0.25      0.19       145



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [16]:
pickle.dump(random_forest, open('best_model_students.pkl', 'wb'))