# Predict Students' Dropout and Academic Success

**For what purpose was the dataset created?**

The dataset was created in a project that aims to contribute to the reduction of academic dropout and failure in higher education, by using machine learning techniques to identify students at risk at an early stage of their academic path, so that strategies to support them can be put into place. 


**Dataset Overview**

* The dataset has `4424` instances and `36` features.

* The dataset includes information known at the time of student enrollment – academic path, demographics, and social-economic factors. 

* The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

You can get more information about features from https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success




## Data Preparation 

### Import necessary libraries

In [4]:
# For data exploration and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# For modelling
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score,confusion_matrix
from sklearn.metrics import precision_recall_curve,roc_curve,RocCurveDisplay,ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import warnings



### Load the data

In [9]:
df = pd.read_csv('data.csv',sep=';')

In [10]:
df.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [12]:
pd.set_option('display.max_columns',37)

In [13]:
df.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,Mother's occupation,Father's occupation,Admission grade,Displaced,Educational special needs,Debtor,Tuition fees up to date,Gender,Scholarship holder,Age at enrollment,International,Curricular units 1st sem (credited),Curricular units 1st sem (enrolled),Curricular units 1st sem (evaluations),Curricular units 1st sem (approved),Curricular units 1st sem (grade),Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,5,9,127.3,1,0,0,1,1,0,20,0,0,0,0,0,0.0,0,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,3,3,142.5,1,0,0,0,1,0,19,0,0,6,6,6,14.0,0,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,9,9,124.8,1,0,0,0,1,0,19,0,0,6,0,0,0.0,0,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,5,3,119.6,1,0,0,1,0,0,20,0,0,6,8,6,13.428571,0,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,9,9,141.5,0,0,0,1,0,0,45,0,0,6,9,5,12.333333,0,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


### Quick review of data

In [14]:
df.columns

Index(['Marital status', 'Application mode', 'Application order', 'Course',
       'Daytime/evening attendance\t', 'Previous qualification',
       'Previous qualification (grade)', 'Nacionality',
       'Mother's qualification', 'Father's qualification',
       'Mother's occupation', 'Father's occupation', 'Admission grade',
       'Displaced', 'Educational special needs', 'Debtor',
       'Tuition fees up to date', 'Gender', 'Scholarship holder',
       'Age at enrollment', 'International',
       'Curricular units 1st sem (credited)',
       'Curricular units 1st sem (enrolled)',
       'Curricular units 1st sem (evaluations)',
       'Curricular units 1st sem (approved)',
       'Curricular units 1st sem (grade)',
       'Curricular units 1st sem (without evaluations)',
       'Curricular units 2nd sem (credited)',
       'Curricular units 2nd sem (enrolled)',
       'Curricular units 2nd sem (evaluations)',
       'Curricular units 2nd sem (approved)',
       'Curricular units 2nd

Let's change some column names, because they are complex,long and hold additional space.

In [15]:
df.columns = ['MaritalStat', 'AppMode', 'AppOrder', 'Course', 
              'DayEvenAttend', 'PrevQual', 'PrevQualGrade', 'Nationality', 
              'MotherQual', 'FatherQual', 'MotherOcc', 'FatherOcc', 
              'AdmGrade', 'Displaced', 'EduSpecNeeds', 'Debtor', 
              'FeesUpToDate', 'Gender', 'Scholarship', 'AgeEnroll', 'International', 
              'CurrUnits1stCred', 'CurrUnits1stEnroll', 'CurrUnits1stEval', 
              'CurrUnits1stAppr', 'CurrUnits1stGrade', 'CurrUnits1stNoEval', 
              'CurrUnits2ndCred', 'CurrUnits2ndEnroll', 'CurrUnits2ndEval', 
              'CurrUnits2ndAppr', 'CurrUnits2ndGrade', 'CurrUnits2ndNoEval', 
              'UnempRate', 'InflRate', 'GDP', 'Target']

In [16]:
df.head()

Unnamed: 0,MaritalStat,AppMode,AppOrder,Course,DayEvenAttend,PrevQual,PrevQualGrade,Nationality,MotherQual,FatherQual,MotherOcc,FatherOcc,AdmGrade,Displaced,EduSpecNeeds,Debtor,FeesUpToDate,Gender,Scholarship,AgeEnroll,International,CurrUnits1stCred,CurrUnits1stEnroll,CurrUnits1stEval,CurrUnits1stAppr,CurrUnits1stGrade,CurrUnits1stNoEval,CurrUnits2ndCred,CurrUnits2ndEnroll,CurrUnits2ndEval,CurrUnits2ndAppr,CurrUnits2ndGrade,CurrUnits2ndNoEval,UnempRate,InflRate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,5,9,127.3,1,0,0,1,1,0,20,0,0,0,0,0,0.0,0,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,3,3,142.5,1,0,0,0,1,0,19,0,0,6,6,6,14.0,0,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,9,9,124.8,1,0,0,0,1,0,19,0,0,6,0,0,0.0,0,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,5,3,119.6,1,0,0,1,0,0,20,0,0,6,8,6,13.428571,0,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,9,9,141.5,0,0,0,1,0,0,45,0,0,6,9,5,12.333333,0,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [17]:
df.shape

(4424, 37)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 37 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   MaritalStat         4424 non-null   int64  
 1   AppMode             4424 non-null   int64  
 2   AppOrder            4424 non-null   int64  
 3   Course              4424 non-null   int64  
 4   DayEvenAttend       4424 non-null   int64  
 5   PrevQual            4424 non-null   int64  
 6   PrevQualGrade       4424 non-null   float64
 7   Nationality         4424 non-null   int64  
 8   MotherQual          4424 non-null   int64  
 9   FatherQual          4424 non-null   int64  
 10  MotherOcc           4424 non-null   int64  
 11  FatherOcc           4424 non-null   int64  
 12  AdmGrade            4424 non-null   float64
 13  Displaced           4424 non-null   int64  
 14  EduSpecNeeds        4424 non-null   int64  
 15  Debtor              4424 non-null   int64  
 16  FeesUp

Except target variable all variables are numerical.

In [20]:
df.describe()

Unnamed: 0,MaritalStat,AppMode,AppOrder,Course,DayEvenAttend,PrevQual,PrevQualGrade,Nationality,MotherQual,FatherQual,MotherOcc,FatherOcc,AdmGrade,Displaced,EduSpecNeeds,Debtor,FeesUpToDate,Gender,Scholarship,AgeEnroll,International,CurrUnits1stCred,CurrUnits1stEnroll,CurrUnits1stEval,CurrUnits1stAppr,CurrUnits1stGrade,CurrUnits1stNoEval,CurrUnits2ndCred,CurrUnits2ndEnroll,CurrUnits2ndEval,CurrUnits2ndAppr,CurrUnits2ndGrade,CurrUnits2ndNoEval,UnempRate,InflRate,GDP
count,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0
mean,1.178571,18.669078,1.727848,8856.642631,0.890823,4.577758,132.613314,1.873192,19.561935,22.275316,10.960895,11.032324,126.978119,0.548373,0.011528,0.113698,0.880651,0.351718,0.248418,23.265145,0.024864,0.709991,6.27057,8.299051,4.7066,10.640822,0.137658,0.541817,6.232143,8.063291,4.435805,10.230206,0.150316,11.566139,1.228029,0.001969
std,0.605747,17.484682,1.313793,2063.566416,0.311897,10.216592,13.188332,6.914514,15.603186,15.343108,26.418253,25.26304,14.482001,0.497711,0.10676,0.31748,0.324235,0.47756,0.432144,7.587816,0.155729,2.360507,2.480178,4.179106,3.094238,4.843663,0.69088,1.918546,2.195951,3.947951,3.014764,5.210808,0.753774,2.66385,1.382711,2.269935
min,1.0,1.0,0.0,33.0,0.0,1.0,95.0,1.0,1.0,1.0,0.0,0.0,95.0,0.0,0.0,0.0,0.0,0.0,0.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.6,-0.8,-4.06
25%,1.0,1.0,1.0,9085.0,1.0,1.0,125.0,1.0,2.0,3.0,4.0,4.0,117.9,0.0,0.0,0.0,1.0,0.0,0.0,19.0,0.0,0.0,5.0,6.0,3.0,11.0,0.0,0.0,5.0,6.0,2.0,10.75,0.0,9.4,0.3,-1.7
50%,1.0,17.0,1.0,9238.0,1.0,1.0,133.1,1.0,19.0,19.0,5.0,7.0,126.1,1.0,0.0,0.0,1.0,0.0,0.0,20.0,0.0,0.0,6.0,8.0,5.0,12.285714,0.0,0.0,6.0,8.0,5.0,12.2,0.0,11.1,1.4,0.32
75%,1.0,39.0,2.0,9556.0,1.0,1.0,140.0,1.0,37.0,37.0,9.0,9.0,134.8,1.0,0.0,0.0,1.0,1.0,0.0,25.0,0.0,0.0,7.0,10.0,6.0,13.4,0.0,0.0,7.0,10.0,6.0,13.333333,0.0,13.9,2.6,1.79
max,6.0,57.0,9.0,9991.0,1.0,43.0,190.0,109.0,44.0,44.0,194.0,195.0,190.0,1.0,1.0,1.0,1.0,1.0,1.0,70.0,1.0,20.0,26.0,45.0,26.0,18.875,12.0,19.0,23.0,33.0,20.0,18.571429,12.0,16.2,3.7,3.51


In [21]:
df.duplicated().sum()

0

There is no duplicate values.

In [23]:
df.isnull().sum().sort_values()

MaritalStat           0
International         0
CurrUnits1stCred      0
CurrUnits1stEnroll    0
CurrUnits1stEval      0
CurrUnits1stAppr      0
CurrUnits1stGrade     0
CurrUnits1stNoEval    0
CurrUnits2ndCred      0
CurrUnits2ndEnroll    0
CurrUnits2ndEval      0
CurrUnits2ndAppr      0
CurrUnits2ndGrade     0
CurrUnits2ndNoEval    0
UnempRate             0
InflRate              0
AgeEnroll             0
GDP                   0
Scholarship           0
FeesUpToDate          0
AppMode               0
AppOrder              0
Course                0
DayEvenAttend         0
PrevQual              0
PrevQualGrade         0
Nationality           0
MotherQual            0
FatherQual            0
MotherOcc             0
FatherOcc             0
AdmGrade              0
Displaced             0
EduSpecNeeds          0
Debtor                0
Gender                0
Target                0
dtype: int64

That's so good, there is no missing values.

Let's look at distribution of values in each column.

In [27]:
for i in df.columns:
  print(df[i].value_counts())
  print()

MaritalStat
1    3919
2     379
4      91
5      25
6       6
3       4
Name: count, dtype: int64

AppMode
1     1708
17     872
39     785
43     312
44     213
7      139
18     124
42      77
51      59
16      38
53      35
15      30
5       16
10      10
2        3
57       1
26       1
27       1
Name: count, dtype: int64

AppOrder
1    3026
2     547
3     309
4     249
5     154
6     137
9       1
0       1
Name: count, dtype: int64

Course
9500    766
9147    380
9238    355
9085    337
9773    331
9670    268
9991    268
9254    252
9070    226
171     215
8014    215
9003    210
9853    192
9119    170
9130    141
9556     86
33       12
Name: count, dtype: int64

DayEvenAttend
1    3941
0     483
Name: count, dtype: int64

PrevQual
1     3717
39     219
19     162
3      126
12      45
40      40
42      36
2       23
6       16
9       11
4        8
38       7
43       6
10       4
15       2
5        1
14       1
Name: count, dtype: int64

PrevQualGrade
133.1    491
130

## Exploratory Data Analysis