# Student Dropout Prediction Challenge

## Introduction

Every year, a large number of students drop out of college. This has an impact not only on students but also on colleges. Predicting dropouts might help you spot potential dangers. Universities can use this research to determine which students are likely to drop out, allowing them to work with them ahead of time to fix any obstacles or issues they may have.

### Objective

The goal of this project is to predict a student's performance in finishing his or her education. This indicates
whether or not a student will drop out of an enrolled course.

### Data

Data collected is for students pursuing Bachelor’s degree during 2012 to 2017. This datasets is
divided into three parts
1. Static Data
2. Progress Data
3. Financial Aid Data

In [1440]:
#import Python modules
import pandas as pd   
import numpy as np
import random  
import matplotlib.pyplot as plt
import seaborn as sns

# Module for Train Test split
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn import metrics
from sklearn.metrics import confusion_matrix
%matplotlib inline

## Student Static Data

In [1441]:
#Import the datasets 
df1 = pd.read_csv('Student Static Data/Fall 2011_ST.csv')
df2 = pd.read_csv('Student Static Data/Fall 2012.csv')
df3 = pd.read_csv('Student Static Data/Fall 2013.csv')
df4 = pd.read_csv('Student Static Data/Fall 2014.csv')
df5 = pd.read_csv('Student Static Data/Fall 2015.csv')
df6 = pd.read_csv('Student Static Data/Fall 2016.csv')
df7 = pd.read_csv('Student Static Data/Spring 2012_ST.csv')
df8 = pd.read_csv('Student Static Data/Spring 2013.csv')
df9 = pd.read_csv('Student Static Data/Spring 2014.csv')
df11 = pd.read_csv('Student Static Data/Spring 2015.csv')
df12 = pd.read_csv('Student Static Data/Spring 2016.csv')

In [1442]:
frames = [df1, df2, df3,df4,df5,df6,df7,df8,df9,df11,df12]

In [1443]:
Student_Static_Data = pd.concat(frames)

In [1444]:
Student_Static_Data.head(10)

Unnamed: 0,StudentID,Cohort,CohortTerm,Campus,Address1,Address2,City,State,Zip,RegistrationDate,...,DualHSSummerEnroll,EnrollmentStatus,NumColCredAttemptTransfer,NumColCredAcceptTransfer,CumLoanAtEntry,HighDeg,MathPlacement,EngPlacement,GatewayMathStatus,GatewayEnglishStatus
0,285848,2011-12,1,,328 Adams St Apt 1,,Hoboken,NJ,7030.0,20110808,...,0,2,0.0,0.0,-1,0,0,0,0,0
1,302176,2011-12,1,,142 Cherry St,,Jersey City,NJ,7305.0,20110804,...,0,2,96.0,45.0,-1,0,0,0,0,0
2,301803,2011-12,1,,12 Rainbow Street,,Presque Isle,ME,4769.0,20110809,...,0,2,0.0,0.0,-1,0,0,0,0,0
3,302756,2011-12,1,,345 4th St Apt 2,,Jersey City,NJ,7302.0,20110823,...,0,2,54.0,87.5,-1,0,0,0,0,0
4,300304,2011-12,1,,6600 Broadway,Apt 3D,West New York,NJ,7093.0,20110725,...,0,1,-2.0,-2.0,-2,0,1,0,0,0
5,301067,2011-12,1,,240 3rd St,,Jersey City,NJ,7302.0,20110420,...,0,2,70.0,66.0,-1,2,0,0,0,0
6,297371,2011-12,1,,15A Claremont Ave,,Jersey City,NJ,7305.0,20110628,...,0,1,-2.0,-2.0,-2,0,0,1,1,0
7,273211,2011-12,1,,274 Jersey Ave Apt 7,,Cliffside Park,NJ,7306.0,20110810,...,0,2,62.0,66.0,-1,2,0,0,0,0
8,302772,2011-12,1,,6 Dogwood Ct,,Sayreville,NJ,8872.0,20110908,...,0,2,53.0,45.0,-1,0,0,0,0,0
9,280023,2011-12,1,,266 Palisade Ave Apt 4D,,Jersey City,NJ,7307.0,20110714,...,0,2,52.0,66.0,-1,0,0,0,0,0


In [1445]:
Student_Static_Data.shape

(13261, 35)

In [1446]:
Student_Static_Data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13261 entries, 0 to 502
Data columns (total 35 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   StudentID                  13261 non-null  int64  
 1   Cohort                     13261 non-null  object 
 2   CohortTerm                 13261 non-null  int64  
 3   Campus                     0 non-null      float64
 4   Address1                   13148 non-null  object 
 5   Address2                   389 non-null    object 
 6   City                       13147 non-null  object 
 7   State                      13148 non-null  object 
 8   Zip                        13127 non-null  float64
 9   RegistrationDate           13261 non-null  int64  
 10  Gender                     13261 non-null  int64  
 11  BirthYear                  13260 non-null  float64
 12  BirthMonth                 13261 non-null  int64  
 13  Hispanic                   13261 non-null  int64

In [1447]:
Student_Static_Data[['BirthYear','Campus','HSGPAUnwtd','HSGPAWtd','FirstGen','DualHSSummerEnroll','NumColCredAttemptTransfer','NumColCredAcceptTransfer','CumLoanAtEntry']].describe()

Unnamed: 0,BirthYear,Campus,HSGPAUnwtd,HSGPAWtd,FirstGen,DualHSSummerEnroll,NumColCredAttemptTransfer,NumColCredAcceptTransfer,CumLoanAtEntry
count,13260.0,0.0,13261.0,13261.0,13261.0,13261.0,13261.0,13261.0,13261.0
mean,1988.916591,,0.162432,-1.0,-1.0,0.0,36.970166,31.765685,-1.41113
std,8.46401,,1.808477,0.0,0.0,0.0,43.403093,34.535089,0.492057
min,1945.0,,-1.0,-1.0,-1.0,0.0,-2.0,-2.0,-2.0
25%,1986.0,,-1.0,-1.0,-1.0,0.0,-2.0,-2.0,-2.0
50%,1992.0,,-1.0,-1.0,-1.0,0.0,14.0,22.0,-1.0
75%,1995.0,,2.4,-1.0,-1.0,0.0,73.0,66.0,-1.0
max,2000.0,,4.0,-1.0,-1.0,0.0,150.0,96.0,-1.0


### Data Cleaning for Student Static Data

In [1448]:
#ALL Campus data are missing,drop this column
Student_Static_Data=Student_Static_Data.drop(columns=['Campus'])

In [1449]:
#Fill one missing value with mean for birth year
Student_Static_Data[['BirthYear']]=Student_Static_Data[['BirthYear']].fillna(1989)

In [1450]:
Student_Static_Data['HSDip'].value_counts()

 1    12887
-1      289
 2       75
 4       10
Name: HSDip, dtype: int64

In [1451]:
#Missing values for HSDip  can be filled with 1, As It's necessary to have high school certificate to join a university course
Student_Static_Data[['HSDip']]=Student_Static_Data[['HSDip']].replace(to_replace =-1,value =1)

In [1452]:
Student_Static_Data['HSGPAUnwtd'].value_counts()

-1.00    9318
 2.70     213
 2.80     199
 2.50     192
 2.60     166
         ... 
 2.05       1
 2.11       1
 3.86       1
 1.89       1
 1.91       1
Name: HSGPAUnwtd, Length: 214, dtype: int64

In [1453]:
#Fill the missing value of HSGPAUnwtd as zero
Student_Static_Data[['HSGPAUnwtd']]=Student_Static_Data[['HSGPAUnwtd']].replace(to_replace =-1,value =0)

In [1454]:
Student_Static_Data['HSGPAWtd'].value_counts()

-1    13261
Name: HSGPAWtd, dtype: int64

In [1455]:
#All values of HSGPAWtd are missing,drop this column
Student_Static_Data=Student_Static_Data.drop(columns=['HSGPAWtd'])

In [1456]:
Student_Static_Data['FirstGen'].value_counts()

-1    13261
Name: FirstGen, dtype: int64

In [1457]:
#All values of HSGPAWtd are missing,drop this column
Student_Static_Data=Student_Static_Data.drop(columns=['FirstGen'])

In [1458]:
Student_Static_Data['DualHSSummerEnroll'].value_counts()

0    13261
Name: DualHSSummerEnroll, dtype: int64

In [1459]:
#All values of DualHSSummerEnroll are 0, so it has no effect on the analysis,drop this column
Student_Static_Data=Student_Static_Data.drop(columns=['DualHSSummerEnroll'])

In [1460]:
Student_Static_Data['NumColCredAttemptTransfer'].value_counts()

-2.00      5452
 0.00       683
-1.00       397
 65.00      213
 68.00      168
           ... 
 44.10        1
 114.50       1
 50.75        1
 126.50       1
 23.50        1
Name: NumColCredAttemptTransfer, Length: 310, dtype: int64

In [1461]:
#Fill the missing value of NumColCredAttemptTransfer as zero
Student_Static_Data[['NumColCredAttemptTransfer']]=Student_Static_Data[['NumColCredAttemptTransfer']].replace(to_replace =-1,value =0)

In [1462]:
Student_Static_Data['NumColCredAcceptTransfer'].value_counts()

-2.0     5452
 66.0    1560
 0.0      682
 96.0     487
 65.0     391
         ... 
 32.5       1
 91.5       1
 84.5       1
 66.5       1
 18.5       1
Name: NumColCredAcceptTransfer, Length: 178, dtype: int64

In [1463]:
#Fill the missing value of NumColCredAttemptTransfer as zero
Student_Static_Data[['NumColCredAcceptTransfer']]=Student_Static_Data[['NumColCredAcceptTransfer']].replace(to_replace =-1,value =0)

In [1464]:
Student_Static_Data['CumLoanAtEntry'].value_counts()

-1    7809
-2    5452
Name: CumLoanAtEntry, dtype: int64

In [1465]:
#All values of CumLoanAtEntry are 0 or missing , so it has no effect on the analysis,drop this column
Student_Static_Data=Student_Static_Data.drop(columns=['CumLoanAtEntry'])

In [1466]:
Student_Static_Data['MathPlacement'].value_counts()

 0    8415
 1    4275
-1     571
Name: MathPlacement, dtype: int64

In [1467]:
#Fill the missing value of MathPlacement as zero
Student_Static_Data[['MathPlacement']]=Student_Static_Data[['MathPlacement']].replace(to_replace =-1,value =0)

In [1468]:
Student_Static_Data['EngPlacement'].value_counts()

 0    9640
 1    3050
-1     571
Name: EngPlacement, dtype: int64

In [1469]:
#Fill the missing value of MathPlacement as zero
Student_Static_Data[['EngPlacement']]=Student_Static_Data[['EngPlacement']].replace(to_replace =-1,value =0)

### Make the different columns of Race to one column for simplicity of analysis

In [1470]:
Student_Static_Data.loc[Student_Static_Data['Hispanic'] == 1,'Race'] = 'Hispanic'

In [1471]:
Student_Static_Data.loc[Student_Static_Data['AmericanIndian'] == 1,'Race'] = 'AmericanIndian'

In [1472]:
Student_Static_Data.loc[Student_Static_Data['NativeHawaiian'] == 1,'Race'] = 'NativeHawaiian'

In [1473]:
Student_Static_Data.loc[Student_Static_Data['Asian'] == 1,'Race'] = 'Asian'

In [1474]:
Student_Static_Data.loc[Student_Static_Data['Black'] == 1,'Race'] = 'Black'

In [1475]:
Student_Static_Data.loc[Student_Static_Data['White'] == 1,'Race'] = 'White'

In [1476]:
Student_Static_Data.loc[Student_Static_Data['TwoOrMoreRace'] == 1,'Race'] = 'TwoOrMoreRace'

In [1477]:
#Fill one missing value of Race with Unknown
Student_Static_Data[['Race']]=Student_Static_Data[['Race']].fillna('Unknown')

In [1478]:
#Drop the Race Columns
Student_Static_Data=Student_Static_Data.drop(columns=['Hispanic','AmericanIndian','Asian','Black','White','TwoOrMoreRace','NativeHawaiian'])

In [1479]:
#Drop un wanted Columns
Student_Static_Data=Student_Static_Data.drop(columns=['Address1','Address2','Zip','RegistrationDate','City','State'])

In [1480]:
#import DropoutTrainLabels dataset
DropoutTrainLabels = pd.read_csv('DropoutTrainLabels.csv')

In [1481]:
#merge the two datasets
Student_Static_Data_to_model= pd.merge( Student_Static_Data,DropoutTrainLabels,on='StudentID')

In [1482]:
Student_Static_Data_to_model['Dropout'].value_counts()

0    7527
1    4734
Name: Dropout, dtype: int64

In [1483]:
Student_Static_Data_to_model.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12261 entries, 0 to 12260
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   StudentID                  12261 non-null  int64  
 1   Cohort                     12261 non-null  object 
 2   CohortTerm                 12261 non-null  int64  
 3   Gender                     12261 non-null  int64  
 4   BirthYear                  12261 non-null  float64
 5   BirthMonth                 12261 non-null  int64  
 6   HSDip                      12261 non-null  int64  
 7   HSDipYr                    12261 non-null  int64  
 8   HSGPAUnwtd                 12261 non-null  float64
 9   EnrollmentStatus           12261 non-null  int64  
 10  NumColCredAttemptTransfer  12261 non-null  float64
 11  NumColCredAcceptTransfer   12261 non-null  float64
 12  HighDeg                    12261 non-null  int64  
 13  MathPlacement              12261 non-null  int

### create the Dummy Variables

In [1484]:
Race_dummies = pd.get_dummies(Student_Static_Data_to_model.Race, prefix="Race")

In [1485]:
df_with_dummies = pd.concat([Student_Static_Data_to_model,Race_dummies],axis='columns')

In [1486]:
df_with_dummies=df_with_dummies.drop(columns=['Race'])

In [1487]:
df_with_dummies=df_with_dummies.drop(columns=['StudentID','Dropout','HSDip','Cohort'])

In [1488]:
X = df_with_dummies

In [1489]:
y = Student_Static_Data_to_model.Dropout

In [1490]:
X = sm.add_constant(X)

  x = pd.concat(x[::order], 1)


In [1491]:
model=sm.Logit(y,X)
result = model.fit(method='newton')

Optimization terminated successfully.
         Current function value: 0.617833
         Iterations 7


In [1492]:
print(result.summary())
print(result.summary2())

                           Logit Regression Results                           
Dep. Variable:                Dropout   No. Observations:                12261
Model:                          Logit   Df Residuals:                    12239
Method:                           MLE   Df Model:                           21
Date:                Thu, 08 Jul 2021   Pseudo R-squ.:                 0.07368
Time:                        04:16:40   Log-Likelihood:                -7575.3
converged:                       True   LL-Null:                       -8177.8
Covariance Type:            nonrobust   LLR p-value:                5.028e-242
                                coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                        71.5662        nan        nan        nan         nan         nan
CohortTerm                    0.2826      0.025     11.393      0.000       0.234     

In [1493]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [1494]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.63


#### Drop race column, It's P value is NAN

In [1497]:
X=X.drop(columns=['Race_AmericanIndian','Race_Asian','Race_Black','Race_Hispanic','Race_NativeHawaiian','Race_TwoOrMoreRace','Race_Unknown','Race_White'])

In [1499]:
X = sm.add_constant(X)

  x = pd.concat(x[::order], 1)


In [1500]:
model=sm.Logit(y,X)
result = model.fit(method='newton')

Optimization terminated successfully.
         Current function value: 0.618954
         Iterations 5


In [1501]:
print(result.summary())
print(result.summary2())

                           Logit Regression Results                           
Dep. Variable:                Dropout   No. Observations:                12261
Model:                          Logit   Df Residuals:                    12246
Method:                           MLE   Df Model:                           14
Date:                Thu, 08 Jul 2021   Pseudo R-squ.:                 0.07199
Time:                        04:25:02   Log-Likelihood:                -7589.0
converged:                       True   LL-Null:                       -8177.8
Covariance Type:            nonrobust   LLR p-value:                1.187e-242
                                coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                        82.3434      5.422     15.187      0.000      71.717      92.970
CohortTerm                    0.2833      0.025     11.430      0.000       0.235     

In [1507]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
logreg = LogisticRegression(solver='lbfgs', max_iter=6000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.65


## Student Progress Data

In [1651]:
#Import the datasets 
df1 = pd.read_csv('Student Progress Data/Fall 2011_SP.csv')
df2 = pd.read_csv('Student Progress Data/Fall 2012_SP.csv')
df3 = pd.read_csv('Student Progress Data/Fall 2013_SP.csv')
df4 = pd.read_csv('Student Progress Data/Fall 2014_SP.csv')
df5 = pd.read_csv('Student Progress Data/Fall 2015_SP.csv')
df6 = pd.read_csv('Student Progress Data/Fall 2016_SP.csv')
df7 = pd.read_csv('Student Progress Data/Spring 2012_SP.csv')
df8 = pd.read_csv('Student Progress Data/Spring 2013_SP.csv')
df9 = pd.read_csv('Student Progress Data/Spring 2014_SP.csv')
df11 = pd.read_csv('Student Progress Data/Spring 2015_SP.csv')
df12 = pd.read_csv('Student Progress Data/Spring 2016_SP.csv')
df13 = pd.read_csv('Student Progress Data/Spring 2017_SP.csv')
df14 = pd.read_csv('Student Progress Data/Sum 2012.csv')
df15 = pd.read_csv('Student Progress Data/Sum 2013.csv')
df16 = pd.read_csv('Student Progress Data/Sum 2014.csv')
df17 = pd.read_csv('Student Progress Data/Sum 2015.csv')
df18 = pd.read_csv('Student Progress Data/Sum 2016.csv')
df19 = pd.read_csv('Student Progress Data/Sum 2017.csv')

In [1652]:
frames = [df1, df2, df3,df4,df5,df6,df7,df8,df9,df11,df12,df13,df14,df15,df16,df17,df18,df19]

In [1653]:
Student_Progress_Data = pd.concat(frames)

In [1654]:
Student_Progress_Data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 57945 entries, 0 to 1218
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   StudentID           57945 non-null  int64  
 1   Cohort              57945 non-null  object 
 2   CohortTerm          57945 non-null  int64  
 3   Term                57945 non-null  int64  
 4   AcademicYear        57945 non-null  object 
 5   CompleteDevMath     57945 non-null  int64  
 6   CompleteDevEnglish  57945 non-null  int64  
 7   Major1              57945 non-null  float64
 8   Major2              57945 non-null  float64
 9   Complete1           57945 non-null  int64  
 10  Complete2           57945 non-null  int64  
 11  CompleteCIP1        57945 non-null  float64
 12  CompleteCIP2        57945 non-null  int64  
 13  TransferIntent      57945 non-null  int64  
 14  DegreeTypeSought    57945 non-null  int64  
 15  TermGPA             57945 non-null  float64
 16  CumGP

In [1655]:
Student_Progress_Data=Student_Progress_Data.drop_duplicates(subset=['StudentID'])

In [1656]:
#import DropoutTrainLabels dataset
DropoutTrainLabels = pd.read_csv('DropoutTrainLabels.csv')

In [1657]:
#merge the two datasets
Student_Progress_Data_to_model= pd.merge( Student_Progress_Data,DropoutTrainLabels,on='StudentID')

In [1658]:
Student_Progress_Data_stastics=Student_Progress_Data_to_model[['TermGPA','CumGPA','CompleteCIP1','CompleteCIP2']]

In [1659]:
Student_Progress_Data_stastics.describe()

Unnamed: 0,TermGPA,CumGPA,CompleteCIP1,CompleteCIP2
count,12261.0,12261.0,12261.0,12261.0
mean,2.857813,2.853022,-1.960287,-2.0
std,1.102113,1.077837,1.365114,0.0
min,0.0,0.0,-2.0,-2.0
25%,2.43,2.42,-2.0,-2.0
50%,3.18,3.17,-2.0,-2.0
75%,3.68,3.66,-2.0,-2.0
max,4.0,4.0,52.1401,-2.0


### Data Cleaning for Student Progress Data

In [1660]:
Student_Progress_Data_to_model['Major2'].value_counts()

-1.0000     12036
 13.1209       84
 13.1202       38
 13.1001       24
 43.0199       10
 52.1401       10
 45.1101        9
 52.0201        8
 50.0701        7
 43.0399        5
 42.0101        5
 52.0801        5
 27.0101        4
 52.0301        3
 54.0101        2
 11.0101        2
 40.0501        2
 40.0801        2
 45.0601        2
 50.0901        1
 45.1001        1
 38.0101        1
Name: Major2, dtype: int64

In [1661]:
#most of major2 is -1(missing),drop this column
Student_Progress_Data_to_model=Student_Progress_Data_to_model.drop(columns=['Major2'])

In [1662]:
Student_Progress_Data_to_model['Complete2'].value_counts()

0    12261
Name: Complete2, dtype: int64

In [1663]:
#All complete2 is 0,drop this column
Student_Progress_Data_to_model=Student_Progress_Data_to_model.drop(columns=['Complete2'])

In [1664]:
Student_Progress_Data_to_model['CompleteCIP1'].value_counts()

-2.0000     12250
 43.0399        2
 43.0199        1
 52.0801        1
 51.0000        1
 16.0905        1
 26.0101        1
 51.3801        1
 52.1401        1
 45.1101        1
 42.0101        1
Name: CompleteCIP1, dtype: int64

In [1665]:
#Most of CompleteCIP1 is -2,drop this column
Student_Progress_Data_to_model=Student_Progress_Data_to_model.drop(columns=['CompleteCIP1'])

In [1666]:
Student_Progress_Data_to_model['DegreeTypeSought'].value_counts()

6    12261
Name: DegreeTypeSought, dtype: int64

In [1667]:
#All the students are bachelor's degree,drop this column
Student_Progress_Data_to_model=Student_Progress_Data_to_model.drop(columns=['DegreeTypeSought'])

In [1668]:
Student_Progress_Data_to_model['TransferIntent'].value_counts()

-1    12261
Name: TransferIntent, dtype: int64

In [1669]:
#All the TransferIntent are -1,drop this column
Student_Progress_Data_to_model=Student_Progress_Data_to_model.drop(columns=['TransferIntent'])

### create the Dummy Variables

In [1670]:
Cohort_dummies = pd.get_dummies(Student_Progress_Data_to_model.Cohort, prefix="Cohort")

In [1671]:
df_with_dummies = pd.concat([Student_Progress_Data_to_model,Cohort_dummies],axis='columns')

In [1672]:
df_with_dummies=df_with_dummies.drop(columns=['Cohort'])

In [1673]:
AcademicYear_dummies = pd.get_dummies(df_with_dummies.AcademicYear, prefix="AcademicYear")

In [1674]:
df_with_dummies = pd.concat([df_with_dummies,AcademicYear_dummies],axis='columns')

In [1675]:
df_with_dummies=df_with_dummies.drop(columns=['AcademicYear'])

In [1677]:
y = df_with_dummies.Dropout

In [1678]:
df_with_dummies=df_with_dummies.drop(columns=['StudentID','Dropout','CompleteDevMath','CompleteDevEnglish','CohortTerm','Term','Complete1','Major1'])

In [1679]:
X = df_with_dummies

In [1680]:
X = sm.add_constant(X)

  x = pd.concat(x[::order], 1)


In [1681]:
model=sm.Logit(y,X)
result = model.fit(method='newton')

Optimization terminated successfully.
         Current function value: 0.504799
         Iterations 34


In [1683]:
print(result.summary())
print(result.summary2())

                           Logit Regression Results                           
Dep. Variable:                Dropout   No. Observations:                12261
Model:                          Logit   Df Residuals:                    12248
Method:                           MLE   Df Model:                           12
Date:                Thu, 08 Jul 2021   Pseudo R-squ.:                  0.2431
Time:                        18:25:15   Log-Likelihood:                -6189.3
converged:                       True   LL-Null:                       -8177.8
Covariance Type:            nonrobust   LLR p-value:                     0.000
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
CompleteCIP2             7.4421   2.46e+07   3.02e-07      1.000   -4.82e+07    4.82e+07
TermGPA                 -0.5627      0.069     -8.192      0.000      -0.697      -0.428
CumGPA      

In [1684]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
logreg = LogisticRegression(solver='lbfgs', max_iter=6000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.71


### Drop cohort and academic year, it's p value is High

In [1686]:
X=X.drop(columns=['Cohort_2011-12','Cohort_2012-13','Cohort_2013-14','Cohort_2014-15','Cohort_2015-16','Cohort_2016-17','AcademicYear_2011-12','AcademicYear_2012-13','AcademicYear_2013-14','AcademicYear_2014-15','AcademicYear_2015-16','AcademicYear_2016-17'])

In [1687]:
X = sm.add_constant(X)

  x = pd.concat(x[::order], 1)


In [1688]:
model=sm.Logit(y,X)
result = model.fit(method='newton')

Optimization terminated successfully.
         Current function value: 0.621915
         Iterations 5


In [1689]:
print(result.summary())
print(result.summary2())

                           Logit Regression Results                           
Dep. Variable:                Dropout   No. Observations:                12261
Model:                          Logit   Df Residuals:                    12258
Method:                           MLE   Df Model:                            2
Date:                Thu, 08 Jul 2021   Pseudo R-squ.:                 0.06756
Time:                        18:35:20   Log-Likelihood:                -7625.3
converged:                       True   LL-Null:                       -8177.8
Covariance Type:            nonrobust   LLR p-value:                1.185e-240
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
CompleteCIP2    -0.5975      0.028    -21.012      0.000      -0.653      -0.542
TermGPA         -0.3683      0.060     -6.130      0.000      -0.486      -0.251
CumGPA          -0.2197      0.061     -3.57

In [1690]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
logreg = LogisticRegression(solver='lbfgs', max_iter=6000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.67


##  Student Financial Data

In [1823]:
#Import the dataset 
Student_finacial_data = pd.read_csv('Student Financial Aid Data/2011-2017_Cohorts_Financial_Aid_and_Fafsa_Data.csv')

In [1824]:
Student_finacial_data_statics=Student_finacial_data.drop(columns=['ID with leading','cohort term',])

In [1825]:
Student_finacial_data_statics.describe()

Unnamed: 0,Adjusted Gross Income,Parent Adjusted Gross Income,2012 Loan,2012 Scholarship,2012 Work/Study,2012 Grant,2013 Loan,2013 Scholarship,2013 Work/Study,2013 Grant,...,2015 Work/Study,2015 Grant,2016 Loan,2016 Scholarship,2016 Work/Study,2016 Grant,2017 Loan,2017 Scholarship,2017 Work/Study,2017 Grant
count,11615.0,11615.0,1237.0,171.0,103.0,1354.0,2187.0,310.0,179.0,2319.0,...,249.0,3404.0,3175.0,685.0,272.0,3694.0,3324.0,985.0,367.0,4037.0
mean,13124.92,28101.521653,7169.025869,5224.741813,1872.995146,6660.931617,7156.096278,4792.646903,2084.268156,7094.158098,...,2127.195783,7369.863796,7625.026822,4897.317679,2036.352941,7458.963519,8256.243983,5024.025198,1928.544441,7794.156061
std,35574.85,42988.074677,6087.970826,5002.498203,665.153279,3811.967801,4807.548156,4413.27938,663.526606,3905.860943,...,850.855342,4018.332071,4885.988167,3870.333736,597.392613,4058.66836,5472.801128,3891.554164,530.66608,4173.771027
min,-24326.0,-62979.0,337.0,283.0,200.0,79.09,103.0,23.0,25.0,162.0,...,10.0,209.0,103.0,28.3,75.0,9.69,103.0,100.0,45.0,0.1
25%,0.0,0.0,3500.0,2000.0,1700.0,3368.25,3500.0,2000.0,2000.0,3683.0,...,2000.0,3880.0,4500.0,2000.0,2000.0,3963.25,5353.75,2000.0,1500.0,4261.0
50%,2637.0,12372.0,5500.0,4000.0,2000.0,5794.0,5500.0,3548.65,2000.0,6089.0,...,2000.0,6358.0,6420.0,4000.0,2000.0,6428.0,6500.0,4000.0,2000.0,7305.0
75%,16323.0,38587.0,9500.0,6000.0,2121.0,10714.0,9500.0,6408.8,2200.0,11040.0,...,2800.0,11592.0,10500.0,6000.0,2000.0,11717.5,11812.5,6906.1,2000.0,12173.0
max,2576425.0,657631.0,55626.0,27631.9,3000.0,13263.0,50555.0,28737.1,4000.0,13790.0,...,4600.0,19038.0,52880.0,31265.5,4000.0,18505.0,60118.0,33847.9,3000.0,19823.0


In [1826]:
Student_finacial_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13769 entries, 0 to 13768
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   ID with leading               13769 non-null  int64  
 1   cohort                        13769 non-null  object 
 2   cohort term                   13769 non-null  int64  
 3   Marital Status                11615 non-null  object 
 4   Adjusted Gross Income         11615 non-null  float64
 5   Parent Adjusted Gross Income  11615 non-null  float64
 6   Father's Highest Grade Level  11477 non-null  object 
 7   Mother's Highest Grade Level  11249 non-null  object 
 8   Housing                       11605 non-null  object 
 9   2012 Loan                     1237 non-null   float64
 10  2012 Scholarship              171 non-null    float64
 11  2012 Work/Study               103 non-null    float64
 12  2012 Grant                    1354 non-null   float64
 13  2

In [1827]:
Student_finacial_data['Marital Status'].value_counts()

Single       10155
Married       1024
Divorced       236
Separated      200
Name: Marital Status, dtype: int64

In [1828]:
#Majority of the students are sigle, so fill the missing values with majority
Student_finacial_data[['Marital Status']]=Student_finacial_data[['Marital Status']].fillna('Single')

In [1829]:
Student_finacial_data['Housing'].value_counts()

Off Campus           5373
With Parent          4608
On Campus Housing    1624
Name: Housing, dtype: int64

In [1830]:
#Majority of the students are Off Campus, so fill the missing values with majority
Student_finacial_data[['Housing']]=Student_finacial_data[['Housing']].fillna('Off Campus')

In [1831]:
Student_finacial_data['Father\'s Highest Grade Level'].value_counts()

High School      5092
College          3284
Unknown          1771
Middle School    1330
Name: Father's Highest Grade Level, dtype: int64

In [1832]:
#fill the missing values with Unknown
Student_finacial_data[['Father\'s Highest Grade Level']]=Student_finacial_data[['Father\'s Highest Grade Level']].fillna('Unknown')

In [1833]:
Student_finacial_data['Mother\'s Highest Grade Level'].value_counts()

High School      5024
College          3215
Unknown          1714
Middle School    1296
Name: Mother's Highest Grade Level, dtype: int64

In [1834]:
#fill the missing values with Unknown
Student_finacial_data[['Mother\'s Highest Grade Level']]=Student_finacial_data[['Mother\'s Highest Grade Level']].fillna('Unknown')

In [1835]:
Student_finacial_data['Adjusted Gross Income'].mean()

13124.92363323289

In [1836]:
#Fill missing values with mean
Student_finacial_data[['Adjusted Gross Income']]=Student_finacial_data[['Adjusted Gross Income']].fillna(13124.92)

In [1837]:
Student_finacial_data['Parent Adjusted Gross Income'].mean()

28101.52165303487

In [1839]:
#Fill missing values with mean
Student_finacial_data[['Parent Adjusted Gross Income']]=Student_finacial_data[['Parent Adjusted Gross Income']].fillna(28101.52)

In [1840]:
#import DropoutTrainLabels dataset
DropoutTrainLabels = pd.read_csv('DropoutTrainLabels.csv')

In [1841]:
#rename Id with Leading to StudentID
Student_finacial_data = Student_finacial_data.rename(columns={'ID with leading': 'StudentID'})

In [1842]:
#merge the two datasets
Student_finacial_Data_to_model= pd.merge( Student_finacial_data,DropoutTrainLabels,on='StudentID')

In [1843]:
Student_finacial_Data_to_model.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12261 entries, 0 to 12260
Data columns (total 34 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   StudentID                     12261 non-null  int64  
 1   cohort                        12261 non-null  object 
 2   cohort term                   12261 non-null  int64  
 3   Marital Status                12261 non-null  object 
 4   Adjusted Gross Income         12261 non-null  float64
 5   Parent Adjusted Gross Income  12261 non-null  float64
 6   Father's Highest Grade Level  12261 non-null  object 
 7   Mother's Highest Grade Level  12261 non-null  object 
 8   Housing                       12261 non-null  object 
 9   2012 Loan                     1151 non-null   float64
 10  2012 Scholarship              160 non-null    float64
 11  2012 Work/Study               96 non-null     float64
 12  2012 Grant                    1257 non-null   float64
 13  2

### create the Dummy Variables

In [1844]:
Marital_Status_dummies = pd.get_dummies(Student_finacial_Data_to_model['Marital Status'], prefix="Marital Status")

In [1845]:
df_with_dummies=Student_finacial_Data_to_model.drop(columns=['Marital Status'])

In [1846]:
df_with_dummies = pd.concat([Student_finacial_Data_to_model,Marital_Status_dummies],axis='columns')

In [1847]:
df_with_dummies=df_with_dummies.drop(columns=['Marital Status'])

In [1848]:
Housing_dummies = pd.get_dummies(df_with_dummies['Housing'], prefix="Housing")

In [1849]:
df_with_dummies=df_with_dummies.drop(columns=['Housing'])

In [1850]:
df_with_dummies = pd.concat([df_with_dummies,Housing_dummies],axis='columns')

In [1851]:
FatherHighestGradeLevel_dummies = pd.get_dummies(df_with_dummies['Father\'s Highest Grade Level'], prefix="FatherHighestGradeLevel")

In [1852]:
df_with_dummies=df_with_dummies.drop(columns=['Father\'s Highest Grade Level'])

In [1853]:
df_with_dummies = pd.concat([df_with_dummies,FatherHighestGradeLevel_dummies],axis='columns')

In [1854]:
MotherHighestGradeLevel_dummies = pd.get_dummies(df_with_dummies['Mother\'s Highest Grade Level'], prefix="MotherHighestGradeLevel")

In [1855]:
df_with_dummies=df_with_dummies.drop(columns=['Mother\'s Highest Grade Level'])

In [1856]:
df_with_dummies = pd.concat([df_with_dummies,MotherHighestGradeLevel_dummies],axis='columns')

In [1857]:
y = df_with_dummies.Dropout

In [1858]:
df_with_dummies=df_with_dummies.drop(columns=['StudentID','Dropout','cohort','cohort term','2012 Loan','2012 Scholarship','2012 Work/Study','2012 Grant','2013 Loan','2013 Scholarship','2013 Work/Study','2013 Grant','2014 Loan','2014 Scholarship','2014 Work/Study','2014 Grant','2015 Loan','2015 Scholarship','2015 Work/Study','2015 Grant','2016 Loan','2016 Scholarship','2016 Work/Study','2016 Grant','2017 Loan','2017 Scholarship','2017 Work/Study','2017 Grant'])

In [1859]:
df_with_dummies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12261 entries, 0 to 12260
Data columns (total 17 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Adjusted Gross Income                  12261 non-null  float64
 1   Parent Adjusted Gross Income           12261 non-null  float64
 2   Marital Status_Divorced                12261 non-null  uint8  
 3   Marital Status_Married                 12261 non-null  uint8  
 4   Marital Status_Separated               12261 non-null  uint8  
 5   Marital Status_Single                  12261 non-null  uint8  
 6   Housing_Off Campus                     12261 non-null  uint8  
 7   Housing_On Campus Housing              12261 non-null  uint8  
 8   Housing_With Parent                    12261 non-null  uint8  
 9   FatherHighestGradeLevel_College        12261 non-null  uint8  
 10  FatherHighestGradeLevel_High School    12261 non-null  uint8  
 11  Fa

In [1860]:
X = df_with_dummies

In [1861]:
X = sm.add_constant(X)

  x = pd.concat(x[::order], 1)


In [1862]:
model=sm.Logit(y,X)
result = model.fit(method='newton')

         Current function value: 0.658107
         Iterations: 35




In [1863]:
print(result.summary())
print(result.summary2())

                           Logit Regression Results                           
Dep. Variable:                Dropout   No. Observations:                12261
Model:                          Logit   Df Residuals:                    12247
Method:                           MLE   Df Model:                           13
Date:                Thu, 08 Jul 2021   Pseudo R-squ.:                 0.01329
Time:                        20:44:05   Log-Likelihood:                -8069.1
converged:                      False   LL-Null:                       -8177.8
Covariance Type:            nonrobust   LLR p-value:                 3.595e-39
                                            coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------------
const                                    -0.2658   7.64e+05  -3.48e-07      1.000    -1.5e+06     1.5e+06
Adjusted Gross Income                 -1.003e-06  

In [1864]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
logreg = LogisticRegression(solver='lbfgs', max_iter=6000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.61


### Drop Marital Status, Housing, FatherHighestGradeLevel,MotherHighestGradeLevel.As it's p value is High

In [1866]:
X=X.drop(columns=['Marital Status_Divorced','Marital Status_Married','Marital Status_Separated','Marital Status_Single','Housing_Off Campus','Housing_On Campus Housing','Housing_With Parent','FatherHighestGradeLevel_College','FatherHighestGradeLevel_High School','FatherHighestGradeLevel_Middle School','FatherHighestGradeLevel_Unknown','MotherHighestGradeLevel_High School','MotherHighestGradeLevel_High School','MotherHighestGradeLevel_Middle School','MotherHighestGradeLevel_College','MotherHighestGradeLevel_Unknown'])

In [1868]:
X = sm.add_constant(X)

  x = pd.concat(x[::order], 1)


In [1869]:
model=sm.Logit(y,X)
result = model.fit(method='newton')

Optimization terminated successfully.
         Current function value: 0.665634
         Iterations 5


In [1870]:
print(result.summary())
print(result.summary2())

                           Logit Regression Results                           
Dep. Variable:                Dropout   No. Observations:                12261
Model:                          Logit   Df Residuals:                    12258
Method:                           MLE   Df Model:                            2
Date:                Thu, 08 Jul 2021   Pseudo R-squ.:                0.002007
Time:                        20:56:14   Log-Likelihood:                -8161.3
converged:                       True   LL-Null:                       -8177.8
Covariance Type:            nonrobust   LLR p-value:                 7.419e-08
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const                           -0.3631      0.027    -13.459      0.000      -0.416      -0.310
Adjusted Gross Income        -1.858e-06   8.54e-07     -2.176      0.030   -3

In [1871]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
logreg = LogisticRegression(solver='lbfgs', max_iter=6000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.61
