# Predicting College's Graduation Rate (model validation)

### - Bassam Atheeque
---

## Data description:

This dataset contains a number of variables for over 700 different universities and colleges in the U.S.

**Private**: Public/private indicator

**Apps**: Number of applications received

**Accept**: Number of applicants accepted

**Enroll**: Number of new students enrolled

**Top10perc**: New students from top 10% of high school class

**Top25perc**: New students from top 25% of high school class

**F.Undergrad**: Number of full-time undergraduates

**P.Undergrad**: Number of part-time undergraduates

**Outstate**: Out-of-state tuition

**Room.Board**: Room and board costs

**Books**: Estimated book costs

**Personal**: Estimated personal spending

**PhD**: Percent of faculty with Ph.D.â€™s

**Terminal**: Percent of faculty with terminal degree

**S.F.Ratio**: Student/faculty ratio

**perc.alumni**: Percent of alumni who donate

**Expend**: Instructional expenditure per student

**Grad.Rate**: Graduation rate

In [1]:
# Importing the libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from scipy import stats

In [2]:
collegedata = pd.read_csv('.\college.csv')

In [3]:
collegedata.head()

Unnamed: 0.1,Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Adrian College,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Agnes Scott College,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Alaska Pacific University,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


### Firstly, we will rename the column name to make it more relatable. It will also help us later on to refer to the variables when specifying the regression formula in statsmodels.

In [4]:
#Renaming the column names

collegedata = collegedata.rename(columns=
                                 {'Unnamed: 0':'College Name',
                                  'F.Undergrad':'F_Undergrad',
                                 'P.Undergrad':'P_Undergrad',
                                 'Room.Board':'Room_Board',
                                 'S.F.Ratio':'SF_Ratio',
                                 'perc.alumni':'perc_alumni',
                                 'Grad.Rate':'Grad_Rate'})


collegedata.head()                              

Unnamed: 0,College Name,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F_Undergrad,P_Undergrad,Outstate,Room_Board,Books,Personal,PhD,Terminal,SF_Ratio,perc_alumni,Expend,Grad_Rate
0,Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Adrian College,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Agnes Scott College,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Alaska Pacific University,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


### Creating new columns to indicate the 
#### (1) Acceptance Rate: ratio of the number of student who are accepted to the number of applications received
#### (2) Yield Rate: ratio of the number of students who enroll to the number of students who are accepted

In [5]:
# Calculating Acceptance Rate and Yield Rate and representing as a percentage (multiplying by 100)

collegedata['Acceptance_Rate'] = (collegedata['Accept']/collegedata['Apps'])*100

collegedata['Yield_Rate'] = (collegedata['Enroll']/collegedata['Accept'])*100


# Displaying the rows with new columns
collegedata[['College Name','Acceptance_Rate','Yield_Rate']].head()

Unnamed: 0,College Name,Acceptance_Rate,Yield_Rate
0,Abilene Christian University,74.216867,58.522727
1,Adelphi University,88.014639,26.611227
2,Adrian College,76.820728,30.628988
3,Agnes Scott College,83.693046,39.255014
4,Alaska Pacific University,75.647668,37.671233


## Statsmodels

#### Now, we will use statsmodels to fit six regression models with different predictors

#### Model 1: (1) the acceptance rate, (2) the percent of new students from top 25% of high school class, (3) the out-of-state tuition.

#### Model 2: Predictors same as Model 1, with an additional predictor of the yield rate.

#### Model 3: Predictors same as Model 1, with an additional predictor of the instructional expenditure per student.

#### Model 4: Predictors same as Model 1, with an additional predictor of the percent of alumni who donate.

#### Model 5: Predictors same as Model 1, with three additional predictors from Model 2, 3 and 4 (i.e., the yield rate, the instructional expenditure per student, and the percent of alumni who donate.)

#### Model 6: Predictors same as Model 5, with an additional predictor of the student/faculty ratio.


### For each model, we will find the three model performance metrics: MSE, $R^2$, and adjusted $R^2$.

### Model 1:

In [6]:
col_grad_rate1 = smf.ols(formula = 'Grad_Rate ~ Acceptance_Rate + Top25perc + Outstate ', data =collegedata).fit()

print("MSE: ", col_grad_rate1.mse_resid)
print("R2: ", col_grad_rate1.rsquared)
print("R2_adj: ", col_grad_rate1.rsquared_adj)

MSE:  182.67745860719873
R2:  0.38330247445562793
R2_adj:  0.38090908173035865


### Model 2:

In [7]:
col_grad_rate2 = smf.ols(formula = 'Grad_Rate ~ Acceptance_Rate + Top25perc + Outstate + Yield_Rate', data =collegedata).fit()

print("MSE: ", col_grad_rate2.mse_resid)
print("R2: ", col_grad_rate2.rsquared)
print("R2_adj: ", col_grad_rate2.rsquared_adj)

MSE:  182.027128539413
R2:  0.3862928693817719
R2_adj:  0.38311303968944943


### Model 3:

In [8]:
col_grad_rate3 = smf.ols(formula = 'Grad_Rate ~ Acceptance_Rate + Top25perc + Outstate + Expend', data =collegedata).fit()

print("MSE: ", col_grad_rate3.mse_resid)
print("R2: ", col_grad_rate3.rsquared)
print("R2_adj: ", col_grad_rate3.rsquared_adj)

MSE:  180.67513988370257
R2:  0.39085111894150604
R2_adj:  0.38769490712255006


### Model 4:

In [9]:
col_grad_rate4 = smf.ols(formula = 'Grad_Rate ~ Acceptance_Rate + Top25perc + Outstate + perc_alumni', data =collegedata).fit()

print("MSE: ", col_grad_rate4.mse_resid)
print("R2: ", col_grad_rate4.rsquared)
print("R2_adj: ", col_grad_rate4.rsquared_adj)

MSE:  174.45129434711376
R2:  0.41183490534916756
R2_adj:  0.4087874178121166


### Model 5:

In [10]:
col_grad_rate5 = smf.ols(formula = 'Grad_Rate ~ Acceptance_Rate + Top25perc + Outstate + Yield_Rate+ Expend + perc_alumni', data =collegedata).fit()

print("MSE: ", col_grad_rate5.mse_resid)
print("R2: ", col_grad_rate5.rsquared)
print("R2_adj: ", col_grad_rate5.rsquared_adj)

MSE:  171.37533584518496
R2:  0.4237024213297752
R2_adj:  0.4192117908466306


### Model 6:

In [11]:
col_grad_rate6 = smf.ols(formula = 'Grad_Rate ~ Acceptance_Rate + Top25perc + Outstate + Yield_Rate+ Expend + perc_alumni + SF_Ratio', data =collegedata).fit()

print("MSE: ", col_grad_rate6.mse_resid)
print("R2: ", col_grad_rate6.rsquared)
print("R2_adj: ", col_grad_rate6.rsquared_adj)

MSE:  171.59191875536084
R2:  0.42372348490214096
R2_adj:  0.41847779490775217


### Comparing the above six models:

#### Though Model 5 and Model 6 are very close, **Model 5** is the best among all as its MSE is the lowest and adjusted R squared is comparatively the highest.
#### It means that the difference between the actual values and the predicted values are comparatively lower.
#### Also, variation in the data of y (graduation rate) can be explained better by x (the predictors). 

## Train-Test Split method
#### We will be using the train/test split method from scikit-learn to find the best model among the six.
#### We will be splitting the data to 85% (training set) and 15% (test set) and the random_state set to 42.

In [12]:
# Importing libraries
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import metrics

In [13]:
# Reading y
y_data = collegedata['Grad_Rate']

### Model 1 (train_test_split):

In [14]:
X1_data = collegedata[['Acceptance_Rate', 'Top25perc', 'Outstate']]

X1_train, X1_test, y1_train, y1_test = train_test_split (X1_data, y_data, test_size = 0.15, random_state= 42)

lr1 = linear_model.LinearRegression()

lr1.fit(X1_train, y1_train)

y1_pred = lr1.predict(X1_test)

print ('MSE for Model 1:',metrics.mean_squared_error (y1_test, y1_pred))

MSE for Model 1: 141.1352414049783


### Model 2 (train_test_split):

In [15]:
X2_data = collegedata[['Acceptance_Rate', 'Top25perc', 'Outstate', 'Yield_Rate']]

X2_train, X2_test, y2_train, y2_test = train_test_split (X2_data, y_data, test_size = 0.15, random_state= 42)

lr2 = linear_model.LinearRegression()

lr2.fit(X2_train, y2_train)

y2_pred = lr2.predict(X2_test)

print ('MSE for Model 2:',metrics.mean_squared_error (y2_test, y2_pred))

MSE for Model 2: 138.12127271829294


### Model 3 (train_test_split):

In [16]:
X3_data = collegedata[['Acceptance_Rate', 'Top25perc', 'Outstate', 'Expend']]

X3_train, X3_test, y3_train, y3_test = train_test_split (X3_data, y_data, test_size = 0.15, random_state= 42)

lr3 = linear_model.LinearRegression()

lr3.fit(X3_train, y3_train)

y3_pred = lr3.predict(X3_test)

print ('MSE for Model 3:',metrics.mean_squared_error (y3_test, y3_pred))

MSE for Model 3: 141.82011995650237


### Model 4 (train_test_split):

In [17]:
X4_data = collegedata[['Acceptance_Rate', 'Top25perc', 'Outstate', 'perc_alumni']]

X4_train, X4_test, y4_train, y4_test = train_test_split (X4_data, y_data, test_size = 0.15, random_state= 42)

lr4 = linear_model.LinearRegression()

lr4.fit(X4_train, y4_train)

y4_pred = lr4.predict(X4_test)

print ('MSE for Model 4:',metrics.mean_squared_error (y4_test, y4_pred))

MSE for Model 4: 137.6916368463751


### Model 5 (train_test_split):

In [18]:
X5_data = collegedata[['Acceptance_Rate', 'Top25perc', 'Outstate', 'Yield_Rate' , 'Expend' , 'perc_alumni']]

X5_train, X5_test, y5_train, y5_test = train_test_split (X5_data, y_data, test_size = 0.15, random_state= 42)

lr5 = linear_model.LinearRegression()

lr5.fit(X5_train, y5_train)

y5_pred = lr5.predict(X5_test)

print ('MSE for Model 5:',metrics.mean_squared_error (y5_test, y5_pred))

MSE for Model 5: 135.54129991259958


### Model 6 (train_test_split):

In [19]:
X6_data = collegedata[['Acceptance_Rate', 'Top25perc', 'Outstate', 'Yield_Rate' , 'Expend' , 'perc_alumni', 'SF_Ratio']]

X6_train, X6_test, y6_train, y6_test = train_test_split (X6_data, y_data, test_size = 0.15, random_state= 42)

lr6 = linear_model.LinearRegression()

lr6.fit(X6_train, y6_train)

y6_pred = lr6.predict(X6_test)

print ('MSE for Model 6:',metrics.mean_squared_error (y6_test, y6_pred))

MSE for Model 6: 136.16950452518563


### Comparing the above six models:
#### Model 5 is the best among all as it has the lowest MSE of 135.54129
#### Lower MSE implies that the difference between the true data and the predicted data is low. Hence our regression line fits better in our model as the error is comparatively lower.

## Cross-Validation:

In [20]:
#importing library

from sklearn.model_selection import cross_val_score

### Model 1 (cross-validation):

In [21]:
Model1_CV = -cross_val_score (lr1, X1_data, y_data,
                cv = 10,
                scoring = 'neg_mean_squared_error').mean()

print ('MSE for Model 1:',Model1_CV)

MSE for Model 1: 184.95846670455708


### Model 2 (cross-validation):

In [22]:
Model2_CV = -cross_val_score (lr2, X2_data, y_data,
                cv = 10,
                scoring = 'neg_mean_squared_error').mean()

print ('MSE for Model 2:',Model2_CV )

MSE for Model 2: 184.99168199166132


### Model 3 (cross-validation):

In [23]:
Model3_CV = -cross_val_score (lr3, X3_data, y_data,
                cv = 10,
                scoring = 'neg_mean_squared_error').mean()

print ('MSE for Model 3:',Model3_CV)

MSE for Model 3: 183.4607896806073


### Model 4 (cross-validation):

In [24]:
Model4_CV = -cross_val_score (lr4, X4_data, y_data,
                cv = 10,
                scoring = 'neg_mean_squared_error').mean()

print ('MSE for Model 4:',Model4_CV)

MSE for Model 4: 176.1842003986497


### Model 5 (cross-validation):

In [25]:
Model5_CV = -cross_val_score (lr5, X5_data, y_data,
                cv = 10,
                scoring = 'neg_mean_squared_error').mean()

print ('MSE for Model 5:',Model5_CV)

MSE for Model 5: 174.51244212033296


### Model 6 (cross-validation):

In [26]:
Model6_CV = -cross_val_score (lr6, X6_data, y_data,
                cv = 10,
                scoring = 'neg_mean_squared_error').mean()

print ('MSE for Model 6:',Model6_CV)

MSE for Model 6: 175.13718723110142


### Comparing the above six models:
#### The best model among all is **Model 5**.
#### It has the lowest MSE of 174.51244212033288 which implies that the difference between the true data and the predicted data is lowest when compared to the differences of other data. Hence comparatively lower error makes Model 5 the best.

In [27]:
# Displaying the intercept and the coefficients from my Model 5:

col_grad_rate5 = smf.ols(formula = 'Grad_Rate ~ Acceptance_Rate + Top25perc + Outstate + Yield_Rate+ Expend + perc_alumni', data =collegedata).fit()
col_grad_rate5.params

Intercept          50.637950
Acceptance_Rate    -0.146102
Top25perc           0.178278
Outstate            0.001579
Yield_Rate         -0.087575
Expend             -0.000411
perc_alumni         0.302697
dtype: float64

### Final Model:

### $$ \text{GradRate}= 50.637950 -0.146102\text{(Acceptance_Rate)} + 0.178278 \text{(Top25perc)} +  0.001579 \text{(Outstate)} -0.087575\text{(Yield_Rate)} -0.000411 \text{(Expend)}  + 0.302697\text{(perc_alumni)} $$

## Thank You!
---