This notebook breaks the Curr_Completetion_Rate variable and splits the observations into two groups (above/below the median).

The models in this notebook predict whether a school is in the high or low group. The models used are:

Logisitic Regression - Baseline

Logistic Regression - PCA

Logistic Regression- Grid Search

Random Forest - Baseline

Random Forest - Grid Search

Decision Tree

Nearest Centroid - Grid Search

K-nearest - Grid Search



In [155]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn.neighbors import KNeighborsClassifier

In [156]:
# Load in merged community college data
url = 'https://github.com/BrownRegaSterlingHeinen/PostsecondaryAttainment/blob/master/2016/Machine%20Learning%20Datasets/NCCCData_ML.csv?raw=true'
NCCCData = pd.read_csv(url)

In [157]:
list(NCCCData)

['AdvESL_MeasureableSkills_Participant_POP_MSG',
 'AdvESL_MeasureableSkills_ParticipServed',
 'AdvESL_MeasureableSkills_AHSGrad',
 'AdvESL_MeasureableSkills_HSE',
 'AdvESL_MeasureableSkills_Postsecondary\r\nEnrollment',
 'AdvESL_MeasureableSkills_MSG',
 'Beg_ESL_PCTProgress',
 'LowBeg_ESL_PCTProgress',
 'HighBeg_ESL_PCTProgress',
 'LowInt_ESL_PCTProgress',
 'HighInt_ESL_PCTProgress',
 'Advanced_ESL_PCTProgress',
 'Basic_Skills_CompletingLevel\r\nLEVEL',
 'Basic_Skills_PCTCompleting',
 'Beg_ABE_Lit_PCTProgress',
 'Beg_BasicEd_PCTProgress',
 'LowInt_BasicEd_PCTProgress',
 'HighInt_BasicEd_PCTProgress',
 'Low_AdultSecondary_Students',
 'Low_AdultSecondary_PCTProgress',
 'BegABELit_Participant_POP_MSG',
 'BegABELit_IndividualsServed',
 'BegABELit_Particip/Served',
 'BegABELit_POPs',
 'BegABELit_AHSGrad',
 'BegABELit_HSE',
 'BegABELit_Postsecondary',
 'BegABELit_Posttest',
 'BegABELit_MSG',
 'BegBasicEd_Participant_POP_MSG',
 'BegBasicEd_Particip/Served',
 'BegBasicEd_AHSGrad',
 'BegBasicEd

In [158]:
#_________________________HAVE TO CHANGE THIS CALCULATION______________________________

#create percentage variables for retention rate
NCCCData['Full_time_rentention_percent']=NCCCData['Full-time retention rate  2016']/10
NCCCData['Part_time_rentention_percent']=NCCCData['Part-time retention rate  2016']/10

In [159]:
#Get the median value of each variable in order to create two groups
cur_median=NCCCData['Curr_Completion_Rate'].median()
first_median=NCCCData['First_Year_Progression'].median()
full_median=NCCCData['Full_time_rentention_percent'].median()
part_median=NCCCData['Part_time_rentention_percent'].median()

#create high and low groups for each variable
NCCCData['Curr_Completion_Rate_TwoGroups']=np.where(NCCCData['Curr_Completion_Rate']<cur_median,"Low","High")
NCCCData['First_Year_Progression_TwoGroups']=np.where(NCCCData['First_Year_Progression']<first_median,"Low","High")
NCCCData['Full_time_rentention_percent_TwoGroups']=np.where(NCCCData['Full_time_rentention_percent']<full_median,"Low","High")
NCCCData['Part_time_rentention_percent_TwoGroups']=np.where(NCCCData['Part_time_rentention_percent']<part_median,"Low","High")

#get the 33rd and 66th percentiles for each variable
cur_33=np.percentile(NCCCData['Curr_Completion_Rate'],33)
cur_66=np.percentile(NCCCData['Curr_Completion_Rate'],66)
first_33=np.percentile(NCCCData['First_Year_Progression'],33)
first_66=np.percentile(NCCCData['First_Year_Progression'],66)
full_33=np.percentile(NCCCData['Full_time_rentention_percent'],33)
full_66=np.percentile(NCCCData['Full_time_rentention_percent'],66)
part_33=np.percentile(NCCCData['Part_time_rentention_percent'],33)
part_66=np.percentile(NCCCData['Part_time_rentention_percent'],66)

#split into three groups
NCCCData['Curr_Completion_Rate_ThreeGroups']=pd.cut(NCCCData['Curr_Completion_Rate'],[0,cur_33,cur_66,np.inf],labels=['low','medium','high'])
NCCCData['First_Year_Progression_ThreeGroups']=pd.cut(NCCCData['First_Year_Progression'],[0,first_33,first_66,np.inf],labels=['low','medium','high'])
NCCCData['Full_time_rentention_percent_ThreeGroups']=pd.cut(NCCCData['Full_time_rentention_percent'],[0,full_33,full_66,np.inf],labels=['low','medium','high'])
NCCCData['Part_time_rentention_percent_ThreeGroups']=pd.cut(NCCCData['Part_time_rentention_percent'],[0,part_33,part_66,np.inf],labels=['low','medium','high'])


These are the variables we want to try to predict.

Curr_Completion_Rate

First_Year_Progression

Full_time_rentention_percent

Part_time_rentention_percent

DECISION TREE/RANDOM FORREST ONLY:
    
Curr_Completion_Rate_ThreeGroups

First_Year_Progression_ThreeGroups

Full_time_rentention_percent_ThreeGroups

Part_time_rentention_percent_ThreeGroups

LOGISTIC REGRESSION ONLY:
    
Curr_Completion_Rate_TwoGroups

First_Year_Progression_TwoGroups

Full_time_rentention_percent_TwoGroups

Part_time_rentention_percent_TwoGroups

########################################################

These are the models we would like to try:

Linear Regression (numeric variables)

Logistic Regression (two group variables)

Decision Tree (numeric and three group)

Random Forrest (numeric and three group)

In [160]:
#Create data frame of target variables

target_variables=NCCCData[["Curr_Completion_Rate", "First_Year_Progression", "Full_time_rentention_percent", "Part_time_rentention_percent", "Curr_Completion_Rate_ThreeGroups", "First_Year_Progression_ThreeGroups", "Full_time_rentention_percent_ThreeGroups", "Part_time_rentention_percent_ThreeGroups", "Curr_Completion_Rate_TwoGroups", "First_Year_Progression_TwoGroups", "Full_time_rentention_percent_TwoGroups", "Part_time_rentention_percent_TwoGroups"]]

In [161]:
#Remove all the target variables from the data set
target_cols=["Curr_Completion_Rate", "First_Year_Progression", "Full_time_rentention_percent", "Part_time_rentention_percent", "Curr_Completion_Rate_ThreeGroups", "First_Year_Progression_ThreeGroups", "Full_time_rentention_percent_ThreeGroups", "Part_time_rentention_percent_ThreeGroups", "Curr_Completion_Rate_TwoGroups", "First_Year_Progression_TwoGroups", "Full_time_rentention_percent_TwoGroups", "Part_time_rentention_percent_TwoGroups"]
targets_dropped=NCCCData.drop(target_cols,axis=1)

In [162]:
#check that columns where dropped
list(targets_dropped)

['AdvESL_MeasureableSkills_Participant_POP_MSG',
 'AdvESL_MeasureableSkills_ParticipServed',
 'AdvESL_MeasureableSkills_AHSGrad',
 'AdvESL_MeasureableSkills_HSE',
 'AdvESL_MeasureableSkills_Postsecondary\r\nEnrollment',
 'AdvESL_MeasureableSkills_MSG',
 'Beg_ESL_PCTProgress',
 'LowBeg_ESL_PCTProgress',
 'HighBeg_ESL_PCTProgress',
 'LowInt_ESL_PCTProgress',
 'HighInt_ESL_PCTProgress',
 'Advanced_ESL_PCTProgress',
 'Basic_Skills_CompletingLevel\r\nLEVEL',
 'Basic_Skills_PCTCompleting',
 'Beg_ABE_Lit_PCTProgress',
 'Beg_BasicEd_PCTProgress',
 'LowInt_BasicEd_PCTProgress',
 'HighInt_BasicEd_PCTProgress',
 'Low_AdultSecondary_Students',
 'Low_AdultSecondary_PCTProgress',
 'BegABELit_Participant_POP_MSG',
 'BegABELit_IndividualsServed',
 'BegABELit_Particip/Served',
 'BegABELit_POPs',
 'BegABELit_AHSGrad',
 'BegABELit_HSE',
 'BegABELit_Postsecondary',
 'BegABELit_Posttest',
 'BegABELit_MSG',
 'BegBasicEd_Participant_POP_MSG',
 'BegBasicEd_Particip/Served',
 'BegBasicEd_AHSGrad',
 'BegBasicEd

In [163]:
#Split into train and test data
#Chose 75/25 train/test split
x_train, x_test, y_train, y_test=train_test_split(targets_dropped,target_variables,test_size=0.25, random_state=0)

print("The train data set shape is: ",x_train.shape)
print("The test data set shape is: ",x_test.shape)

The train data set shape is:  (44, 321)
The test data set shape is:  (15, 321)


In [164]:
#Classifier Evaluation
#The following functions performs cross validation using cross_validate() for classification estimators and returns accuracy, 
#precision, and recall. These will be called several times throughout the notebook to evaluate models tested.

def EvaluateClassifierEstimator(classifierEstimator, X, y, cv):
   
    #Perform cross validation 
    scores = cross_validate(classifierEstimator, x_train, y_train["Curr_Completion_Rate"], scoring=['accuracy','precision','recall','r2','neg_mean_squared_error']
                            , cv=cv, return_train_score=True)

    Accavg = scores['test_accuracy'].mean()
    Preavg = scores['test_precision'].mean()
    Recavg = scores['test_recall'].mean()
    #R2avg = scores['test_r2'].mean()
    #MSEavg = scores['test_neg_mean_squared_error'].mean()

    print_str = "The average accuracy for all cv folds is: \t\t\t {Accavg:.5}"
    print_str2 = "The average precision for all cv folds is: \t\t\t {Preavg:.5}"
    print_str3 = "The average recall for all cv folds is: \t\t\t {Recavg:.5}"
    #print_str4 = "The average R-Squared for all cv folds is: \t\t\t {R2avg:.5}"
    #print_str5 = "The average Negative MSE for all cv folds is: \t\t\t {MSEavg:.5}"

    print(print_str.format(Accavg=Accavg))
    print(print_str2.format(Preavg=Preavg))
    print(print_str3.format(Recavg=Recavg))
    #print(print_str4.format(R2avg=R2avg))
    #print(print_str5.format(MSEavg=MSEavg))
    print('*********************************************************')

    print('Cross Validation Fold Mean Error Scores')
    scoresResults = pd.DataFrame()
    scoresResults['Accuracy'] = scores['test_accuracy']
    scoresResults['Precision'] = scores['test_precision']
    scoresResults['Recall'] = scores['test_recall']
    #scoresResults['R2'] = scores['test_r2']
    #scoresResults['MSE'] = scores['test_neg_mean_squared_error']

    return scoresResults


In [165]:
#create cross validation object
cv = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0)

##############################################################

Curr_Completion_Rate_TwoGroups Models

In [166]:
#logistic regression baseline
log_reg=LogisticRegression().fit(x_train,y_train["Curr_Completion_Rate_TwoGroups"])

print("The mean accuracy on test data is: ", log_reg.score(x_test,y_test["Curr_Completion_Rate_TwoGroups"]))

The mean accuracy on test data is:  0.333333333333


In [167]:
#logistic regression PCA
pca = PCA(.95)

pca.fit(x_train)
pca.n_components_

print("In this PCA model, 95% of the variance amounts to", pca.n_components_, "principal components.")

#create model
x_train_pca = pca.transform(x_train)
x_test_pca = pca.transform(x_test)

log_reg_pca = LogisticRegression()
log_reg_pca.fit(x_train_pca, y_train["Curr_Completion_Rate_TwoGroups"])
print("The mean accuracy on test data is: ", log_reg_pca.score(x_test_pca, y_test["Curr_Completion_Rate_TwoGroups"]))

In this PCA model, 95% of the variance amounts to 4 principal components.
The mean accuracy on test data is:  0.6


In [168]:
#logistic regression grid search
log_reg_grid = LogisticRegression()


parameters = { 'penalty':['l2']
              ,'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
              ,'class_weight': ['balanced', 'none']
              ,'random_state': [0]
              ,'solver': ['lbfgs']
              ,'max_iter':[100,500]
             }

#Create a grid search object 
regGridSearch = GridSearchCV(estimator=log_reg_grid
                   , n_jobs=8 # jobs to run in parallel
                   , verbose=1 # low verbosity
                   , param_grid=parameters
                   , cv=cv # KFolds = 10
                   , scoring='accuracy')

regGridSearch.fit(x_train,y_train["Curr_Completion_Rate_TwoGroups"])


Fitting 10 folds for each of 28 candidates, totalling 280 fits


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:  2.0min
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:  2.1min
[Parallel(n_jobs=8)]: Done 280 out of 280 | elapsed:  2.2min finished


GridSearchCV(cv=ShuffleSplit(n_splits=10, random_state=0, test_size=0.2, train_size=None),
       error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'penalty': ['l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'class_weight': ['balanced', 'none'], 'random_state': [0], 'solver': ['lbfgs'], 'max_iter': [100, 500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [169]:
#Use the best parameters for our model
log_reg_grid = regGridSearch.best_estimator_

print("The mean accuracy on test data is: ", log_reg_grid.score(x_test, y_test["Curr_Completion_Rate_TwoGroups"]))

The mean accuracy on test data is:  0.333333333333


In [170]:
#random forest baseline
rf=RandomForestClassifier().fit(x_train,y_train["Curr_Completion_Rate_TwoGroups"])

print("The mean accuracy on test data is: ", rf.score(x_test,y_test["Curr_Completion_Rate_TwoGroups"]))

The mean accuracy on test data is:  0.4


In [171]:
#random forest grid search
rf_grid=RandomForestClassifier()

parameters = { 'n_estimators': [1,5,10,15,20,25,30,35,40,45,50]
              ,'criterion': ['gini','entropy']
              ,'bootstrap': [True,False]
              }

regGridSearch = GridSearchCV(estimator=rf_grid
                   , n_jobs=8 # jobs to run in parallel
                   , verbose=1 # low verbosity
                   , param_grid=parameters
                   , cv=cv # KFolds = 10
                   , scoring='accuracy')

#Perform hyperparameter search to find the best combination of parameters for our data
regGridSearch.fit(x_train, y_train["Curr_Completion_Rate_TwoGroups"])

Fitting 10 folds for each of 44 candidates, totalling 440 fits


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:  1.6min
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:  1.7min
[Parallel(n_jobs=8)]: Done 425 out of 440 | elapsed:  1.8min remaining:    3.8s
[Parallel(n_jobs=8)]: Done 440 out of 440 | elapsed:  1.8min finished


GridSearchCV(cv=ShuffleSplit(n_splits=10, random_state=0, test_size=0.2, train_size=None),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'n_estimators': [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50], 'criterion': ['gini', 'entropy'], 'bootstrap': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [172]:
#Use the best parameters for our model
rf_grid = regGridSearch.best_estimator_

print("The mean accuracy on test data is: ", rf_grid.score(x_test, y_test["Curr_Completion_Rate_TwoGroups"]))

The mean accuracy on test data is:  0.533333333333


In [173]:
#decision tree

dt=DecisionTreeClassifier().fit(x_train,y_train["Curr_Completion_Rate_TwoGroups"])

print("The mean accuracy on test data is: ", dt.score(x_test, y_test["Curr_Completion_Rate_TwoGroups"]))

The mean accuracy on test data is:  0.733333333333


In [174]:
#Nearest Centroid base line
nc = NearestCentroid().fit(x_train,y_train["Curr_Completion_Rate_TwoGroups"])

print("The mean accuracy on test data is: ", nc.score(x_test, y_test["Curr_Completion_Rate_TwoGroups"]))

The mean accuracy on test data is:  0.666666666667


In [175]:
#Nearest Centroid grid search
nc_grid = NearestCentroid()

parameters = { 'metric': ['euclidean','cosine','manhattan']
              ,'shrink_threshold': [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
              }

regGridSearch = GridSearchCV(estimator=nc_grid
                   , n_jobs=8 # jobs to run in parallel
                   , verbose=1 # low verbosity
                   , param_grid=parameters
                   , cv=cv # KFolds = 10
                   , scoring='accuracy')

#Perform hyperparameter search to find the best combination of parameters for our data
regGridSearch.fit(x_train, y_train["Curr_Completion_Rate_TwoGroups"])


Fitting 10 folds for each of 63 candidates, totalling 630 fits


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:  1.7min
[Parallel(n_jobs=8)]: Done 494 tasks      | elapsed:  1.7min
[Parallel(n_jobs=8)]: Done 630 out of 630 | elapsed:  1.7min finished


GridSearchCV(cv=ShuffleSplit(n_splits=10, random_state=0, test_size=0.2, train_size=None),
       error_score='raise',
       estimator=NearestCentroid(metric='euclidean', shrink_threshold=None),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'metric': ['euclidean', 'cosine', 'manhattan'], 'shrink_threshold': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [176]:
#Use the best parameters for our model
nc_grid = regGridSearch.best_estimator_

print("The mean accuracy on test data is: ", nc_grid.score(x_test, y_test["Curr_Completion_Rate_TwoGroups"]))

The mean accuracy on test data is:  0.666666666667


In [177]:
#k-nearest neighbor grid search
kn_grid = KNeighborsClassifier()

parameters = { 'n_neighbors':[1,2,3,4,5,6,7,8,9,10]
              ,'weights': ['uniform','distance']
              ,'algorithm': ['auto', 'ball_tree','kd_tree','brute']
              ,'leaf_size': [1,5,10,15,20,25,30]
              ,'p':[1,2]
              ,'metric': ['euclidean','manhattan','chebyshev','minkowski']
              ,'n_jobs':[1,2,4,6,8]
             }

regGridSearch = GridSearchCV(estimator=kn_grid
                   , n_jobs=8 # jobs to run in parallel
                   , verbose=1 # low verbosity
                   , param_grid=parameters
                   , cv=cv # KFolds = 10
                   , scoring='accuracy')

#Perform hyperparameter search to find the best combination of parameters for our data
regGridSearch.fit(x_train, y_train["Curr_Completion_Rate_TwoGroups"])




Fitting 10 folds for each of 22400 candidates, totalling 224000 fits


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:  1.8min
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:  1.8min
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:  1.9min
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:  2.1min
[Parallel(n_jobs=8)]: Done 1234 tasks      | elapsed:  2.3min
[Parallel(n_jobs=8)]: Done 1784 tasks      | elapsed:  2.7min
[Parallel(n_jobs=8)]: Done 3322 tasks      | elapsed:  3.3min
[Parallel(n_jobs=8)]: Done 5122 tasks      | elapsed:  4.1min
[Parallel(n_jobs=8)]: Done 6822 tasks      | elapsed:  4.9min
[Parallel(n_jobs=8)]: Done 9156 tasks      | elapsed:  6.0min
[Parallel(n_jobs=8)]: Done 11256 tasks      | elapsed:  7.0min
[Parallel(n_jobs=8)]: Done 13556 tasks      | elapsed:  8.1min
[Parallel(n_jobs=8)]: Done 16536 tasks      | elapsed:  9.4min
[Parallel(n_jobs=8)]: Done 19602 tasks      | elapsed: 10.9min
[Parallel(n_jobs=8)]: Done 22754 tasks      | elapsed: 12.3min
[Parallel(n_jobs=8)]: Done 26214 tasks      | elapsed: 14.0min
[Paral

GridSearchCV(cv=ShuffleSplit(n_splits=10, random_state=0, test_size=0.2, train_size=None),
       error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'weights': ['uniform', 'distance'], 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'], 'leaf_size': [1, 5, 10, 15, 20, 25, 30], 'p': [1, 2], 'metric': ['euclidean', 'manhattan', 'chebyshev', 'minkowski'], 'n_jobs': [1, 2, 4, 6, 8]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [178]:
#Use the best parameters for our model
kn_grid = regGridSearch.best_estimator_

print("The mean accuracy on test data is: ", kn_grid.score(x_test, y_test["Curr_Completion_Rate_TwoGroups"]))

The mean accuracy on test data is:  0.333333333333
