# Data to Policy Spring 2021

Weston Grewe and Angela Morrison

University of Colorado Denver

Math 7594 Integer Programming

Instructor: Dr. Steffen Borgwardt

## Summary
A college degree is becoming a necessary requirement of entering the middle class. A college education is also an excellent way to lift people out of poverty. Some high schools have high immediate college enrollments while others have low immediate college enrollments. For any one high school, it is nearly impossible to determine which factors contribute significantly to college enrollment. Often, it is a blend of many factors such as class size, proportion of low income students, teacher pay, number of AP classes offered, and many other factors. 

In this project, we create an interpretable optimal decision tree to understand which factors make the greatest impact. We will use data from Massachusetts' public schools in 2017 which can be found on Kaggle.

In [2]:
import numpy as np
import pandas as pd

In [3]:
raw_data = pd.read_csv('MA_Public_Schools_2017.csv')

The dataset contains 302 fields for 1861 schools. This includes elementary, middle, and high schools as well as schools that serve many grade levels. We will begin by selecting only schools which serve senior high school students. A school which does not serve seniors cannot have immediate college enrollment. We will then select only fields which would be beneficial to this analysis. For example, the number of AP classes taken is relevant while the school's principal is not.

In [4]:
slice1 = raw_data[raw_data['12_Enrollment'] > 1]
slice1

Unnamed: 0,School Code,School Name,School Type,Function,Contact Name,Address 1,Address 2,Town,State,Zip,...,MCAS_10thGrade_English_Incl. in SGP(#),Accountability and Assistance Level,Accountability and Assistance Description,School Accountability Percentile (1-99),Progress and Performance Index (PPI) - All Students,Progress and Performance Index (PPI) - High Needs Students,District_Accountability and Assistance Level,District_Accountability and Assistance Description,District_Progress and Performance Index (PPI) - All Students,District_Progress and Performance Index (PPI) - High Needs Students
0,10505,Abington High,Public School,Principal,Teresa Sullivan-Cruz,201 Gliniewicz Way,,Abington,MA,2351,...,111.0,Level 1,Meeting gap narrowing goals,42.0,76.0,75.0,Level 3,One or more schools in the district classified...,63.0,60.0
8,50505,Agawam High,Public School,Principal,Thomas Schnepp,760 Cooper Street,,Agawam,MA,1001,...,263.0,Level 2,Not meeting gap narrowing goals,41.0,65.0,61.0,Level 2,One or more schools in the district classified...,54.0,56.0
16,70505,Amesbury High,Public School,Principal,Elizabeth McAndrews,5 Highland Street,,Amesbury,MA,1913,...,133.0,Level 2,Not meeting gap narrowing goals,53.0,67.0,66.0,Level 2,One or more schools in the district classified...,50.0,46.0
17,70515,Amesbury Innovation High School,Public School,Principal,Eryn Maguire,71 Friend Street,,Amesbury,MA,1913,...,,Insufficient data,,,,,Level 2,One or more schools in the district classified...,50.0,46.0
23,90505,Andover High,Public School,Principal,Philip Conrad,80 Shawsheen Road,,Andover,MA,1810,...,310.0,Level 2,Not meeting gap narrowing goals,81.0,85.0,64.0,Level 2,One or more schools in the district classified...,83.0,57.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1851,39010900,Massachusetts Virtual Academy at Greenfield Co...,Public School,Principal,Greg Runyan,278 Main St.,Ste. 205,Greenfield,MA,1301,...,20.0,Level 3,Among lowest performing 20% of schools and sub...,7.0,48.0,47.0,Level 3,Among lowest performing 20% of schools and sub...,48.0,47.0
1854,35010505,Paulo Freire Social Justice Charter School,Charter School,Principal,Melissa Mirhej,161 Lower Westfield Road,P O Box 1009,Holyoke,MA,1041,...,48.0,Insufficient data,,,,,Insufficient data,,,
1855,35080505,Phoenix Academy Public Charter High School Spr...,Charter School,Principal,Jacqueline Adam-Taylor,65 Lincoln Street,,Springfield,MA,1105,...,3.0,Insufficient data,,,,,Insufficient data,,,
1856,35060505,Pioneer Charter School of Science II (PCSS-II),Charter School,Principal,Vahit Sevinc,97 Main Street,,Saugus,MA,1906,...,30.0,Insufficient data,,,,,Insufficient data,,,


Now, we must choose which columns to include in our analysis. It would also be interesting to study mutable and immutable factors in two different analyses to understand what changes. For instance, some factors may be most determining, e.g. poverty/wealth, but schools have no control over these factors. For a decision, schools can only be concerned with mutable factors, e.g. teacher pay, number of AP classes. 

Factors (in order of Col #)
- School type (Public/Charter)
- ZIP 
- District/District Code
- Total Enrollment
- First Lang Not English
- English Lang Learner
- Disability
- High Need
- Economically Disadvantaged
- Race Makeup
- Average Class Size
- Average Salary
- Average Expenditure per Pupil
- % Graduated
- % Dropped Out
- AP Test takers
- Number of Tests Taken
- AP Score
- Average SAT Math
- Average SAT Reading
- Average SAT Writing
- 10th Grade MCAS (If used, filter for 10th graders)
- Accountability Metrics

Replace "%" symbols in column names to avoid possible errors in future column name calling.

In [37]:
slice1.columns = slice1.columns.str.replace('%', 'Percent');

Take only columns with information we care about.

In [39]:
>>> important_cols = slice1[['School Name', 'School Type', 'TOTAL_Enrollment', 'First Language Not English', 'Students With Disabilities','Percent Students With Disabilities', 'High Needs', 'Percent High Needs', 'Economically Disadvantaged','Percent Economically Disadvantaged', 'Percent African American', 'Percent Asian', 'Percent Hispanic', 'Percent White','Percent Native American', 'Percent Native Hawaiian, Pacific Islander', 'Percent Multi-Race, Non-Hispanic','Percent Males', 'Percent Females', 'Average Class Size', 'Percent Graduated', 'Percent Non-Grad Completers','Percent GED', 'Percent Dropped Out', 'High School Graduates (#)', 'Attending Coll./Univ. (#)', 'Percent Attending College','Percent Private Two-Year', 'Percent Private Four-Year', 'Percent Public Two-Year', 'Percent Public Four-Year', 'Percent MA Community College','Percent MA State University','Percent UMass', 'AP_Test Takers', 'AP_Tests Taken', 'AP_One Test', 'AP_Two Tests', 'AP_Three Tests','AP_Four Tests', 'AP_Five or More Tests','AP_Score=1', 'AP_Score=2', 'AP_Score=3', 'AP_Score=4', 'AP_Score=5','Percent AP_Score 1-2', 'Percent AP_Score 3-5', 'SAT_Tests Taken', 'Average SAT_Reading', 'Average SAT_Writing', 'Average SAT_Math']]

Get current shape of new dataframe and remove rows with any missing data and get shape of new dataset

In [44]:
print(important_cols.shape)
important_cols.isnull().sum().sum()
clean_imp_cols = important_cols.dropna()
print(clean_imp_cols.shape)

(393, 52)
(292, 52)


Print names of columns in cleaned dataset

In [45]:
clean_imp_cols.columns.tolist()

['School Name',
 'School Type',
 'TOTAL_Enrollment',
 'First Language Not English',
 'Students With Disabilities',
 'Percent Students With Disabilities',
 'High Needs',
 'Percent High Needs',
 'Economically Disadvantaged',
 'Percent Economically Disadvantaged',
 'Percent African American',
 'Percent Asian',
 'Percent Hispanic',
 'Percent White',
 'Percent Native American',
 'Percent Native Hawaiian, Pacific Islander',
 'Percent Multi-Race, Non-Hispanic',
 'Percent Males',
 'Percent Females',
 'Average Class Size',
 'Percent Graduated',
 'Percent Non-Grad Completers',
 'Percent GED',
 'Percent Dropped Out',
 'High School Graduates (#)',
 'Attending Coll./Univ. (#)',
 'Percent Attending College',
 'Percent Private Two-Year',
 'Percent Private Four-Year',
 'Percent Public Two-Year',
 'Percent Public Four-Year',
 'Percent MA Community College',
 'Percent MA State University',
 'Percent UMass',
 'AP_Test Takers',
 'AP_Tests Taken',
 'AP_One Test',
 'AP_Two Tests',
 'AP_Three Tests',
 'AP_Fo

Save cleaned data to csv file to be used in AMPL

In [47]:
clean_imp_cols.to_csv('clean_school_data', index=False)