### Title: High School Graduation Rate Predictor Based on Exam Statistics

#### Project Motivation

Education is always a vital and fundamental factor of one country. Potential applications of big data and data science have been explored and deployed in K-12 education [1,2].  One of these applications is capturing useful information of both students and teachers to predict progress and outcomes for individual student. Thus it is possible for data scientists to build an early warning system to help teachers to prevent possible dropouts in the future. 

In this project, I want to explore the relationship between high school graduation rate and exam statistics of students. The goal of this project is to predict the graduation rate of high school of one school/county based on the related exam data in elementary school and middle school.  Thus, based on this model, we could give some schools/counties strong suggestions or warnings if the predictions based on their current data are much worse than their expectations.  

#### Data Source

I downloaded data from the website of department of education [3]. The first part of department of education’s EDFacts data tracks schools’ participation and proficiency rates on standardized math (MATH) and reading/language arts (RLA) exams from K3 to K8. These files provide data on all students and several groups, such as race/ethnicity, sex, disability status, homelessness, and more. These data are recorded or summarized by school, county or city. The second part of data tracks high-school graduation rates. 

#### Tools

1, Ipython - write algorithms, train and test data, and generate inline plots.

2, xgboost - for model training and prediction.

3, sklearn  - for preprocessing, cross-validation, and evaluation.

4, pandas  - for data exploration and preprocessing.

#### Data Preprocessing:

I had to convert two kinds of special records before continuing to next step. Firstly, instead of an exact value, some numbers or percentages are represented as some ranges, like 50-80, GE90, etc.  In this case, I simply chose the smallest numbers of ranges using regex to represent the whole ranges.  Secondly, some records are missing due to protection of privacy. In the following initial analysis, I just dropped all the missing data. However, I may fill these data by median or mean values in the future analysis. At last, I standardized all data by removing the mean and scaling to unit variance.

#### Model

##### 1, Feature Engineering:

Since this is an initial analysis, I only considered some obvious important features in one year. There are two kinds of exams: mathematics (MATH) and reading/language arts (RLA). Students can participate these two exams only one time per year. Each student who participated exams will be assigned a performance level designated by the state. But only some of students will be assigned as “proficient or above” which is defined as achieving at the “proficient” or “advanced” levels. Here is a list of six features which can be read from the raw data directly after some preprocessing steps:

Percentage_Participate_MATH   -- percentage of students who participated math exam 

Number_Participate_MATH         -- number of students who participated math exam 

Percentage_Proficiency_MATH   – percentage of students scoring at or above proficiency level on math exam 

Percentage_Participate_RLA      -- percentage of students who participated rla exam 

Number_Participate_RLA            -- number of students who participated rla exam

Percentage_Proficiency_RLA      -- percentage of students scoring at or above proficiency level on rla exam 

I constructed two other features based on the combinations of the existing features. And the following testing results show that these two features have the biggest feature importance. 

Number_Proficiency_MATH  = Number_Participate_MATH  * Percentage_Proficiency_MATH

-- Number of students scoring at or above proficiency level on math exam 

Number_Proficiency_RLA  = Number_Participate_RLA  * Percentage_Proficiency_RLA

-- Number of students scoring at or above proficiency level on rla exam

##### 2, Model Selection and Parameters Tuning

The training data contains 11672 samples with 8 features and the response of training data is in range from 0 to 100. This is only the core part of the whole data in one year. In the future analysis, I will import more than 200 features of the data in 5 years. The distribution of training data is not clear. For this kind of real-world regression problem, simple linear regression obviously does not work. SVM would give a more reasonable result. However, tree based model is an idea tool to analyze this data. XGBoost uses both first and second terms of the Taylor expansion of the loss function which is regularized by L1 or L2 penalty. XGBoost also inherits learning rate and bagging skills from other tree based methods like random forest.  So I decided to run XGBoost firstly to test the assumptions based on this data. For parameters tuning, I used 10-fold cross validation and grid search over four parameters to search the best parameters.   

##### 3, Results and Plot Explanations

The first plot is a comparison between real values and prediction values. The score R^2 is about 0.95. Thus this model can predict the graduation rate successfully. In this plot, graduation rates in 11672 cities are sorted from low to high. The red line in this graph links these 11672 points. The blue dots around the red line are the corresponding prediction values. This plot demonstrates that my proposed model is correct.

<img src="./result/feature_importances.png" alt="name" width="500" height="250" />

The second one is a bar plot of feature importance. Two most important features are numbers of students who scoring at or above proficiency level on math or rla exam. This result is consistent with my assumption, which is the graduate rate of one high school is high if lots of students in that high school obtained high exam grades when they was in middle school. Two least important features are the percentages of students who participated the exams. This is because actually most of the students in each school or city take the exam every year. Then the values of these two features are almost same in all samples. Thus we can discard these two features in the future analysis.

<img src="./result/prediction.png" alt="name" width="500" height="300" />

#### Challenges and Future Plans

1, I will continue to explore more features based on race/ethnicity, sex, disability status, homelessness, and more. Some interesting results could be obtained by studying on the performances of students in different groups. Educators could use these results to regulate the limit resources to serve students in different groups better. The challenge is that there are too many missing data in these subgroups.

2, In the initial analysis of this project, I only used data in one year, 2012-2013. I will train the model using more data. Here is an interesting question. Is there a significant difference of performances of students in different years or different states?

3, I did not use PCA to preprocess the data. But obviously there are some similar patterns in different schools or cities. So PCA or other dimension reduction methods may improve the prediction results significantly. 

#### References:

[1] The Future of Big Data and Analytics in K-12 Education. http://www.edweek.org/ew/articles/2016/01/13/the-future-of-big-data-and-analytics.html
[2] Big Data and Analytics in K-12 Education: The Time is Right. http://www.hmhco.com/~/media/sites/home/Teachers/Files/HMH-CDE_Issue%20Brief_DataAnalytics.pdf
[3] EDFacts Data Files. http://www2.ed.gov/about/inits/ed/edfacts/data-files/index.html
