# Leveraging Student Information to Enhance College Persistence

## Objective of the Project

- Develop a predictive analytics scoring system that ranks students' likelihood of persisting to their second year of college within the 10,000 Degrees program
- Identify key factors that impact student persistence
- Assist 10,000 Degrees in integrating the scoring system into their decision-making process.
- Utilize data insights to continually improve the accuracy and effectiveness of the scoring system

## Data Sources

The student data is provided by 10,000 Degrees and obtained through the National Student Clearinghouse (NSC), and includes the following:
- Demographics data such as gender, race/ethnicity, English as a second language, and highest degree earned before starting college
- High school data such as name, region, graduation year, participation in the 10,000 Degrees program, and dual enrollment during high school
- College data such as name, region, 2 year vs. 4 year program, participation in the 10,000 Degrees program, persistence indicator for the second year of college (target variable), and transfer status to a 4 year college
- Publicly available data sources include:
    - GreatSchools.org for school ratings
    - County information for residence

![image.png](attachment:image.png)

## Review of the Data Sample for the Development of the Scoring System

| #     | Category |# Records    |# Records Excluded |Comments|
| :---        |    :----   |    :----   |   :---|   :---|
|0|Sample from 10,000 Degrees|9,945|--|Sample was pulled by 10,000 Degrees in January 2023.|
|1|Persistence Indicator == 'Yes' or 'No'|4,990|4,955|Remove 'Ineligible' students. Some of them took time off before starting college. We do not know if they persisted for their second year of college. They are indeterminates.
|2|High school Grad Year >= 2010|4,975|15|Students graduated from high school before 2010 may not have reliable data.|
|3|Participated a 10,000 program|4,663|312|Students with 'null' values did not get support from 10,000 Degrees.|
|4|Attended a 2-year program|2,019|2,644|The study will primarily focus on students attending a 2-year program, as the dropout rate for 4-year students is very low.|
|5|Started college after high school|1,393|626|The focus of the study is on high school students, and those who have obtained prior degrees such as a certificate or another college degree will not be included in the study, as their dropout rate is also very low.|



## Review of the Data Features for the Development of the Scoring System

### Dependent Variable

The dependent variable was obtained from NSC data to discern students who initially enroll in college during the Fall semester right after high school graduation. This group was further subdivided into those who return to any college the next Fall and those who withdraw from the college system.

Out of the 1,393 students included in the data sample, a substantial 33% dropout rate was observed. To enhance the likelihood of favorable outcomes, 10,000 Degrees is endeavoring to evaluate the risk factors linked with these students and institute interventions that foster their academic advancement.

### Independent Variables

After an extensive exploratory data analysis, the following variables were used for model development:

| #     | Variable |Relationship with the Dependent Variable|
| :---        |    :----   |    :----   |
|1|High school rating|Dropout rate decreases for the students who graduated from the schools with higher rating.|
|2|ESL|Students whose first language is not English are at a higher risk of dropping out.|
|3|High school program participant|Attending the ECCA program during high school reduces the likelihood of dropping out.|
|4|Summer program participant| Students who participate in the Institute program during the summer before starting college have a lower dropout rate in their second year of college.|
|5|Success participants with scholarship |Students who receive a scholarship through the Success program are more likely to persist in college than those who only attended the program.|
|6|Dual enrollment to college during high school|Students who enrolled in community college during high school are more likely to persist in their second year of college.|
|7|Gender|Female students are more likely to persist.|
|8|Race|Asian students have a lower likelihood of dropping out, whereas African American students are more likely to drop out.|
|9|Geographic location of the high school attended|Students from Marin and Napa counties are more likely to persist.|
|10|Time until First College Enrollment|Students who attend the summer program offered by colleges immediately after high school graduation are more likely to persist.|
|11|Geographic location of the college attended|Students who attend colleges located outside of California are at a higher risk of dropping out, compared to those who attend colleges within the state. However, students who remain in California but move away from the Bay Area exhibit a moderate level of risk.|
|12|Geographic location change after high school|Students who attend college in a different county than their high school have a slightly higher likelihood of dropping out.|

## Model Validation Results

After testing several models on the analysis sample, the gradient boosting model was chosen due to its superior performance on both the training and test samples.

The following observations were made based on the model's ability to rank students by their probability of dropping out of college:

- The chart provided below offers valuable insights into the model's performance. It reveals that students in the top deciles are considerably more likely to drop out than those in the lower deciles.
- On the training sample, the dropout rate for the highest scoring decile is over nine times greater than that for the lowest scoring decile. The gap between these two groups is much smaller on the test sample but still noteworthy, with a dropout rate of 65% versus 20%.
- The dropout rates show no fluctuations in the training sample, indicating a perfect rank ordering of the score. The fluctuations are minimal for the test sample as well, further indicating the model's reliability.

![image.png](attachment:image.png)

## Factors impacting the persistence

Based on our analysis, the likelihood of a student persisting to their second year of college is most strongly influenced by the following top five factors:
- Participation in a summer program offered by their college after high school graduation (time until full college enrollment)
- Attending a high school with a high rating
- Enrollment in a Success program with a scholarship
- Race
- Geographic location of the college attended

![image-2.png](attachment:image-2.png)

These factors play a significant role in determining a student's chances of success in college. Students who participated in a summer program offered by their college after high school graduation are more likely to persist in their second year of college. Similarly, students who attended a high school with a high rating, enrolled in a Success program with a scholarship, or belong to certain racial groups have a higher chance of success. Finally, the geographic location of the college attended also plays a role in determining a student's likelihood of persisting to their second year of college.

## Further Research

We're actively collaborating with 10,000 Degrees to improve our model's performance by addressing some critical areas:
- To increase the sample size, we are working with 10,000 Degrees to retrieve the persistence status of the excluded students with Persistence Indicator = 'Ineligible'. These students didn't start college right after high school and did not receive a Yes/No value for Persistence Indicator. 10,000 Degrees is collaborating with their partner who flattens the NSC data to explore alternative ways of obtaining their persistence status. If we can get a valid Persistence Indicator for these students, we can possibly double our data size and enhance our analysis.
- The Success program participation factor needs to be further investigated as there are indications that the students who didn't attend the program have lower persistence rates. We'll be closely working with 10,000 Degrees to identify the underlying reasons.
- Another crucial aspect we're exploring is the impact of full-time student status on persistence rates. 10,000 Degrees' prior research suggests that full-time students are more likely to persist, and we're trying to access this data to improve our model.
- We also plan to isolate the COVID-19 period in our analysis and compare the model's results with and without the pandemic data once we have a bigger sample. This will help us gain more insights into the factors affecting student persistence during challenging times.
- As we work to expand our data sample, we may want to explore other modeling techniques such as neural networks. While the gradient boosting model has shown promise, we are interested in evaluating whether a neural network could lead to further improvements in our predictive performance.
- It's important for our model to be interpretable so that we can provide meaningful insights to 10,000 Degrees. To that end, we will research how to build more explainable models and explore ways to provide detailed explanations at the individual record level to better understand the factors impacting a student's persistence score.