# COGS 108 - Project Proposal

## Authors

- Michelle Ma: Data
- Yves Mojica:  Hypothesis & Timeline
- Edgar Seecof: Ethics
- Travon Williams: Ethics
- Felix Xie: Background Information

## Research Question

To what extent does early academic performance in college predict student dropout? Specifically, using students’ first-year GPA, course completion rates, and credit accumulation as predictors, can we model the probability that a student drops out within one year?



## Background and Prior Work

Predicting student dropout in higher education has become a prevalent topic in educational research because early identification of at-risk students can enable universities and colleges to proactively support their students and try to prevent them leaving the college. Student attrition is a large loss to higher education institutions, as it represents lost tuition revenue, reduced completion metrics that reflect poorly on the institution itself, and an inefficient allocation of resources. Arguably, it is sometimes worse off for students, who may incur financial debt, delayed career entry, and negative psychological consequences from leaving college early. At a higher societal level, dropout undermine the workforce and its development, and only further education and economic inequality. 

This understanding has motivated more and more work to help model student dropout risk using early academic data, serving as a basis for data driven intervention strategies to help mitigate these fallbacks of dropout. Previous research suggests that much of dropout experienced in the first years of college, and is related to students' academic performance early on in their education as it could affect their belief in their academic fit and future success. This is believed to be because early academic success and/or failures act as a feedback loop to update student beliefs in their own abilities. Furthermore, when students fail to meet expectations early on, their assessment of whether or not higher education is a worthwhile financial investment is also called into question, leading to a higher chance of dropout.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Building on this theoretical research, researchers in educational data mining have this task of dropout prediction as a supervised machine learning problem. Studies applying traditional ML classification models such as logistic regression, decision trees, and boosting methods have found that academic performance serves as a relatively consistent predictor of dropout (as a binary classification task), amplified when the scope is purely on the first year of dropout. <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). This study highlights the feasibility of utilizing a more traditional ML approach to accurately map out dropout rates, and motivates us to focus on early academic indicators. 

Expanding on this, a similar paper from UCI expands this to a multi-class classification task, adding three labels: graduated, dropped out, or still continuing the degree after the expected amount of time. They found similar results, arguing that early academic performance is a consistent predictor of student outcomes, but they also note that many other factors and variable do play a role, for example one being their financial situation and socioeconomic status. <a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3). This study provides both a validated and cleaned dataset as well as a methodological reference point for modeling dropout. 

1. <a name="cite_note-1"></a> Stinebrickner, T., & Stinebrickner, R. (2014). A major in science? Initial beliefs and final outcomes for college major and dropout. NBER Working Paper No. 18945. https://www.nber.org/papers/w18945
2. <a name="cite_note-2"></a> Lakkaraju, H., Aguiar, E., Shan, C., Miller, D., Bhanpuri, N., Ghani, R., & Addison, K. (2015). A machine learning framework to identify students at risk of adverse academic outcomes. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://dl.acm.org/doi/10.1145/2783258.2788620
3. <a name="cite_note-3"></a> Martins, M. V., Tolledo, D., Oliveira, J., & Gonçalves, R. (2021). Early prediction of student’s performance in higher education: A case study. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Predict+Students+Dropout+and+Academic+Success

## Hypothesis


After looking at the data available to us, we predict that there will be a slight correlation between the early academic performance of a student and the likelihood of dropping out of university. Poor early academic performance doesn't necessarily mean that a student will inevitably fail; there are likely other factors that have a greater correlation with early dropout than poor early academic performance. Students can always catch up, but those who struggle early on will have a harder start to their college career that is more likely to lead to dropping out. 

## Data

1. Our ideal dataset would include first-year GPA, course completion rate, credit accumulation, and dropout status within one year, documented as a binary variable (0 = student dropped out, 1 = student remained enrolled) to allow for efficient data processing. Depending on the direction of the project, we may also want to incorporate additional variables such as demographics (age, gender, race/ethnicity), socioeconomic factors (family income bracket, first-generation status), high school GPA, field of study, institution type, and campus size. These additional variables would provide more context for each observation and allow us to draw more nuanced conclusions.
   
    In terms of sample size, we would aim to include several thousand students, ideally over 5,000, to ensure sufficient statistical power and representation of dropout cases. The data would come from undergraduate students entering college for the first time and would be collected using academic records and enrollment statuses during students’ first year of college. These data could be obtained through institutional records, such as registrar data for GPA, credits earned, and course completion, and enrollment databases for registration and withdrawal statuses.

    The data should be stored in a clean, tidy, and structured dataset where each row represents a single student and each column represents one variable. Time-based academic variables, such as term GPAs, should be stored in separate columns (e.g., “Fall GPA” and “Spring GPA”). To protect student privacy, each record should also include a unique anonymized student ID rather than personally identifiable information.


2. One potential dataset for this project is the Predict Students’ Dropout and Academic Success dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/dataset/697/predict+students%27+dropout+and+academic+success). This dataset is publicly available and can be downloaded directly without requesting permission, as it is released under a Creative Commons license. It contains data on approximately 4,400 undergraduate students, including demographic, socioeconomic, and academic performance information collected at enrollment and during the first year. Important variables for this project include semester grades, number of curricular units approved, number of units enrolled, and a categorical outcome variable indicating whether a student dropped out, remained enrolled, or graduated. This outcome variable can be converted into a binary indicator of dropout within one year.

    Another useful dataset is the University Student Dropout Longitudinal Dataset hosted on Zenodo and described in an academic data paper (https://zenodo.org/records/17239943). The data are publicly accessible and can be downloaded as CSV files without special permission, though proper citation is required. This dataset tracks students across multiple academic terms and includes detailed information on course enrollments, grades, and credits earned. Key variables relevant to this project include course completion records, cumulative credits earned by term, and enrollment status across semesters, which can be used to derive first-year GPA, completion rates, and dropout within one year. The dataset also includes variables such as parental education level, placement exam results, age, number of assignments submitted, and number of exams taken, which may provide additional context in the analysis.

    A third potential dataset is the Tecnológico de Monterrey Student Dropout Dataset, available through a public DOI repository (https://datahub.tec.mx/dataset.xhtml?persistentId=doi:10.57687/FK2/PWJRSJ). This dataset can be accessed freely and does not require institutional approval. It includes a large number of student records across multiple cohorts and contains academic performance indicators as well as information on the semester in which a student dropped out. Important variables for this study include grades, number of failed courses, and enrollment status by semester. The dataset also provides demographic information such as age, gender, and place of origin, along with variables related to institution type, financial aid eligibility, and extracurricular involvement.

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> We will have to read through the paper that originally collected this data and determine whether the data was appropriately collected and whether or not students were given adequate warning as to what was being collected.

 - [ ] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> The data was collected by serveral researchers in Portugal for the purposes of their research as such we will have to consider the bias that the researchers possibly could have introduced through the data collection and the biases that may originate from the data being collected in portugal.

 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> Although we currently believe the data we have is anonymous we will need to re-examine the source and determine if there is any additional information that needs to be censored.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We will have to perform analysis where we include and exclude gender, our protected groups, to see if doing so results some form of algorithmic bias so that we can hopefully correct for it and provide an unbiased collection of data.


### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> Since we are not generating any new data and instead using an already created data set there will not be anything for us to hide, although we could hide the conclusions of our data at the end of the analysis.

 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> The data was already collected by others so we cannot support this for the original participants.

 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> The data was already collected by others so we cannot support this for the original participants.

### C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> While we intend for our analysis to be qualitative, we could reach out to experts in education inequity to get a better handle of how accurate our analysis is from their persepctive.

 - [ ] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> Since the data has many data points that are discrete, many that maybe would have been better served being continuous, we will take specific caution towards determining which of these can be used and determining how best to interpret and use them.

 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> We will have to ensure that different data points are correctly weighted when determining such that we are not forcing our analysis towards a particulur conclusion.

 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> Since the data does not have any specific PII, we do not anticipate having to do anything specific for our analysis.

 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> We intend to have a very well formatted and easily readable jupyter notebook to ensure that it is easy to see what we did thus making it easily auditable.

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> We will have to be careful with the careers or the parents as these may be proxies that result in us finding that socioeconomic status is the sole determinant of success.

 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> We will have to test to see if the model is appropriatley fair accross the binary grouping that they have in the data like rural vs urban addresses which do not actually make up an address and rather just have a one value for rural and another for urban.

 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> Since we are intending to optimize to find out what factors can be used as predictors to determine if a students chance of dropping out we need to be cautious that this can result in false positives and negatives, if our model places certain students in the dropout or graduate pile this could change how money is spent in a school which we would like to avoid.

 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> We hope to make the model very clear and to explain all of the analysis that we do in the jupyter notebook so that we can go back and understand and justify the decision that the model made.

 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> We will have to ensure that we communicate the various limitations that are present with our data and analysis so that who ever desires to look at our model understands the potential flaws in our analysis and modeling.

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> We do not yet have a clear plan for how to monitor the model after it is deployed but, if it were to be used, the users would likely have to be careful with how they use it when modifying school programs and spending.

 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> No we have not discussed this, but it would be difficult to address if it were to happen since we do not know the respondants and are geographically very far away from them.

 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> We think that we would be able to just turn off the script. In the end our analysis will not continue to aggregate more data so we do not anticipate any issues that require deleting the model.

 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> We have yet to take steps towards this, but we need to ensure that the models results cannot be misinterpreted and abused for someone elses purpose.


## Team Expectations 

* Our primary form of communication is Discord. We expect all members to respond to and/or acknowledge all members' messages within one day. We plan to meet once a week, either in person or virtually, on Monday afternoons.
We expect all members to maintain a respectful and polite tone when communicating with others. Don't be mean, even if there are disagreements. We want to keep an open mind, value everyone's opinions equally, and be proactive in brainstorming solutions for the good of the team. For example, if there are conflicting perspectives, we can communicate our opinions by saying "I don't think moving forward with X is within our group's best interest because of Y. Instead, we should explore Z."

* Ideally, we want to make unanimous decisions. However, this is not always possible, so we will default to majority vote rules. If a member does not reply or acknowledge a proposal/message within a day, we can move forward with their input. Team members can react to Discord messages as a form of acknowledgement, especially if they're unable to respond immediately.

* Every member will get first-hand experience pertaining to all aspects of our project. Having members do a little bit of everything will ensure that we are all able to develop our skills individually. We will delegate tasks during our weekly meetings and send a message in our "to-do" channel on Discord.

* If there any issues, we expect each other to speak up EARLY before the deadline. As a general rule of thumb, we expect members to reach out at least a day or two PLUS the expected time it takes to complete the specific task if there are any issues or concerns.


## Project Timeline Proposal

Tentative timeline that is subject to change throughout the rest of the quarter

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/26  |  3 PM | Reviewed Lecture slides and any information related to our project  | Determine best form of communication; Introduce Ourselves; Review previous projects; Begin brainstorming possible project ideas  | 
| 2/2  |  2 PM |  Brainstorm Ideas For Final Project | Discuss ideal dataset(s) and project ideas; Draft project proposal/Assign Individual parts;  | 
| 2/9  | 2 PM| Finish and finalize project proposal  | Discuss Wrangling and possible analytical approaches; Discuss overall organiation of project and procedures; Work on data  |
| 2/16 or 2/18  | 2 PM  | Review dataset and have it prepared for analysis | Work on Data Checkpoint; Discuss Analysis Plan   |
| 2/23  | 2 PM  | Finish data checkpoint and all things related to data | Work on analysis of our data  |
| 3/2  | 2 PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project |
| 3/9  | 2PM  | Work on Final Project | Finishing touches; Turn in Final Project; Group Project Surveys |
| 3/16(?)  | 2PM  | Work on Final Project | Buffer Day if necessary |