# COGS 108 - Project Proposal

## Authors

Team members and credits:
- Shaila Valenzuela
- Nicholas Nurwinata
- Khuyen Lai
- Weng Lok

## Research Question

How do student demographics, study habits, and non-academic behaviors relate to academic performance (e.g., semester grades) in a recent higher-education context? Specifically, which factors (attendance, study preparation time, gaming habits, family income, etc.) are most predictive of a student’s final GPA, and can we build a predictive model that outperforms a simple baseline? This is a regression problem where the outcome variable will be GPA or equivalent grade metrics, and predictors include socio-economic and behavior attributes. The research will include controlling for confounders such as department/major and gender. It will explore both statistical significance of variables and predictive power via machine learning models.



## Background and Prior Work

Academic performance in higher education is influenced by a mix of student habits, demographic characteristics, and lifestyle behaviors. We define academic success using quantitative measures such as GPA or final course grades. GPA is not only important within the academic environment but is also connected to future opportunities such as graduate school admissions and job offers. Employer surveys indicate that GPA has historically been used as an initial screening tool in entry-level hiring, meaning that stronger academic performance can increase access to early career opportunities, even as some employers shift toward skills-based hiring practices ([NACE, 2023](https://www.naceweb.org/talent-acquisition/trends-and-predictions/nearly-two-thirds-of-employers-use-skills-based-hiring-practices-for-new-entry-level-hires?utm_source=chatgpt.com)). Understanding what influences academic performance may therefore help students improve both academic and post-graduate outcomes. Prior research suggests that both academic behaviors and non-academic factors, such as sleep patterns or leisure activities, play roles in shaping student outcomes, though the relative strength of these influences varies by context and population.

Studies have consistently shown that class attendance and engagement are strongly associated with higher academic achievement. Research across institutions has found that increased attendance reduces the likelihood of poor academic outcomes such as failing or withdrawing from courses, suggesting that attendance is a strong predictor of academic success ([Credé et al., 2010](https://files.eric.ed.gov/fulltext/EJ1248452.pdf)). In contrast, excessive video game use and screen time have been linked to lower GPA. Research examining video game usage among college students finds that higher levels of gaming are associated with reduced study time and weaker time management skills, which can negatively affect academic performance ([Anand, 2017](https://www.researchgate.net/publication/303226124_The_impact_of_video_games_on_student_GPA_study_habits_and_time_management_skills_What's_the_big_deal)).

Demographic and socioeconomic factors have also been associated with variation in academic performance. Research on socioeconomic status suggests that indicators such as family income and parental education are correlated with GPA, likely because these factors influence access to educational resources, academic support systems, and the amount of time students can devote to studying. These findings highlight the importance of considering both individual behaviors and background characteristics when modeling academic performance, as academic outcomes are shaped by structural as well as personal factors ([Credé et al., 2010](https://files.eric.ed.gov/fulltext/EJ1248452.pdf)).

Non-academic lifestyle factors, particularly sleep and time management, have also been identified as important predictors of academic performance. Studies of college students show that sleep duration and sleep quality are positively associated with GPA, while chronic sleep deprivation is linked to lower academic performance and increased academic stress ([Short et al., 2023](https://www.jahonline.org/article/S1054-139X(23)00373-7/fulltext)). Additionally, research comparing working and non-working students suggests that employment responsibilities can indirectly affect academic outcomes by reducing time available for sleep and study. Together, these findings suggest that lifestyle constraints outside the classroom can meaningfully influence academic success and should be accounted for in analyses of student performance.


## Hypothesis


We hypothesize that students with better academic habits, especially higher attendance and more consistent study time, will have higher grades. We also predict that non-academic behaviors that take time away from schoolwork, for example gaming time and alcohol use, will be negatively associated with GPA. For demographic/socioeconomic factors, we expect them to explain less variation in GPA than academic behaviors. Overall, we expect that a predictive model using these habit variables will show a simple baseline by meaningfully reducing prediction error and that attendance and study time will appear as the most important predictors across models. 


## Data

Ideal dataset:
- Individual student records with GPA or final course grades
- Demographic features (gender, age, income, urban/rural status, major)
- Behavioural features (attendance rate, hours studied, hours gaming, part-time job status, extracurricular activities)
- Academic history (Highschool scores, prior GPA)
- Size, 5000+ records

Potential dataset:
- https://data.mendeley.com/datasets/5b82ytz489/1 Student Performance Metrics Dataset
- https://zenodo.org/records/16459132Student Performance and Learning Behavior Dataset for Educational Analytics
- https://www.kaggle.com/datasets/rabieelkharoua/students-performance-dataset/ Kaggle dataset

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

This data was collected from a survey of undergraduate students from various departments from Universiti Malaya. The survey ensures that the data was taken with the consent of the students and with full knowledge on the use of this data as the dataset was meant to be an open resource for academic research. 
### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 Definitely, since the access control would be limited only to this github repo which is limited from the rest of the open internet
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
The individuals (students in this case) won't have their identity shown, as it is also not important for data analysis just the features would be analyzed for the project
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 not pre trained and on jupyter is possible
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

- Communication: Weekly check-ins via discord, urgent issues addressed within 24 hours.
- Work division: One member leads data wrangling, another leads EDA and modeling, another handles background & writing. All team members review and contribute to each section.
- Respect & accountability: All members share drafts at least 24 hours before deadlines, conflicts resolved through structured discussion.

## Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| 2/9 (Mon) | 7 PM | Proposal finalized and submitted, dataset downloaded and inspected | Confirm dataset understanding, finalize research variables; assign roles for wrangling, EDA, and analysis |
| 2/16 (Mon) | 7 PM | Data cleaning and preprocessing completed, initial EDA started | Review data quality issues, refine EDA questions, adjust analysis plan if needed |
| 2/23 (Mon) | 7 PM | Finalize EDA; visualizations completed, feature engineering | Discuss findings from EDA, finalize modeling approach and evaluation metrics |
| 3/2 (Mon) | 7 PM | Initial models implemented and evaluated  | Compare models, interpret results, decide on final model(s) |
| 3/9 (Mon) | 7 PM | Final analysis completed, draft results and discussion sections  | Edit and refine analysis narrative, connect results back to hypothesis and literature |
| 3/16 (Mon) | 7 PM | Full project draft completed, ethics and limitations finalized | Final edits, rubric check, prepare figures and tables for submission |
| 3/20 (Fri) | Before 11:59 PM | NA | Turn in Final Project & Group Project Surveys |
