# COGS 108 - Project Proposal

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Hannah Yuan: Project administration, Data curation, Writing - original draft
- Sakura Nishikawa: Analysis, Software, Visualization
- Scarlett Wu: Methodology, Software, Writing - review & editing
- Tania Jain: Analysis, Visualization, Writing - original draft
- Tanya Bhat: Background Research, Software, Writing - review & editing

## Research Question

To what extent can enrollment behavior observed in the first 72 hours of course registration predict the final maximum waitlist size for UCSD courses across multiple academic terms?

## Background and Prior Work

At UC San Diego, course registration occurs through WebReg and is structured around priority enrollment windows tied to academic standing. Students receive two enrollment appointments (“first pass” and “second pass”), typically a week apart. During first pass, students may enroll in up to 11.5 units; during second pass, they may enroll (and waitlist) up to 19.5 units, increasing to 22 units once instruction begins. Waitlists are only available during second pass, and students may waitlist only one section per course. If a desired section is full, students must either choose another section or join the waitlist. These rules create a competitive, time-sensitive environment where students actively monitor seat availability and react to changes during registration. As a result, the enrollment process itself generates rich behavioral traces that reflect how students compete for limited seats over time. Accurate forecasting is also valuable for students, who must decide which courses to prioritize during first pass and which courses are realistically worth waitlisting.

Universities must forecast course demand each term to determine how many seats, sections, instructors, and classrooms to allocate. When demand is underestimated, courses rapidly fill and waitlists grow, forcing departments to add seats or deny access. When demand is overestimated, sections may run under-enrolled or be cancelled, disrupting schedules and wasting instructional resources. Because of these operational consequences, predicting course enrollment has become an important problem.

<a name="cite_ref-1"></a >[<sup>1</sup>](#cite_note-1) Prior applies machine learning techniques to predict final course enrollment using historical enrollment patterns, departmental trends, and course offering history. The study finds that historical enrollment data is the single strongest predictor of future course demand, outperforming other features such as course descriptions and scheduling patterns. However, the study focuses on historical signals available before enrollment begins, rather than examining whether early behavior during registration itself can serve as a predictive signal.

<a name="cite_ref-2"></a >[<sup>2</sup>](#cite_note-2) Another paper shows that an automated time-series–based tool—especially using Gaussian Processes—can predict CSUN undergraduate CS course enrollments within one class size for most courses when forecasting up to one academic year ahead, making it practically useful for departmental scheduling despite using only historical enrollment data.
This result suggests that behavioral traces from prior terms carry substantial predictive power. Yet, these signals are also available only before registration begins and do not address whether information revealed during the early hours of enrollment can forecast later enrollment pressure.

<a name="cite_ref-3"></a >[<sup>3</sup>](#cite_note-3)In addition to formal research, prior student analyses using the UCSD Historical Enrollment dataset have attempted to model enrollment speed—how quickly courses fill—using static course attributes such as CAPE ratings, professor evaluations, GPA distributions, and estimated study time. These projects found very poor predictive performance, with regression models producing extremely low R-squared values. The authors concluded that static course metadata was insufficient to explain enrollment behavior and suggested that meaningful prediction would likely require observing the dynamic enrollment process itself. This indicates that the most informative signals may not be who teaches a course or how students rate it, but rather how students behave during registration as seats disappear and waitlists grow.

These works together show that course enrolment forecasting is an important problem for universities and their student, and also that historical enrollment patters and useful tools for prediction. However, little prior work has examined whether the early, high-frequency behavioral signals that appear during the first hours of registration, like rapid seat loss, waitlist growth, and enrollment rate—contain enough information to predict later enrollment pressure. This project addresses that gap by using detailed time-series enrollment data from UC San Diego to investigate whether patterns observed in the first 72 hours of registration can predict the final maximum waitlist size for a course offering.

1. <a name="cite_note-1"></a > [^](#cite_ref-1) Lee, Dianne. 2020. A Classy Affair: Modeling Course Enrollment Prediction. Bachelor's thesis, Harvard College. https://dash.harvard.edu/entities/publication/36459815-6238-4384-82b2-958b5a7b840a

2. <a name="cite_note-2"></a > [^](#cite_ref-2) Modeling in R and Weka for Course Enrollment Prediction. https://www.iaiai.org/journals/index.php/IJIRM/article/view/212

3. <a name="cite_note-3"></a > [^](#cite_ref-3) UCSD Course Enrollment Speed Analysis And Prediction https://colab.research.google.com/drive/1NbM8z0QhziBPPLQWjJMJBkh18vvzOBN7?usp=sharing#scrollTo=yMJmggrMp9Uo

## Hypothesis


We hypothesize that courses with faster seat fill rates and greater enrollment increases during the first 72 hours of registration will exhibit larger final maximum waitlist sizes at UC San Diego. This expectation is based on the idea that early registration behavior reflects underlying student demand and that limited course capacity prevents that demand from being immediately satisfied, leading to sustained waitlists over the remainder of the enrollment period.

## Data

Instructions: REPLACE the contents of this cell with your work

1.  To accurately predict the final maximum waitlist size for undergraduate courses in teh Cognitive Science and Methematics departments at UC San Diego based on the first 72 hours of enrollment, our ideal dataset would be longitudinal and rich in categorized metadata.
   1. For the variables, we need exact time-framed snapshots(enrollment_start_time), such as the enrollment data every 15 minutes once the enrollment opens, the predictors, like the rate of seats filled per minute, using (enrolled_count), (waitlist_count), and (total_seats_available). The target variable is (max_waitlist_size), and some other categorical variables like (course_subject), (course_level), (rmp_quality_score) or (course_time).
   2. We would need MATH and COGS data at least for the past 3 academic years(excluding graduate only data and summer sessions), or approximately 9 quarters to account for different behaviors between seasons. This would approximately be 5000 course selections.
   3. This data can be collected by an automated scraper or direct API access to TritonLink or WebReg.  
   4. These data can be stored in a SQL database or a time-series database, which would be organized into two tables: a course info table that's static, and a real-time enrollment log table. 
2. Potential Real Dataset 1: UCSD Historical Enrollment Data(GitHub)
   This is a public open-source project that scrapes data from UCSD WebReg system. The dataset is categorized into academic terms, like WI25, SP22, or S123, and the file under the selected academic term contains the static schedule information. It is essential for linking course metadata to the enrollment counts.
   1. [UCSD Historical Enrollment Data](https://github.com/UCSD-Historical-Enrollment-Data/UCSDHistEnrollData/blob/master/data/schedules/WI25.tsv)
      No special permission or application is required to use it (MIT License).
   2. (subj_course_id): It identifies the course (e.g.,"MATH 18"), allowing us to group data by MATH and COGS department and course levels.
      (sec_id): Section ID that distinguishes the different discussion sections within the course.
      (instructor): Full name of the instructor, which is one of the key aspects of course selection,
      (total_seats): The capacity of the section.
      (meetings): Contains the day, time, and building (e.g.,"LE,MW'18:30 - 19:50) allowing us to extract the (course_time).

   Potential Real Dataset 2: RateMyProfessors (RMP) Dataset
   This is a public website that allows users to rate or see the ratings of the professors in their campus. Students often use this website alongside the course registration process to acquire information of the professors to sign up for.
   1. [RateMyProfessor.com](https://www.ratemyprofessors.com/school/1079)
      There are open-source Python libraries like RateMyProfessorAPI on GitHub designed to scrape this information.
   2. (quality_score): Past student's rating on a scale of 1-5 heavily affects the enrollment behavior.
       (difficulty_score): Students often avoid high-difficulty ratings, so this could predict slower waitlist growth.
      (tags): The categorical and qualitative variables beyond the scores.

   Potential Real Dataset 3: UCSD Academic Calendar & Registrar Deadlines
   It provides the essential timeline for enrollment activities, which can be used to predict the increase or decrease of the waitlist counts.
   1. [UCSD Academic Calendar & Registrar Deadlines](https://blink.ucsd.edu/instructors/courses/enrollment/calendars/index.html)
      It's publicly available on the Blink(UCSD faculty/staff portal) website. No permission is needed. We would be able to create a small CSV file containing the key dates for each quarter we are analyzing.
   2. (enrollment_start_date): The exact date/time registration opened.
      (second_pass_start_date): The waitlist comes in and often spike drastically when the 19.5 unit limit is lifted.
      (drop_deadline): The date where enrolled students tend to leave, opening spots for the waitlist, causing waitlist decay.

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [x] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
    >In this project, an existing publicly available dataset containing aggregated data for course enrollments at UCSD is used. There is no direct interaction with human subjects, so informed consent is not applicable.
- [x] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
    >We are also aware of potential biases that are related to course types, departments, terms, and enrollment policies, which may affect the generalizability of the findings. Thus, we are focusing on the COGS (cognitive science) department.
- [x] **A.3 Limit PII exposure**: Limit PII exposure: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
    >The data does not contain personally identifiable information and is aggregated at the course level, so there are no privacy risks. The analysis does not increase the exposure of the data.
- [x] **A.4 Downstream bias mitigation**:Downstream bias mitigation: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
    >The dataset lacks protected attributes, and as such, it is difficult for us to determine the differential effects on different student groups. We will interpret the results cautiously and avoid making prescriptive claims that could create unintended bias in course planning.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
    > The dataset used in this project is obtained from a publicly available GitHub repository and does not contain private credentials or direct student identifiers. However, the group will still take precautions when handling local copies of the data. Files will be stored on password protected devices and restricted to team members only. The team will avoid linking the dataset with other sources that could increase the risk of reidentification and will not redistribute modified datasets that could introduce new privacy concerns.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
     > The dataset used in this project contains only aggregated course information (e.g., time stamps, number enrolled, seats available, and waitlist counts) and does not include any personally identifiable information. As a result, individuals cannot be identified from the data, and the risk of personal data exposure is minimal. The group does not control the original public dataset hosted on GitHub; however, if any privacy concerns were raised or if personally identifiable information were later discovered, the team would remove the affected records from all local copies and exclude them from further analysis.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

    > The research team will retain local copies of the publicly available, aggregate enrollment dataset obtained from GitHub only for the duration of the course project and grading period. After the project is complete, all local raw data files will be deleted from personal devices and shared storage locations. Only summary statistics, figures, and trained model outputs that do not contain raw time-series records will be preserved for documentation purposes. Any backups containing the raw dataset will also be removed according to this schedule.
    
### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
    > "Our analysis uses only course-level seat and waitlist data and does not capture individual student experiences, motivations, or constraints (e.g., enrollment time, major requirements). We therefore interpret results as system-level patterns rather than claims about student intent."
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
    > "We focus on repeatedly offered, high-demand STEM courses, which introduces selection bias. Our conclusions will be limited to similar competitive courses and not generalized to all university classes."
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to represent the underlying data honestly?
    > "We will clearly document preprocessing steps and avoid misleading visualizations or cherry-picked examples, reporting trends across many course offerings."
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
   > " The dataset does not contain any personally identifiable information (PII). It consists only of course-level seat counts, waitlist counts, timestamps, and instructor names as publicly displayed on WebReg. No student identities, grades, or personal records are used in this analysis."
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
   > " All data processing and modeling steps will be implemented in documented, reproducible notebooks using a publicly available dataset."

### D. Modeling
  - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

    > There is a risk that the model could pick up on student biases against instructors of certain genders, ethnicities, or backgrounds by the instructor column, like predicting lower waitlists for female instructors in STEM fields. 

 - [x] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

    > The model may perform unevenly across different subjects, defined by the (subj_course_id). It might be highly accurate for large, predictable majors like CSE or Biology, but fail for smaller departments where waitlist behavior is erratic, resulting in higher error rates for students in smaller majors. We could evaluate the mean absolute error separately for different departments.

 - [x] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

    > Our current metric is accuracy, but this might not leave the room for underestimates, since it might cause students to not get into classes. So, we should consider a metric that avoids underestimation more than overestimation, as it is the worse outcome for the students.

 - [x] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

    > Our approach is to prioritize interpretable insights for students, so if the model predicts a massive waitlist, we should be able to explain why, like "This course filled 50% of its (total_seats) in the first 10 minutes, which historically leads to a high waitlist."

 - [x] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

    > Our plan is to clearly state that this model is a prediction, and not a guarantee by specifying that the dataset cannot account for sudden administrative changes, like sudden 20 seats openings in Week 2. We would explicitely state these limitations to the stakeholders.

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
    > If the analysis leads to misleading or biased conclusions that negatively affect course planning, the analysis should be re-examined and revised. This may include adjusting assumptions, incorporating additional variables, or clearly communicating uncertainty to prevent repeated harm.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
    > If the predictive relationship no longer holds or produces unreliable results, the use of this analysis should be paused or discontinued until it is revalidated. Rolling back to alternative or descriptive approaches would help avoid reinforcing incorrect decisions.
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
    > The results of this project may be misused if interpreted as measures of course quality or instructor performance. To reduce this risk the intended scope and limitations of the analysis should be clearly stated, emphasizing that it is meant for aggregate-level planning rather than individual evaluation.

## Team Expectations 

Instructions: REPLACE the contents of this cell with your work
  
Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1* - Accountability and Shared Responsibility: It is expected that all team members are responsible for their designated tasks and ensure they are completed on time and to the best of their ability. Tasks will be assigned according to votes on a first come first serve basis. While there are specific tasks for each team member, there is also an understanding that the final product is a shared responsibility, and all team members are expected to contribute to the team's success.

* *Team Expectation 2* - Consistent and Respectful Communication: Our team will communicate regularly throughout the quarter using a consistent means of communication, such as through Instagram and iMessage. We will make every effort to reply within a timely manner. We plan to meet weekly over Zoom and in person if required. We will also respect each individual’s communication style and provide feedback constructively.

* *Team Expecation 3* - Collaboration, Inclusion, and Mutual Respect: We believe that every member of the team has valuable ideas and skills to bring to the table. We will strive to create an environment where everyone feels comfortable contributing. We will use majority vote to make decisions for our project and respect the final choice. We are committed to respecting each other’s work styles and schedules.

* *Team Expectation 4* - Conflict Resolution: In the event that conflicts or disagreements arise, we will resolve the issues immediately, rather than letting the problem escalate. We will trust that all individuals have good intentions and will listen to each other’s views and opinions, with the goal of resolving the problem. If we are unable to resolve the problem, we will seek advice from the course staff early on to maintain a productive and respectful working environment.


## Project Timeline Proposal

Instructions: REPLACE the contents of this cell with your work

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/6  |  3 PM | Conduct preliminary data inspection (reviewing structure, missing values, and data types)  | Reviewing datasets and all the progress that made before the meeting | 
| 2/13  |  3 PM | Address missing values, outliers, and inconsistent data types; Document all cleaning steps and decisions in the notebook; Confirm the dataset is fully clean and ready for analysis | Revise and clarify data cleaning explanations | 
| 2/20  | 3 PM  | Create at least three appropriate data visualizations; Use suitable plot types and clearly labeled axes; Interpret each visualization in the notebook text  | Conduct Exploratory Data Analysis (EDA) to identify key patterns; Evaluate whether EDA findings align with the original hypothesis   |
| 2/27  | 3 PM  | Perform the main Data Analysis; Apply appropriate analytical methods to answer the research question; Interpret analysis results and assess hypothesis support  |  Draft the Overview section (3–4 sentences)   |
| 3/6  | 3 PM  | Write Privacy and Ethics Considerations; Address potential bias, ethical concerns, and responsible data use; Complete Conclusion & Discussion; Clearly answer the research question  | Discuss limitations of the analysis; Prepare and record the Final Video (3–5 minutes); Ensure the question, methods, results, and takeaway are clear|
| 3/13  | 3 PM  | Conduct Final Checks; Remove all instructions; Ensure all text and visuals display correctly; Include all group members’ names; Rename the final notebook| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |