# COGS 108 - Data Checkpoint

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

- Yuchen Hou: Background research
- Minkyung Gwak: Background research
- Nicolas Leedy: Background research
- Mowen Tan: Background research
- Iris Liu: Background research

## Research Question

How do admission rates and yield rates differ for international, out-of-state, and in-state undergraduate applicants to UC San Diego, and have the gaps between these groups widened or narrowed between 2020 and 2025?

## Background and Prior Work

Admission decisions at selective public universities shape not only the composition of the student body but also reflect institutional priorities and structural constraints. Within the University of California (UC) system, campuses receive applications from multiple residency categories, including California residents (in-state), domestic non-residents (out-of-state), and international applicants. These groups differ in tuition structure, financial contribution, and application volume, making comparisons of admission and enrollment outcomes particularly important. UC San Diego (UCSD), as a highly selective public research institution, provides a useful context for examining how admission rates and yield rates vary across residency groups. Public UC admissions datasets provide counts of applicants, admitted students, and enrolled students by residency status, enabling direct analysis of both admission probabilities and post-admission enrollment behavior.<a href=" ">1</a >

Prior research and institutional analyses have emphasized that understanding university access requires examining rates rather than only aggregate enrollment numbers. Summaries provided by the University of California Office of the President highlight how Proposition 209 reshaped admissions practices and underscore the importance of evaluating application, admission, and enrollment patterns separately.<a href="#ref3">3</a > These analyses suggest that raw enrollment totals alone may obscure underlying differences in admissions dynamics. Empirical work examining UC admissions policies similarly demonstrates that changes in student composition may arise from shifts in applicant behavior, selection processes, and institutional factors, motivating closer examination of admission probabilities across groups.<a href="#ref3">3</a >

Contemporary reporting further illustrates the relevance of residency-based comparisons. Recent coverage by the Los Angeles Times documents record numbers of California applicants alongside continued international enrollment growth, highlighting how applicant pools and admissions outcomes may evolve over time.<a href="#ref2">2</a > Policy commentary also points to structural barriers affecting California students, including disparities in academic preparation and access to advising resources, which may influence both application behavior and admissions outcomes.<a href="#ref4">4</a > Together, these discussions motivate systematic quantitative investigation of how admission rates and yield rates differ across residency categories.

Building on prior work, this project investigates how admission rates and yield rates differ for international, out-of-state, and in-state undergraduate applicants to UC San Diego between 2020 and 2025, and whether gaps between these groups have widened or narrowed over time.


References

<a name="ref1">1</a > University of California Admissions. UC San Diego First-Year Admit Data.
https://admission.universityofcalifornia.edu/campuses-majors/san-diego/first-year-admit-data.html

<a name="ref2">2</a > Los Angeles Times. UC Admissions and Residency Trends.
https://www.latimes.com/california/story/2025-07-28/uc-fall-2025-admissions-record-number-from-california-international-students-racial-diversity

<a name="ref3">3</a > Bleemer, Z. The Impact of Proposition 209 and Access-Oriented UC Admissions Policies.
https://www.ucop.edu/institutional-research-academic-planning/_files/uc-affirmative-action.pdf

<a name="ref4">4</a > CalMatters. Barriers Facing California Students Seeking UC Admission.
https://calmatters.org/commentary/2026/01/college-barriers-uc-california-students/

## Hypothesis


We hypothesize that there are distinct patterns across first generation, undocumented, and international students admitted to UC San Diego(2020-2025) across ethnicity, gender, and intended major. Specifically, due to cultural and economic factors, we believe that first generation and International will share considerable concentration in STEM-based majors, and that first generation will have a disproportionate percentage of underrepresented ethnicities while international students will show a different distribution of ethnicities and genders. 

We expect that the focus on STEM majors is influenced by economic stability. International students have increased costs due to travel and education expenses which may incentivize financial predictability with potential of high return when picking a major. Similarly, first generation students may prefer STEM as there is a higher financial floor and increased opportunities following graduation. On the other hand, first generation students may be disproportionately underrepresented ethnicities due to historical disparities in K-12 education, college counseling, and opportunity. International students have differing constraints which come in the form of restrictions of certain countries to send students to the US as well as socioeconomic differences across the world.  


## Data

- Dataset #1
  - Dataset Name: Uc Admissions and enrollment based on Residency 
 
  - Link to the dataset: https://www.universityofcalifornia.edu/about-us/information-center/admissions-residency-and-ethnicity

  - Number of observations: 

  - Number of variables: 8 

  - Description of the variables most relevant to this project
	- there are only 8 variables that are relevant to this project.
   
	       -Total first-time, first-year who applied:shows application numbers for all 3 groups
       
	       -Total first-time, first-year who admitted:shows admission numbers for all 3 groups
       
	       -Total first-time, first-year who enrolled: shows enrollment numbers for all 3 groups
       
	       -in-state: separates the top 3 variable for in-state only
       
	       -out-of-state: separates the top 3 variable for out-of-state only
       
	       -international: separates the top 3 variable for international only
       
           -percentage: takes the total of (applied, admitted, enrolled)  and finds the percentages for each residency group relative to the total for each year.
    
           -year:2017-2025
	- all of these varibale should be useful towards our project.  

- Descriptions of any shortcomings this dataset has with respect to the project
	-The biggest shortcoming of this dataset is that it does not show individual data. Because all of the data is aggregated we will be able to work with percentages. 



In [5]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading airline-safety.csv:   0%|          | 0.00/1.23k [00:00<?, ?B/s][A
                                                                           [A

Successfully downloaded: airline-safety.csv



Downloading bad-drivers.csv:   0%|          | 0.00/1.37k [00:00<?, ?B/s][A
Overall Download Progress: 100%|██████████| 2/2 [00:00<00:00, 19.44it/s][A

Successfully downloaded: bad-drivers.csv





### Admission and yield rates over 9 year at UCSD Dataset

Dataset Description: UCSD Common Data Set (2017–2025)

The UCSD Common Data Set (CDS) is an annual institutional report that provides standardized information on admissions, enrollment, and financial characteristics. The dataset consists of aggregated statistics, typically reported as counts, proportions, or percentages rather than individual-level records. Central variables for this project include the admission rate, defined as the proportion of applicants offered admission, and the yield rate, defined as the percentage of admitted students who ultimately enroll. These metrics are unitless proportions that capture different stages of the admissions pipeline: selection and post-admission decision-making. Because the CDS reports values by residency categories (in-state, out-of-state, international), it enables comparisons across applicant groups.

Financial variables in the CDS provide important context for interpreting these outcomes. Total Estimated Expenses, measured in U.S. Dollars (USD), represent the annual cost of attendance. Non-resident students (out-of-state and international) face substantially higher tuition than California residents, often exceeding $70,000 per year. Academic preparation measures, such as High School GPA, are reported on a weighted 4.0 scale and serve as standardized indicators of applicant academic performance.

Major Concerns and Potential Biases

Several limitations of the CDS dataset should be considered. First, policy changes affect comparability across years. Following the UC system’s adoption of a test-blind admissions policy, standardized test metrics were removed or inconsistently reported, which alters how academic preparation is represented. Second, because CDS values are aggregated, the data are subject to aggregation bias, limiting analysis to population-level trends rather than individual behavior.

Additionally, financial aid metrics may underrepresent non-resident experiences, particularly for international students who are generally ineligible for federal aid programs. Finally, reported outcomes may reflect survivorship bias, as summary statistics describe enrolled students but do not capture those who decline admission or withdraw. These factors suggest that CDS analyses are best interpreted descriptively rather than causally.



In [16]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE

#downloads file and reads it using pandas 
import pandas as pd 
df = pd.read_excel("data/00-raw/Cogs108_dataset.xlsx")


# converts the xlsx into a csv file 
df.to_csv("data/00-raw/Cogs108_dataset.csv", index=False)
df = pd.read_csv("data/00-raw/Cogs108_dataset.csv")
df.head(18)




Unnamed: 0.1,Unnamed: 0,Unnamed: 1,2017,2018,2019,2020,2021,2022,2023,2024,2025
0,California resident,adm redacted calc along Characteristics FR,59093.0,65735.0,66064.0,66399.0,76398.0,84344.0,84930.0,88415.0,87574.0
1,California resident,adm redacted percent calc along Characteristic...,0.67,0.67,0.67,0.66,0.65,0.64,0.65,6.65,0.64
2,Domestic nonresident,adm redacted calc along Characteristics FR,11443.0,12414.0,13202.0,14389.0,20852.0,23826.0,23996.0,24235.0,24929.0
3,Domestic nonresident,adm redacted percent calc along Characteristic...,0.13,0.13,0.13,0.14,0.18,0.18,0.18,0.18,0.18
4,International nonresident,adm redacted calc along Characteristics FR,17912.0,19749.0,19858.0,19247.0,21133.0,23059.0,21909.0,21800.0,24224.0
5,International nonresident,adm redacted percent calc along Characteristic...,0.2,0.2,0.2,0.19,0.18,0.18,0.17,0.16,0.18
6,California resident,adm redacted calc along Characteristics FR,18489.0,17426.0,17849.0,21898.0,21700.0,20117.0,20685.0,22948.0,21542.0
7,California resident,adm redacted percent calc along Characteristic...,0.62,0.59,0.57,0.6,0.54,0.65,0.65,0.64,0.56
8,Domestic nonresident,adm redacted calc along Characteristics FR,5813.0,6315.0,7692.0,8356.0,12300.0,7466.0,7583.0,8106.0,9771.0
9,Domestic nonresident,adm redacted percent calc along Characteristic...,0.19,0.21,0.25,0.23,0.3,0.24,0.24,0.23,0.25


## Ethics

## Team Expectations 

* Individual members are expected to complete their assigned parts of a given task on time.
* Members are expected to communicate to each other through text, Zoom, and in-person meetings every week.
* Members are expected to address conflicts calmly and focus on solutions beneficial to the team.
* Members are expected to support each other, share resources, and ask for help when needed to ensure the team succeeds. 

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---| 
| 2/1  | 4:00 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 2/3  |  8:00 PM |  Read 3–5 peer-reviewed articles related to International Student Admissions; summarize key findings in shared doc (at least 5 bullet points per person). | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/4  | 7:00 PM | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14 | 8:00 PM | Being clear and think about what's the problem of the project proposal. Plan a workflow to refine everyones' own distibuted part. | Discuss how to refine our project proposal as a whole. Help each other (provide advice) to figure out some existing problems|
| 2/18  | 8:30 PM  | Work session：Data Integration: Merge multiple datasets; Initial Cleaning:Quantify missingness (% per column); decide between drop vs. imputation (mean/median/mode); document decision in notebook markdown. | Wrangling Review: Verify data integrity after merging; EDA Deep Dive: Analyze distributions of key variables & identify outliers; Refine Analysis Plan: Select specific statistical tests/models based on EDA trends.   |
| 2/22 | 8:00 PM | Advanced EDA: Generate correlation matrix & scatterplot matrices; Initial Hypothesis Testing.| Feature Selection: Discuss which variables to keep/drop based on correlation; Address Data Bias: Identify any systematic bias in the dataset (Ethics).|
| 2/28 | 8:00 PM | Feature Engineering: One-hot encoding for categorical data; Scaling/Normalization of numerical data | Define Analysis Pipeline: Finalize the choice of ML models (e.g., Regression vs. Classification); Assign members to code specific models. |
| 3/4  | 8:00 PM | Execute Baseline Models; Generate performance metrics (e.g., Accuracy, MSE, R-squared). | Model Evaluation: Compare performance across different models; Identify potential underfitting or overfitting; Complete project check-in. |
| 3/11  | 8:00 PM (Tentative) | Hyperparameter Tuning: Refine models for better performance; Finalize Visualizations (Final polished plots). | Result Interpretation: Explain the "Why" behind the results; Finalize Ethics & Privacy discussion based on final results; Relate findings back to original hypothesis; evaluate whether results support or reject it (statistical significance threshold α=0.05). Review project limitations. |
| 3/12  | 8:00 PM (Tentative) | Draft Results, Conclusion, and Discussion sections; Clean up Jupyter Notebook code and comments. | Full Project Review: Peer-edit the narrative for clarity and flow; Ensure reproducibility (Run the notebook from top to bottom). |
| 3/18  | Before 11:59 PM  | Final proofreading; Ensure all citations and references are formatted correctly. | Final Submission: Turn in Final Project & complete Group Project Surveys. |

Version Control & Workflow

We will use GitHub for version control. Each team member will work on separate feature branches and submit pull requests before merging into the main branch. All major analytical decisions (e.g., data cleaning strategies, model selection, evaluation metrics) will be documented in markdown cells within the notebook. We will ensure full reproducibility by running the notebook from top to bottom before submission.