## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Alice Anderson: Conceptualization, Data curation, Methodology, Writing - original draft
- Bob Barker:  Analysis, Software, Visualization
- Charlie Chang: Project administration, Software, Writing - review & editing
- Dani Delgado: Analysis, Background research, Visualization, Writing - original draft

## Research Question

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

From 2018,2020,2022,2024 at UC San Diego (UCSD), how do transfer-admit selectivity patterns differ across ten selected majors, and how are these patterns associated with major-level alumni earnings? Using major × year aggregated data, we will characterize selectivity using admitted GPA ranges (converted to midpoint and spread) and admission rates, and we will estimate year-to-year trends for each major. We will then merge in major-level earnings outcomes (25th/median/75th percentile annual earnings and earner counts) for the corresponding year and test whether majors with higher earnings tend to have higher admitted GPA midpoints and/or lower admission rates. Our analysis will use descriptive summaries, trend comparisons, and correlation/OLS regression models (with controls for year, major fixed effects, and applicant volume where available) to assess the strength and significance of these relationships. The final deliverables will identify (1) which majors appear most selective in each year, (2) which majors show the fastest changes in selectivity from 2018,2020,2022,2024, and (3) whether major-level earnings are statistically associated with selectivity measures in this period.

## Background and Prior Work

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

Choosing a university major is a crucial decision that impacts a student's academic trajectory and future career prospects. In recent years, there has been increasing attention to the "return on investment" of higher education, with prospective students often weighing the academic difficulty of a major against its potential graduate earnings. For transfer students—who typically have more defined career goals upon entering university—the competitiveness of admission (measured by the GPA distribution of admitted students (25th, 50th, and 75th percentiles)) is a primary evaluation criterion. This study uses the University of California, San Diego (UCSD) as a case study, a leading public research university where transfer admission selectivity and GPA thresholds vary significantly across different disciplines.

Previous research has confirmed a significant but nuanced link between university admission difficulty and post-graduation earnings. Data from the University of California Information Center suggests this relationship is particularly pronounced at highly competitive institutions like UCSD, especially in fields such as mechanical engineering and computer science, where graduates consistently rank among the highest earners five years after graduation. On the other hand, some "pre-professional" majors, such as human biology, maintain high GPA admission thresholds despite potentially lower undergraduate earnings compared to technical fields. This is often because these students go on to pursue advanced medical degrees, indicating that "admission difficulty" is not solely determined by current average salaries but also by long-term academic value.

Labor market conditions and alumni salary signals have also been shown to influence student demand, and consequently, the academic standards required for admission. It is widely believed that students tend to gravitate towards majors perceived to have higher economic returns, such as STEM fields and applied social sciences like business psychology. This shift in demand leads to increased admission difficulty and higher GPA admission thresholds. For example, UCSD's computer science program consistently maintains higher GPA requirements than humanities disciplines like history, directly reflecting its leading position in alumni earnings data.

Despite these observations, there is a lack of localized research specifically examining the year-to-year changes in UCSD's admission GPA distribution and its correlation with real-time average salary trends for specific majors. While general research suggests that a 10% increase in expected salary increases the likelihood of students choosing a particular major by nearly 14% to 18%, it remains unclear whether these enrollment preferences immediately translate into quantifiable changes in the 25th, and 75th percentiles of matriculation GPA. This project aims to bridge the gap between educational enrollment data and labor market wage signals by analyzing fifteen different majors at the University of California, San Diego (UCSD), spanning fields from history to mechanical engineering.

## Hypothesis


Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics


Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?  

            Admissions and earnings data may reflect structural inequalities, differences in applicant pools, and labor market demand. These systemic factors may influence observed associations.
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?  

            The dataset contains only major-by-year aggregated statistics and no personally identifiable information.
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?  

            Because we analyze aggregated data, we cannot evaluate individual-level fairness. We explicitly state this limitation in our discussion.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?  

            Data is stored locally on password-protected devices and in a private GitHub repository accessible only to team members.
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?  

            No individual-level data is collected, so this is not applicable.
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?  

            Data will be retained only for the duration of the course project.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?  

            Our analysis does not account for demographic factors, socioeconomic background, or non-monetary educational value.
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?  

            Earnings may reflect broader economic trends, and admission rates may depend on applicant volume. These may confound interpretation.
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?  

            We report correlations and regression results transparently and avoid causal claims.
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?  

            All results are based on aggregated statistics; no individual data is displayed.
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?  

            All preprocessing and modeling steps are documented in the notebook to ensure reproducibility.
### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?  

            Although no sensitive attributes are used, selectivity and earnings may indirectly reflect systemic inequality.
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?  

            We use multiple selectivity measures and regression models rather than relying on a single metric.
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?  

            We use interpretable linear regression models.
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?  

            We clearly state that correlation does not imply causation.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?  

            This is an academic analysis and not a deployed system.
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?  

            Findings could be misinterpreted as ranking majors solely by income; we caution against oversimplification.

## Team Expectations 

Below is our Team Expectations

## Communication

Primary Communication: WeChat will be our main communication platform. We will create a group chat for all project-related discussions. For important discussions, we may also schedule in-person meetings.

Response Time: All team members are expected to respond to messages within 24 hours. If there is an urgent issue (e.g., upcoming deadline or ready check before submission), members should respond as soon as possible.

Meetings: We will hold in-person meetings when necessary (especially before major deadlines). If someone cannot attend, they must inform the group in advance and review meeting notes afterward.

Respectful Communication: We will communicate respectfully and professionally. It is okay to disagree, but we will discuss issues constructively and focus on problem-solving rather than personal criticism.

## Task Completion & Responsibility

Each member is responsible for completing their assigned tasks on time.

If someone anticipates difficulty meeting a deadline, they must inform the group as soon as possible so we can adjust responsibilities.

We will use GitHub (Issues/Projects board) to track task progress and ensure transparency.

All members should contribute meaningfully to coding, analysis, writing, and presentation work.

## Addressing Problems

If a team member becomes unresponsive or does not complete tasks:

First, we will check in privately to understand the situation.

If the issue continues, we will discuss it as a group.

If the problem persists and affects project progress, we will inform the TA and/or professor.

## Commitment

All team members agree to:

Communicate clearly and respectfully.

Contribute fairly to the project.

Support each other throughout the quarter.

Work toward completing the project successfully and professionally.

## Project Timeline Proposal

Instructions: Replace this with your timeline.  **PLEASE UPDATE your Timeline!** No battle plan survives contact with the enemy, so make sure we understand how your plans have changed.  Also if you have lost points on the previous checkpoint fix them

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| 1/20 | 1 PM | Read COGS 108 expectations; brainstorm UCSD-focused research ideas | Finalize research question: UCSD transfer selectivity and major-level earnings |
| 1/26 | 10 AM | Collect UCSD transfer admissions data and major-level earnings data | Review dataset structure; identify key variables (major, year, gpa_mid, admit_rate, earn_med, earn_p25, earn_p50, earn_p75) |
| 2/1 | 10 AM | Begin data cleaning; convert percentages and currency formats | Discuss construction of selectivity metrics (GPA midpoint, admission rate); check missing values |
| 2/16 | 6 PM | Complete full data cleaning and merge by major × year (due 2/16) | Confirm final cleaned dataset; verify merged variables and sample size |
| 2/23 | 12 PM | Complete EDA: trends in gpa_mid and admit_rate by major; earnings distributions | Interpret EDA results; refine hypotheses about selectivity–earnings relationship |
| 2/28 | 6 PM | Finalize EDA checkpoint (due end of Feb) | Plan statistical analysis: Pearson/Spearman correlations; OLS regressions (gpa_mid ~ earn_med, admit_rate ~ earn_med + year + C(major)) |
| 3/10 | 12 PM | Run regression models; include year and major fixed effects | Interpret coefficients; discuss limitations and potential confounders |
| 3/17 | 6 PM | Draft final results, discussion, and ethics section | Final review and edits before submission |
| 3/18 | Before 11:59 PM | NA | Submit Final Project & Group Surveys |
