# COGS 108 - Project Proposal

## Authors

**Eric Badilla:** Conceptualization, Data curation, Methodology, Writing – original draft 

**Nishka Vaghela:** Background research, Writing – original draft 

**Niharika Sapre:** Conceptualization, Writing – original draft 

**Renee Li:** Methodology, Writing – original draft 

**Jenny Fu:** Project administration, Writing – original draft

## Research Question

How does the number of new Computer Science graduates compare to the number of entry-level software engineering job openings in the US from 2020 to 2025? Specifically, is the gap between new graduates and available beginner jobs getting wider each year?

## Background and Prior Work

Computer science has been one of the fastest-growing academic disciplines in the U.S., driven by the perception of strong job prospects in software engineering and related roles. At the same time, structural changes in the technology labor market such as economic fluctuations, layoffs, and the growing role of automation, have reshaped job demand dynamics for early-career engineers. 

Educational Output of CS Graduates:

According to the National Student Clearinghouse Research Center, the number of U.S. students earning bachelor’s degrees in computer and information sciences more than doubled over the past decade, reaching over 112000 degrees (https://www.studentclearinghouse.org/nscblog/computer-science-has-highest-increase-in-bachelors-earners/) 

Moreover, data from the National Center for Education Statistics show that CS degree completions contributed significantly to the expanding pool of graduates in STEM fields, including computing (https://nces.ed.gov/programs/coe/indicator/cta) and computer science.

While no definitive dataset currently tracks both sides of the supply-demand equation (graduates and job openings) in exactly the same time series from 2020 to 2025, several prior studies offer useful context:

National Student Clearinghouse Research Center trend analysis shows the rapid increase in CS degree earners through 2022–23, which suggests a growing supply of potential software engineers.

Lightcast and hack training reports provide data on entry-level software engineer job postings, with evidence of growth in listings for early-career roles in the 2023-24 period.

Federal Reserve/St. Louis Fed labor posting indices track macro trends in job postings on platforms like Indeed, indicating that software development postings have declined relative to 2020.

Industry hiring analyses discuss how layoffs and economic pressures have led tech employers to prioritize experienced talent over entry-level hires in some cases, a factor that shapes demand relative to graduate supply.

Educational statistics from NCES and NSF provide multi-year degree award trends in STEM fields, showing sustained increases that feed into the pool of graduates seeking software engineering roles. These resources show us that the CS degree trends suggest rising supply, while labor market and hiring trend analyses show fluctuating demand.

The Computer Science labour market has been shaped by growing numbers of graduates, but fluctuations in entry-level software jobs. This growth reflects long-standing expectations that software engineering offers strong and stable employment opportunities. However, prior work suggests that labor market demand has not increased at the same pace. 


## Hypothesis


We predict an increase in the gap between new graduates and available entry-level software engineering jobs year over year to present. 

As described in the background, we predict that this is because there has been a significant rise in the number of computer science bachelors degree enrollment and output. At the same time, hiring growth has been more volatile because of various macroeconomic conditions, the eruption of AI leading to tech layoffs, and shifts in company hiring strategies also due to this. 

## Data

1. Ideal Dataset
    1. What variables?
We would want a table with columns for:
Date (Month/Year)
New_Graduates (exact count of students finishing a CS degree that specific month)
Entry_Level_Openings (count of jobs for beginners)
Location (State/City).
    2. How many observations are needed?
We would need monthly data points for every state in the US over the last 5 years. That would be roughly 50 states × 60 months = 3,000+ observations.
    3. Who/what/how would these data be collected?
Ideally, this wouldn't be manual. It would be collected via a real-time API that aggregates registrar data from every accredited US university and simultaneously scrapes all major job boards (LinkedIn, Indeed).
    4. How would these data be stored/organized?
We would store this in a single, clean CSV file or SQL database, indexed by date, so we could easily graph "Supply" vs. "Demand."
2. Real Datasets
Dataset 1: Supply (Graduates) Data.gov
    1. Location/Access: We started by looking for official government data (data.gov) suggested in the course resource list. We navigated to the NCES IPEDS website (https://nces.ed.gov/ipeds/use-the-data), which is the Department of Education's public portal. We don't need to apply for permission; we can simply navigate to the Survey Data tab and download the "Completions" surveys CSV files for 2019-2023 directly from their data center.
    2. Important Variables:
CIPCODE (which the documentation defines as the "Classification of Instructional Program") -A column specifically selecting the "2-digit Series 11" (defined in the documentation as "Computer and Information Sciences") for "Computer Science".
CTOTALT (identified in the Data Dictionary as the "Grand Total" of awards) -A column for "Total degrees awarded" that we will sum to count the exact amount of new graduates.
These are the variables we will use to count the "Supply" of new graduates.

Dataset 2: Demand (Job Postings) Kaggle
1. Location/Access: We also checked Kaggle, which was listed as a recommended resource course list. We searched for "Tech Jobs" under the dataset filter and found the "100k US Tech Jobs (Winter 2024)" dataset.(https://www.kaggle.com/datasets/christopherkverne/100k-us-tech-jobs-winter-2024). We can download this directly as a CSV file with a free account.
2. Important variables:
Title - A column to filter for "Software Engineer".
Description- A column to scan the text for keywords like "Entry Level" or "0-2 years experience".
These are the variables to measure the "Demand" side of our equation.


## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Example of how to use the checkbox, and also of how you can put in a short paragraph that discusses the way this checklist item affects your project.  Remove this paragraph and the X in the checkbox before you fill this out for your project

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
> We are concerned that the data itself may be biased or incomplete. Not all entry-level software engineering jobs are posted online, since some people get jobs through referrals, internal hiring, or campus recruiting that does not appear on job boards. Because of this, it is difficult to collect data that fully represents the true number of available jobs, which may affect the accuracy of the comparison.

 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
> The data used in this project is publicly available and does not include any private information. It is stored locally during the analysis, and basic care is taken to avoid accidentally changing or sharing the files. Because the data is low risk, no special security measures are needed.

 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
> This project looks at overall numbers and trends, so it does not include personal experiences from recent graduates or employers. Because of this, the results may not fully explain why certain trends happen, and they should be understood as showing general patterns rather than individual experiences.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
> The datasets used in this analysis may introduce bias because they rely on broad categories and assumptions. Not all Computer Science graduates are looking for software engineering jobs, and not all jobs labeled as “entry-level” are actually accessible to new graduates. These mismatches can affect the comparison between supply and demand, so the results should be interpreted as approximate trends rather than exact measurements.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> The results are presented in a way that avoids oversimplifying or exaggerating the data. Instead of focusing on single-year changes, the analysis looks at overall trends across multiple years to reduce the impact of short-term fluctuations. All figures and summaries are explained in context so readers understand what the data shows and what it does not show.

 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> The steps of the analysis, including data sources and processing methods, are documented so that the work can be checked or repeated by others if needed, which also helps maintain transparency and accountability.


### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
> The metrics used, such as the number of graduates and the number of entry-level job postings, were chosen because they directly relate to the research question. However, these metrics do not capture job quality or underemployment, which is a limitation.

 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
> The final results clearly explain the limitations of the analysis, including data gaps and simplifying assumptions, so readers understand what the results do and do not show.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
> This analysis is based on data from a specific time period. If it were updated in the future, the data and methods would need to be checked again to make sure they still reflect current job market conditions.

 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> There is a possibility that the results could be misunderstood or used to discourage students from majoring in Computer Science. To reduce this risk, the analysis emphasizes that it describes overall trends in the job market and does not predict individual outcomes or career success.


## Team Expectations 

* *We will communicate primarily through messages for quick updates and questions, and use Google Docs for longer-form work and progress tracking. We will meet once per week (virtually) and schedule additional meetings as needed*
* *We agree to communicate in a blunt but polite manner. Team members should feel comfortable expressing disagreement or concerns respectfully and constructively*
* *For major project decisions, we will aim for consensus. Otherwise, by majority vote. If in cases of time-sensitive decisions, the member will have to make a temporary decision at the moment and inform the group*
* *Tasks will be divided based on individual choices as they want. We will track tasks and progress using a shared document, so responsibilities and deadlines are visible to everyone.*
* *We will follow the agreed-upon project timeline and update it as needed. Team members are expected to complete assigned tasks by internal deadlines so the group can double-check.*
* *If a team member is struggling to complete a task, they should notify the group as early as possible, so we can work together to redistribute work temporarily or provide support. If a member consistently misses deadlines without communication, the group will address the issue directly and follow course guidelines if needed.*
* *All team members are expected to contribute equally in effort, communicate regularly, and respect each other’s time and commitments. We recognize that everyone has different strengths, schedules, and working styles, and we will support one another to ensure the project progresses smoothly.*


## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/4  |  1 PM | Review COGS 108 project expectations; brainstorm project ideas related to Big Tech hiring  | Determine best form of communication; Discuss and decide on final research question; discuss hypothesis; begin background research | 
| 2/11  |  10 AM |  Do background research on CS graduate trends and Big Tech hiring patterns | Identify potential datasets (education + job postings) and ethics; draft project proposal | 
| 2/18  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/25  | 6 PM  | Import & Wrangle Data; EDA | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 3/4  | 12 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete project check-in |
| 3/11  | 12 PM  | Complete analysis; Draft results/conclusion/discussion| Discuss/edit full project |
| 3/18  | Before 11:59 PM  | double check assigned parts is completed and polished | Turn in Final Project & Group Project Surveys |
