# COGS 108 - Project Proposal

## Authors

- Ahmon Embaye: Conceptualization, Data curation, Methodology, Analysis, Background research
- Eugene Bertrand: Analysis, Background research
- George Robin: Analysis, Background research, Project administration, Writing - review & editing
- Hemendra Ande: Analysis, Background research, Writing - original draft
- Harvir Ghuman: Analysis, Background research, Writing - original draft

## Research Question

Does a higher level of formal education (Master's/PhD) or years of professional experience provide a statistically significant reduction in AI automation risk across technical versus service job sectors?

Specifically, we aim to determine if the protective nature of a college degree is being superseded by practical experience as AI begins to handle cognitive tasks traditionally reserved for highly educated professionals.

## Background and Prior Work

The rapid integration of Generative AI into the global workforce has sparked an intense debate regarding which demographics are most susceptible to job displacement. Historically, automation primarily replaced manual labor; however, modern AI is increasingly capable of performing cognitive tasks that previously required a college degree. This shift raises critical questions for students and professionals about whether advanced formal education or extensive professional experience serves as a more effective shield against automation.

Existing research using the "AI Exposure Index" has found that roles requiring high levels of literacy and analytical writing—tasks often associated with Master's and PhD holders—actually possess some of the highest exposure scores. This suggests that high-level formal education may no longer guarantee career safety. In contrast, some data indicates that mid-to-late career professionals may benefit from "complementarity," where AI functions as a tool that enhances productivity rather than replacing expert human labor.

This project builds upon previous data science explorations found on platforms like Kaggle and GitHub that utilize occupational data to model AI impact. We will analyze these trends to see if the data supports the traditional view that education reduces risk, or if we are entering an era where practical experience provides a stronger defense in specific labor sectors.

1. Kaggle - AI Impact on Jobs 2030 dataset exploring occupational risk projections.
2. OECD AI and Employment Surveys - Public indicators on AI workforce skills and automation risk by occupation.
3. O*NET/BLS - Task-based modeling for employment projections.

## Hypothesis

We hypothesize that a PhD will provide a greater reduction in automation risk than 10 years of experience in technical sectors, but experience will be more protective in service sectors. This is based on the idea that technical sectors may still value specialized, high-level theoretical knowledge, while service sectors rely more on practical, interpersonal, and experiential skills that are harder for AI to replicate.

## Data

### Ideal Dataset
The ideal dataset would be a longitudinal panel study tracking individual career trajectories alongside AI adoption metrics. It would include variables such as Education Level, Years of Experience, AI Exposure Index, Job Sector, and Skill Profile. We would require at least 10,000 to 20,000 observations across diverse job titles to ensure sufficient statistical power for multi-linear regression. Ideally, this data would be collected via Bureau of Labor Statistics (BLS) projections and task-based modeling from organizations like O*NET and stored in a tidy Relational Database or Parquet file.

### Real Datasets
**1. AI Impact on Jobs 2030 (Kaggle)**
- This dataset is publicly available on Kaggle and designed for 2030 projections.
- It contains over 500 job roles with realistic projections based on economic trends.
- Key variables include Job_Title, AI_Exposure_Index, Automation_Probability_2030, Education_Level, and Years_Experience.

**2. OECD AI and Employment Surveys**
- The OECD provides live data on AI jobs and skills across different countries.
- While microdata may require academic requests, public CSV exports for the AI workforce are accessible.
- Important variables include Industry_Sector, Skill_Demand_Intensity, Tertiary_Education_Share, and Automation_Risk_by_Occupation.

## Ethics 

### Data Collection and Bias
Our team has carefully evaluated the potential for collection bias within the Kaggle "AI Impact on Jobs 2030" dataset. Because the risk scores are based on future projections, they are subject to the subjective judgment of the researchers and experts who originally labeled the data. This creates a risk of bias, where high-level formal education (like a Master's or PhD) may be viewed as more irreplaceable simply because the experts themselves possess those credentials. We will mitigate this by comparing these projections against historical trends of automation to see if the protective nature of education remains consistent across different data models.

### Data Retention and Auditability
In accordance with the project's lifecycle, we have established a data retention plan where all processed datasets and intermediate analysis files will be deleted upon the final grading of the project. To maintain auditability, we will document our entire methodology, from data cleaning to the final statistical tests, within our Jupyter Notebook. This ensures that our process is transparent and that any biases introduced during the wrangling phase can be identified and corrected by others in the research community.

### Honest Representation and Limitations
To ensure the honest representation of our findings, our analysis will include measures of variance and uncertainty rather than just average risk scores. AI development is notoriously non-linear, as seen with the sudden emergence of Large Language Models, and a "2030 projection" is a snapshot of current expectations that could change rapidly. We will communicate these limitations clearly to prevent our model from being used as a definitive tool for high-stakes career or policy decisions.

### Proxy Discrimination and Social Impact
A central ethical concern of this study is the risk of proxy discrimination. Formal education levels and years of professional experience are often highly correlated with socioeconomic status, race, and geographic access to resources. By investigating whether a PhD provides a "statistically significant reduction" in automation risk, we acknowledge that we are studying a variable that is not accessible to everyone. We must be careful not to frame our results in a way that suggests lower-educated workers are at fault for their automation risk, but rather highlight how systemic barriers to education may create a disadvantage in the future labor market.

## Team Expectations 

* We will communicate through discord or imessages, and expect a response within 6 hours.
* We will meet once a week through a discord call.
* No expectations around tone, just be respectful, but out of pocket is okay.
* We will do a majority vote with the group members that are present.
* For tasks everyone do a bit of everything.
* Make sure to update the group when you make a change.
* Be sure to help group members struggling with a task.

## Project Timeline Proposal

| Meeting Date  | Meeting Time | Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/8/26 | 8PM | Download Kaggle and OECD datasets; set up shared GitHub Repo. | Review variables like Education_Level and AI_Exposure_Index; assign initial data cleaning tasks. |
| 2/15/26 | 8PM | Complete data cleaning (handling missing values and standardizing Job_Titles). | Look at initial Summary Statistics; do certain sectors already show higher risk in the raw data? |
| 2/22/26 | 8PM | Run the first draft of the Multi-Linear Regression model. | Evaluate the p-values for PhD vs. 10 years experience; discuss if the "Technical vs. Service" split is appearing. |
| 3/1/26 | 8PM | Create final visualizations (Correlation heatmaps, Regression plots). | Interpret findings; check for Proxy Discrimination or biases in the results before the project check-in. |
| 3/8/26 | 8PM | Each member writes a summary of the findings for their assigned variables. | Discuss/edit Analysis; Complete project check-in. |
| 3/15/26 | 8PM | Combine the Introduction, Methodology, Ethics, and Analysis into one master document. | Discuss/edit full project. |
| 3/20/26 | 8PM | Finish project. | Turn in Final Project & Group Project Surveys. |