# COGS 108 - Project Proposal

## Authors

- Ahmon Embaye: Conceptualization, Data curation, Methodology, Analysis, Background research
- Eugene Bertrand: Analysis, Background research
- George Robin: Analysis, Background research, Project administration, Writing - review & editing
- Hemendra Ande: Analysis, Background research, Writing - original draft
- Harvir Ghuman: Analysis, Background research, Writing - original draft

## Research Question

Does a higher level of formal education (Master's/PhD) or years of professional experience provide a statistically significant reduction in AI automation risk across technical versus service job sectors?

Specifically, we aim to determine if the protective nature of a college degree is being superseded by practical experience as AI begins to handle cognitive tasks traditionally reserved for highly educated professionals.

## Background and Prior Work

The rapid integration of Generative AI into the global workforce has sparked an intense debate regarding which demographics are most susceptible to job displacement. Historically, automation primarily replaced manual labor; however, modern AI is increasingly capable of performing cognitive tasks that previously required a college degree. This shift raises critical questions for students and professionals about whether advanced formal education or extensive professional experience serves as a more effective shield against automation.

Existing research using the "AI Exposure Index" has found that roles requiring high levels of literacy and analytical writing—tasks often associated with Master's and PhD holders—actually possess some of the highest exposure scores. This suggests that high-level formal education may no longer guarantee career safety. In contrast, some data indicates that mid-to-late career professionals may benefit from "complementarity," where AI functions as a tool that enhances productivity rather than replacing expert human labor.

This project builds upon previous data science explorations found on platforms like Kaggle and GitHub that utilize occupational data to model AI impact. We will analyze these trends to see if the data supports the traditional view that education reduces risk, or if we are entering an era where practical experience provides a stronger defense in specific labor sectors.

1. Kaggle - AI Impact on Jobs 2030 dataset exploring occupational risk projections.
2. OECD AI and Employment Surveys - Public indicators on AI workforce skills and automation risk by occupation.
3. O*NET/BLS - Task-based modeling for employment projections.

## Hypothesis

We hypothesize that a PhD will provide a greater reduction in automation risk than 10 years of experience in technical sectors, but experience will be more protective in service sectors. This is based on the idea that technical sectors may still value specialized, high-level theoretical knowledge, while service sectors rely more on practical, interpersonal, and experiential skills that are harder for AI to replicate.

## Data

### Ideal Dataset
The ideal dataset would be a longitudinal panel study tracking individual career trajectories alongside AI adoption metrics. It would include variables such as Education Level, Years of Experience, AI Exposure Index, Job Sector, and Skill Profile. We would require at least 10,000 to 20,000 observations across diverse job titles to ensure sufficient statistical power for multi-linear regression. Ideally, this data would be collected via Bureau of Labor Statistics (BLS) projections and task-based modeling from organizations like O*NET and stored in a tidy Relational Database or Parquet file.

### Real Datasets
**1. AI Impact on Jobs 2030 (Kaggle)**
- This dataset is publicly available on Kaggle and designed for 2030 projections. It contains over 500 job roles with realistic projections based on economic trends. Key variables include Job_Title, AI_Exposure_Index, Automation_Probability_2030, Education_Level, and Years_Experience.

**2. OECD AI and Employment Surveys**
- The OECD provides live data on AI jobs and skills across different countries. While microdata may require academic requests, public CSV exports for the AI workforce are accessible. Important variables include Industry_Sector, Skill_Demand_Intensity, Tertiary_Education_Share, and Automation_Risk_by_Occupation.

## Ethics 

### A. Data Collection
 - [X] **A.2 Collection bias**: We have evaluated the Kaggle dataset for bias, noting that risk scores are based on future projections subject to the subjective judgment of experts. We will mitigate this by comparing projections against historical trends.

### B. Data Storage
 - [X] **B.3 Data retention plan**: All processed datasets and intermediate analysis files will be deleted upon the final grading of the project.

### C. Analysis
 - [X] **C.3 Honest representation**: Analysis will include measures of variance and uncertainty to represent the non-linear nature of AI development.
 - [X] **C.5 Auditability**: We will document the entire methodology from cleaning to statistical testing in a Jupyter Notebook to ensure transparency.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Education and experience often correlate with race and socioeconomic status. We will acknowledge systemic barriers and avoid framing results as the fault of lower-educated workers.
 - [X] **D.5 Communicate limitations**: We will clearly state that these 2030 projections are snapshots and should not be used for high-stakes policy decisions.

## Team Expectations 

* Communicate through Discord or iMessage with a response expected within 6 hours.
* Meet once a week via Discord call.
* Tone should be respectful; group decisions will be made by majority vote of present members.
* Everyone will contribute to all tasks, updating the group on changes and helping those who struggle.

## Project Timeline Proposal

| Meeting Date  | Meeting Time | Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/8/26 | 8PM | Download Kaggle/OECD datasets; set up GitHub. | Review variables; assign cleaning tasks. |
| 2/15/26 | 8PM | Complete data cleaning. | Look at Summary Statistics and sector risk. |
| 2/22/26 | 8PM | Run first draft of Multi-Linear Regression. | Evaluate p-values for PhD vs. Experience. |
| 3/1/26 | 8PM | Create final visualizations (Heatmaps, etc.). | Interpret findings and check for bias. |
| 3/8/26 | 8PM | Write summaries of findings for variables. | Discuss/edit Analysis; Project check-in. |
| 3/15/26 | 8PM | Combine all sections into master document. | Discuss/edit full project. |
| 3/20/26 | 8PM | Finish project. | Turn in Final Project & Surveys. |