# AI_Impact_on_Jobs_2030 Dataset Exploration

## 1. Dataset Introduction

This dataset models how different jobs may be affected by AI-driven automation.  
It includes variables such as salary, years of experience, education level, automation probability, job stress level, and ten skill-related columns.  

## 2. Conclusion

After evaluating this dataset, I decided **not to select** the AI Jobs dataset for my final project.  
Although the topic is interesting, the numerical distributions appear highly symmetric and artificial, several value combinations are unrealistic, and many “skill” columns lack clear definitions.  

## 3. Setup and Initial Inspection

In [1]:
import pandas as pd

df = pd.read_csv("datasets/AI_Impact_on_Jobs_2030.csv")
df.head()

Unnamed: 0,Job_Title,Average_Salary,Years_Experience,Education_Level,AI_Exposure_Index,Tech_Growth_Factor,Automation_Probability_2030,Risk_Category,Skill_1,Skill_2,Skill_3,Skill_4,Skill_5,Skill_6,Skill_7,Skill_8,Skill_9,Skill_10
0,Security Guard,45795,28,Master's,0.18,1.28,0.85,High,0.45,0.1,0.46,0.33,0.14,0.65,0.06,0.72,0.94,0.0
1,Research Scientist,133355,20,PhD,0.62,1.11,0.05,Low,0.02,0.52,0.4,0.05,0.97,0.23,0.09,0.62,0.38,0.98
2,Construction Worker,146216,2,High School,0.86,1.18,0.81,High,0.01,0.94,0.56,0.39,0.02,0.23,0.24,0.68,0.61,0.83
3,Software Engineer,136530,13,PhD,0.39,0.68,0.6,Medium,0.43,0.21,0.57,0.03,0.84,0.45,0.4,0.93,0.73,0.33
4,Financial Analyst,70397,22,High School,0.52,1.46,0.64,Medium,0.75,0.54,0.59,0.97,0.61,0.28,0.3,0.17,0.02,0.42


### Summary of Initial Observation from head()

From the first few rows, this dataset appears very clean and highly structured.
It contains columns such as job title, salary, years of experience, education level, risk category, AI exposure index, tech growth factor, automation probability, and several skill score fields.

### Core analytical columns:
Average_Salary, Years_Experience, Education_Level,
AI_Exposure_Index, Automation_Probability_2030.

### Strange or suspicious elements (visible from head):

Some jobs show unrealistic combinations (e.g., Security Guard with 28 years of experience).

Education levels appear mismatched to the role (e.g., Master’s for low-skill jobs).

Numeric fields look extremely smooth and evenly spaced, suggesting synthetic generation.

### Initial Potential Analysis Directions
Even with its limitations, possible analysis directions might include:
1. Relationship between years of experience and salary.  
2. Factors associated with higher automation probability.  
3. Whether certain skill scores correlate with salary or job stress level.


## 4. Basic Structure

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Job_Title                    3000 non-null   object 
 1   Average_Salary               3000 non-null   int64  
 2   Years_Experience             3000 non-null   int64  
 3   Education_Level              3000 non-null   object 
 4   AI_Exposure_Index            3000 non-null   float64
 5   Tech_Growth_Factor           3000 non-null   float64
 6   Automation_Probability_2030  3000 non-null   float64
 7   Risk_Category                3000 non-null   object 
 8   Skill_1                      3000 non-null   float64
 9   Skill_2                      3000 non-null   float64
 10  Skill_3                      3000 non-null   float64
 11  Skill_4                      3000 non-null   float64
 12  Skill_5                      3000 non-null   float64
 13  Skill_6           

### Summary from info()

- Shape: 3000 rows × 15 columns  
- All columns are numerical (`int64` or `float64`), including education level and skill scores.  
- No missing values in the entire dataset.  
- The absence of missing data across 3000 records is unusual for real-world job data, further suggesting synthetic generation.  
- Data types are consistent and require no type conversion.

In [3]:
df.describe()

Unnamed: 0,Average_Salary,Years_Experience,AI_Exposure_Index,Tech_Growth_Factor,Automation_Probability_2030,Skill_1,Skill_2,Skill_3,Skill_4,Skill_5,Skill_6,Skill_7,Skill_8,Skill_9,Skill_10
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,89372.279,14.677667,0.501283,0.995343,0.501503,0.496973,0.497233,0.499313,0.503667,0.49027,0.499807,0.49916,0.502843,0.501433,0.493627
std,34608.088767,8.739788,0.284004,0.287669,0.247881,0.287888,0.288085,0.288354,0.287063,0.285818,0.28605,0.288044,0.289832,0.285818,0.286464
min,30030.0,0.0,0.0,0.5,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,58640.0,7.0,0.26,0.74,0.31,0.24,0.25,0.25,0.26,0.24,0.26,0.25,0.25,0.26,0.25
50%,89318.0,15.0,0.5,1.0,0.5,0.505,0.5,0.5,0.51,0.49,0.5,0.49,0.5,0.5,0.49
75%,119086.5,22.0,0.74,1.24,0.7,0.74,0.74,0.75,0.75,0.73,0.74,0.75,0.75,0.74,0.74
max,149798.0,29.0,1.0,1.5,0.95,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Summary from describe()

- Numerical columns show almost perfectly symmetric distributions (mean ≈ median), which is rare in real labor datasets.
- Salary shows a narrow dispersion relative to its range, and values appear too evenly spaced.
- Years of experience shows an unusually uniform distribution, not reflecting typical workforce patterns.
- Automation probability ranges cleanly between 0 and 1, with no outliers or jagged behavior, which is typical in generated data.
- Skill columns (`skill_1` to `skill_10`) have nearly identical statistical profiles, strongly indicating formulaic generation.

Overall, the distributions lack natural skew, long tails, or realistic irregularities that appear in genuine human workforce data.

## 5. Confirmed Possible Analysis Directions

Despite the dataset’s limitations, these analytical directions are technically possible:

- Examine which factors (skills, experience, education) correlate with salary.
- Explore relationships between automation probability and job characteristics.
- Investigate whether higher skill scores link to lower job stress levels.

However, because the dataset appears synthetic and overly uniform, the insights would likely be artificial rather than meaningful.


## 6. Strengths and Weaknesses

### Strengths
- No missing values, making preprocessing minimal.
- All columns are numerical, allowing easy correlation and modeling.
- Topic is modern and conceptually interesting.

### Weaknesses
- Distributions appear artificial and overly symmetric, indicating synthetic data.
- Education level is encoded numerically with oversimplified categories.
- Skill columns lack definitions, severely limiting interpretability.
