In [29]:
# %% [markdown]
# # Strategic Workforce Analysis: AI Integration vs. Structural Risk (2010-2025)
# ## Task 1: Data Preprocessing & Outlier Handling

# %%
import pandas as pd
import numpy as np
import os

# --- 1. Robust Path Setup ---
# Get current working directory (notebooks) and move to parent to find data/
BASE_DIR = os.path.dirname(os.getcwd())
RAW_DATA_PATH = os.path.join(BASE_DIR, 'data', 'raw', 'ai_impact_jobs_2010_2025.csv')
PROCESSED_DIR = os.path.join(BASE_DIR, 'data', 'processed')

# Ensure folder exists
os.makedirs(PROCESSED_DIR, exist_ok=True)

# 2. Load Dataset
df = pd.read_csv(RAW_DATA_PATH)

# 3. Data Cleaning (Nulls)
df['ai_skills'] = df['ai_skills'].fillna('Not Specified')
df['ai_keywords'] = df['ai_keywords'].fillna('None')

# 4. NESTED IQR LOGIC (The "Flower" of the project)
# We calculate IQR for each (Region + Seniority) group to remove local anomalies.
cleaned_chunks = []

# Get all unique combinations of Region and Seniority
for region in df['region'].unique():
    for level in df['seniority_level'].unique():
        # Create a subset
        subset = df[(df['region'] == region) & (df['seniority_level'] == level)]
        
        if len(subset) > 3: # Only clean if we have enough data
            q1 = subset['salary_usd'].quantile(0.25)
            q3 = subset['salary_usd'].quantile(0.75)
            iqr = q3 - q1
            lower = q1 - 1.5 * iqr
            upper = q3 + 1.5 * iqr
            # Filter the subset
            subset = subset[(subset['salary_usd'] >= lower) & (subset['salary_usd'] <= upper)]
        
        cleaned_chunks.append(subset)

# Combine everything back - Columns are guaranteed to stay
df_cleaned = pd.concat(cleaned_chunks).reset_index(drop=True)

# 5. Save the Cleaned Dataset
CLEANED_FILE_PATH = os.path.join(PROCESSED_DIR, 'ai_impact_jobs_cleaned.csv')
df_cleaned.to_csv(CLEANED_FILE_PATH, index=False)

print(f"Task 1 Success! Cleaned data saved at: {CLEANED_FILE_PATH}")
print(f"Final Columns Check: {df_cleaned.columns.tolist()}")

Task 1 Success! Cleaned data saved at: /Users/miraekang/proyectos/eda/data/processed/ai_impact_jobs_cleaned.csv
Final Columns Check: ['job_id', 'posting_year', 'country', 'region', 'city', 'company_name', 'company_size', 'industry', 'job_title', 'seniority_level', 'ai_mentioned', 'ai_keywords', 'ai_intensity_score', 'core_skills', 'ai_skills', 'salary_usd', 'salary_change_vs_prev_year_percent', 'automation_risk_score', 'reskilling_required', 'ai_job_displacement_risk', 'job_description_embedding_cluster', 'industry_ai_adoption_stage']


In [None]:
# %% [markdown]
# ## Task 2: Establishing the Salary Baseline (Q1)
# **Objective:** To define the global salary standard and identify regional variations.

# %%
import plotly.express as px
import pandas as pd
import os

# Define path again for safety
BASE_DIR = os.path.dirname(os.getcwd())
CLEANED_FILE_PATH = os.path.join(BASE_DIR, 'data', 'processed', 'ai_impact_jobs_cleaned.csv')

# Load the clean data
df_cleaned = pd.read_csv(CLEANED_FILE_PATH)

# %% [markdown]
# ### 2.1 The Global Distribution (Post-Nested IQR)
# **Story:** After removing localized outliers, we observe the "True Market Range". 
# This represents the stable economic floor for the global AI workforce.

# %%
fig1 = px.histogram(df_cleaned, x="salary_usd", marginal="box",
                   title="<b>Graph 1: Global Salary Baseline - Mainstream Market Range</b>",
                   labels={'salary_usd': 'Salary (USD)'},
                   color_discrete_sequence=['#27ae60'], 
                   template="plotly_white")
fig1.show()

# %% [markdown]
# ### 2.2 The Locality Paradox: North America vs. South Asia
# **Question:** Why did our global seniority analysis look "flat" earlier? 
# **Result:** As shown below, an Intern in North America often earns more than a Senior in South Asia. 
# **Story:** The "Where" (Region) is currently a stronger salary driver than "Who" (Seniority).

# %%
# Comparing two extreme markets to tell a story
target_regions = ['North America', 'South Asia']
df_comparison = df_cleaned[df_cleaned['region'].isin(target_regions)]

fig2 = px.box(df_comparison, x="seniority_level", y="salary_usd", color="region",
             category_orders={"seniority_level": ["Intern", "Junior", "Mid", "Senior", "Lead", "Executive"]},
             title="<b>Graph 2: Market Maturity Gap - North America vs. South Asia</b>",
             labels={'salary_usd': 'Annual Salary (USD)', 'seniority_level': 'Seniority Level'},
             template="simple_white")

fig2.update_layout(boxmode='group')
fig2.show()

# %% [markdown]
# **Strategic Bridge:** 
# If regional geography dictates the baseline, can **AI Intensity** be the catalyst that breaks this geographic barrier? 
# In Task 3, we will analyze the ROI of AI Integration.