# ETL Step 1: Combining CVS Files

## Objective
Merge the two raw CSV files (**ai_job_dataset.csv** and **ai_job_dataset.csv**) into a single cosnolidated dataset, while handling differences in columns.

## Dateset
- **ai_job_dataset.csv** - Part 1 (19 columns, no `salary_local`)
- **ai_job_dataset1.csv** - Part 2 (20 columns, includes `salary_local`)

##  Column Differences
- The second file includes an extra column: `salary_local`.

##  Process
1. Load both CSV files with **pandas**.
2. Align their columns by adding any missing columns (like `salary_local`) with `NaN` values in the first dataset.
3. Concatenate the datasets into one.
4. Save the combined dataset as `ai_jobs_combined.csv`.

##  Expected Output
- A single CSV file with **all rows from both parts** and **consistent columns**.
- Missing values for `salary_local` in the first file are represented as `NaN` for later cleaning in ETL.


In [13]:
import os 
current_dir =os.getcwd()
current_dir

'/Users/anita/Documents/vscode-projects/global-ai-job-markets-and-salary-trends-2025/jupyter_notebooks'

In [16]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px


ModuleNotFoundError: No module named 'numpy'

In [None]:
# Load the dataset
df1 = pd.read_csv("../data/inputs/raw/ai_job_dataset.csv")
df2 = pd.read_csv("../data/inputs/raw/ai_job_dataset1.csv")

# Align the columns of the two dataframes
for col in df2.columns:
    if col not in df1.columns:
        df1[col] = pd.NA # Fill missing columns with NA

# Ensure both datasets have same column order
df1 = df1[df2.columns]

# Concatenate the two datasets
combined_df = pd.concat([df1, df2], ignore_index=True)

# Save the combined dataset to a new CSV file
combined_df.to_csv("../data/inputs/raw/ai_job_dataset_combined.csv", index=False)

# Print the shape of the combined dataset
print(f" Combined dataset saved. Total rows: {combined_df.shape[0]}")
print(f" Final columns: {combined_df.columns.tolist()}")

 Combined dataset saved. Total rows: 30000
 Final columns: ['job_id', 'job_title', 'salary_usd', 'salary_currency', 'salary_local', 'experience_level', 'employment_type', 'company_location', 'company_size', 'employee_residence', 'remote_ratio', 'required_skills', 'education_required', 'years_experience', 'industry', 'posting_date', 'application_deadline', 'job_description_length', 'benefits_score', 'company_name']


# ETL Step 2: Handling Missing Values

## Objective
Identify and handle missing values in the combined dataset (**ai_job_dataset_combined.csv**) to ensure clean and reliable data for analysis.

## What We'll Do
1. **Load** the combined dataset.
2. **Check** for missing values using `isnull()` and `sum()`.
3. **Categorize missing values**:
   - **Critical columns** (e.g., `job_id`, `job_title`, `salary_usd`) – cannot have missing data.
   - **Optional columns** (e.g., `salary_local`, `benefits_score`) – can be missing and filled logically.
4. **Decide on handling strategy**:
   - **Drop rows** if critical data is missing.
   - **Fill NaNs** for non-critical data (e.g., replace missing `salary_local` with `Not Provided` or median).
5. **Save** the cleaned dataset as `ai_jobs_cleaned.csv`.

##  Expected Output
- Clean dataset with **no missing critical values** and logical handling of optional ones.


In [None]:
# Load the combined dataset
df = pd.read_csv('../data/inputs/raw/ai_job_dataset_combined.csv')

In [None]:
df.isnull().sum()  # Check for missing values

job_id                        0
job_title                     0
salary_usd                    0
salary_currency               0
salary_local              15000
experience_level              0
employment_type               0
company_location              0
company_size                  0
employee_residence            0
remote_ratio                  0
required_skills               0
education_required            0
years_experience              0
industry                      0
posting_date                  0
application_deadline          0
job_description_length        0
benefits_score                0
company_name                  0
dtype: int64

In [None]:
# Drop rows with missing critical values
critical_cols = ["job_id", "job_title", "salary_usd"]
df = df.dropna(subset=critical_cols)

In [None]:
# Define critical columns for analysis
critical_cols = ["job_id", "job_title", "salary_usd"]

# Drop rows with missing critical values
# Capture initial rows for reporting
initial_rows_before_drop = len(df)
df.dropna(subset=critical_cols, inplace=True)
rows_dropped = initial_rows_before_drop - len(df)

if rows_dropped > 0:
    print(f"\nDropped {rows_dropped} rows due to missing critical values in {critical_cols}.")
else:
    print("\nNo rows dropped. Critical columns have no missing values.")


No rows dropped. Critical columns have no missing values.


In [None]:
# Fill missing values  for optional columns
df['salary_local'] = df['salary_local'].fillna('Not Provided')
df['benefits_score'] = df['benefits_score'].fillna(df['benefits_score'].median())

# Handling Missing Values – Final Summary

 **Findings**
- Only one column had missing values: `salary_local` (15,000 rows).
- All other 19 columns had **0 missing values**.

 **Action Taken**
- We filled missing `salary_local` values with `"Not Provided"`.
- No rows were dropped since critical columns (`job_id`, `job_title`, `salary_usd`) had no missing data.

 **Result**
- Total rows: **30,000**
- Total columns: **20**


# ETL Step 3: Data Cleaning

##  Objective
Ensure the dataset is **clean, standardized, and analysis-ready** by removing inconsistencies and formatting errors.

## 🔍 What We'll Do
1. **Standardize column names**
   - Make all lowercase
   - Replace spaces with underscores

2. **Clean text fields**
   - Trim extra spaces
   - Ensure consistent capitalization for `job_title`, `company_name`, `industry`

3. **Validate date columns**
   - Convert `posting_date` & `application_deadline` to datetime format

4. **Check salary fields**
   - Ensure `salary_usd` is numeric
   - Keep `salary_local` as string since some values are "Not Provided"

5. **Save cleaned dataset**
   - File name: `ai_jobs_cleaned_v2.csv`

##  Expected Output
- Dataset with **clean, standardized column names**
- Text fields formatted consistently
- Dates properly converted



In [None]:
# Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Clean text fields: strip spaces and standardize capitalization
text_cols = ['job_title', 'company_name', 'industry', 'salary_currency',
             'experience_level', 'employment_type', 'company_location',
             'company_size', 'employee_residence'] 
for col in text_cols:
    if col in df.columns and df[col].dtype == 'object': 
        df[col] = df[col].astype(str).str.strip().str.title()


# Validate and convert date columns
df["posting_date"] = pd.to_datetime(df["posting_date"], errors='coerce')
df["application_deadline"] = pd.to_datetime(df["application_deadline"], errors='coerce')


# Ensure salary_usd is numeric
df['salary_usd'] = pd.to_numeric(df['salary_usd'], errors='coerce')

# Remove duplicates based on 'job_id' (assuming 'job_id' is unique identifier)
initial_rows_before_dedupe = len(df)
df.drop_duplicates(subset='job_id', inplace=True)
duplicates_removed = initial_rows_before_dedupe - len(df)
if duplicates_removed > 0:
    print(f"Removed {duplicates_removed} duplicate rows based on 'job_id'.")
else:
    print("No duplicate rows found based on 'job_id'.")

# Validate salary ranges - remove highly unrealistic salaries
initial_rows_before_salary_check = len(df)
df = df[(df['salary_usd'] >= 1) & (df['salary_usd'] <= 1000000)].copy() 
salaries_out_of_range = initial_rows_before_salary_check - len(df)
if salaries_out_of_range > 0:
    print(f"Removed {salaries_out_of_range} rows with unrealistic 'salary_usd' values (<1 or >1M).")
else:
    print("No unrealistic salary_usd values found within range (1 to 1,000,000 USD).")

# Quick check columns and data types after cleaning
df.dtypes.head(10)

No duplicate rows found based on 'job_id'.
No unrealistic salary_usd values found within range (1 to 1,000,000 USD).


job_id                object
job_title             object
salary_usd             int64
salary_currency       object
salary_local          object
experience_level      object
employment_type       object
company_location      object
company_size          object
employee_residence    object
dtype: object

In [None]:
df.head()

Unnamed: 0,job_id,job_title,salary_usd,salary_currency,salary_local,experience_level,employment_type,company_location,company_size,employee_residence,remote_ratio,required_skills,education_required,years_experience,industry,posting_date,application_deadline,job_description_length,benefits_score,company_name
0,AI00001,Ai Research Scientist,90376,Usd,Not Provided,Se,Ct,China,M,China,50,"Tableau, PyTorch, Kubernetes, Linux, NLP",Bachelor,9,Automotive,2024-10-18,2024-11-07,1076,5.9,Smart Analytics
1,AI00002,Ai Software Engineer,61895,Usd,Not Provided,En,Ct,Canada,M,Ireland,100,"Deep Learning, AWS, Mathematics, Python, Docker",Master,1,Media,2024-11-20,2025-01-11,1268,5.2,Techcorp Inc
2,AI00003,Ai Specialist,152626,Usd,Not Provided,Mi,Fl,Switzerland,L,South Korea,0,"Kubernetes, Deep Learning, Java, Hadoop, NLP",Associate,2,Education,2025-03-18,2025-04-07,1974,9.4,Autonomous Tech
3,AI00004,Nlp Engineer,80215,Usd,Not Provided,Se,Fl,India,M,India,50,"Scala, SQL, Linux, Python",PhD,7,Consulting,2024-12-23,2025-02-24,1345,8.6,Future Systems
4,AI00005,Ai Consultant,54624,Eur,Not Provided,En,Pt,France,S,Singapore,100,"MLOps, Java, Tableau, Python",Master,0,Media,2025-04-15,2025-06-23,1989,6.6,Advanced Robotics


# ETL Step 4: Data Transformation

##  Objective
Make the dataset analysis-friendly by creating new calculated fields and ensuring all formats are correct.

##  Planned Transformations
1. **Create `salary_category`**
   - Low (< $50,000)
   - Mid ($50,000–100,000)
   - High (> $100,000)

2. **Create `remote_status`**
   - From `remote_ratio`:
     - 0 → Onsite
     - 50 → Hybrid
     - 100 → Fully Remote

3. **Extract date parts**
   - From `posting_date` → new columns: `posting_year`, `posting_month`

4. **Reorder columns**
   - Place the most important ones (`job_id`, `job_title`, `salary_usd`, `salary_category`, `remote_status`) up front

## ✅ Expected Output
- A richer dataset with **categorical features** and **date insights**
- All fields ready for EDA and dashboarding


In [None]:
# Create `salary_category
def categorize_salary(salary):
    if salary < 50000:
        return 'Low'
    elif 50000 <= salary <= 100000:
        return 'Mid'
    else: # salary > 100000
        return 'High'

df['salary_category'] = df['salary_usd'].apply(categorize_salary)
df['salary_category'].value_counts()

# Create `remote_status` from `remote_ratio` 
def map_remote_ratio(ratio):
    if ratio == 0:
        return 'On-site'
    elif ratio == 50:
        return 'Hybrid'
    elif ratio == 100:
        return 'Fully Remote'
    else:
        return 'Unknown'

df['remote_status'] = df['remote_ratio'].apply(map_remote_ratio)
df['remote_status'].value_counts()


#  Extract date parts from `posting_date` 
# Ensure 'posting_date' is datetime (should be from previous cleaning step)
df['posting_year'] = df['posting_date'].dt.year
df['posting_month'] = df['posting_date'].dt.month

# --- 4. Reorder columns ---
# Define the desired order, putting important columns first
desired_column_order = [
    'job_id',
    'job_title',
    'salary_usd',
    'salary_category',
    'remote_status',
    'experience_level',
    'employment_type',
    'company_location',
    'employee_residence',
    'company_name',
    'company_size',
    'remote_ratio', 
    'salary_currency',
    'salary_local', 
    'required_skills',
    'education_required',
    'years_experience',
    'industry',
    'posting_date',
    'posting_year',
    'posting_month',
    'application_deadline',
    'job_description_length',
    'benefits_score'
]

# Get columns that exist in our DataFrame and are in the desired order
final_columns = [col for col in desired_column_order if col in df.columns]
df_final = df[final_columns].copy() 

print("\nColumns reordered.")
print(f"Final dataset shape: {df_final.shape}")
print(f"Final columns: {df_final.columns.tolist()}")



Columns reordered.
Final dataset shape: (15000, 24)
Final columns: ['job_id', 'job_title', 'salary_usd', 'salary_category', 'remote_status', 'experience_level', 'employment_type', 'company_location', 'employee_residence', 'company_name', 'company_size', 'remote_ratio', 'salary_currency', 'salary_local', 'required_skills', 'education_required', 'years_experience', 'industry', 'posting_date', 'posting_year', 'posting_month', 'application_deadline', 'job_description_length', 'benefits_score']


## Save Cleaned Data
 
 The final step in the ETL process is to save the cleaned and transformed DataFrame. This ensures that the prepared data can be easily accessed for subsequent analysis phases (Exploratory Data Analysis and Machine Learning Modeling) without needing to re-run the entire cleaning script

In [None]:
# Save the cleaned DataFrame to a new CSV file
df_final.to_csv("../data/inputs/cleaned/ai_job_cleaned_dataset.csv", index=False)
