# ETL Step 1: Combining CVS Files

## Objective
Merge the two raw CSV files (**ai_job_dataset.csv** and **ai_job_dataset.csv**) into a single cosnolidated dataset, while handling differences in columns.

## Dateset
- **ai_job_dataset.csv** - Part 1 (19 columns, no `salary_local`)
- **ai_job_dataset1.csv** - Part 2 (20 columns, includes `salary_local`)

##  Column Differences
- The second file includes an extra column: `salary_local`.

##  Process
1. Load both CSV files with **pandas**.
2. Align their columns by adding any missing columns (like `salary_local`) with `NaN` values in the first dataset.
3. Concatenate the datasets into one.
4. Save the combined dataset as `ai_jobs_combined.csv`.

##  Expected Output
- A single CSV file with **all rows from both parts** and **consistent columns**.
- Missing values for `salary_local` in the first file are represented as `NaN` for later cleaning in ETL.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [7]:
# Load the dataset
df1 = pd.read_csv("../data/inputs/raw/ai_job_dataset.csv")
df2 = pd.read_csv("../data/inputs/raw/ai_job_dataset1.csv")

# Align the columns of the two dataframes
for col in df2.columns:
    if col not in df1.columns:
        df1[col] = pd.NA # Fill missing columns with NA

# Ensure both datasets have same column order
df1 = df1[df2.columns]

# Concatenate the two datasets
combined_df = pd.concat([df1, df2], ignore_index=True)

# Save the combined dataset to a new CSV file
combined_df.to_csv("../data/inputs/raw/ai_job_dataset_combined.csv", index=False)

# Print the shape of the combined dataset
print(f" Combined dataset saved. Total rows: {combined_df.shape[0]}")
print(f" Final columns: {combined_df.columns.tolist()}")

 Combined dataset saved. Total rows: 30000
 Final columns: ['job_id', 'job_title', 'salary_usd', 'salary_currency', 'salary_local', 'experience_level', 'employment_type', 'company_location', 'company_size', 'employee_residence', 'remote_ratio', 'required_skills', 'education_required', 'years_experience', 'industry', 'posting_date', 'application_deadline', 'job_description_length', 'benefits_score', 'company_name']


# ETL Step 2: Handling Missing Values

## Objective
Identify and handle missing values in the combined dataset (**ai_job_dataset_combined.csv**) to ensure clean and reliable data for analysis.

## What We'll Do
1. **Load** the combined dataset.
2. **Check** for missing values using `isnull()` and `sum()`.
3. **Categorize missing values**:
   - **Critical columns** (e.g., `job_id`, `job_title`, `salary_usd`) – cannot have missing data.
   - **Optional columns** (e.g., `salary_local`, `benefits_score`) – can be missing and filled logically.
4. **Decide on handling strategy**:
   - **Drop rows** if critical data is missing.
   - **Fill NaNs** for non-critical data (e.g., replace missing `salary_local` with `Not Provided` or median).
5. **Save** the cleaned dataset as `ai_jobs_cleaned.csv`.

##  Expected Output
- Clean dataset with **no missing critical values** and logical handling of optional ones.


In [8]:
# Load the combined dataset
df = pd.read_csv('../data/inputs/raw/ai_job_dataset_combined.csv')

In [9]:
df.isnull().sum()  # Check for missing values

job_id                        0
job_title                     0
salary_usd                    0
salary_currency               0
salary_local              15000
experience_level              0
employment_type               0
company_location              0
company_size                  0
employee_residence            0
remote_ratio                  0
required_skills               0
education_required            0
years_experience              0
industry                      0
posting_date                  0
application_deadline          0
job_description_length        0
benefits_score                0
company_name                  0
dtype: int64

In [10]:
# Drop rows with missing critical values
critical_cols = ["job_id", "job_title", "salary_usd"]
df = df.dropna(subset=critical_cols)

In [13]:
# Fill missing values  for optional columns
df['salary_local'] = df['salary_local'].fillna('Not Provided')
df['benefits_score'] = df['benefits_score'].fillna(df['benefits_score'].median())

# Handling Missing Values – Final Summary

 **Findings**
- Only one column had missing values: `salary_local` (15,000 rows).
- All other 19 columns had **0 missing values**.

 **Action Taken**
- We filled missing `salary_local` values with `"Not Provided"`.
- No rows were dropped since critical columns (`job_id`, `job_title`, `salary_usd`) had no missing data.

 **Result**
- Total rows: **30,000**
- Total columns: **20**


# ETL Step 3: Data Cleaning

##  Objective
Ensure the dataset is **clean, standardized, and analysis-ready** by removing inconsistencies and formatting errors.

## 🔍 What We'll Do
1. **Standardize column names**
   - Make all lowercase
   - Replace spaces with underscores

2. **Clean text fields**
   - Trim extra spaces
   - Ensure consistent capitalization for `job_title`, `company_name`, `industry`

3. **Validate date columns**
   - Convert `posting_date` & `application_deadline` to datetime format

4. **Check salary fields**
   - Ensure `salary_usd` is numeric
   - Keep `salary_local` as string since some values are "Not Provided"

5. **Save cleaned dataset**
   - File name: `ai_jobs_cleaned_v2.csv`

##  Expected Output
- Dataset with **clean, standardized column names**
- Text fields formatted consistently
- Dates properly converted



In [14]:
# Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Clean text fields: strip spaces and standardize capitalization
text_cols = ['job_title', 'company_name', 'industry']
for col in text_cols:
    df[col] = df[col].str.strip().str.title()

# Validate and convert date columns
df["posting_date"] = pd.to_datetime(df["posting_date"], errors='coerce')
df["application_deadline"] = pd.to_datetime(df["application_deadline"], errors='coerce')

# Ensure salary_usd is numeric
df['salary_usd'] = pd.to_numeric(df['salary_usd'], errors='coerce')

# Quick check columns and data types
df.dtypes.head(10)

job_id                object
job_title             object
salary_usd             int64
salary_currency       object
salary_local          object
experience_level      object
employment_type       object
company_location      object
company_size          object
employee_residence    object
dtype: object

In [16]:
df.head()

Unnamed: 0,job_id,job_title,salary_usd,salary_currency,salary_local,experience_level,employment_type,company_location,company_size,employee_residence,remote_ratio,required_skills,education_required,years_experience,industry,posting_date,application_deadline,job_description_length,benefits_score,company_name
0,AI00001,Ai Research Scientist,90376,USD,Not Provided,SE,CT,China,M,China,50,"Tableau, PyTorch, Kubernetes, Linux, NLP",Bachelor,9,Automotive,2024-10-18,2024-11-07,1076,5.9,Smart Analytics
1,AI00002,Ai Software Engineer,61895,USD,Not Provided,EN,CT,Canada,M,Ireland,100,"Deep Learning, AWS, Mathematics, Python, Docker",Master,1,Media,2024-11-20,2025-01-11,1268,5.2,Techcorp Inc
2,AI00003,Ai Specialist,152626,USD,Not Provided,MI,FL,Switzerland,L,South Korea,0,"Kubernetes, Deep Learning, Java, Hadoop, NLP",Associate,2,Education,2025-03-18,2025-04-07,1974,9.4,Autonomous Tech
3,AI00004,Nlp Engineer,80215,USD,Not Provided,SE,FL,India,M,India,50,"Scala, SQL, Linux, Python",PhD,7,Consulting,2024-12-23,2025-02-24,1345,8.6,Future Systems
4,AI00005,Ai Consultant,54624,EUR,Not Provided,EN,PT,France,S,Singapore,100,"MLOps, Java, Tableau, Python",Master,0,Media,2025-04-15,2025-06-23,1989,6.6,Advanced Robotics


# ETL Step 4: Data Transformation

##  Objective
Make the dataset analysis-friendly by creating new calculated fields and ensuring all formats are correct.

##  Planned Transformations
1. **Create `salary_category`**
   - Low (< $50,000)
   - Mid ($50,000–100,000)
   - High (> $100,000)

2. **Create `remote_status`**
   - From `remote_ratio`:
     - 0 → Onsite
     - 50 → Hybrid
     - 100 → Fully Remote

3. **Extract date parts**
   - From `posting_date` → new columns: `posting_year`, `posting_month`

4. **Reorder columns**
   - Place the most important ones (`job_id`, `job_title`, `salary_usd`, `salary_category`, `remote_status`) up front

## ✅ Expected Output
- A richer dataset with **categorical features** and **date insights**
- All fields ready for EDA and dashboarding


In [21]:
# Check and remove duplicates
duplicates = df.duplicated(subset='job_id').sum()
if duplicates > 0:
    df = df.drop_duplicates(subset='job_id')

# Validate salary ranges
invalid_salaries = df[(df['salary_usd'] < 1) | (df['salary_usd'] > 1000000)]
df = df[(df['salary_usd'] >= 1) & (df['salary_usd'] <= 1000000)]

# Validate remote_ratio
df["remote_ratio"].unique()

# Check categories integrity
df["salary_currency"].unique()


df.shape

(15000, 20)