# ETL Step 1: Combining CVS Files

## Objective
Merge the two raw CSV files (**ai_job_dataset.csv** and **ai_job_dataset.csv**) into a single cosnolidated dataset, while handling differences in columns.

## Dateset
- **ai_job_dataset.csv** - Part 1 (19 columns, no `salary_local`)
- **ai_job_dataset1.csv** - Part 2 (20 columns, includes `salary_local`)

##  Column Differences
- The second file includes an extra column: `salary_local`.

##  Process
1. Load both CSV files with **pandas**.
2. Align their columns by adding any missing columns (like `salary_local`) with `NaN` values in the first dataset.
3. Concatenate the datasets into one.
4. Save the combined dataset as `ai_jobs_combined.csv`.

##  Expected Output
- A single CSV file with **all rows from both parts** and **consistent columns**.
- Missing values for `salary_local` in the first file are represented as `NaN` for later cleaning in ETL.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [7]:
# Load the dataset
df1 = pd.read_csv("../data/inputs/raw/ai_job_dataset.csv")
df2 = pd.read_csv("../data/inputs/raw/ai_job_dataset1.csv")

# Align the columns of the two dataframes
for col in df2.columns:
    if col not in df1.columns:
        df1[col] = pd.NA # Fill missing columns with NA

# Ensure both datasets have same column order
df1 = df1[df2.columns]

# Concatenate the two datasets
combined_df = pd.concat([df1, df2], ignore_index=True)

# Save the combined dataset to a new CSV file
combined_df.to_csv("../data/inputs/raw/ai_job_dataset_combined.csv", index=False)

# Print the shape of the combined dataset
print(f" Combined dataset saved. Total rows: {combined_df.shape[0]}")
print(f" Final columns: {combined_df.columns.tolist()}")

 Combined dataset saved. Total rows: 30000
 Final columns: ['job_id', 'job_title', 'salary_usd', 'salary_currency', 'salary_local', 'experience_level', 'employment_type', 'company_location', 'company_size', 'employee_residence', 'remote_ratio', 'required_skills', 'education_required', 'years_experience', 'industry', 'posting_date', 'application_deadline', 'job_description_length', 'benefits_score', 'company_name']


# ETL Step 2: Handling Missing Values

## Objective
Identify and handle missing values in the combined dataset (**ai_job_dataset_combined.csv**) to ensure clean and reliable data for analysis.

## What We'll Do
1. **Load** the combined dataset.
2. **Check** for missing values using `isnull()` and `sum()`.
3. **Categorize missing values**:
   - **Critical columns** (e.g., `job_id`, `job_title`, `salary_usd`) – cannot have missing data.
   - **Optional columns** (e.g., `salary_local`, `benefits_score`) – can be missing and filled logically.
4. **Decide on handling strategy**:
   - **Drop rows** if critical data is missing.
   - **Fill NaNs** for non-critical data (e.g., replace missing `salary_local` with `Not Provided` or median).
5. **Save** the cleaned dataset as `ai_jobs_cleaned.csv`.

##  Expected Output
- Clean dataset with **no missing critical values** and logical handling of optional ones.


In [8]:
# Load the combined dataset
df = pd.read_csv('../data/inputs/raw/ai_job_dataset_combined.csv')

In [9]:
df.isnull().sum()  # Check for missing values

job_id                        0
job_title                     0
salary_usd                    0
salary_currency               0
salary_local              15000
experience_level              0
employment_type               0
company_location              0
company_size                  0
employee_residence            0
remote_ratio                  0
required_skills               0
education_required            0
years_experience              0
industry                      0
posting_date                  0
application_deadline          0
job_description_length        0
benefits_score                0
company_name                  0
dtype: int64

In [10]:
# Drop rows with missing critical values
critical_cols = ["job_id", "job_title", "salary_usd"]
df = df.dropna(subset=critical_cols)

In [13]:
# Fill missing values  for optional columns
df['salary_local'] = df['salary_local'].fillna('Not Provided')
df['benefits_score'] = df['benefits_score'].fillna(df['benefits_score'].median())

# Handling Missing Values – Final Summary

 **Findings**
- Only one column had missing values: `salary_local` (15,000 rows).
- All other 19 columns had **0 missing values**.

 **Action Taken**
- We filled missing `salary_local` values with `"Not Provided"`.
- No rows were dropped since critical columns (`job_id`, `job_title`, `salary_usd`) had no missing data.

 **Result**
- Total rows: **30,000**
- Total columns: **20**
