# ETL Step 1: Combining CVS Files

## Objective
Merge the two raw CSV files (**ai_job_dataset.csv** and **ai_job_dataset.csv**) into a single cosnolidated dataset, while handling differences in columns.

## Dateset
- **ai_job_dataset.csv** - Part 1 (19 columns, no `salary_local`)
- **ai_job_dataset1.csv** - Part 2 (20 columns, includes `salary_local`)

##  Column Differences
- The second file includes an extra column: `salary_local`.

##  Process
1. Load both CSV files with **pandas**.
2. Align their columns by adding any missing columns (like `salary_local`) with `NaN` values in the first dataset.
3. Concatenate the datasets into one.
4. Save the combined dataset as `ai_jobs_combined.csv`.

##  Expected Output
- A single CSV file with **all rows from both parts** and **consistent columns**.
- Missing values for `salary_local` in the first file are represented as `NaN` for later cleaning in ETL.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [7]:
# Load the dataset
df1 = pd.read_csv("../data/inputs/raw/ai_job_dataset.csv")
df2 = pd.read_csv("../data/inputs/raw/ai_job_dataset1.csv")

# Align the columns of the two dataframes
for col in df2.columns:
    if col not in df1.columns:
        df1[col] = pd.NA # Fill missing columns with NA

# Ensure both datasets have same column order
df1 = df1[df2.columns]

# Concatenate the two datasets
combined_df = pd.concat([df1, df2], ignore_index=True)

# Save the combined dataset to a new CSV file
combined_df.to_csv("../data/inputs/raw/ai_job_dataset_combined.csv", index=False)

# Print the shape of the combined dataset
print(f" Combined dataset saved. Total rows: {combined_df.shape[0]}")
print(f" Final columns: {combined_df.columns.tolist()}")

 Combined dataset saved. Total rows: 30000
 Final columns: ['job_id', 'job_title', 'salary_usd', 'salary_currency', 'salary_local', 'experience_level', 'employment_type', 'company_location', 'company_size', 'employee_residence', 'remote_ratio', 'required_skills', 'education_required', 'years_experience', 'industry', 'posting_date', 'application_deadline', 'job_description_length', 'benefits_score', 'company_name']
