## 1) Imports & paths
##### 💡 Concept
Before manipulating data, we must set up the environment — define where data lives and where cleaned outputs will go. Think of this like laying out your lab tools before starting an experiment. Without consistent paths, reproducibility collapses.

In [1]:
import pandas as pd, numpy as np, re, json
from pathlib import Path
RNG = 42

ROOT = Path.cwd().parents[0] if (Path.cwd().name == "notebooks") else Path.cwd()
DATA_RAW = ROOT / "data" / "raw"
DATA_PROC = ROOT / "data" / "processed"
REPORTS = ROOT / "reports"
FIGS = ROOT / "outputs" / "figures"

DATA_PROC.mkdir(parents=True, exist_ok=True)
REPORTS.mkdir(parents=True, exist_ok=True)
FIGS.mkdir(parents=True, exist_ok=True)

## 2) Load raw datasets

##### 💡 Concept
At its core, “loading” isn’t just reading CSVs — it’s about validating the data contract.
Each dataset has a schema (columns, types, units). Before merging, you must check whether those schemas align or conflict.

In [None]:
df1 = pd.read_csv(DATA_RAW / "dataset1_data_science_job.csv")
df2 = pd.read_csv(DATA_RAW / "dataset2_all_job_post.csv")
df3 = pd.read_csv(DATA_RAW / "dataset3_ai_job_dataset.csv")

for name, df in {"df1": df1, "df2": df2, "df3": df3}.items():
    print(name, df.shape); display(df.head(2)); display(df.info())

## 3) Profiling snapshot (lightweight)

##### 💡 Concept
Profiling is the diagnostic stage of cleaning — like running blood tests before prescribing medicine.
It tells you what’s wrong: missing values, strange datatypes, duplicates, etc.
Without this, cleaning becomes random guessing.

In [None]:
def profile(df: pd.DataFrame, name: str) -> dict:
    """Return basic profile stats for df. 
    Time: O(n * c). Space: O(c)."""
    return {
        "rows": len(df),
        "cols": df.shape[1],
        "na_counts": df.isna().sum().to_dict(),
        "dup_rows": int(df.duplicated().sum()),
        "numeric_cols": df.select_dtypes(include="number").columns.tolist(),
        "object_cols": df.select_dtypes(include="object").columns.tolist(),
    }

profiles = {k: profile(v, k) for k, v in {"df1": df1, "df2": df2, "df3": df3}.items()}
print(json.dumps(profiles, indent=2)[:2000], "...")


## 4️) Schema Harmonization

##### 💡 Concept
Datasets from different sources often call the same thing by different names — e.g., job_title vs title.
Before merging, we need a shared vocabulary.
This is the lingua franca of your data — making sure everyone (and every dataset) “speaks the same language”.

In [6]:
COLMAP = {
    "title": "job_title",
    "jobTitle": "job_title",
    "category": "job_category",          # NEW
    "skills": "required_skills",
    "job_skill_set": "required_skills",  # NEW
    "experience": "experience_level",
    "exp_level": "experience_level",
    "salary_in_usd": "salary_usd",       # NEW
    "salary_usd": "salary_usd",
    "salaryLocal": "salary_local",       # just in case
    "salary": "salary",
    "location": "company_location",
    "posted_at": "posting_date",
}

DATE_COLS = ["posting_date", "application_deadline"]
NUM_COLS  = ["salary_usd", "salary", "salary_local",
             "remote_ratio", "years_experience",
             "benefits_score", "job_description_length"]

def harmonize(df: pd.DataFrame) -> pd.DataFrame:
    """Standardize schema + dtypes. Time: O(n+c), Space: ~O(1) extra."""
    out = df.rename(columns={k:v for k,v in COLMAP.items() if k in df.columns}).copy()
    for d in DATE_COLS:
        if d in out.columns:
            out[d] = pd.to_datetime(out[d], errors="coerce")
    for n in NUM_COLS:
        if n in out.columns:
            out[n] = pd.to_numeric(out[n], errors="coerce")
    return out

df1h, df2h, df3h = map(harmonize, (df1, df2, df3))


## 5️) Missing Values & “Unknown” Categories

##### 💡 Concept
Missing data is information — it tells you where the system failed to observe.  
We never randomly “fill” it; we reason about why it’s missing.  
Duplicates distort truth — one job posted twice looks like double demand.  
Here, imputation = an informed guess.


In [7]:
def fill_missing(df):
    out = df.copy()
    for c in out.select_dtypes("object"):
        out[c] = out[c].fillna("Unknown")
    for c in out.select_dtypes("number"):
        out[c] = out[c].fillna(out[c].median())
    return out.drop_duplicates()

df1c, df2c, df3c = map(fill_missing, (df1h, df2h, df3h))
