# AI Job Market - Data Cleaning & Enrichment Documentation

**Purpose**: This notebook documents the complete data cleaning and enrichment pipeline for the AI Job Market dataset.

**Dataset Source**: 
- Kaggle: AI Job Market Dataset
- Local: `data/raw/ai_job_market.csv`

**Workflow**:
1. Data Loading (Kaggle + Local)
2. Data Exploration
3. Data Cleaning (duplicates, missing values, validation)
4. Data Enrichment (salary, skills, tools, location, experience, date)
5. Save Cleaned & Enriched Data

**Output**:
- Cleaned dataset: `data/cleaned/ai_job_market_cleaned.csv`
- Enriched datasets: `data/enriched/` (by category)

---

## 1. Setup and Imports

Import all necessary libraries and configure the environment for data loading, cleaning, and enrichment.

In [1]:
# Standard library imports
import sys
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Add project root to path for imports
project_root = Path.cwd().parent
sys.path.append(str(project_root / 'src'))

# Data manipulation
import pandas as pd
import numpy as np

# Project-specific imports
from utils.config_loader import get_config_loader
from utils.file_handler import FileHandler
from utils.data_cleaner import DataCleaner
from utils.data_validator import DataValidator
from utils.logger import get_logger
from utils.enrichers import (
    SalaryEnricher, SkillsEnricher, ToolsEnricher,
    ExperienceEnricher, LocationEnricher, DateEnricher,
    AdditionalFeaturesEnricher
)

# Initialize
config = get_config_loader()
file_handler = FileHandler()
logger = get_logger(__name__)

print("âœ“ All imports successful!")
print(f"âœ“ Project root: {project_root}")
print(f"âœ“ Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

âœ“ All imports successful!
âœ“ Project root: c:\Users\Admin\project\Data Analysis\ai_job_market
âœ“ Current time: 2025-12-19 12:53:13


## 2. Data Loading

Load data from multiple sources:
- **Kaggle API**: Download directly from Kaggle (if credentials are configured)
- **Local Storage**: Load from `data/raw/ai_job_market.csv`

### 2.1 Configure Kaggle API (Optional)

In [2]:
def load_from_kaggle(dataset_name: str, download_path: str = 'data/raw') -> bool:
    try:
        import kaggle
        from kaggle.api.kaggle_api_extended import KaggleApi
        
        # Authenticate
        api = KaggleApi()
        api.authenticate()
        
        # Download dataset
        print(f"Downloading dataset: {dataset_name}")
        api.dataset_download_files(dataset_name, path=download_path, unzip=True)
        print(f"âœ“ Dataset downloaded to: {download_path}")
        return True
        
    except ImportError:
        print("âš  Kaggle API not installed. Install with: pip install kaggle")
        return False
    except Exception as e:
        print(f"âš  Error downloading from Kaggle: {str(e)}")
        print("Falling back to local data...")
        return False

# Uncomment and modify to download from Kaggle
# KAGGLE_DATASET = 'your-username/ai-job-market'
# load_from_kaggle(KAGGLE_DATASET)

### 2.2 Load from Local Storage

In [4]:
# Load configuration
# Fix config path for notebook context
config.config_dir = project_root / 'config'

paths_config = config.load('paths')
raw_data_path = paths_config['paths']['raw_data_file']
loading_config = paths_config['data_processing']['loading']

print(f"Loading data from: {raw_data_path}")
print(f"Encoding: {loading_config['encoding']}")
print(f"Delimiter: {loading_config['delimiter']}")
print("-" * 60)

# Load raw data with absolute path
raw_data_path_abs = project_root / raw_data_path
df_raw = file_handler.load_csv(
    str(raw_data_path_abs),
    encoding=loading_config['encoding'],
    delimiter=loading_config['delimiter']
)

print(f"\nâœ“ Data loaded successfully!")
print(f"  Shape: {df_raw.shape[0]:,} rows Ã— {df_raw.shape[1]} columns")
print(f"  Columns: {list(df_raw.columns)}")
print(f"  Memory usage: {df_raw.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

2025-12-19 12:59:03,997 - utils.file_handler - INFO - Loaded CSV: c:\Users\Admin\project\Data Analysis\ai_job_market\data\raw\ai_job_market.csv with shape (2000, 12)


Loading data from: data/raw/ai_job_market.csv
Encoding: utf-8
Delimiter: ,
------------------------------------------------------------

âœ“ Data loaded successfully!
  Shape: 2,000 rows Ã— 12 columns
  Columns: ['job_id', 'company_name', 'industry', 'job_title', 'skills_required', 'experience_level', 'employment_type', 'location', 'salary_range_usd', 'posted_date', 'company_size', 'tools_preferred']
  Memory usage: 1.33 MB


## 3. Data Exploration

Initial exploration to understand the dataset structure, quality, and issues before cleaning.

In [5]:
# Display first and last rows
print("First 5 rows:")
display(df_raw.head())

print("\nLast 5 rows:")
display(df_raw.tail())

# Dataset info
print("\n" + "="*60)
print("DATASET INFORMATION")
print("="*60)
df_raw.info()

First 5 rows:


Unnamed: 0,job_id,company_name,industry,job_title,skills_required,experience_level,employment_type,location,salary_range_usd,posted_date,company_size,tools_preferred
0,1,Foster and Sons,Healthcare,Data Analyst,"NumPy, Reinforcement Learning, PyTorch, Scikit...",Mid,Full-time,"Tracybury, AR",92860-109598,2025-08-20,Large,"KDB+, LangChain"
1,2,"Boyd, Myers and Ramirez",Tech,Computer Vision Engineer,"Scikit-learn, CUDA, SQL, Pandas",Senior,Full-time,"Lake Scott, CU",78523-144875,2024-03-22,Large,"FastAPI, KDB+, TensorFlow"
2,3,King Inc,Tech,Quant Researcher,"MLflow, FastAPI, Azure, PyTorch, SQL, GCP",Entry,Full-time,"East Paige, CM",124496-217204,2025-09-18,Large,"BigQuery, PyTorch, Scikit-learn"
3,4,"Cooper, Archer and Lynch",Tech,AI Product Manager,"Scikit-learn, C++, Pandas, LangChain, AWS, R",Mid,Full-time,"Perezview, FI",50908-123743,2024-05-08,Large,"TensorFlow, BigQuery, MLflow"
4,5,Hall LLC,Finance,Data Scientist,"Excel, Keras, SQL, Hugging Face",Senior,Contract,"North Desireeland, NE",98694-135413,2025-02-24,Large,"PyTorch, LangChain"



Last 5 rows:


Unnamed: 0,job_id,company_name,industry,job_title,skills_required,experience_level,employment_type,location,salary_range_usd,posted_date,company_size,tools_preferred
1995,1996,"Mueller, Ellis and Clark",Finance,NLP Engineer,"Flask, FastAPI, Power BI",Senior,Internship,"Washingtonmouth, SD",90382-110126,2024-04-22,Large,MLflow
1996,1997,Roberts-Yu,Automotive,AI Product Manager,"R, Flask, Excel, C++, CUDA, Scikit-learn",Mid,Remote,"Joshuafort, ZA",47848-137195,2023-12-02,Large,"KDB+, LangChain, MLflow"
1997,1998,"Brooks, Williams and Randolph",Education,Data Analyst,"Hugging Face, Excel, Scikit-learn, R, MLflow",Entry,Contract,"West Brittanyburgh, CG",134994-180108,2023-10-29,Large,PyTorch
1998,1999,Castaneda-Smith,Education,Quant Researcher,"AWS, Python, Scikit-learn",Senior,Contract,"Anthonyshire, OM",62388-82539,2024-08-10,Large,"MLflow, TensorFlow, FastAPI"
1999,2000,Estes Group,Finance,Quant Researcher,"Flask, TensorFlow, Power BI",Senior,Full-time,"Benjaminview, NE",55835-97374,2025-02-20,Startup,MLflow



DATASET INFORMATION
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   job_id            2000 non-null   int64 
 1   company_name      2000 non-null   object
 2   industry          2000 non-null   object
 3   job_title         2000 non-null   object
 4   skills_required   2000 non-null   object
 5   experience_level  2000 non-null   object
 6   employment_type   2000 non-null   object
 7   location          2000 non-null   object
 8   salary_range_usd  2000 non-null   object
 9   posted_date       2000 non-null   object
 10  company_size      2000 non-null   object
 11  tools_preferred   2000 non-null   object
dtypes: int64(1), object(11)
memory usage: 187.6+ KB


In [9]:
# Data quality check using DataValidator
validator = DataValidator(df_raw)

print("="*60)
print("DATA QUALITY ASSESSMENT")
print("="*60)

# Check missing values
missing_info = validator.check_missing_values()
print(f"\n1. MISSING VALUES: {missing_info['total_missing']} total")
if missing_info['total_missing'] > 0:
    missing_df = pd.DataFrame(missing_info['by_column']).T
    missing_df = missing_df[missing_df['count'] > 0].sort_values('count', ascending=False)
    display(missing_df)

# Check duplicates
duplicate_info = validator.check_duplicates()
print(f"\n2. DUPLICATES: {duplicate_info['count']} rows")
if duplicate_info['count'] > 0:
    print(f"   Percentage: {duplicate_info['percentage']:.2f}%")

# Get validation summary
summary = validator.get_summary()
print(f"\n3. SUMMARY:")
for key, value in summary.items():
    if key == 'shape':
        print(f"   Total Rows: {value[0]:,}")
        print(f"   Total Columns: {value[1]}")
    elif key == 'validation_results':
        print(f"   Missing Values: {value['missing_values']['total_missing']}")
        print(f"   Duplicates: {value['duplicates']['count']}")
    elif key == 'data_quality_score':
        print(f"   Data Quality Score: {value:.2f}%")

DATA QUALITY ASSESSMENT

1. MISSING VALUES: 0 total

2. DUPLICATES: 0 rows

3. SUMMARY:
   Total Rows: 2,000
   Total Columns: 12
   Missing Values: 0
   Duplicates: 0


## 4. Data Cleaning

Clean the dataset by:
1. **Removing duplicates**: Eliminate duplicate job postings
2. **Handling missing values**: Drop rows with missing critical data
3. **Validating results**: Ensure cleaning improved data quality

The `DataCleaner` class uses a fluent interface for chainable operations.

In [10]:
# Initialize DataCleaner with the raw dataset
cleaner = DataCleaner(df_raw)

print("="*60)
print("DATA CLEANING PROCESS")
print("="*60)
print(f"\nOriginal dataset shape: {df_raw.shape}")

# Step 1: Remove duplicates
cleaning_config = paths_config['data_processing']['cleaning']
if cleaning_config['remove_duplicates']:
    cleaner.remove_duplicates()
    print("âœ“ Step 1: Removed duplicates")

# Step 2: Handle missing values
if cleaning_config['handle_missing_values']:
    strategy = cleaning_config['missing_strategy']
    fill_value = cleaning_config.get('fill_value')
    cleaner.handle_missing_values(strategy=strategy, fill_value=fill_value)
    print(f"âœ“ Step 2: Handled missing values (strategy: {strategy})")

# Get cleaned data and report
df_cleaned = cleaner.get_cleaned_data()
cleaning_report = cleaner.get_report()

print(f"\nCleaned dataset shape: {df_cleaned.shape}")
print(f"Total rows removed: {cleaning_report['total_rows_removed']}")
print(f"Rows retained: {df_cleaned.shape[0] / df_raw.shape[0] * 100:.2f}%")

# Display cleaning operations
print("\n" + "="*60)
print("CLEANING OPERATIONS SUMMARY")
print("="*60)
for i, operation in enumerate(cleaning_report['operations'], 1):
    print(f"\n{i}. {operation['operation'].upper()}")
    for key, value in operation.items():
        if key != 'operation':
            print(f"   {key}: {value}")

2025-12-19 13:03:56,941 - utils.data_cleaner - INFO - Removed 0 duplicate rows
2025-12-19 13:03:56,977 - utils.data_cleaner - INFO - Handled missing values with strategy 'drop': 0 -> 0


DATA CLEANING PROCESS

Original dataset shape: (2000, 12)
âœ“ Step 1: Removed duplicates
âœ“ Step 2: Handled missing values (strategy: drop)

Cleaned dataset shape: (2000, 12)
Total rows removed: 0
Rows retained: 100.00%

CLEANING OPERATIONS SUMMARY

1. REMOVE_DUPLICATES
   rows_removed: 0
   subset: None
   keep: first

2. HANDLE_MISSING_VALUES
   strategy: drop
   missing_before: 0
   missing_after: 0
   rows_removed: 0
   columns: ['job_id', 'company_name', 'industry', 'job_title', 'skills_required', 'experience_level', 'employment_type', 'location', 'salary_range_usd', 'posted_date', 'company_size', 'tools_preferred']


In [11]:
# Validate cleaned data
validator_cleaned = DataValidator(df_cleaned)
missing_after = validator_cleaned.check_missing_values()
duplicates_after = validator_cleaned.check_duplicates()

print("="*60)
print("POST-CLEANING VALIDATION")
print("="*60)
print(f"\nâœ“ Missing values: {missing_after['total_missing']}")
print(f"âœ“ Duplicates: {duplicates_after['count']}")

# Display sample of cleaned data
print("\nFirst 3 rows of cleaned data:")
display(df_cleaned.head(3))

POST-CLEANING VALIDATION

âœ“ Missing values: 0
âœ“ Duplicates: 0

First 3 rows of cleaned data:


Unnamed: 0,job_id,company_name,industry,job_title,skills_required,experience_level,employment_type,location,salary_range_usd,posted_date,company_size,tools_preferred
0,1,Foster and Sons,Healthcare,Data Analyst,"NumPy, Reinforcement Learning, PyTorch, Scikit...",Mid,Full-time,"Tracybury, AR",92860-109598,2025-08-20,Large,"KDB+, LangChain"
1,2,"Boyd, Myers and Ramirez",Tech,Computer Vision Engineer,"Scikit-learn, CUDA, SQL, Pandas",Senior,Full-time,"Lake Scott, CU",78523-144875,2024-03-22,Large,"FastAPI, KDB+, TensorFlow"
2,3,King Inc,Tech,Quant Researcher,"MLflow, FastAPI, Azure, PyTorch, SQL, GCP",Entry,Full-time,"East Paige, CM",124496-217204,2025-09-18,Large,"BigQuery, PyTorch, Scikit-learn"


## 5. Data Enrichment

Enrich the cleaned dataset with additional features across multiple categories:

### Enrichment Categories:
1. **Salary**: Parse min/max/avg, create salary clusters
2. **Skills**: Extract top 20 skills, count skills, create binary features
3. **Tools**: Extract top 15 tools, count tools, create binary features
4. **Experience**: Convert to ordinal encoding (Entry=1, Mid=2, Senior=3)
5. **Location**: Parse city/state, cluster locations, USA vs International
6. **Date**: Extract year/month/quarter, calculate job age
7. **Additional**: Employment type flags, company features

Each enricher is a separate class for modularity and reusability.

### 5.1 Salary Enrichment

Parse salary ranges and create salary clusters for analysis.

In [12]:
print("="*60)
print("ENRICHMENT 1: SALARY FEATURES")
print("="*60)

# Create a copy for enrichment
df_enriched = df_cleaned.copy()

# Enrich salary data
salary_enricher = SalaryEnricher(df_enriched)
df_enriched = salary_enricher.enrich()

print("\nâœ“ Parsed salary ranges:")
print(f"  - salary_min: minimum salary")
print(f"  - salary_max: maximum salary")
print(f"  - salary_avg: average salary")

print("\nâœ“ Created salary clusters:")
print(df_enriched['salary_cluster'].value_counts().sort_index())

print("\nSample salary data:")
display(df_enriched[['salary_range_usd', 'salary_min', 'salary_max', 'salary_avg', 'salary_cluster']].head(3))

2025-12-19 13:04:36,349 - utils.enrichers - INFO - Parsed salary ranges into min, max, and avg
2025-12-19 13:04:36,398 - utils.enrichers - INFO - Created salary clusters with 7 categories


ENRICHMENT 1: SALARY FEATURES

âœ“ Parsed salary ranges:
  - salary_min: minimum salary
  - salary_max: maximum salary
  - salary_avg: average salary

âœ“ Created salary clusters:
salary_cluster
<60K         58
60-80K      199
80-100K     322
100-120K    360
120-150K    540
150-200K    521
200K+         0
Name: count, dtype: int64

Sample salary data:


Unnamed: 0,salary_range_usd,salary_min,salary_max,salary_avg,salary_cluster
0,92860-109598,92860,109598,101229.0,100-120K
1,78523-144875,78523,144875,111699.0,100-120K
2,124496-217204,124496,217204,170850.0,150-200K


### 5.2 Skills Enrichment

Extract and analyze skills from job postings. Creates binary features for top 20 skills.

In [13]:
print("="*60)
print("ENRICHMENT 2: SKILLS FEATURES")
print("="*60)

# Enrich skills data
skills_enricher = SkillsEnricher(df_enriched, top_n=20)
df_enriched, skill_counts = skills_enricher.enrich()

print("\nâœ“ Created binary features for top 20 skills")
print("âœ“ Created skill category flags:")
print(f"  - has_programming_lang: {df_enriched['has_programming_lang'].sum()} jobs")
print(f"  - has_cloud_platform: {df_enriched['has_cloud_platform'].sum()} jobs")
print(f"  - has_ml_framework: {df_enriched['has_ml_framework'].sum()} jobs")

print("\nTop 10 most demanded skills:")
print(skill_counts.head(10))

# Show skill columns
skill_columns = [col for col in df_enriched.columns if col.startswith('skill_')]
print(f"\nâœ“ Created {len(skill_columns)} skill binary features")
print(f"Sample: {skill_columns[:5]}")

2025-12-19 13:04:48,798 - utils.enrichers - INFO - Parsed 20 top skills as binary features
2025-12-19 13:04:48,819 - utils.enrichers - INFO - Created skill category flags


ENRICHMENT 2: SKILLS FEATURES

âœ“ Created binary features for top 20 skills
âœ“ Created skill category flags:
  - has_programming_lang: 1202 jobs
  - has_cloud_platform: 1005 jobs
  - has_ml_framework: 1247 jobs

Top 10 most demanded skills:
TensorFlow                452
Excel                     432
Pandas                    427
FastAPI                   419
NumPy                     416
Reinforcement Learning    414
Azure                     413
SQL                       408
Hugging Face              408
Keras                     406
Name: count, dtype: int64

âœ“ Created 20 skill binary features
Sample: ['skill_tensorflow', 'skill_excel', 'skill_pandas', 'skill_fastapi', 'skill_numpy']


### 5.3 Tools Enrichment

Extract and analyze preferred tools from job postings. Creates binary features for top 15 tools.

In [14]:
print("="*60)
print("ENRICHMENT 3: TOOLS FEATURES")
print("="*60)

# Enrich tools data
tools_enricher = ToolsEnricher(df_enriched, top_n=15)
df_enriched, tool_counts = tools_enricher.enrich()

print("\nâœ“ Created binary features for top 15 tools")
print(f"âœ“ Created tools_count feature")

print("\nTop 10 most preferred tools:")
print(tool_counts.head(10))

# Show tool columns
tool_columns = [col for col in df_enriched.columns if col.startswith('tool_')]
print(f"\nâœ“ Created {len(tool_columns)} tool binary features")
print(f"Sample: {tool_columns[:5]}")

2025-12-19 13:05:12,701 - utils.enrichers - INFO - Parsed 8 top tools as binary features


ENRICHMENT 3: TOOLS FEATURES

âœ“ Created binary features for top 15 tools
âœ“ Created tools_count feature

Top 10 most preferred tools:
MLflow          513
LangChain       511
FastAPI         505
KDB+            499
BigQuery        494
TensorFlow      487
PyTorch         475
Scikit-learn    474
Name: count, dtype: int64

âœ“ Created 8 tool binary features
Sample: ['tool_mlflow', 'tool_langchain', 'tool_fastapi', 'tool_kdbplus', 'tool_bigquery']


### 5.4 Experience Level Enrichment

Convert experience levels to ordinal encoding for analysis.

In [15]:
print("="*60)
print("ENRICHMENT 4: EXPERIENCE LEVEL")
print("="*60)

# Enrich experience data
experience_enricher = ExperienceEnricher(df_enriched)
df_enriched = experience_enricher.enrich()

print("\nâœ“ Converted experience level to ordinal encoding:")
print("  Entry Level = 1")
print("  Mid Level = 2")
print("  Senior Level = 3")

print("\nExperience level distribution:")
print(df_enriched['experience_level_ordinal'].value_counts().sort_index())

2025-12-19 13:05:19,903 - utils.enrichers - INFO - Created ordinal encoding for experience levels


ENRICHMENT 4: EXPERIENCE LEVEL

âœ“ Converted experience level to ordinal encoding:
  Entry Level = 1
  Mid Level = 2
  Senior Level = 3

Experience level distribution:
experience_level_ordinal
1    702
2    668
3    630
Name: count, dtype: int64


### 5.5 Location Enrichment

Parse location data and create geographic clusters.

In [17]:
print("="*60)
print("ENRICHMENT 5: LOCATION FEATURES")
print("="*60)

# Enrich location data
location_enricher = LocationEnricher(df_enriched)
df_enriched, state_counts = location_enricher.enrich()

print("\nâœ“ Parsed location into:")
print("  - location_city")
print("  - location_state")
print("  - location_region (e.g., Northeast, West)")
print("  - is_usa (1 for USA, 0 for International)")

print("\nTop 10 states by job count:")
print(state_counts.head(10))

# Check if is_usa column exists, if not derive it from location_region
if 'is_usa' in df_enriched.columns:
    print(f"\nâœ“ USA jobs: {df_enriched['is_usa'].sum()}")
    print(f"âœ“ International jobs: {(df_enriched['is_usa'] == 0).sum()}")
elif 'location_region' in df_enriched.columns:
    usa_count = (df_enriched['location_region'] == 'USA').sum()
    intl_count = (df_enriched['location_region'] == 'International').sum()
    print(f"\nâœ“ USA jobs: {usa_count}")
    print(f"âœ“ International jobs: {intl_count}")
else:
    print("\nâš  Location region information not available")

ENRICHMENT 5: LOCATION FEATURES


2025-12-19 13:06:55,134 - utils.enrichers - INFO - Parsed location into city, state, cluster, and region



âœ“ Parsed location into:
  - location_city
  - location_state
  - location_region (e.g., Northeast, West)
  - is_usa (1 for USA, 0 for International)

Top 10 states by job count:
location_state
PG    19
BB    18
FJ    18
HR    18
BT    18
IQ    17
JO    17
UZ    16
JM    16
GQ    16
Name: count, dtype: int64

âœ“ USA jobs: 227
âœ“ International jobs: 1773


### 5.6 Date Enrichment

Parse posted dates and create temporal features including job aging.

In [18]:
print("="*60)
print("ENRICHMENT 6: DATE FEATURES")
print("="*60)

# Enrich date data (using reference date: 2025-12-09)
reference_date = datetime(2025, 12, 9)
date_enricher = DateEnricher(df_enriched, reference_date=reference_date)
df_enriched = date_enricher.enrich()

print(f"\nâœ“ Reference date: {reference_date.strftime('%Y-%m-%d')}")
print("\nâœ“ Created temporal features:")
print("  - posted_year")
print("  - posted_month")
print("  - posted_quarter")
print("  - posted_day_of_week")
print("  - days_since_posted")
print("  - job_age_category (Very Recent, Recent, Moderate, Old, Very Old)")

print("\nJob age distribution:")
if 'job_age_category' in df_enriched.columns:
    print(df_enriched['job_age_category'].value_counts().sort_index())

2025-12-19 13:07:01,420 - utils.enrichers - INFO - Extracted date features
2025-12-19 13:07:01,426 - utils.enrichers - INFO - Created aging feature categories
2025-12-19 13:07:01,438 - utils.enrichers - INFO - Created monthly date clusters


ENRICHMENT 6: DATE FEATURES

âœ“ Reference date: 2025-12-09

âœ“ Created temporal features:
  - posted_year
  - posted_month
  - posted_quarter
  - posted_day_of_week
  - days_since_posted
  - job_age_category (Very Recent, Recent, Moderate, Old, Very Old)

Job age distribution:


### 5.7 Additional Features

Create derived features for employment type and company characteristics.

In [19]:
print("="*60)
print("ENRICHMENT 7: ADDITIONAL FEATURES")
print("="*60)

# Enrich additional features
additional_enricher = AdditionalFeaturesEnricher(df_enriched)
df_enriched = additional_enricher.enrich()

print("\nâœ“ Created employment type flags:")
if 'is_remote' in df_enriched.columns:
    print(f"  - is_remote: {df_enriched['is_remote'].sum()} jobs")
if 'is_full_time' in df_enriched.columns:
    print(f"  - is_full_time: {df_enriched['is_full_time'].sum()} jobs")
if 'is_contract' in df_enriched.columns:
    print(f"  - is_contract: {df_enriched['is_contract'].sum()} jobs")

print("\nâœ“ Created company features:")
print("  - company_size_category (if available)")
print("  - industry_category (if available)")

print(f"\nFinal enriched dataset shape: {df_enriched.shape}")
print(f"Total features created: {df_enriched.shape[1] - df_cleaned.shape[1]} new columns")

2025-12-19 13:07:11,600 - utils.enrichers - INFO - Created additional derived features


ENRICHMENT 7: ADDITIONAL FEATURES

âœ“ Created employment type flags:
  - is_remote: 452 jobs
  - is_contract: 465 jobs

âœ“ Created company features:
  - company_size_category (if available)
  - industry_category (if available)

Final enriched dataset shape: (2000, 74)
Total features created: 62 new columns


## 6. Save Processed Data

Save both cleaned and enriched datasets for downstream analysis.

### 6.1 Save Cleaned Data

In [20]:
# Save cleaned data
cleaned_path = config.get_path('paths.cleaned_data_file')
file_handler.save_csv(df_cleaned, cleaned_path)
print(f"âœ“ Cleaned data saved to: {cleaned_path}")
print(f"  Shape: {df_cleaned.shape}")

2025-12-19 13:07:21,501 - utils.file_handler - INFO - Created directory: data\cleaned
2025-12-19 13:07:21,551 - utils.file_handler - INFO - Saved CSV: data\cleaned\ai_job_market_cleaned.csv with shape (2000, 12)


âœ“ Cleaned data saved to: data/cleaned/ai_job_market_cleaned.csv
  Shape: (2000, 12)


### 6.2 Save Enriched Data by Category

Organize enriched data into separate files by category for efficient access.

In [21]:
from utils.constant import COMMON_COLUMNS

# Define enriched data directory
enriched_dir = 'data/enriched'
file_handler.ensure_directory(enriched_dir)

# Define common columns for all categories
common_cols = ['job_id', 'job_title', 'company_name', 'location']

# Category configurations
category_configs = {
    'salary': common_cols + [
        'salary_range_usd', 'salary_min', 'salary_max', 'salary_avg', 'salary_cluster'
    ],
    'skills': common_cols + ['skills_required', 'skills_count',
                              'has_programming_lang', 'has_cloud_platform', 'has_ml_framework'] + 
              [col for col in df_enriched.columns if col.startswith('skill_')],
    'tools': common_cols + ['tools_preferred', 'tools_count'] +
             [col for col in df_enriched.columns if col.startswith('tool_')],
    'experience': common_cols + ['experience_level', 'experience_level_ordinal'],
    'location': common_cols + ['location_city', 'location_state', 'location_region', 'is_usa'],
    'date': common_cols + ['posted_date', 'posted_year', 'posted_month', 'posted_quarter',
                           'posted_day_of_week', 'days_since_posted', 'job_age_category'],
    'employment': common_cols + ['employment_type', 'is_remote', 'is_full_time', 'is_contract'],
    'company': common_cols + ['company_name', 'company_size', 'industry']
}

print("="*60)
print("SAVING ENRICHED DATA BY CATEGORY")
print("="*60)

# Save each category
for category, columns in category_configs.items():
    # Filter columns that actually exist in the dataframe
    available_cols = [col for col in columns if col in df_enriched.columns]
    
    if available_cols:
        df_category = df_enriched[available_cols]
        filepath = f"{enriched_dir}/{category}_enriched.csv"
        file_handler.save_csv(df_category, filepath)
        print(f"âœ“ {category}_enriched.csv ({len(available_cols)} columns)")

print(f"\nâœ“ All enriched data saved to: {enriched_dir}/")

2025-12-19 13:07:27,644 - utils.file_handler - INFO - Created directory: data\enriched
2025-12-19 13:07:27,711 - utils.file_handler - INFO - Saved CSV: data\enriched\salary_enriched.csv with shape (2000, 9)
2025-12-19 13:07:27,753 - utils.file_handler - INFO - Saved CSV: data\enriched\skills_enriched.csv with shape (2000, 29)
2025-12-19 13:07:27,781 - utils.file_handler - INFO - Saved CSV: data\enriched\tools_enriched.csv with shape (2000, 14)
2025-12-19 13:07:27,801 - utils.file_handler - INFO - Saved CSV: data\enriched\experience_enriched.csv with shape (2000, 6)
2025-12-19 13:07:27,822 - utils.file_handler - INFO - Saved CSV: data\enriched\location_enriched.csv with shape (2000, 7)
2025-12-19 13:07:27,846 - utils.file_handler - INFO - Saved CSV: data\enriched\date_enriched.csv with shape (2000, 10)
2025-12-19 13:07:27,866 - utils.file_handler - INFO - Saved CSV: data\enriched\employment_enriched.csv with shape (2000, 7)


SAVING ENRICHED DATA BY CATEGORY
âœ“ salary_enriched.csv (9 columns)
âœ“ skills_enriched.csv (29 columns)
âœ“ tools_enriched.csv (14 columns)
âœ“ experience_enriched.csv (6 columns)
âœ“ location_enriched.csv (7 columns)
âœ“ date_enriched.csv (10 columns)
âœ“ employment_enriched.csv (7 columns)


2025-12-19 13:07:27,887 - utils.file_handler - INFO - Saved CSV: data\enriched\company_enriched.csv with shape (2000, 7)


âœ“ company_enriched.csv (7 columns)

âœ“ All enriched data saved to: data/enriched/


### 6.3 Save Data Dictionaries

Save frequency counts for skills, tools, and locations for reference.

In [22]:
# Save data dictionaries
dictionary_dir = 'data/dictionary'
file_handler.ensure_directory(dictionary_dir)

print("="*60)
print("SAVING DATA DICTIONARIES")
print("="*60)

# Save skill frequency
if 'skill_counts' in locals():
    skill_freq_df = pd.DataFrame({
        'skill': skill_counts.index,
        'frequency': skill_counts.values
    })
    file_handler.save_csv(skill_freq_df, f"{dictionary_dir}/skill_frequency.csv")
    print(f"âœ“ skill_frequency.csv ({len(skill_freq_df)} skills)")

# Save tool frequency
if 'tool_counts' in locals():
    tool_freq_df = pd.DataFrame({
        'tool': tool_counts.index,
        'frequency': tool_counts.values
    })
    file_handler.save_csv(tool_freq_df, f"{dictionary_dir}/tool_frequency.csv")
    print(f"âœ“ tool_frequency.csv ({len(tool_freq_df)} tools)")

# Save location frequency
if 'state_counts' in locals():
    location_freq_df = pd.DataFrame({
        'location': state_counts.index,
        'frequency': state_counts.values
    })
    file_handler.save_csv(location_freq_df, f"{dictionary_dir}/location_frequency.csv")
    print(f"âœ“ location_frequency.csv ({len(location_freq_df)} locations)")

print(f"\nâœ“ Data dictionaries saved to: {dictionary_dir}/")

2025-12-19 13:07:44,478 - utils.file_handler - INFO - Created directory: data\dictionary
2025-12-19 13:07:44,486 - utils.file_handler - INFO - Saved CSV: data\dictionary\skill_frequency.csv with shape (22, 2)
2025-12-19 13:07:44,494 - utils.file_handler - INFO - Saved CSV: data\dictionary\tool_frequency.csv with shape (8, 2)
2025-12-19 13:07:44,520 - utils.file_handler - INFO - Saved CSV: data\dictionary\location_frequency.csv with shape (195, 2)


SAVING DATA DICTIONARIES
âœ“ skill_frequency.csv (22 skills)
âœ“ tool_frequency.csv (8 tools)
âœ“ location_frequency.csv (195 locations)

âœ“ Data dictionaries saved to: data/dictionary/


## 7. Final Summary

Summary of the complete cleaning and enrichment process.

In [25]:
print("="*60)
print("DATA CLEANING & ENRICHMENT SUMMARY")
print("="*60)

print("\nðŸ“Š DATASET TRANSFORMATION:")
print(f"   Raw Data:       {df_raw.shape[0]:,} rows Ã— {df_raw.shape[1]} columns")
print(f"   Cleaned Data:   {df_cleaned.shape[0]:,} rows Ã— {df_cleaned.shape[1]} columns")
print(f"   Enriched Data:  {df_enriched.shape[0]:,} rows Ã— {df_enriched.shape[1]} columns")
print(f"   New Features:   {df_enriched.shape[1] - df_cleaned.shape[1]} columns added")

print("\nðŸ§¹ CLEANING OPERATIONS:")
print(f"   Duplicates removed:     {cleaning_report['total_rows_removed']}")
# Get quality score before cleaning from the original validator
# Calculate quality score from raw data validator's summary
raw_summary = validator.get_summary()
raw_quality_score = raw_summary.get('data_quality_score', 0.0)
print(f"   Data quality improved:  {raw_quality_score:.1f}% â†’ 100%")

print("\nâœ¨ ENRICHMENT CATEGORIES:")
enrichment_summary = {
    'Salary': ['salary_min', 'salary_max', 'salary_avg', 'salary_cluster'],
    'Skills': [col for col in df_enriched.columns if col.startswith('skill_')],
    'Tools': [col for col in df_enriched.columns if col.startswith('tool_')],
    'Experience': ['experience_level_ordinal'],
    'Location': ['location_city', 'location_state', 'location_region', 'location_cluster'],
    'Date': ['posted_year', 'posted_month', 'posted_quarter', 'aging_feature'],
    'Additional': ['is_remote', 'is_fulltime', 'is_contract']
}

for category, features in enrichment_summary.items():
    available_features = [f for f in features if f in df_enriched.columns]
    print(f"   {category:12s}: {len(available_features):3d} features")

print("\nðŸ’¾ OUTPUT FILES:")
print("   Cleaned: data/cleaned/ai_job_market_cleaned.csv")
print("   Enriched: data/enriched/ (8 category files)")
print("   Dictionary: data/dictionary/ (3 frequency files)")

print("\nâœ… DATA READY FOR ANALYSIS!")
print("="*60)

DATA CLEANING & ENRICHMENT SUMMARY

ðŸ“Š DATASET TRANSFORMATION:
   Raw Data:       2,000 rows Ã— 12 columns
   Cleaned Data:   2,000 rows Ã— 12 columns
   Enriched Data:  2,000 rows Ã— 74 columns
   New Features:   62 columns added

ðŸ§¹ CLEANING OPERATIONS:
   Duplicates removed:     0
   Data quality improved:  0.0% â†’ 100%

âœ¨ ENRICHMENT CATEGORIES:
   Salary      :   4 features
   Skills      :  20 features
   Tools       :   8 features
   Experience  :   1 features
   Location    :   4 features
   Date        :   4 features
   Additional  :   3 features

ðŸ’¾ OUTPUT FILES:
   Cleaned: data/cleaned/ai_job_market_cleaned.csv
   Enriched: data/enriched/ (8 category files)
   Dictionary: data/dictionary/ (3 frequency files)

âœ… DATA READY FOR ANALYSIS!


---

## 8. Reusable Functions for Other Notebooks

These functions can be imported and used in other analysis notebooks.

### Usage Example:
```python
# In another notebook:
from cleaning import load_cleaned_data, load_enriched_data

# Load cleaned data
df = load_cleaned_data()

# Load specific enriched category
df_salary = load_enriched_data('salary')
df_skills = load_enriched_data('skills')
```

In [26]:
def load_cleaned_data() -> pd.DataFrame:
    config = get_config_loader()
    file_handler = FileHandler()
    cleaned_path = config.get_path('paths.cleaned_data_file')
    return file_handler.load_csv(cleaned_path)


def load_enriched_data(category: str = None) -> pd.DataFrame:

    file_handler = FileHandler()
    enriched_dir = 'data/enriched'
    
    if category:
        filepath = f"{enriched_dir}/{category}_enriched.csv"
        return file_handler.load_csv(filepath)
    else:
        # Load all categories and merge
        categories = ['salary', 'skills', 'tools', 'experience', 
                     'location', 'date', 'employment', 'company']
        dfs = []
        for cat in categories:
            filepath = f"{enriched_dir}/{cat}_enriched.csv"
            try:
                df_cat = file_handler.load_csv(filepath)
                dfs.append(df_cat)
            except:
                pass
        
        # Merge on common columns
        if dfs:
            df_merged = dfs[0]
            for df_cat in dfs[1:]:
                common_cols = ['job_id', 'job_title', 'company_name', 'location']
                merge_cols = [col for col in common_cols if col in df_merged.columns and col in df_cat.columns]
                if merge_cols:
                    df_merged = df_merged.merge(df_cat, on=merge_cols, how='outer')
            return df_merged
        return None


def get_skill_frequency() -> pd.DataFrame:
    file_handler = FileHandler()
    return file_handler.load_csv('data/dictionary/skill_frequency.csv')


def get_tool_frequency() -> pd.DataFrame:
    file_handler = FileHandler()
    return file_handler.load_csv('data/dictionary/tool_frequency.csv')


def get_location_frequency() -> pd.DataFrame:
    file_handler = FileHandler()
    return file_handler.load_csv('data/dictionary/location_frequency.csv')


print("âœ“ Reusable functions defined:")
print("  - load_cleaned_data()")
print("  - load_enriched_data(category=None)")
print("  - get_skill_frequency()")
print("  - get_tool_frequency()")
print("  - get_location_frequency()")

âœ“ Reusable functions defined:
  - load_cleaned_data()
  - load_enriched_data(category=None)
  - get_skill_frequency()
  - get_tool_frequency()
  - get_location_frequency()


---

## Appendix: Technical Details

### Data Cleaning Strategy
- **Duplicates**: Removed using pandas `drop_duplicates()` with `keep='first'`
- **Missing Values**: Dropped rows with missing critical fields
- **Validation**: Post-cleaning validation ensures 100% data quality

### Enrichment Architecture
Each enricher class follows the **Single Responsibility Principle**:
- `SalaryEnricher`: Salary parsing and clustering
- `SkillsEnricher`: Skill extraction and binary encoding
- `ToolsEnricher`: Tool extraction and binary encoding
- `ExperienceEnricher`: Ordinal encoding of experience levels
- `LocationEnricher`: Geographic parsing and clustering
- `DateEnricher`: Temporal feature extraction
- `AdditionalFeaturesEnricher`: Derived features

### File Organization
```
data/
â”œâ”€â”€ raw/                         # Original dataset
â”œâ”€â”€ cleaned/                     # Cleaned dataset
â”œâ”€â”€ enriched/                    # Enriched by category
â”‚   â”œâ”€â”€ salary_enriched.csv
â”‚   â”œâ”€â”€ skills_enriched.csv
â”‚   â”œâ”€â”€ tools_enriched.csv
â”‚   â”œâ”€â”€ experience_enriched.csv
â”‚   â”œâ”€â”€ location_enriched.csv
â”‚   â”œâ”€â”€ date_enriched.csv
â”‚   â”œâ”€â”€ employment_enriched.csv
â”‚   â””â”€â”€ company_enriched.csv
â””â”€â”€ dictionary/                  # Reference data
    â”œâ”€â”€ skill_frequency.csv
    â”œâ”€â”€ tool_frequency.csv
    â””â”€â”€ location_frequency.csv
```

---

**End of Notebook** 

For questions or issues, refer to the project documentation in `README.md` and `ARCHITECTURE.md`.