# 03. Feature Engineering

## Goal
Create new features that will make our analysis and dashboard more powerful. We will simplify job titles, categorize salaries, and group locations.

## Steps
1. Load the cleaned dataset.
2. **Job Title Categorization**: Group hundreds of specific titles into broader categories (e.g., 'Data Scientist', 'Data Analyst', 'ML Engineer').
3. **Salary Tiers**: Create a categorical column for salary ranges (Low, Medium, High) based on quartiles.
4. **Location Grouping**: Create an 'India vs Global' column to specifically target our project goal.
5. Save the enhanced dataset.

In [1]:
import pandas as pd

### 1. Load Cleaned Data

In [2]:
file_path = "../data/cleaned/cleaned_jobs_data.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,job_title,experience_level,employment_type,work_models,work_year,employee_residence,salary,salary_currency,salary_in_usd,company_location,company_size
0,Data Engineer,Mid-level,Full-time,Remote,2024,United States,148100,USD,148100,United States,Medium
1,Data Engineer,Mid-level,Full-time,Remote,2024,United States,98700,USD,98700,United States,Medium
2,Data Scientist,Senior-level,Full-time,Remote,2024,United States,140032,USD,140032,United States,Medium
3,Data Scientist,Senior-level,Full-time,Remote,2024,United States,100022,USD,100022,United States,Medium
4,BI Developer,Mid-level,Full-time,On-site,2024,United States,120000,USD,120000,United States,Medium


### 2. Job Title Categorization
There are many variations of job titles. Let's group them into standard roles: 'Data Scientist', 'Data Analyst', 'Data Engineer', 'Machine Learning', 'Manager', 'Other'.

In [3]:
def categorize_job_title(title):
    title = title.lower()
    if 'scientist' in title or 'science' in title:
        return 'Data Scientist'
    elif 'analyst' in title or 'analytics' in title:
        return 'Data Analyst'
    elif 'engineer' in title and 'data' in title:
        return 'Data Engineer'
    elif 'machine learning' in title or 'ml' in title or 'ai' in title:
        return 'ML/AI Engineer'
    elif 'manager' in title or 'lead' in title or 'head' in title or 'director' in title:
        return 'Manager/Lead'
    elif 'architect' in title:
        return 'Data Architect'
    else:
        return 'Other'

# Apply the function
df['job_category'] = df['job_title'].apply(categorize_job_title)

# Check distribution
df['job_category'].value_counts()

job_category
Data Scientist    1985
Data Analyst      1476
Data Engineer     1380
ML/AI Engineer     944
Other              499
Data Architect     181
Manager/Lead       134
Name: count, dtype: int64

### 3. Salary Tiers
We'll divide salaries into 3 groups (Low, Medium, High) based on their distribution (quartiles) to help with visualizing salary spread.

In [4]:
# distinct salary tiers using quantiles
q1 = df['salary_in_usd'].quantile(0.33)
q2 = df['salary_in_usd'].quantile(0.66)

def categorize_salary(salary):
    if salary < q1:
        return 'Low'
    elif salary < q2:
        return 'Medium'
    else:
        return 'High'

df['salary_tier'] = df['salary_in_usd'].apply(categorize_salary)
df[['salary_in_usd', 'salary_tier']].head()

Unnamed: 0,salary_in_usd,salary_tier
0,148100,Medium
1,98700,Low
2,140032,Medium
3,100022,Low
4,120000,Medium


### 4. Location Grouping (India vs Global)
For our specific analysis goal, we want to easily compare India against the rest of the world.

In [5]:
def categorise_location(country_code):
    if country_code == 'IN':
        return 'India'
    else:
        return 'Global'

df['job_location_type'] = df['company_location'].apply(categorise_location)
df['job_location_type'].value_counts()

job_location_type
Global    6599
Name: count, dtype: int64

### 5. Save Feature-Rich Dataset
Overwriting the cleaned file or saving as a new 'processed' file. We will save as `processed_jobs_data.csv` to indicate it has engineering added.

In [6]:
output_path = "../data/cleaned/processed_jobs_data.csv"
df.to_csv(output_path, index=False)
print(f"✅ Feature engineering complete! Saved to {output_path}")

✅ Feature engineering complete! Saved to ../data/cleaned/processed_jobs_data.csv
