# Week 9: Data Cleaning & Preparation

Welcome to one of the most important and time-consuming aspects of data analysis! This week we'll learn how to transform messy, real-world data into a clean, analysis-ready format.

## Key Topics:
- **Conceptual Framework**: Cleaning for Descriptive vs. Predictive analysis
- **Missing Value Imputation**: Strategies for handling missing data
- **Outlier Handling**: Identifying and treating extreme values
- **Categorical Encoding**: Converting text data into a format for analysis
- **Distributional Transformations**: Dealing with skewed data

These skills are essential for ensuring the accuracy and reliability of any analysis you perform.

## Your Mission: Analyse Employee Performance at TechCorp

You're a Data Analyst at **TechCorp**. The HR department wants to understand factors driving employee promotion and performance, but their data is messy and comes from multiple sources.

**Your Challenge:**
- The dataset contains common data quality issues:
- **Missing information** for some employees
- **Inconsistent text entries** and typos
- **Extreme outliers** that could skew results
- **Skewed data distributions**

**Your Goal:**
- Prepare two versions of the dataset:
  1. A version for **descriptive analysis** (e.g., for an HR dashboard)
  2. A version for **predictive modeling** (e.g., to predict future promotions)

## The Core Concept: Descriptive vs. Predictive Cleaning

This is the most important concept of this workshop. The way you clean your data depends entirely on your goal.

### Cleaning for Descriptive Analytics
- **Goal**: To present the data as accurately and honestly as possible. We want to understand the data's current state, including its flaws.
- **Approach**: Gentle and transparent. We fix obvious errors (like typos) but we don't fundamentally change the data's shape or values. We might report on missing values rather than filling them in.
- **Example**: An HR dashboard showing the current distribution of salaries, including outliers, and explicitly stating how many employees have missing performance ratings.

### Cleaning for Predictive Modeling
- **Goal**: To prepare the data for a machine learning algorithm. Algorithms have strict requirements (e.g., no missing values, all numerical input).
- **Approach**: More aggressive and transformative. We fill missing values, convert all text to numbers, and may even alter distributions to help the model learn better.
- **Example**: A dataset where all missing values are filled, all categories are converted to numbers, and salary is log-transformed to reduce the impact of outliers on a promotion prediction model.

## Step 1: Load and Explore the Raw Data

In [None]:
import pandas as pd
import numpy as np

# Load the dataset
df_raw = pd.read_csv('employee_performance_data.csv')

print('Dataset loaded successfully!')
df_raw.head()

Let's get a quick overview of the data types and missing values.

In [None]:
# Get initial info
df_raw.info()

And now for a statistical summary of the numerical columns.

In [None]:
# Get statistical summary
df_raw.describe()

**Initial Observations:**
- **Missing Values**: `education`, `age`, and `previous_year_rating` have nulls.
- **Outliers**: The `max` salary (around 337k) seems much higher than the 75th percentile (around 83k). `tenure_months` also has a high max value.
- **Categorical Data**: `department`, `region`, `education`, `gender`, and `recruitment_channel` will need to be handled.

## Part 1: Cleaning for Descriptive Analytics

Our goal here is to create a clean dataset for an HR dashboard. We want to fix errors but preserve the original story of the data.

In [None]:
# Create a copy for descriptive cleaning
df_desc = df_raw.copy()

### 1.1 Handling Categorical Messiness (Typos)

Let's check the `department` column for inconsistent values.

In [None]:
# Check for unique values in department
df_desc['department'].value_counts()

We can see 'Enginering' and 'ENGINEERING'. Let's consolidate these into 'Engineering'.

In [None]:
# Consolidate department names
department_map = {
    'Enginering': 'Engineering',
    'ENGINEERING': 'Engineering'
}
df_desc['department'] = df_desc['department'].replace(department_map)

print('Cleaned department values:')
df_desc['department'].value_counts()

### 1.2 Handling Missing Values

For descriptive analysis, we often replace missing categorical values with a placeholder like 'Unknown' to be explicit.

In [None]:
# Fill missing education with 'Unknown'
df_desc['education'] = df_desc['education'].fillna('Unknown')

print('Value counts for education after filling missing values:')
df_desc['education'].value_counts()

For numerical columns, we'll leave them as `NaN`. This is because pandas calculations like `.mean()` or `.sum()` will correctly ignore them by default, which is what we want for honest reporting.

In [None]:
# Confirm that missing values still exist in numerical columns
print(f"Missing values in 'age': {df_desc['age'].isnull().sum()}")
print(f"Missing values in 'previous_year_rating': {df_desc['previous_year_rating'].isnull().sum()}")

### 1.3 Outlier Handling

For descriptive analysis, we don't remove outliers. They are part of the story! An HR manager would want to know who has an unusually high salary. Our job is to identify and report on them, not hide them.

In [None]:
# Identify salary outliers for reporting
salary_q99 = df_desc['salary'].quantile(0.99)
outliers = df_desc[df_desc['salary'] > salary_q99]

print(f'There are {len(outliers)} employees with salaries above the 99th percentile (${salary_q99:,.2f}).')
outliers[['employee_id', 'department', 'salary']].head()

### 1.4 Binning Numeric Data for Descriptive Analysis

Sometimes it's useful to group continuous numbers into categories (or 'bins') for reporting. Let's create age groups.

In [None]:
# Define the bins and labels
age_bins = [20, 30, 40, 50, 60, 70]
age_labels = ['20-29', '30-39', '40-49', '50-59', '60+']

df_desc['age_group'] = pd.cut(df_desc['age'], bins=age_bins, labels=age_labels, right=False)

print('Age group distribution:')
df_desc['age_group'].value_counts().sort_index()

### Summary of Descriptive Cleaning

Our `df_desc` DataFrame is now ready for reporting. We've:
- Corrected typos in categorical data.
- Explicitly labeled missing categorical data as 'Unknown'.
- Kept numerical missing values and outliers to ensure our reports reflect the true state of the data.

## Part 2: Cleaning for Predictive Modeling

Our goal now is to prepare the data for a hypothetical model to predict `is_promoted`. This requires more aggressive transformations.

In [None]:
# Create a copy for predictive cleaning
df_pred = df_raw.copy()

### 2.1 Handling Missing Values (Imputation)

Machine learning models can't handle missing values. We need to fill them using a reasonable strategy (imputation).

For numerical columns, we can use the median. The median is generally safer than the mean when we have skewed data or outliers.

In [None]:
# Impute numerical columns with the median
for col in ['age', 'previous_year_rating']:
    median_val = df_pred[col].median()
    df_pred[col] = df_pred[col].fillna(median_val)
    print(f"Filled missing '{col}' with median value: {median_val}")

For categorical columns, we can use the mode (the most frequent value).

In [None]:
# Impute categorical column with the mode
mode_val = df_pred['education'].mode()[0]
df_pred['education'] = df_pred['education'].fillna(mode_val)

print(f"Filled missing 'education' with mode value: '{mode_val}'")

### 2.2 Outlier Handling

Extreme outliers can negatively impact some models. A common strategy is 'capping' or 'winsorizing', where we cap the values at a certain percentile.

In [None]:
# Cap salary at the 99th percentile
q99_salary = df_pred['salary'].quantile(0.99)
df_pred['salary'] = df_pred['salary'].clip(upper=q99_salary)

print(f'Salaries are now capped at the 99th percentile: ${q99_salary:,.2f}')
df_pred.describe()

### 2.3 Distributional Transformations

Features with highly skewed distributions (like salary) can sometimes be improved with a log transformation. This can help some models perform better.

In [None]:
# Apply a log transformation to salary
df_pred['salary_log'] = np.log1p(df_pred['salary'])

print('Salary distribution before and after log transform:')
df_pred[['salary', 'salary_log']].hist(bins=30, figsize=(10, 4))

### 2.4 Categorical Encoding

Models need numerical input. We must convert our categorical columns into numbers.

#### Ordinal Encoding

For columns with a clear order (like `education`), we can map them to numbers manually.

In [None]:
# Manually encode education level
education_map = {
    'High School': 1,
    "Bachelor's": 2,
    "Master's": 3,
    'PhD': 4
}
df_pred['education_encoded'] = df_pred['education'].map(education_map)

df_pred[['education', 'education_encoded']].head()

#### One-Hot Encoding

For columns with no inherent order (like `department`), we use one-hot encoding. This creates a new binary (0/1) column for each category.

In [None]:
# One-hot encode department
department_dummies = pd.get_dummies(df_pred['department'], prefix='dept')

# Join the new dummy variables back to our dataframe
df_pred = pd.concat([df_pred, department_dummies], axis=1)

print('DataFrame with one-hot encoded departments:')
df_pred.head()

### Summary of Predictive Cleaning

Our `df_pred` DataFrame is now almost ready for a model. We have:
- Filled all missing values.
- Capped extreme outliers.
- Transformed skewed data.
- Converted categorical features into numerical formats.

The final step would be to drop the original non-numeric columns before feeding the data to a model.

In [None]:
# Select numeric columns AND boolean columns (from one-hot encoding)
df_model_ready = df_pred.select_dtypes(include=[np.number, bool]).drop(columns=['is_promoted'])

print('Final model-ready dataset shape:')
print(df_model_ready.shape)
df_model_ready.head()

## Conclusion: Two Goals, Two Different Datasets

Let's compare our two final datasets.

**`df_desc` (for Descriptive Analytics):**
- Still contains text and missing values (NaNs).
- Preserves original data distributions, including outliers.
- Perfect for creating honest, transparent business reports and dashboards.

**`df_model_ready` (for Predictive Modeling):**
- Is fully numerical with no missing values.
- Has outliers and distributions transformed to suit algorithms.
- Ready to be used to train a machine learning model.

Understanding this distinction is a key skill for any data analyst. The right cleaning strategy always depends on your end goal!