# Data Cleaning and Preprocessing Crash Course

**Date Created:** 20 January 2026

This notebook provides a comprehensive guide to data cleaning and preprocessing techniques commonly tested in data science interviews and assessments. Each section includes explanations, examples, and practice questions to solidify your understanding.

## Table of Contents

1. [Identifying and Handling Missing Values](#1-identifying-and-handling-missing-values)
2. [Detecting and Handling Outliers](#2-detecting-and-handling-outliers)
3. [Data Type Conversions](#3-data-type-conversions)
4. [Encoding Categorical Variables](#4-encoding-categorical-variables)
5. [Feature Scaling and Normalisation](#5-feature-scaling-and-normalisation)
6. [Handling Duplicates](#6-handling-duplicates)
7. [String Cleaning and Text Preprocessing](#7-string-cleaning-and-text-preprocessing)
8. [Date Parsing and Feature Extraction](#8-date-parsing-and-feature-extraction)
9. [Binning and Discretisation](#9-binning-and-discretisation)
10. [Feature Engineering Basics](#10-feature-engineering-basics)
11. [Practice Questions](#11-practice-questions)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import (
    LabelEncoder, 
    OneHotEncoder, 
    StandardScaler, 
    MinMaxScaler, 
    RobustScaler,
    KBinsDiscretizer
)
from sklearn.impute import SimpleImputer, KNNImputer
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

---

## 1. Identifying and Handling Missing Values

Missing values occur when no data is recorded for certain entries in a dataset. This can happen due to:
- Data collection errors
- Incomplete surveys or forms
- System failures
- Data integration issues

### 1.1 Identifying Missing Values

In [None]:
# Create sample data with missing values
data = {
    'name': ['Alice', 'Bob', None, 'David', 'Eve'],
    'age': [25, np.nan, 35, 40, np.nan],
    'salary': [50000, 60000, np.nan, 80000, 55000],
    'department': ['HR', 'IT', 'IT', None, 'Finance'],
    'years_experience': [2, 5, np.nan, 10, 3]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\n" + "="*50)

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print("\nPercentage of missing values:")
print((df.isnull().sum() / len(df) * 100).round(2))
print("\nTotal missing values:", df.isnull().sum().sum())

In [None]:
# Visualise missing values pattern
print("Missing value pattern (True = Missing):")
print(df.isnull())

### 1.2 Handling Missing Values

#### Method 1: Dropping Missing Values (`dropna`)

In [None]:
def demonstrate_dropna(df: pd.DataFrame) -> None:
    """Demonstrate various dropna options.
    
    Args:
        df: DataFrame with missing values.
    """
    print("Original shape:", df.shape)
    
    # Drop rows with ANY missing values
    df_drop_any = df.dropna()
    print("\nAfter dropna() - drop rows with any NaN:")
    print(df_drop_any)
    print("Shape:", df_drop_any.shape)
    
    # Drop rows only if ALL values are missing
    df_drop_all = df.dropna(how='all')
    print("\nAfter dropna(how='all') - drop rows where all values are NaN:")
    print(df_drop_all)
    
    # Drop rows based on specific columns
    df_drop_subset = df.dropna(subset=['salary'])
    print("\nAfter dropna(subset=['salary']):")
    print(df_drop_subset)
    
    # Drop rows with threshold (keep rows with at least n non-null values)
    df_thresh = df.dropna(thresh=4)
    print("\nAfter dropna(thresh=4) - keep rows with at least 4 non-null values:")
    print(df_thresh)

demonstrate_dropna(df)

#### Method 2: Filling Missing Values (`fillna`)

In [None]:
def demonstrate_fillna(df: pd.DataFrame) -> None:
    """Demonstrate various fillna strategies.
    
    Args:
        df: DataFrame with missing values.
    """
    # Fill with a constant value
    print("Fill with constant value (0):")
    print(df['age'].fillna(0))
    
    # Fill with mean (for numerical columns)
    print("\nFill age with mean:")
    print(df['age'].fillna(df['age'].mean()))
    
    # Fill with median (more robust to outliers)
    print("\nFill salary with median:")
    print(df['salary'].fillna(df['salary'].median()))
    
    # Fill with mode (for categorical columns)
    print("\nFill department with mode:")
    print(df['department'].fillna(df['department'].mode()[0]))
    
    # Forward fill (propagate last valid observation)
    print("\nForward fill (ffill):")
    print(df['age'].ffill())
    
    # Backward fill
    print("\nBackward fill (bfill):")
    print(df['age'].bfill())

demonstrate_fillna(df)

#### Method 3: Interpolation

In [None]:
# Create time series data for interpolation example
ts_data = pd.DataFrame({
    'date': pd.date_range('2025-01-01', periods=10),
    'temperature': [20, 22, np.nan, np.nan, 28, 30, np.nan, 32, 31, 29]
})
print("Time series data with gaps:")
print(ts_data)

# Linear interpolation
ts_data['temp_linear'] = ts_data['temperature'].interpolate(method='linear')

# Polynomial interpolation
ts_data['temp_polynomial'] = ts_data['temperature'].interpolate(method='polynomial', order=2)

print("\nAfter interpolation:")
print(ts_data)

#### Method 4: Using sklearn Imputers

In [None]:
def impute_with_sklearn(df: pd.DataFrame) -> pd.DataFrame:
    """Demonstrate sklearn imputation methods.
    
    Args:
        df: DataFrame with missing values.
        
    Returns:
        DataFrame with imputed values.
    """
    numerical_cols = ['age', 'salary', 'years_experience']
    df_numeric = df[numerical_cols].copy()
    
    # SimpleImputer with mean strategy
    imputer_mean = SimpleImputer(strategy='mean')
    df_mean_imputed = pd.DataFrame(
        imputer_mean.fit_transform(df_numeric),
        columns=numerical_cols
    )
    print("SimpleImputer (mean strategy):")
    print(df_mean_imputed)
    
    # KNN Imputer (considers relationships between features)
    knn_imputer = KNNImputer(n_neighbors=2)
    df_knn_imputed = pd.DataFrame(
        knn_imputer.fit_transform(df_numeric),
        columns=numerical_cols
    )
    print("\nKNN Imputer:")
    print(df_knn_imputed)
    
    return df_knn_imputed

impute_with_sklearn(df)

---

## 2. Detecting and Handling Outliers

An outlier is a data point that deviates significantly from other observations. Outliers can be caused by:
- Data entry errors
- Measurement errors
- Genuine extreme values

### 2.1 IQR (Interquartile Range) Method

In [None]:
# Create data with outliers
np.random.seed(42)
data_with_outliers = np.concatenate([
    np.random.normal(50, 10, 100),  # Normal data
    [150, 160, -20, -30]  # Outliers
])
df_outliers = pd.DataFrame({'value': data_with_outliers})

print(f"Data statistics:")
print(df_outliers.describe())

In [None]:
def detect_outliers_iqr(data: pd.Series, multiplier: float = 1.5) -> tuple:
    """Detect outliers using the IQR method.
    
    Args:
        data: Series containing numerical data.
        multiplier: IQR multiplier for bounds (default 1.5).
        
    Returns:
        Tuple of (lower_bound, upper_bound, outlier_mask).
    """
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR
    
    outlier_mask = (data < lower_bound) | (data > upper_bound)
    
    print(f"Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
    print(f"Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")
    print(f"Number of outliers: {outlier_mask.sum()}")
    
    return lower_bound, upper_bound, outlier_mask

lower, upper, outliers = detect_outliers_iqr(df_outliers['value'])
print("\nOutlier values:")
print(df_outliers[outliers]['value'].values)

### 2.2 Z-Score Method

In [None]:
def detect_outliers_zscore(data: pd.Series, threshold: float = 3.0) -> pd.Series:
    """Detect outliers using the Z-score method.
    
    Args:
        data: Series containing numerical data.
        threshold: Z-score threshold (default 3.0).
        
    Returns:
        Boolean Series indicating outliers.
    """
    z_scores = np.abs(stats.zscore(data))
    outlier_mask = z_scores > threshold
    
    print(f"Z-score threshold: {threshold}")
    print(f"Number of outliers: {outlier_mask.sum()}")
    
    return outlier_mask

zscore_outliers = detect_outliers_zscore(df_outliers['value'])
print("\nOutlier values (Z-score method):")
print(df_outliers[zscore_outliers]['value'].values)

### 2.3 Handling Outliers

In [None]:
def handle_outliers(df: pd.DataFrame, column: str, method: str = 'cap') -> pd.DataFrame:
    """Handle outliers using different methods.
    
    Args:
        df: DataFrame containing the data.
        column: Column name to process.
        method: Method to handle outliers ('cap', 'remove', 'impute').
        
    Returns:
        DataFrame with handled outliers.
    """
    df_result = df.copy()
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    
    if method == 'cap':
        # Cap/Winsorise: Replace outliers with boundary values
        df_result[column] = df_result[column].clip(lower=lower, upper=upper)
    elif method == 'remove':
        # Remove outliers
        mask = (df_result[column] >= lower) & (df_result[column] <= upper)
        df_result = df_result[mask]
    elif method == 'impute':
        # Replace outliers with median
        median = df_result[column].median()
        outlier_mask = (df_result[column] < lower) | (df_result[column] > upper)
        df_result.loc[outlier_mask, column] = median
    
    return df_result

# Demonstrate each method
print("Original shape:", df_outliers.shape)
print("Original stats:")
print(df_outliers.describe().T[['min', 'max', 'mean']])

df_capped = handle_outliers(df_outliers, 'value', method='cap')
print("\nAfter capping:")
print(df_capped.describe().T[['min', 'max', 'mean']])

df_removed = handle_outliers(df_outliers, 'value', method='remove')
print(f"\nAfter removal (shape: {df_removed.shape}):")
print(df_removed.describe().T[['min', 'max', 'mean']])

---

## 3. Data Type Conversions

Ensuring correct data types is crucial for proper analysis and model training.

In [None]:
# Create sample data with incorrect types
df_types = pd.DataFrame({
    'id': ['001', '002', '003', '004'],
    'age': ['25', '30', '35', '28'],
    'salary': ['50000.50', '60000.75', '55000.00', '70000.25'],
    'is_active': ['True', 'False', 'True', 'True'],
    'join_date': ['2023-01-15', '2022-06-20', '2024-03-10', '2023-09-05']
})

print("Original data types:")
print(df_types.dtypes)
print("\nSample data:")
print(df_types)

In [None]:
def convert_data_types(df: pd.DataFrame) -> pd.DataFrame:
    """Convert columns to appropriate data types.
    
    Args:
        df: DataFrame with columns to convert.
        
    Returns:
        DataFrame with converted data types.
    """
    df_converted = df.copy()
    
    # Convert to integer
    df_converted['age'] = df_converted['age'].astype(int)
    
    # Convert to float
    df_converted['salary'] = df_converted['salary'].astype(float)
    
    # Convert to boolean
    df_converted['is_active'] = df_converted['is_active'].map({'True': True, 'False': False})
    
    # Convert to datetime
    df_converted['join_date'] = pd.to_datetime(df_converted['join_date'])
    
    # Convert to category (memory efficient for repeated values)
    df_converted['id'] = df_converted['id'].astype('category')
    
    return df_converted

df_converted = convert_data_types(df_types)
print("Converted data types:")
print(df_converted.dtypes)

In [None]:
# Handling conversion errors
messy_numbers = pd.Series(['100', '200', 'N/A', '400', 'missing', '600'])

# Using errors='coerce' converts invalid values to NaN
converted = pd.to_numeric(messy_numbers, errors='coerce')
print("Converting messy data with errors='coerce':")
print(converted)

---

## 4. Encoding Categorical Variables

Machine learning models require numerical input, so categorical data must be encoded.

### 4.1 Label Encoding

Converts each category to a unique integer. Best for ordinal data or tree-based models.

In [None]:
# Sample categorical data
df_cat = pd.DataFrame({
    'colour': ['red', 'blue', 'green', 'blue', 'red', 'green'],
    'size': ['small', 'medium', 'large', 'small', 'large', 'medium'],
    'quality': ['low', 'medium', 'high', 'medium', 'high', 'low']
})
print("Original categorical data:")
print(df_cat)

In [None]:
# Label Encoding
label_encoder = LabelEncoder()

df_label_encoded = df_cat.copy()
df_label_encoded['colour_encoded'] = label_encoder.fit_transform(df_cat['colour'])

print("Label Encoded (colour):")
print(df_label_encoded[['colour', 'colour_encoded']])
print("\nMapping:", dict(zip(label_encoder.classes_, range(len(label_encoder.classes_)))))

### 4.2 One-Hot Encoding

Creates binary columns for each category. Best for nominal data (no inherent order).

In [None]:
# One-Hot Encoding with pandas
df_onehot = pd.get_dummies(df_cat['colour'], prefix='colour')
print("One-Hot Encoded (using pandas):")
print(df_onehot)

In [None]:
# One-Hot Encoding with sklearn
onehot_encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' to avoid multicollinearity

colour_encoded = onehot_encoder.fit_transform(df_cat[['colour']])
feature_names = onehot_encoder.get_feature_names_out(['colour'])

df_onehot_sklearn = pd.DataFrame(colour_encoded, columns=feature_names)
print("One-Hot Encoded (sklearn with drop='first'):")
print(df_onehot_sklearn)

### 4.3 Ordinal Encoding

For categorical variables with a natural order.

In [None]:
# Manual ordinal encoding with custom order
size_order = {'small': 0, 'medium': 1, 'large': 2}
quality_order = {'low': 0, 'medium': 1, 'high': 2}

df_ordinal = df_cat.copy()
df_ordinal['size_encoded'] = df_cat['size'].map(size_order)
df_ordinal['quality_encoded'] = df_cat['quality'].map(quality_order)

print("Ordinal Encoded:")
print(df_ordinal)

### 4.4 Target Encoding (Mean Encoding)

Replaces categories with the mean of the target variable. Useful for high-cardinality features.

In [None]:
def target_encode(df: pd.DataFrame, column: str, target: str) -> pd.Series:
    """Apply target encoding to a categorical column.
    
    Args:
        df: DataFrame containing the data.
        column: Categorical column to encode.
        target: Target column for calculating means.
        
    Returns:
        Series with target-encoded values.
    """
    target_means = df.groupby(column)[target].mean()
    return df[column].map(target_means)

# Example with target encoding
df_target = pd.DataFrame({
    'city': ['London', 'Paris', 'London', 'Berlin', 'Paris', 'Berlin', 'London', 'Paris'],
    'price': [500, 400, 550, 300, 420, 320, 480, 450]
})

df_target['city_encoded'] = target_encode(df_target, 'city', 'price')
print("Target Encoded:")
print(df_target)

---

## 5. Feature Scaling and Normalisation

Scaling ensures features are on a similar scale, which is important for distance-based algorithms and gradient descent.

### 5.1 StandardScaler (Z-score Normalisation)

Transforms data to have mean=0 and std=1.

In [None]:
# Sample data for scaling
df_scale = pd.DataFrame({
    'age': [25, 30, 35, 40, 45, 50],
    'income': [30000, 50000, 60000, 80000, 100000, 150000],
    'score': [0.5, 0.6, 0.7, 0.8, 0.85, 0.9]
})
print("Original data:")
print(df_scale)
print("\nOriginal statistics:")
print(df_scale.describe().round(2))

In [None]:
# StandardScaler
scaler = StandardScaler()
df_standard = pd.DataFrame(
    scaler.fit_transform(df_scale),
    columns=df_scale.columns
)
print("After StandardScaler:")
print(df_standard.round(2))
print("\nMean:", df_standard.mean().round(4).values)
print("Std:", df_standard.std().round(4).values)

### 5.2 MinMaxScaler (Min-Max Normalisation)

Scales data to a fixed range, typically [0, 1].

In [None]:
# MinMaxScaler
minmax_scaler = MinMaxScaler()
df_minmax = pd.DataFrame(
    minmax_scaler.fit_transform(df_scale),
    columns=df_scale.columns
)
print("After MinMaxScaler:")
print(df_minmax.round(2))
print("\nMin:", df_minmax.min().values)
print("Max:", df_minmax.max().values)

### 5.3 RobustScaler

Uses median and IQR, making it robust to outliers.

In [None]:
# RobustScaler (robust to outliers)
robust_scaler = RobustScaler()
df_robust = pd.DataFrame(
    robust_scaler.fit_transform(df_scale),
    columns=df_scale.columns
)
print("After RobustScaler:")
print(df_robust.round(2))

### When to Use Which Scaler

| Scaler | Use When | Not Suitable For |
|--------|----------|------------------|
| StandardScaler | Data is approximately normally distributed | Data with significant outliers |
| MinMaxScaler | Need bounded values (e.g., neural networks) | Data with outliers |
| RobustScaler | Data contains outliers | When exact bounds are required |

---

## 6. Handling Duplicates

Duplicate records can skew analysis and model training.

In [None]:
# Create data with duplicates
df_dup = pd.DataFrame({
    'id': [1, 2, 3, 2, 4, 3, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Charlie', 'Eve'],
    'value': [100, 200, 300, 200, 400, 300, 500]
})
print("Data with duplicates:")
print(df_dup)

In [None]:
# Identify duplicates
print("Duplicate rows (all columns):")
print(df_dup[df_dup.duplicated(keep=False)])

print("\nDuplicate rows (based on 'id'):")
print(df_dup[df_dup.duplicated(subset=['id'], keep=False)])

In [None]:
def handle_duplicates(df: pd.DataFrame, subset: list = None, keep: str = 'first') -> pd.DataFrame:
    """Remove duplicate rows from a DataFrame.
    
    Args:
        df: DataFrame to process.
        subset: Columns to consider for identifying duplicates.
        keep: Which duplicate to keep ('first', 'last', False).
        
    Returns:
        DataFrame with duplicates removed.
    """
    n_before = len(df)
    df_clean = df.drop_duplicates(subset=subset, keep=keep)
    n_after = len(df_clean)
    
    print(f"Removed {n_before - n_after} duplicate rows")
    return df_clean

df_no_dup = handle_duplicates(df_dup, subset=['id'], keep='first')
print("\nAfter removing duplicates:")
print(df_no_dup)

---

## 7. String Cleaning and Text Preprocessing

Text data often requires cleaning before analysis.

In [None]:
# Sample messy text data
df_text = pd.DataFrame({
    'name': ['  Alice Smith  ', 'BOB JONES', 'charlie brown', '  David Lee'],
    'email': ['ALICE@EMAIL.COM', 'bob@email.com', 'Charlie@Email.Com', 'david@email.com'],
    'phone': ['123-456-7890', '(234) 567-8901', '345.678.9012', '456 789 0123'],
    'description': ['Good product!', 'Bad...    very bad!!!', 'OK  product', 'Great!!!   ']
})
print("Messy text data:")
print(df_text)

In [None]:
def clean_text_data(df: pd.DataFrame) -> pd.DataFrame:
    """Clean text columns in a DataFrame.
    
    Args:
        df: DataFrame with text columns.
        
    Returns:
        DataFrame with cleaned text.
    """
    df_clean = df.copy()
    
    # Strip whitespace
    df_clean['name'] = df_clean['name'].str.strip()
    
    # Convert to title case
    df_clean['name'] = df_clean['name'].str.title()
    
    # Convert to lowercase
    df_clean['email'] = df_clean['email'].str.lower()
    
    # Remove non-numeric characters from phone
    df_clean['phone'] = df_clean['phone'].str.replace(r'[^0-9]', '', regex=True)
    
    # Clean description: remove extra spaces and punctuation
    df_clean['description'] = (
        df_clean['description']
        .str.strip()
        .str.replace(r'\s+', ' ', regex=True)  # Multiple spaces to single
        .str.replace(r'[!]+', '!', regex=True)  # Multiple ! to single
        .str.replace(r'\.+', '.', regex=True)  # Multiple . to single
    )
    
    return df_clean

df_text_clean = clean_text_data(df_text)
print("Cleaned text data:")
print(df_text_clean)

In [None]:
# Additional text operations
sample_text = pd.Series(['Hello World', 'Python Programming', 'Data Science'])

print("Original:", sample_text.tolist())
print("Upper:", sample_text.str.upper().tolist())
print("Lower:", sample_text.str.lower().tolist())
print("Length:", sample_text.str.len().tolist())
print("Contains 'Python':", sample_text.str.contains('Python').tolist())
print("Replace:", sample_text.str.replace('World', 'Universe').tolist())
print("Split (first word):", sample_text.str.split().str[0].tolist())

---

## 8. Date Parsing and Feature Extraction

Datetime features can provide valuable information for models.

In [None]:
# Sample date data in various formats
df_dates = pd.DataFrame({
    'date_str': ['2025-01-15', '15/02/2025', 'March 20, 2025', '2025-04-25 14:30:00'],
    'timestamp': [1705312800, 1708041600, 1710892800, 1714052400]
})
print("Raw date data:")
print(df_dates)

In [None]:
def parse_dates(df: pd.DataFrame) -> pd.DataFrame:
    """Parse date strings to datetime objects.
    
    Args:
        df: DataFrame with date columns.
        
    Returns:
        DataFrame with parsed dates.
    """
    df_parsed = df.copy()
    
    # Parse date strings (pandas infers format)
    df_parsed['date_parsed'] = pd.to_datetime(df_parsed['date_str'], format='mixed')
    
    # Parse Unix timestamps
    df_parsed['timestamp_parsed'] = pd.to_datetime(df_parsed['timestamp'], unit='s')
    
    return df_parsed

df_dates_parsed = parse_dates(df_dates)
print("Parsed dates:")
print(df_dates_parsed)

In [None]:
def extract_date_features(df: pd.DataFrame, date_col: str) -> pd.DataFrame:
    """Extract features from a datetime column.
    
    Args:
        df: DataFrame with datetime column.
        date_col: Name of the datetime column.
        
    Returns:
        DataFrame with extracted date features.
    """
    df_features = df.copy()
    dt = df[date_col].dt
    
    # Basic components
    df_features['year'] = dt.year
    df_features['month'] = dt.month
    df_features['day'] = dt.day
    df_features['hour'] = dt.hour
    
    # Derived features
    df_features['day_of_week'] = dt.dayofweek  # Monday=0, Sunday=6
    df_features['day_name'] = dt.day_name()
    df_features['is_weekend'] = dt.dayofweek >= 5
    df_features['quarter'] = dt.quarter
    df_features['week_of_year'] = dt.isocalendar().week
    df_features['day_of_year'] = dt.dayofyear
    
    # Is it a month start/end?
    df_features['is_month_start'] = dt.is_month_start
    df_features['is_month_end'] = dt.is_month_end
    
    return df_features

df_date_features = extract_date_features(df_dates_parsed, 'date_parsed')
print("Extracted date features:")
print(df_date_features[['date_parsed', 'year', 'month', 'day', 'day_name', 'is_weekend', 'quarter']])

---

## 9. Binning and Discretisation

Converting continuous variables into categorical bins.

In [None]:
# Sample continuous data
np.random.seed(42)
df_bins = pd.DataFrame({
    'age': np.random.randint(18, 80, 20),
    'income': np.random.randint(20000, 150000, 20)
})
print("Continuous data:")
print(df_bins.head(10))

In [None]:
# Equal-width binning with pd.cut
df_bins['age_bins_equal'] = pd.cut(df_bins['age'], bins=4, labels=['Young', 'Adult', 'Middle', 'Senior'])
print("Equal-width binning (age):")
print(df_bins[['age', 'age_bins_equal']].head(10))

In [None]:
# Equal-frequency binning with pd.qcut
df_bins['income_quantile'] = pd.qcut(df_bins['income'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])
print("Equal-frequency binning (income):")
print(df_bins[['income', 'income_quantile']].head(10))

In [None]:
# Custom bins
age_bins = [0, 25, 35, 50, 65, 100]
age_labels = ['Gen Z', 'Millennial', 'Gen X', 'Boomer', 'Silent']
df_bins['age_custom'] = pd.cut(df_bins['age'], bins=age_bins, labels=age_labels)
print("Custom binning:")
print(df_bins[['age', 'age_custom']].head(10))

In [None]:
# Using sklearn KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
df_bins['income_sklearn'] = discretizer.fit_transform(df_bins[['income']])
print("KBinsDiscretizer (quantile strategy):")
print(df_bins[['income', 'income_sklearn']].head(10))

---

## 10. Feature Engineering Basics

Creating new features from existing data to improve model performance.

In [None]:
# Sample data for feature engineering
df_eng = pd.DataFrame({
    'length': [10, 20, 15, 25, 30],
    'width': [5, 10, 8, 12, 15],
    'height': [2, 4, 3, 5, 6],
    'price': [100, 200, 150, 300, 400],
    'quantity': [10, 5, 8, 3, 2]
})
print("Original data:")
print(df_eng)

In [None]:
def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    """Create new features from existing columns.
    
    Args:
        df: DataFrame with source columns.
        
    Returns:
        DataFrame with engineered features.
    """
    df_new = df.copy()
    
    # Mathematical combinations
    df_new['area'] = df['length'] * df['width']
    df_new['volume'] = df['length'] * df['width'] * df['height']
    
    # Ratios
    df_new['aspect_ratio'] = df['length'] / df['width']
    df_new['price_per_unit'] = df['price'] / df['quantity']
    
    # Polynomial features
    df_new['price_squared'] = df['price'] ** 2
    df_new['price_sqrt'] = np.sqrt(df['price'])
    
    # Log transformation (useful for skewed data)
    df_new['price_log'] = np.log1p(df['price'])  # log1p handles zero values
    
    # Aggregations
    df_new['total_value'] = df['price'] * df['quantity']
    
    return df_new

df_engineered = engineer_features(df_eng)
print("Engineered features:")
print(df_engineered)

In [None]:
# Interaction features
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
features = df_eng[['length', 'width']]
poly_features = poly.fit_transform(features)

df_poly = pd.DataFrame(
    poly_features,
    columns=poly.get_feature_names_out(['length', 'width'])
)
print("Polynomial interaction features:")
print(df_poly)

---

## 11. Practice Questions

Test your understanding with the following practice questions. Try to solve each problem before revealing the answer.

### Question 1: Missing Value Analysis

Given the following DataFrame, write a function that:
1. Calculates the percentage of missing values for each column
2. Returns only columns with more than 20% missing values

In [None]:
# Setup data for Question 1
np.random.seed(42)
q1_data = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5, np.nan, 7, np.nan, 9, 10],
    'B': [np.nan, np.nan, np.nan, 4, 5, np.nan, np.nan, 8, np.nan, 10],
    'C': [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan],
    'D': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
print("Question 1 Data:")
print(q1_data)

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
def get_high_missing_columns(df: pd.DataFrame, threshold: float = 20.0) -> pd.Series:
    """Find columns with missing values above threshold.
    
    Args:
        df: DataFrame to analyse.
        threshold: Percentage threshold for missing values.
        
    Returns:
        Series with missing percentages for qualifying columns.
    """
    missing_pct = (df.isnull().sum() / len(df) * 100)
    return missing_pct[missing_pct > threshold]

result = get_high_missing_columns(q1_data, threshold=20.0)
print("Columns with >20% missing values:")
print(result)
```

</details>

### Question 2: Outlier Detection and Treatment

Write a function that:
1. Detects outliers using the IQR method
2. Replaces outliers with the median value
3. Returns the cleaned DataFrame and the number of outliers replaced

In [None]:
# Setup data for Question 2
np.random.seed(42)
q2_data = pd.DataFrame({
    'values': list(np.random.normal(50, 10, 20)) + [150, -30, 200]
})
print("Question 2 Data (with outliers):")
print(q2_data.describe())

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
def replace_outliers_with_median(df: pd.DataFrame, column: str) -> tuple:
    """Replace outliers with median using IQR method.
    
    Args:
        df: DataFrame containing the data.
        column: Column name to process.
        
    Returns:
        Tuple of (cleaned DataFrame, number of outliers replaced).
    """
    df_clean = df.copy()
    
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outlier_mask = (df[column] < lower_bound) | (df[column] > upper_bound)
    n_outliers = outlier_mask.sum()
    
    median = df[column].median()
    df_clean.loc[outlier_mask, column] = median
    
    return df_clean, n_outliers

cleaned_df, n_replaced = replace_outliers_with_median(q2_data, 'values')
print(f"Number of outliers replaced: {n_replaced}")
print("\nCleaned data statistics:")
print(cleaned_df.describe())
```

</details>

### Question 3: Data Type Validation

Write a function that validates and converts data types:
1. Convert 'age' to integer (handle invalid values)
2. Convert 'date' to datetime
3. Convert 'price' to float
4. Return the number of conversion errors for each column

In [None]:
# Setup data for Question 3
q3_data = pd.DataFrame({
    'age': ['25', '30', 'invalid', '40', 'N/A'],
    'date': ['2025-01-15', '2025-02-20', 'bad_date', '2025-04-10', '2025-05-25'],
    'price': ['100.50', '200.75', '300.00', 'free', '500.25']
})
print("Question 3 Data:")
print(q3_data)

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
def validate_and_convert(df: pd.DataFrame) -> tuple:
    """Validate and convert data types with error tracking.
    
    Args:
        df: DataFrame with columns to convert.
        
    Returns:
        Tuple of (converted DataFrame, dictionary of error counts).
    """
    df_converted = df.copy()
    errors = {}
    
    # Age to integer
    df_converted['age'] = pd.to_numeric(df['age'], errors='coerce')
    errors['age'] = df_converted['age'].isna().sum()
    
    # Date to datetime
    df_converted['date'] = pd.to_datetime(df['date'], errors='coerce')
    errors['date'] = df_converted['date'].isna().sum()
    
    # Price to float
    df_converted['price'] = pd.to_numeric(df['price'], errors='coerce')
    errors['price'] = df_converted['price'].isna().sum()
    
    return df_converted, errors

converted_df, error_counts = validate_and_convert(q3_data)
print("Converted DataFrame:")
print(converted_df)
print("\nConversion errors per column:")
print(error_counts)
```

</details>

### Question 4: Categorical Encoding Pipeline

Create a function that:
1. Applies one-hot encoding to nominal columns
2. Applies ordinal encoding to ordinal columns with specified order
3. Returns the encoded DataFrame

In [None]:
# Setup data for Question 4
q4_data = pd.DataFrame({
    'city': ['London', 'Paris', 'Berlin', 'London', 'Paris'],
    'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', 'Master'],
    'colour': ['red', 'blue', 'green', 'red', 'blue']
})
print("Question 4 Data:")
print(q4_data)

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
def encode_categorical_columns(
    df: pd.DataFrame,
    nominal_cols: list,
    ordinal_cols: dict
) -> pd.DataFrame:
    """Encode categorical columns with appropriate methods.
    
    Args:
        df: DataFrame with categorical columns.
        nominal_cols: List of columns for one-hot encoding.
        ordinal_cols: Dict mapping column names to ordered categories.
        
    Returns:
        DataFrame with encoded columns.
    """
    df_encoded = df.copy()
    
    # One-hot encode nominal columns
    for col in nominal_cols:
        dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
        df_encoded = pd.concat([df_encoded, dummies], axis=1)
        df_encoded = df_encoded.drop(col, axis=1)
    
    # Ordinal encode with custom order
    for col, order in ordinal_cols.items():
        mapping = {cat: i for i, cat in enumerate(order)}
        df_encoded[f'{col}_encoded'] = df[col].map(mapping)
        df_encoded = df_encoded.drop(col, axis=1)
    
    return df_encoded

result = encode_categorical_columns(
    q4_data,
    nominal_cols=['city', 'colour'],
    ordinal_cols={'education': ['Bachelor', 'Master', 'PhD']}
)
print("Encoded DataFrame:")
print(result)
```

</details>

### Question 5: Feature Scaling Comparison

Write a function that:
1. Applies StandardScaler, MinMaxScaler, and RobustScaler to the data
2. Returns a comparison of the scaled values showing min, max, mean, and std for each method

In [None]:
# Setup data for Question 5
q5_data = pd.DataFrame({
    'feature': [10, 20, 30, 100, 50, 60, 70, 80, 90, 1000]  # Note the outlier
})
print("Question 5 Data:")
print(q5_data)

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
def compare_scalers(df: pd.DataFrame, column: str) -> pd.DataFrame:
    """Compare different scaling methods.
    
    Args:
        df: DataFrame with data to scale.
        column: Column name to scale.
        
    Returns:
        DataFrame comparing scaler statistics.
    """
    data = df[[column]]
    
    scalers = {
        'StandardScaler': StandardScaler(),
        'MinMaxScaler': MinMaxScaler(),
        'RobustScaler': RobustScaler()
    }
    
    results = {'Original': data[column]}
    
    for name, scaler in scalers.items():
        scaled = scaler.fit_transform(data)
        results[name] = scaled.flatten()
    
    comparison = pd.DataFrame(results)
    
    stats = pd.DataFrame({
        'Min': comparison.min(),
        'Max': comparison.max(),
        'Mean': comparison.mean(),
        'Std': comparison.std()
    }).round(3)
    
    return stats

comparison_result = compare_scalers(q5_data, 'feature')
print("Scaler Comparison:")
print(comparison_result)
```

</details>

### Question 6: Duplicate Detection and Aggregation

Write a function that:
1. Identifies duplicate entries based on a subset of columns
2. Aggregates the duplicates by taking the mean of numerical columns
3. Returns the deduplicated DataFrame with aggregated values

In [None]:
# Setup data for Question 6
q6_data = pd.DataFrame({
    'customer_id': [1, 2, 1, 3, 2, 1],
    'product': ['A', 'B', 'A', 'C', 'B', 'A'],
    'quantity': [10, 5, 15, 8, 7, 20],
    'price': [100, 200, 100, 150, 200, 100]
})
print("Question 6 Data:")
print(q6_data)

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
def aggregate_duplicates(
    df: pd.DataFrame,
    group_cols: list,
    agg_cols: list
) -> pd.DataFrame:
    """Aggregate duplicate rows by specified columns.
    
    Args:
        df: DataFrame with potential duplicates.
        group_cols: Columns to group by.
        agg_cols: Columns to aggregate (mean).
        
    Returns:
        Aggregated DataFrame without duplicates.
    """
    agg_dict = {col: 'mean' for col in agg_cols}
    
    df_aggregated = df.groupby(group_cols, as_index=False).agg(agg_dict)
    
    return df_aggregated

result = aggregate_duplicates(
    q6_data,
    group_cols=['customer_id', 'product'],
    agg_cols=['quantity', 'price']
)
print("Aggregated DataFrame:")
print(result)
```

</details>

### Question 7: Text Cleaning Pipeline

Create a function that cleans text data by:
1. Converting to lowercase
2. Removing special characters (keep only alphanumeric and spaces)
3. Removing extra whitespace
4. Removing leading/trailing whitespace

In [None]:
# Setup data for Question 7
q7_data = pd.DataFrame({
    'text': [
        '  Hello WORLD!!!  ',
        'This   is   a    TEST...',
        '@User: Check #this out!!!',
        'Email: test@email.com',
        '   Multiple   Spaces   Here   '
    ]
})
print("Question 7 Data:")
print(q7_data)

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
import re

def clean_text_column(df: pd.DataFrame, column: str) -> pd.Series:
    """Clean text data in a DataFrame column.
    
    Args:
        df: DataFrame containing text data.
        column: Name of the text column to clean.
        
    Returns:
        Series with cleaned text.
    """
    cleaned = (
        df[column]
        .str.lower()  # Convert to lowercase
        .str.replace(r'[^a-z0-9\s]', '', regex=True)  # Remove special characters
        .str.replace(r'\s+', ' ', regex=True)  # Remove extra whitespace
        .str.strip()  # Remove leading/trailing whitespace
    )
    return cleaned

q7_data['cleaned_text'] = clean_text_column(q7_data, 'text')
print("Cleaned text:")
print(q7_data)
```

</details>

### Question 8: Date Feature Engineering

Write a function that extracts the following features from a date column:
1. Year, month, day
2. Day of week (as number and name)
3. Is weekend (boolean)
4. Days since a reference date (e.g., '2020-01-01')

In [None]:
# Setup data for Question 8
q8_data = pd.DataFrame({
    'date': pd.to_datetime(['2025-01-15', '2025-03-20', '2025-06-07', '2025-09-13', '2025-12-25'])
})
print("Question 8 Data:")
print(q8_data)

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
def extract_date_features(
    df: pd.DataFrame,
    date_col: str,
    reference_date: str = '2020-01-01'
) -> pd.DataFrame:
    """Extract features from a datetime column.
    
    Args:
        df: DataFrame with datetime column.
        date_col: Name of the datetime column.
        reference_date: Reference date for calculating days since.
        
    Returns:
        DataFrame with extracted date features.
    """
    df_features = df.copy()
    dt = df[date_col].dt
    ref = pd.to_datetime(reference_date)
    
    df_features['year'] = dt.year
    df_features['month'] = dt.month
    df_features['day'] = dt.day
    df_features['day_of_week'] = dt.dayofweek
    df_features['day_name'] = dt.day_name()
    df_features['is_weekend'] = dt.dayofweek >= 5
    df_features['days_since_ref'] = (df[date_col] - ref).dt.days
    
    return df_features

result = extract_date_features(q8_data, 'date', '2020-01-01')
print("Date features:")
print(result)
```

</details>

### Question 9: Custom Binning with Labels

Write a function that:
1. Bins a continuous variable into custom ranges
2. Assigns meaningful labels to each bin
3. Returns the binned column and a summary of the bin distribution

In [None]:
# Setup data for Question 9
np.random.seed(42)
q9_data = pd.DataFrame({
    'income': np.random.randint(15000, 200000, 50)
})
print("Question 9 Data:")
print(q9_data.describe())

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
def create_income_bins(
    df: pd.DataFrame,
    column: str,
    bins: list = None,
    labels: list = None
) -> tuple:
    """Bin income data into categories.
    
    Args:
        df: DataFrame with income data.
        column: Name of the income column.
        bins: List of bin edges.
        labels: List of labels for bins.
        
    Returns:
        Tuple of (binned Series, distribution summary).
    """
    if bins is None:
        bins = [0, 30000, 50000, 75000, 100000, float('inf')]
    if labels is None:
        labels = ['Low', 'Lower-Middle', 'Middle', 'Upper-Middle', 'High']
    
    binned = pd.cut(df[column], bins=bins, labels=labels)
    distribution = binned.value_counts().sort_index()
    
    return binned, distribution

q9_data['income_bracket'], dist = create_income_bins(q9_data, 'income')
print("Binned data sample:")
print(q9_data.head(10))
print("\nDistribution:")
print(dist)
```

</details>

### Question 10: Complete Data Preprocessing Pipeline

Write a comprehensive preprocessing function that:
1. Handles missing values (numeric: median, categorical: mode)
2. Removes outliers using IQR method
3. Encodes categorical variables (one-hot)
4. Scales numeric features (StandardScaler)
5. Returns the preprocessed DataFrame

In [None]:
# Setup data for Question 10
np.random.seed(42)
q10_data = pd.DataFrame({
    'age': [25, np.nan, 35, 40, 150, 30, np.nan, 45, 50, 28],  # 150 is outlier
    'income': [50000, 60000, np.nan, 80000, 70000, 55000, 65000, 90000, np.nan, 52000],
    'category': ['A', 'B', 'A', None, 'C', 'B', 'A', 'C', 'B', 'A']
})
print("Question 10 Data:")
print(q10_data)

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    """Complete data preprocessing pipeline.
    
    Args:
        df: Raw DataFrame to preprocess.
        
    Returns:
        Preprocessed DataFrame.
    """
    df_processed = df.copy()
    
    # Identify column types
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
    
    # Step 1: Handle missing values
    for col in numeric_cols:
        df_processed[col] = df_processed[col].fillna(df_processed[col].median())
    
    for col in categorical_cols:
        mode_val = df_processed[col].mode()
        if len(mode_val) > 0:
            df_processed[col] = df_processed[col].fillna(mode_val[0])
    
    # Step 2: Remove outliers from numeric columns
    for col in numeric_cols:
        Q1 = df_processed[col].quantile(0.25)
        Q3 = df_processed[col].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        df_processed[col] = df_processed[col].clip(lower=lower, upper=upper)
    
    # Step 3: One-hot encode categorical columns
    df_processed = pd.get_dummies(df_processed, columns=categorical_cols, drop_first=True)
    
    # Step 4: Scale numeric features
    scaler = StandardScaler()
    df_processed[numeric_cols] = scaler.fit_transform(df_processed[numeric_cols])
    
    return df_processed

result = preprocess_data(q10_data)
print("Preprocessed DataFrame:")
print(result)
```

</details>

### Question 11: Handling Imbalanced Missing Data

Write a function that:
1. Analyses the pattern of missing data (MCAR, MAR, MNAR)
2. Returns a summary indicating whether missing values appear to be random or correlated with other features

In [None]:
# Setup data for Question 11
np.random.seed(42)
n = 100
q11_data = pd.DataFrame({
    'income': np.random.normal(50000, 15000, n),
    'age': np.random.randint(20, 70, n)
})
# Make missing values correlated with income (high income -> more likely missing age)
missing_mask = q11_data['income'] > 60000
q11_data.loc[missing_mask, 'age'] = np.where(
    np.random.random(missing_mask.sum()) < 0.7,
    np.nan,
    q11_data.loc[missing_mask, 'age']
)
print("Question 11 Data sample:")
print(q11_data.head(20))
print(f"\nMissing age values: {q11_data['age'].isna().sum()}")

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
def analyse_missing_pattern(
    df: pd.DataFrame,
    target_col: str
) -> dict:
    """Analyse the pattern of missing data.
    
    Args:
        df: DataFrame to analyse.
        target_col: Column with missing values to analyse.
        
    Returns:
        Dictionary with analysis results.
    """
    results = {}
    
    # Create missing indicator
    df_analysis = df.copy()
    df_analysis['is_missing'] = df[target_col].isna().astype(int)
    
    # Get numeric columns (excluding target)
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    if target_col in numeric_cols:
        numeric_cols.remove(target_col)
    
    # Check correlation between missingness and other features
    correlations = {}
    for col in numeric_cols:
        corr = df_analysis[[col, 'is_missing']].corr().iloc[0, 1]
        correlations[col] = corr
    
    results['correlations'] = correlations
    
    # Compare means of other features for missing vs non-missing
    mean_comparison = {}
    for col in numeric_cols:
        mean_missing = df_analysis[df_analysis['is_missing'] == 1][col].mean()
        mean_not_missing = df_analysis[df_analysis['is_missing'] == 0][col].mean()
        mean_comparison[col] = {
            'mean_when_missing': mean_missing,
            'mean_when_present': mean_not_missing,
            'difference': mean_missing - mean_not_missing
        }
    
    results['mean_comparison'] = mean_comparison
    
    # Determine likely pattern
    max_corr = max(abs(c) for c in correlations.values()) if correlations else 0
    if max_corr < 0.1:
        results['likely_pattern'] = 'MCAR (Missing Completely At Random)'
    elif max_corr < 0.3:
        results['likely_pattern'] = 'Possibly MAR (Missing At Random)'
    else:
        results['likely_pattern'] = 'Likely MAR or MNAR (Not Missing At Random)'
    
    return results

analysis = analyse_missing_pattern(q11_data, 'age')
print("Missing data analysis:")
print(f"\nCorrelations with missingness: {analysis['correlations']}")
print(f"\nMean comparison: {analysis['mean_comparison']}")
print(f"\nLikely pattern: {analysis['likely_pattern']}")
```

</details>

### Question 12: Multi-Column Imputation Strategy

Implement a function that:
1. Uses KNN imputation for numerical columns
2. Uses mode imputation for categorical columns
3. Tracks which values were imputed by creating indicator columns

In [None]:
# Setup data for Question 12
q12_data = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 35, np.nan, 50, 45],
    'income': [50000, np.nan, 70000, 80000, np.nan, 60000, 90000, 75000],
    'education': ['Bachelor', 'Master', 'PhD', None, 'Bachelor', 'Master', None, 'PhD'],
    'city': ['London', 'Paris', None, 'Berlin', 'London', None, 'Paris', 'Berlin']
})
print("Question 12 Data:")
print(q12_data)

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
from sklearn.impute import KNNImputer

def impute_with_tracking(df: pd.DataFrame) -> pd.DataFrame:
    """Impute missing values with tracking indicators.
    
    Args:
        df: DataFrame with missing values.
        
    Returns:
        DataFrame with imputed values and indicator columns.
    """
    df_result = df.copy()
    
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
    
    # Create missing indicators
    for col in df.columns:
        df_result[f'{col}_was_missing'] = df[col].isna().astype(int)
    
    # KNN imputation for numerical columns
    if numeric_cols:
        knn_imputer = KNNImputer(n_neighbors=3)
        df_result[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])
    
    # Mode imputation for categorical columns
    for col in categorical_cols:
        mode_val = df[col].mode()
        if len(mode_val) > 0:
            df_result[col] = df_result[col].fillna(mode_val[0])
    
    return df_result

result = impute_with_tracking(q12_data)
print("Imputed DataFrame with tracking:")
print(result)
```

</details>

---

## Summary

This notebook covered the essential data cleaning and preprocessing techniques:

1. **Missing Values**: `dropna()`, `fillna()`, interpolation, sklearn imputers
2. **Outliers**: IQR method, Z-score, capping/removal/imputation
3. **Data Types**: Type conversion, handling conversion errors
4. **Encoding**: Label encoding, one-hot encoding, ordinal encoding, target encoding
5. **Scaling**: StandardScaler, MinMaxScaler, RobustScaler
6. **Duplicates**: Detection and removal strategies
7. **Text Cleaning**: String operations, regex cleaning
8. **Date Features**: Parsing and feature extraction
9. **Binning**: Equal-width, equal-frequency, custom bins
10. **Feature Engineering**: Mathematical transformations, polynomial features

### Key Interview Tips

- Always explore your data first before deciding on preprocessing steps
- Document your preprocessing decisions and rationale
- Consider the impact of preprocessing on model interpretability
- Use pipelines in production to ensure consistent preprocessing
- Be prepared to explain the trade-offs of different approaches