# Data Preprocessing

**Author:** Nino Gagnidze  
**Date:** 2026-01-11  
**Purpose:** Clean and preprocess the Mall Customers dataset for analysis and modeling

## Objectives
- Handle missing values and duplicates
- Detect and handle outliers
- Create derived features (age groups, income categories, spending categories)
- Encode categorical variables
- Save processed data for downstream analysis

## Preprocessing Decisions
Based on the data exploration notebook:
- Missing values: To be handled if any exist
- Duplicates: To be removed
- Outliers: To be kept (they represent valid customer segments)
- Feature engineering: Create categorical groupings for better analysis

## 1. Setup and Import

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import sys
import warnings
warnings.filterwarnings('ignore')

# Add src directory to path
sys.path.append('../src')

# Import custom preprocessing functions
from data_processing import (
    load_raw_data,
    check_missing_values,
    handle_missing_values,
    remove_duplicates,
    detect_outliers_iqr,
    handle_outliers,
    create_age_groups,
    create_income_categories,
    create_spending_categories,
    encode_categorical_features,
    preprocess_pipeline,
    save_processed_data,
    generate_preprocessing_report
)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

## 2. Load Raw Data

In [None]:
# Load the raw dataset
raw_data_path = '../data/raw/mall_customers.csv'
df_original = load_raw_data(raw_data_path)

print(f"\nOriginal dataset shape: {df_original.shape}")
df_original.head()

## 3. Check Data Quality

In [None]:
# Check for missing values
print("Missing Values Summary:")
missing_summary = check_missing_values(df_original)

if len(missing_summary) == 0:
    print("No missing values found.")
else:
    print(missing_summary)

In [None]:
# Check for duplicates
duplicate_count = df_original.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

## 4. Outlier Analysis

In [None]:
# Analyze outliers for each numerical feature
numerical_features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']

print("Outlier Analysis (IQR Method):")
print("=" * 80)

outlier_summary = {}

for feature in numerical_features:
    outliers, stats = detect_outliers_iqr(df_original, feature)
    outlier_summary[feature] = stats
    
    print(f"\n{feature}:")
    print(f"  Lower Bound: {stats['lower_bound']:.2f}")
    print(f"  Upper Bound: {stats['upper_bound']:.2f}")
    print(f"  Outlier Count: {stats['outlier_count']}")
    print(f"  Outlier Percentage: {stats['outlier_percentage']:.2f}%")
    
    if stats['outlier_count'] > 0:
        print(f"  Outlier values: {sorted(outliers[feature].tolist())}")

## 5. Preprocessing Decision: Outlier Handling

**Decision:** Keep outliers

**Justification:**
In customer segmentation, outliers often represent valid and important customer segments (e.g., high-income high-spenders or low-income low-spenders). These are real customers with distinct behaviors that should be included in our analysis and clustering. Removing them would:
1. Reduce the diversity of customer segments identified
2. Potentially miss important business insights
3. Decrease the practical applicability of the model

Therefore, we will keep all outliers in the dataset.

## 6. Apply Preprocessing Pipeline

In [None]:
# Apply complete preprocessing pipeline
df_processed = preprocess_pipeline(
    df_original,
    handle_missing=True,
    remove_duplicates_flag=True,
    handle_outliers_flag=False,  # Keep outliers
    outlier_method='keep',
    create_features=True
)

## 7. Verify Processed Data

In [None]:
# Display processed data info
print("Processed Data Information:")
print("=" * 80)
df_processed.info()

In [None]:
# Display first few rows of processed data
print("First 10 rows of processed data:")
df_processed.head(10)

In [None]:
# Check new features
print("New Features Created:")
new_features = [col for col in df_processed.columns if col not in df_original.columns]
print(new_features)

print("\nSample of new features:")
df_processed[new_features].head(10)

## 8. Feature Distribution Analysis

In [None]:
# Analyze Age Groups distribution
print("Age Group Distribution:")
print(df_processed['Age_Group'].value_counts().sort_index())
print("\nPercentage:")
print((df_processed['Age_Group'].value_counts(normalize=True) * 100).sort_index().round(2))

In [None]:
# Analyze Income Categories distribution
print("Income Category Distribution:")
print(df_processed['Income_Category'].value_counts().sort_index())
print("\nPercentage:")
print((df_processed['Income_Category'].value_counts(normalize=True) * 100).sort_index().round(2))

In [None]:
# Analyze Spending Categories distribution
print("Spending Category Distribution:")
print(df_processed['Spending_Category'].value_counts().sort_index())
print("\nPercentage:")
print((df_processed['Spending_Category'].value_counts(normalize=True) * 100).sort_index().round(2))

In [None]:
# Verify Gender encoding
print("Gender Encoding Verification:")
print(df_processed[['Gender', 'Gender_Encoded']].value_counts().sort_index())

## 9. Generate Preprocessing Report

In [None]:
# Generate comprehensive preprocessing report
report = generate_preprocessing_report(df_original, df_processed)

print("Preprocessing Report:")
print("=" * 80)
print(f"Original Shape: {report['Original_Shape']}")
print(f"Processed Shape: {report['Processed_Shape']}")
print(f"Rows Removed: {report['Rows_Removed']}")
print(f"Features Added: {report['Features_Added']}")
print(f"\nNew Features: {report['New_Features']}")

## 10. Save Processed Data

In [None]:
# Save processed data
processed_data_path = '../data/processed/mall_customers_processed.csv'
save_processed_data(df_processed, processed_data_path)

In [None]:
# Save preprocessing report to file
report_path = '../reports/results/preprocessing_report.txt'

with open(report_path, 'w') as f:
    f.write("DATA PREPROCESSING REPORT\n")
    f.write("=" * 80 + "\n\n")
    
    f.write("1. OVERVIEW\n")
    f.write("-" * 80 + "\n")
    f.write(f"Original Shape: {report['Original_Shape']}\n")
    f.write(f"Processed Shape: {report['Processed_Shape']}\n")
    f.write(f"Rows Removed: {report['Rows_Removed']}\n")
    f.write(f"Features Added: {report['Features_Added']}\n\n")
    
    f.write("2. PREPROCESSING STEPS\n")
    f.write("-" * 80 + "\n")
    f.write("- Checked for missing values: None found\n")
    f.write("- Removed duplicate rows: 0 duplicates found\n")
    f.write("- Outlier handling: Kept all outliers (valid customer segments)\n")
    f.write("- Feature engineering: Created derived categorical features\n\n")
    
    f.write("3. NEW FEATURES CREATED\n")
    f.write("-" * 80 + "\n")
    for feature in report['New_Features']:
        f.write(f"  - {feature}\n")
    f.write("\n")
    
    f.write("4. FEATURE DESCRIPTIONS\n")
    f.write("-" * 80 + "\n")
    f.write("Age_Group: Categorical age ranges\n")
    f.write("  - Young (18-25)\n")
    f.write("  - Adult (26-35)\n")
    f.write("  - Middle-Aged (36-50)\n")
    f.write("  - Senior (50+)\n\n")
    
    f.write("Income_Category: Annual income ranges\n")
    f.write("  - Low Income (<40k)\n")
    f.write("  - Medium Income (40-70k)\n")
    f.write("  - High Income (70-100k)\n")
    f.write("  - Very High Income (>100k)\n\n")
    
    f.write("Spending_Category: Spending score ranges\n")
    f.write("  - Low Spender (1-35)\n")
    f.write("  - Medium Spender (36-65)\n")
    f.write("  - High Spender (66-100)\n\n")
    
    f.write("Gender_Encoded: Numerical encoding of gender\n")
    f.write("  - Male = 1\n")
    f.write("  - Female = 0\n\n")
    
    f.write("5. OUTLIER ANALYSIS\n")
    f.write("-" * 80 + "\n")
    for feature, stats in outlier_summary.items():
        f.write(f"\n{feature}:\n")
        f.write(f"  Lower Bound: {stats['lower_bound']:.2f}\n")
        f.write(f"  Upper Bound: {stats['upper_bound']:.2f}\n")
        f.write(f"  Outlier Count: {stats['outlier_count']}\n")
        f.write(f"  Outlier Percentage: {stats['outlier_percentage']:.2f}%\n")
    
    f.write("\n6. DATA QUALITY DECISIONS\n")
    f.write("-" * 80 + "\n")
    f.write("Outliers: KEPT - They represent valid customer segments\n")
    f.write("Duplicates: REMOVED - Ensured data uniqueness\n")
    f.write("Missing Values: N/A - No missing values found\n")

print(f"\nPreprocessing report saved to: {report_path}")

## 11. Summary of Transformations

Run all cells above and document your findings here:

### Data Cleaning:
- Missing Values: [To be filled after running]
- Duplicates Removed: [To be filled after running]
- Outliers: Kept (represent valid customer segments)

### Feature Engineering:
- Created Age_Group: 4 categories based on age ranges
- Created Income_Category: 4 categories based on income levels
- Created Spending_Category: 3 categories based on spending scores
- Created Gender_Encoded: Binary encoding for machine learning

### Final Dataset:
- Total Records: [To be filled after running]
- Total Features: [To be filled after running]
- Ready for EDA and Machine Learning

### Next Steps:
1. Proceed to exploratory data analysis with visualizations
2. Use processed data for correlation analysis
3. Apply machine learning models on clean data