# Data Preprocessing

**Author:** Nino Gagnidze  
**Purpose:** Clean and preprocess the Mall Customers dataset for analysis and modeling

## Objectives
- Handle missing values and duplicates
- Detect and handle outliers
- Create derived features (age groups, income categories, spending categories)
- Encode categorical variables
- Save processed data for downstream analysis

## Preprocessing Decisions
Based on the data exploration notebook:
- Missing values: To be handled if any exist
- Duplicates: To be removed
- Outliers: To be kept (they represent valid customer segments)
- Feature engineering: Create categorical groupings for better analysis

## 1. Setup and Import

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import sys
import warnings
warnings.filterwarnings('ignore')

# Add src directory to path
sys.path.append('../src')

# Import custom preprocessing functions
from data_processing import (
    load_raw_data,
    check_missing_values,
    handle_missing_values,
    remove_duplicates,
    detect_outliers_iqr,
    handle_outliers,
    create_age_groups,
    create_income_categories,
    create_spending_categories,
    encode_categorical_features,
    preprocess_pipeline,
    save_processed_data,
    generate_preprocessing_report
)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

## 2. Load Raw Data

In [2]:
# Load the raw dataset
raw_data_path = '../data/raw/mall_customers.csv'
df_original = load_raw_data(raw_data_path)

print(f"\nOriginal dataset shape: {df_original.shape}")
df_original.head()

Data loaded successfully. Shape: (200, 5)

Original dataset shape: (200, 5)


Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


## 3. Check Data Quality

In [3]:
# Check for missing values
print("Missing Values Summary:")
missing_summary = check_missing_values(df_original)

if len(missing_summary) == 0:
    print("No missing values found.")
else:
    print(missing_summary)

Missing Values Summary:
No missing values found.


In [4]:
# Check for duplicates
duplicate_count = df_original.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

Number of duplicate rows: 0


## 4. Outlier Analysis

In [5]:
# Analyze outliers for each numerical feature
numerical_features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']

print("Outlier Analysis (IQR Method):")
print("=" * 80)

outlier_summary = {}

for feature in numerical_features:
    outliers, stats = detect_outliers_iqr(df_original, feature)
    outlier_summary[feature] = stats
    
    print(f"\n{feature}:")
    print(f"  Lower Bound: {stats['lower_bound']:.2f}")
    print(f"  Upper Bound: {stats['upper_bound']:.2f}")
    print(f"  Outlier Count: {stats['outlier_count']}")
    print(f"  Outlier Percentage: {stats['outlier_percentage']:.2f}%")
    
    if stats['outlier_count'] > 0:
        print(f"  Outlier values: {sorted(outliers[feature].tolist())}")

Outlier Analysis (IQR Method):

Age:
  Lower Bound: -1.62
  Upper Bound: 79.38
  Outlier Count: 0
  Outlier Percentage: 0.00%

Annual Income (k$):
  Lower Bound: -13.25
  Upper Bound: 132.75
  Outlier Count: 2
  Outlier Percentage: 1.00%
  Outlier values: [137, 137]

Spending Score (1-100):
  Lower Bound: -22.62
  Upper Bound: 130.38
  Outlier Count: 0
  Outlier Percentage: 0.00%


## 5. Preprocessing Decision: Outlier Handling

**Decision:** Keep outliers

**Justification:**
In customer segmentation, outliers often represent valid and important customer segments (e.g., high-income high-spenders or low-income low-spenders). These are real customers with distinct behaviors that should be included in our analysis and clustering. Removing them would:
1. Reduce the diversity of customer segments identified
2. Potentially miss important business insights
3. Decrease the practical applicability of the model

Therefore, we will keep all outliers in the dataset.

## 6. Apply Preprocessing Pipeline

In [6]:
# Apply complete preprocessing pipeline
df_processed = preprocess_pipeline(
    df_original,
    handle_missing=True,
    remove_duplicates_flag=True,
    handle_outliers_flag=False,  # Keep outliers
    outlier_method='keep',
    create_features=True
)

Starting preprocessing pipeline...

1. Handling missing values...
No missing values found.

2. Removing duplicates...
Removed 0 duplicate rows.

4. Creating derived features...
Age groups created successfully.
Income categories created successfully.
Spending categories created successfully.
Gender encoded: Male=1, Female=0

Preprocessing complete! Final shape: (200, 9)


## 7. Verify Processed Data

In [7]:
# Display processed data info
print("Processed Data Information:")
print("=" * 80)
df_processed.info()

Processed Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   CustomerID              200 non-null    int64   
 1   Gender                  200 non-null    object  
 2   Age                     200 non-null    int64   
 3   Annual Income (k$)      200 non-null    int64   
 4   Spending Score (1-100)  200 non-null    int64   
 5   Age_Group               200 non-null    category
 6   Income_Category         200 non-null    category
 7   Spending_Category       200 non-null    category
 8   Gender_Encoded          200 non-null    int64   
dtypes: category(3), int64(5), object(1)
memory usage: 10.6+ KB


In [8]:
# Display first few rows of processed data
print("First 10 rows of processed data:")
df_processed.head(10)

First 10 rows of processed data:


Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100),Age_Group,Income_Category,Spending_Category,Gender_Encoded
0,1,Male,19,15,39,Young (18-25),Low Income (<40k),Medium Spender,1
1,2,Male,21,15,81,Young (18-25),Low Income (<40k),High Spender,1
2,3,Female,20,16,6,Young (18-25),Low Income (<40k),Low Spender,0
3,4,Female,23,16,77,Young (18-25),Low Income (<40k),High Spender,0
4,5,Female,31,17,40,Adult (26-35),Low Income (<40k),Medium Spender,0
5,6,Female,22,17,76,Young (18-25),Low Income (<40k),High Spender,0
6,7,Female,35,18,6,Adult (26-35),Low Income (<40k),Low Spender,0
7,8,Female,23,18,94,Young (18-25),Low Income (<40k),High Spender,0
8,9,Male,64,19,3,Senior (50+),Low Income (<40k),Low Spender,1
9,10,Female,30,19,72,Adult (26-35),Low Income (<40k),High Spender,0


In [9]:
# Check new features
print("New Features Created:")
new_features = [col for col in df_processed.columns if col not in df_original.columns]
print(new_features)

print("\nSample of new features:")
df_processed[new_features].head(10)

New Features Created:
['Age_Group', 'Income_Category', 'Spending_Category', 'Gender_Encoded']

Sample of new features:


Unnamed: 0,Age_Group,Income_Category,Spending_Category,Gender_Encoded
0,Young (18-25),Low Income (<40k),Medium Spender,1
1,Young (18-25),Low Income (<40k),High Spender,1
2,Young (18-25),Low Income (<40k),Low Spender,0
3,Young (18-25),Low Income (<40k),High Spender,0
4,Adult (26-35),Low Income (<40k),Medium Spender,0
5,Young (18-25),Low Income (<40k),High Spender,0
6,Adult (26-35),Low Income (<40k),Low Spender,0
7,Young (18-25),Low Income (<40k),High Spender,0
8,Senior (50+),Low Income (<40k),Low Spender,1
9,Adult (26-35),Low Income (<40k),High Spender,0


## 8. Feature Distribution Analysis

In [10]:
# Analyze Age Groups distribution
print("Age Group Distribution:")
print(df_processed['Age_Group'].value_counts().sort_index())
print("\nPercentage:")
print((df_processed['Age_Group'].value_counts(normalize=True) * 100).sort_index().round(2))

Age Group Distribution:
Age_Group
Young (18-25)          38
Adult (26-35)          60
Middle-Aged (36-50)    62
Senior (50+)           40
Name: count, dtype: int64

Percentage:
Age_Group
Young (18-25)          19.0
Adult (26-35)          30.0
Middle-Aged (36-50)    31.0
Senior (50+)           20.0
Name: proportion, dtype: float64


In [11]:
# Analyze Income Categories distribution
print("Income Category Distribution:")
print(df_processed['Income_Category'].value_counts().sort_index())
print("\nPercentage:")
print((df_processed['Income_Category'].value_counts(normalize=True) * 100).sort_index().round(2))

Income Category Distribution:
Income_Category
Low Income (<40k)           50
Medium Income (40-70k)      76
High Income (70-100k)       60
Very High Income (>100k)    14
Name: count, dtype: int64

Percentage:
Income_Category
Low Income (<40k)           25.0
Medium Income (40-70k)      38.0
High Income (70-100k)       30.0
Very High Income (>100k)     7.0
Name: proportion, dtype: float64


In [12]:
# Analyze Spending Categories distribution
print("Spending Category Distribution:")
print(df_processed['Spending_Category'].value_counts().sort_index())
print("\nPercentage:")
print((df_processed['Spending_Category'].value_counts(normalize=True) * 100).sort_index().round(2))

Spending Category Distribution:
Spending_Category
Low Spender       55
Medium Spender    87
High Spender      58
Name: count, dtype: int64

Percentage:
Spending_Category
Low Spender       27.5
Medium Spender    43.5
High Spender      29.0
Name: proportion, dtype: float64


In [13]:
# Verify Gender encoding
print("Gender Encoding Verification:")
print(df_processed[['Gender', 'Gender_Encoded']].value_counts().sort_index())

Gender Encoding Verification:
Gender  Gender_Encoded
Female  0                 112
Male    1                  88
Name: count, dtype: int64


## 9. Generate Preprocessing Report

In [14]:
# Generate comprehensive preprocessing report
report = generate_preprocessing_report(df_original, df_processed)

print("Preprocessing Report:")
print("=" * 80)
print(f"Original Shape: {report['Original_Shape']}")
print(f"Processed Shape: {report['Processed_Shape']}")
print(f"Rows Removed: {report['Rows_Removed']}")
print(f"Features Added: {report['Features_Added']}")
print(f"\nNew Features: {report['New_Features']}")

Preprocessing Report:
Original Shape: (200, 5)
Processed Shape: (200, 9)
Rows Removed: 0
Features Added: 4

New Features: ['Age_Group', 'Income_Category', 'Spending_Category', 'Gender_Encoded']


## 10. Save Processed Data

In [15]:
# Save processed data
processed_data_path = '../data/processed/mall_customers_processed.csv'
save_processed_data(df_processed, processed_data_path)

Processed data saved to: ../data/processed/mall_customers_processed.csv


In [16]:
# Save preprocessing report to file
report_path = '../reports/results/preprocessing_report.txt'

with open(report_path, 'w') as f:
    f.write("DATA PREPROCESSING REPORT\n")
    f.write("=" * 80 + "\n\n")
    
    f.write("1. OVERVIEW\n")
    f.write("-" * 80 + "\n")
    f.write(f"Original Shape: {report['Original_Shape']}\n")
    f.write(f"Processed Shape: {report['Processed_Shape']}\n")
    f.write(f"Rows Removed: {report['Rows_Removed']}\n")
    f.write(f"Features Added: {report['Features_Added']}\n\n")
    
    f.write("2. PREPROCESSING STEPS\n")
    f.write("-" * 80 + "\n")
    f.write("- Checked for missing values: None found\n")
    f.write("- Removed duplicate rows: 0 duplicates found\n")
    f.write("- Outlier handling: Kept all outliers (valid customer segments)\n")
    f.write("- Feature engineering: Created derived categorical features\n\n")
    
    f.write("3. NEW FEATURES CREATED\n")
    f.write("-" * 80 + "\n")
    for feature in report['New_Features']:
        f.write(f"  - {feature}\n")
    f.write("\n")
    
    f.write("4. FEATURE DESCRIPTIONS\n")
    f.write("-" * 80 + "\n")
    f.write("Age_Group: Categorical age ranges\n")
    f.write("  - Young (18-25)\n")
    f.write("  - Adult (26-35)\n")
    f.write("  - Middle-Aged (36-50)\n")
    f.write("  - Senior (50+)\n\n")
    
    f.write("Income_Category: Annual income ranges\n")
    f.write("  - Low Income (<40k)\n")
    f.write("  - Medium Income (40-70k)\n")
    f.write("  - High Income (70-100k)\n")
    f.write("  - Very High Income (>100k)\n\n")
    
    f.write("Spending_Category: Spending score ranges\n")
    f.write("  - Low Spender (1-35)\n")
    f.write("  - Medium Spender (36-65)\n")
    f.write("  - High Spender (66-100)\n\n")
    
    f.write("Gender_Encoded: Numerical encoding of gender\n")
    f.write("  - Male = 1\n")
    f.write("  - Female = 0\n\n")
    
    f.write("5. OUTLIER ANALYSIS\n")
    f.write("-" * 80 + "\n")
    for feature, stats in outlier_summary.items():
        f.write(f"\n{feature}:\n")
        f.write(f"  Lower Bound: {stats['lower_bound']:.2f}\n")
        f.write(f"  Upper Bound: {stats['upper_bound']:.2f}\n")
        f.write(f"  Outlier Count: {stats['outlier_count']}\n")
        f.write(f"  Outlier Percentage: {stats['outlier_percentage']:.2f}%\n")
    
    f.write("\n6. DATA QUALITY DECISIONS\n")
    f.write("-" * 80 + "\n")
    f.write("Outliers: KEPT - They represent valid customer segments\n")
    f.write("Duplicates: REMOVED - Ensured data uniqueness\n")
    f.write("Missing Values: N/A - No missing values found\n")

print(f"\nPreprocessing report saved to: {report_path}")


Preprocessing report saved to: ../reports/results/preprocessing_report.txt


## 11. Summary of Transformations

### Data Cleaning:
- **Missing Values:** 0 - No missing values found in any column
- **Duplicates Removed:** 0 - No duplicate rows were present
- **Outliers:** Kept - 2 high-income customers (1%) identified as outliers but retained as they represent valid customer segments

### Feature Engineering:
Successfully created 4 new features:

1. **Age_Group:** Categorical classification with 4 categories
   - Young (18-25): 38 customers (19%)
   - Adult (26-35): 60 customers (30%)
   - Middle-Aged (36-50): 62 customers (31%)
   - Senior (50+): 40 customers (20%)

2. **Income_Category:** Income level classification with 4 categories
   - Low Income (<40k): 50 customers (25%)
   - Medium Income (40-70k): 76 customers (38%)
   - High Income (70-100k): 60 customers (30%)
   - Very High Income (>100k): 14 customers (7%)

3. **Spending_Category:** Spending behavior classification with 3 categories
   - Low Spender (1-35): 55 customers (27.5%)
   - Medium Spender (36-65): 87 customers (43.5%)
   - High Spender (66-100): 58 customers (29%)

4. **Gender_Encoded:** Binary encoding for ML models
   - Female = 0: 112 customers
   - Male = 1: 88 customers

### Final Dataset:
- **Total Records:** 200 (no rows removed)
- **Total Features:** 9 (5 original + 4 engineered)
- **Data Quality:** 100% complete, no missing values, no duplicates
- **File Saved:** `data/processed/mall_customers_processed.csv`
- **Status:** Ready for EDA and Machine Learning

### Transformation Summary:
- **Original Shape:** (200, 5)
- **Processed Shape:** (200, 9)
- **Rows Removed:** 0
- **Features Added:** 4
- **Memory Usage:** Optimized with category dtypes for engineered features

### Key Decisions Made:
1. **Outlier Handling:** Kept all outliers as they represent valuable high-income customer segments
2. **Missing Values:** No imputation needed (perfect data quality)
3. **Duplicates:** No removal needed (all records unique)
4. **Feature Engineering:** Created categorical groupings for better interpretability and analysis
5. **Encoding:** Used simple binary encoding for gender (0/1) for ML compatibility

### Next Steps:
1. Proceed to exploratory data analysis with comprehensive visualizations
2. Use processed data for correlation analysis and pattern discovery
3. Apply K-Means clustering on clean data to identify customer segments
4. Build classification models using all 9 features