## **3. Data Preprocessing**

### **3.1 Overview**
This notebook demonstrates the data preprocessing pipeline for customer segmentation. The code has been modularized into the `src/data_preprocessing.py` module for production use.

**Key Steps:**
1. Load raw transaction data
2. Handle missing values (Description, CustomerID)
3. Remove duplicate records
4. Enforce proper data types
5. Remove outliers using IQR method
6. Filter out cancelled orders

**Production Usage:**
```python
from src.data_preprocessing import preprocess_data
df_clean = preprocess_data('../data/raw/Online_Retail.csv')
```

### **3.2 Checking Missing and Duplicates**

In [None]:
# Import libraries and load data
import pandas as pd
import sys
import os

# Add src directory to path for importing our module
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

# Load raw data
df = pd.read_csv('../data/raw/Online_Retail.csv')
print(f"Raw data shape: {df.shape}")
df.head()

In [None]:
# Check missing values using our modular function
from data_preprocessing import check_missing_values

missing_summary = check_missing_values(df)
print("Missing Values Summary:")
print(missing_summary)

# Note: This functionality is now available in src/data_preprocessing.py
# Function: check_missing_values(df)

In [None]:
# Handle missing values using our modular function
from data_preprocessing import handle_missing_values

df_clean = handle_missing_values(df)
print(f"Shape after handling missing values: {df_clean.shape}")

# Verify no missing values remain
missing_after = check_missing_values(df_clean)
print("Missing values after cleaning:")
print(missing_after)

# Check how CustomerID nulls were handled
null_customer_ids = df['CustomerID'].isnull().sum()
new_customer_ids = df_clean['CustomerID'].str.startswith('N').sum()
print(f"\nOriginal null CustomerIDs: {null_customer_ids}")
print(f"New CustomerIDs created: {new_customer_ids}")

# Note: This functionality is now available in src/data_preprocessing.py
# Function: handle_missing_values(df)

In [None]:
# Remove duplicates using our modular function
from data_preprocessing import remove_duplicates

df_no_duplicates = remove_duplicates(df_clean)
print(f"Shape after removing duplicates: {df_no_duplicates.shape}")
print(f"Removed {df_clean.shape[0] - df_no_duplicates.shape[0]} duplicate records")

# Note: This functionality is now available in src/data_preprocessing.py
# Function: remove_duplicates(df, keep='last')

### **3.3 Data Type Enforcement**

In [None]:
# Enforce proper data types using our modular function
from data_preprocessing import enforce_dtypes

df_typed = enforce_dtypes(df_no_duplicates)
print("Data types after enforcement:")
print(df_typed.dtypes)
print(f"\nShape: {df_typed.shape}")

# Note: This functionality is now available in src/data_preprocessing.py
# Function: enforce_dtypes(df)

### **3.4 Outlier Detection and Removal**

In [None]:
# Visualize outliers before removal
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.boxplot(y=df_typed['UnitPrice'])
plt.title('UnitPrice Outliers (Before)')

plt.subplot(1, 2, 2)
sns.boxplot(y=df_typed['Quantity'])
plt.title('Quantity Outliers (Before)')

plt.tight_layout()
plt.show()

In [None]:
# Remove outliers using IQR method
from data_preprocessing import remove_outliers_iqr

df_no_outliers = remove_outliers_iqr(df_typed, ['UnitPrice', 'Quantity'])
print(f"Original shape: {df_typed.shape}")
print(f"Shape after removing outliers: {df_no_outliers.shape}")
print(f"Removed {df_typed.shape[0] - df_no_outliers.shape[0]} outlier records")

# Note: This functionality is now available in src/data_preprocessing.py
# Function: remove_outliers_iqr(df, columns)

In [None]:
# Visualize outliers after removal
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.boxplot(y=df_no_outliers['UnitPrice'])
plt.title('UnitPrice Outliers (After)')

plt.subplot(1, 2, 2)
sns.boxplot(y=df_no_outliers['Quantity'])
plt.title('Quantity Outliers (After)')

plt.tight_layout()
plt.show()

### **3.5 Filter Viable Orders**

In [None]:
# Remove cancelled orders (negative quantities)
from data_preprocessing import filter_viable_orders

viable_orders = filter_viable_orders(df_no_outliers)
print(f"Shape after filtering viable orders: {viable_orders.shape}")
print(f"Removed {df_no_outliers.shape[0] - viable_orders.shape[0]} cancelled orders")

# Display sample of viable orders
viable_orders.head()

# Note: This functionality is now available in src/data_preprocessing.py
# Function: filter_viable_orders(df)

### **3.6 Complete Pipeline**

##### Run the complete preprocessing pipeline using our modular function 

In [None]:
df_final = preprocess_data('../data/raw/Online_Retail.csv')
print(f"Final processed data shape: {df_final.shape}")
print("\nFinal data info:")
df_final.info()

##### This single function performs all the steps above:
##### 1. Load data
##### 2. Handle missing values
##### 3. Remove duplicates
##### 4. Enforce data types
##### 5. Remove outliers
##### 6. Filter viable orders

In [None]:
# Save processed data
from data_preprocessing import save_processed_data
save_processed_data(df_final, '../data/processed/Online_Retail_Cleaned.csv')
print("\nProcessed data saved to: ../data/processed/Online_Retail_Cleaned.csv")