## **Exploratory Data Analysis**

### **EDA Overview**
This notebook demonstrates exploratory data analysis for customer segmentation. The visualization code has been modularized into the `src/visualization.py` module for production use.

**EDA Components:**
1. **Data Quality Analysis**: Missing values, duplicates, outliers
2. **Statistical Analysis**: Distributions, correlations
3. **Geographic Analysis**: Country distribution
4. **Temporal Analysis**: Sales trends over time
5. **Customer Analysis**: RFM distributions and insights

**Production Usage:**
```python
from src.visualization import create_customer_dashboard
create_customer_dashboard(df_segments)
```

### **Load Processed Data**

In [None]:
# Import libraries and load processed data
import pandas as pd
import numpy as np
import sys
import os

# Add src directory to path for importing our module
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

# Load processed data
viable_orders = pd.read_csv('../data/processed/Online_Retail_Cleaned.csv')
print(f"Processed data shape: {viable_orders.shape}")
viable_orders.head()

### **Outlier Detection**

In [None]:
# Visualize outliers using our modular function
from visualization import plot_outlier_detection

plot_outlier_detection(viable_orders, ['UnitPrice', 'Quantity'],
                      save_path='../data/processed/images/outliers_notebook.png')

# Statistical summary
df_stats = viable_orders[['UnitPrice', 'Quantity']].describe()
print("Statistical Summary:")
print(df_stats)

# Note: Function available in src/visualization.py

### **Distribution Analysis**

In [None]:
# Plot distributions using our modular function
from visualization import plot_distributions

plot_distributions(viable_orders, ['UnitPrice', 'Quantity'],
                 save_path='../data/processed/images/distributions_notebook.png')

# Note: Function available in src/visualization.py

### **Geographic Analysis**


In [None]:
# Plot country distribution using our modular function
from visualization import plot_country_distribution

plot_country_distribution(viable_orders, 
                        save_path='../data/processed/images/country_distribution_notebook.png')

# Country statistics
country_stats = viable_orders['Country'].value_counts()
print(f"Total countries: {len(country_stats)}")
print(f"Top 5 countries:")
print(country_stats.head())

# Note: Function available in src/visualization.py

### **Sales Trend Analysis**

In [None]:
# Prepare data for sales trend analysis
df_sales = viable_orders[['UnitPrice', 'Quantity', 'InvoiceDate']].copy()
df_sales['InvoiceDate'] = pd.to_datetime(df_sales['InvoiceDate'], errors='coerce').dt.normalize()
df_sales['InvoiceDate'] = df_sales['InvoiceDate'].fillna(pd.Timestamp('2010-01-12'))
df_sales['Sales'] = df_sales['UnitPrice'] * df_sales['Quantity']

# Plot sales trend using our modular function
from visualization import plot_sales_trend

plot_sales_trend(df_sales, start_date='2010-04-01',
                save_path='../data/processed/images/sales_trend_notebook.png')

# Note: Function available in src/visualization.py

### **Customer RFM Analysis**

In [None]:
# Load customer features for RFM analysis
customer_data = pd.read_csv('../data/processed/Customer_RFM_Features.csv')
print(f"Customer data shape: {customer_data.shape}")
customer_data.head()

### **Customer Insights Analysis**

In [None]:
# Find most valuable customers based on different metrics
most_recent_customer = customer_data.loc[customer_data['Recency'].idxmin()]
most_frequent_customer = customer_data.sort_values(by='Frequency', ascending=False).iloc[1]
highest_monetary_customer = customer_data.sort_values(by='Monetary', ascending=False).iloc[1]   
most_variety_customer = customer_data.sort_values(by='UniqueProducts', ascending=False).iloc[1]
most_loyal_customer = customer_data.sort_values(by='AvgOrderValue', ascending=False).iloc[1]

print("=== CUSTOMER INSIGHTS ===")
print(f"Most Recent Customer: {most_recent_customer['CustomerID']} (Recency: {most_recent_customer['Recency']} days)")
print(f"Most Frequent Customer: {most_frequent_customer['CustomerID']} (Frequency: {most_frequent_customer['Frequency']} orders)")
print(f"Highest Monetary Customer: {highest_monetary_customer['CustomerID']} (Monetary: ${highest_monetary_customer['Monetary']:,.2f})")
print(f"Most Variety Customer: {most_variety_customer['CustomerID']} (Unique Products: {most_variety_customer['UniqueProducts']})")
print(f"Most Loyal Customer: {most_loyal_customer['CustomerID']} (Avg Order Value: ${most_loyal_customer['AvgOrderValue']:,.2f})")

### **Country-Based Analysis**

In [None]:
# Average order value by country
country_avg_order = (
    customer_data.groupby('Country')['AvgOrderValue']
    .mean()
    .reset_index()
    .rename(columns={'AvgOrderValue': 'MeanAvgOrderValue'})
)

country_avg_order_sorted = country_avg_order.sort_values(by='MeanAvgOrderValue', ascending=False)
print("=== TOP 5 COUNTRIES BY AVERAGE ORDER VALUE ===")
print(country_avg_order_sorted.head())

In [None]:
# Total monetary value by country
country_monetary = (
    customer_data.groupby('Country')['Monetary']
    .mean()
    .reset_index()
    .rename(columns={'Monetary': 'MeanMonetaryValue'})
)           

country_monetary_sorted = country_monetary.sort_values(by='MeanMonetaryValue', ascending=False)
print("=== TOP 5 COUNTRIES BY MEAN MONETARY VALUE ===")
print(country_monetary_sorted.head())

### **Complete EDA Dashboard**

In [None]:
# Create comprehensive EDA dashboard using our modular function
from visualization import create_customer_dashboard

# Create output directory if it doesn't exist
os.makedirs('../data/processed/images', exist_ok=True)

# Generate complete dashboard
if os.path.exists('../data/processed/Customer_Segments.csv'):
    customer_segments = pd.read_csv('../data/processed/Customer_Segments.csv')
    create_customer_dashboard(customer_segments, '../data/processed/images/')
    print("EDA Dashboard created successfully!")
else:
    print("Customer segments not available. Run clustering first to generate complete dashboard.")
    
# Note: Function available in src/visualization.py

### **Segment Analysis (if available)**

In [None]:
# Load customer segments if available
import os
if os.path.exists('../data/processed/Customer_Segments.csv'):
    customer_segments = pd.read_csv('../data/processed/Customer_Segments.csv')
    
    # Visualize RFM segments using our modular function
    from visualization import plot_rfm_segments
    
    plot_rfm_segments(customer_segments, 
                    save_path='../data/processed/images/rfm_segments_eda_notebook.png')
    
    # Segment distribution
    print("=== SEGMENT DISTRIBUTION ===")
    segment_dist = customer_segments['Segment'].value_counts()
    print(segment_dist)
    
    print(f"\nTotal customers segmented: {len(customer_segments)}")
    print(f"Number of segments: {customer_segments['Segment'].nunique()}")
else:
    print("Customer segments not found. Run modeling notebook first.")