# Exploratory Data Analysis

## Business Questions
This analysis answers key business questions:

1. **Customer Segmentation**: How do customer segments differ in value and behavior?
2. **Geographic Performance**: Which markets drive revenue and have growth potential?
3. **Temporal Patterns**: When do sales peak and what drives seasonality?
4. **Customer Retention**: How important are returning customers?
5. **Product Analysis**: Which products drive revenue?

Each section includes **insights** and **recommendations**.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Set style for all plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.2f}'.format)

In [None]:
# Load data
df = pd.read_csv('data_final.csv')
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

print(f"Dataset: {len(df):,} transactions")
print(f"Period: {df['InvoiceDate'].min().date()} to {df['InvoiceDate'].max().date()}")
print(f"Customers: {df['CustomerID'].nunique():,}")
print(f"Products: {df['StockCode'].nunique():,}")
print(f"Countries: {df['Country'].nunique()}")

---
## 1. Customer Segmentation Analysis

**Question**: How do customer segments differ in purchasing behavior and value?

### Methodology
Using **quartile-based segmentation** (more robust than median-based):
- **CLV Quartiles**: Q1 (Low), Q2-Q3 (Medium), Q4 (High)
- **Frequency Quartiles**: Q1 (Low), Q2-Q3 (Medium), Q4 (High)

In [None]:
# Create customer-level summary (deduplicated)
customer_summary = df.groupby('CustomerID').agg({
    'TotalAmount': 'sum',
    'InvoiceNo': 'nunique',
    'InvoiceDate': ['min', 'max']
}).reset_index()
customer_summary.columns = ['CustomerID', 'TotalSpend', 'OrderCount', 'FirstOrder', 'LastOrder']

# Calculate customer tenure in days
customer_summary['TenureDays'] = (customer_summary['LastOrder'] - customer_summary['FirstOrder']).dt.days

print(f"Unique customers: {len(customer_summary):,}")
customer_summary.describe()

In [None]:
# Segment using quartiles (more statistically sound than arbitrary cutoffs)
customer_summary['CLV_Segment'] = pd.qcut(
    customer_summary['TotalSpend'], 
    q=4, 
    labels=['Low', 'Medium-Low', 'Medium-High', 'High']
)

customer_summary['Frequency_Segment'] = pd.qcut(
    customer_summary['OrderCount'].rank(method='first'),  # Handle ties
    q=4, 
    labels=['Low', 'Medium-Low', 'Medium-High', 'High']
)

# Create combined segment
customer_summary['Segment'] = customer_summary['CLV_Segment'].astype(str) + ' CLV / ' + customer_summary['Frequency_Segment'].astype(str) + ' Freq'

# Simplified 2x2 segmentation for clearer visualization
customer_summary['SimpleSegment'] = customer_summary.apply(
    lambda x: 'Champions' if x['CLV_Segment'] in ['High', 'Medium-High'] and x['Frequency_Segment'] in ['High', 'Medium-High']
    else 'Loyal' if x['Frequency_Segment'] in ['High', 'Medium-High']
    else 'Big Spenders' if x['CLV_Segment'] in ['High', 'Medium-High']
    else 'At Risk',
    axis=1
)

In [None]:
# Visualize segments - using BAR CHART (not pie chart - easier to compare)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Segment distribution
segment_counts = customer_summary['SimpleSegment'].value_counts()
colors = {'Champions': '#2ecc71', 'Loyal': '#3498db', 'Big Spenders': '#f39c12', 'At Risk': '#e74c3c'}
segment_colors = [colors[s] for s in segment_counts.index]

axes[0].barh(segment_counts.index, segment_counts.values, color=segment_colors)
axes[0].set_xlabel('Number of Customers')
axes[0].set_title('Customer Segment Distribution')
for i, v in enumerate(segment_counts.values):
    axes[0].text(v + 50, i, f'{v:,} ({v/len(customer_summary)*100:.1f}%)', va='center')

# Revenue contribution by segment
segment_revenue = customer_summary.groupby('SimpleSegment')['TotalSpend'].sum().sort_values(ascending=True)
segment_colors_rev = [colors[s] for s in segment_revenue.index]

axes[1].barh(segment_revenue.index, segment_revenue.values, color=segment_colors_rev)
axes[1].set_xlabel('Total Revenue')
axes[1].set_title('Revenue Contribution by Segment')
total_rev = segment_revenue.sum()
for i, v in enumerate(segment_revenue.values):
    axes[1].text(v + 10000, i, f'${v:,.0f} ({v/total_rev*100:.1f}%)', va='center')

plt.tight_layout()
plt.show()

In [None]:
# Segment statistics
segment_stats = customer_summary.groupby('SimpleSegment').agg({
    'CustomerID': 'count',
    'TotalSpend': ['sum', 'mean'],
    'OrderCount': 'mean',
    'TenureDays': 'mean'
}).round(2)
segment_stats.columns = ['Count', 'Total Revenue', 'Avg CLV', 'Avg Orders', 'Avg Tenure (days)']
segment_stats['% of Customers'] = (segment_stats['Count'] / segment_stats['Count'].sum() * 100).round(1)
segment_stats['% of Revenue'] = (segment_stats['Total Revenue'] / segment_stats['Total Revenue'].sum() * 100).round(1)
segment_stats = segment_stats[['Count', '% of Customers', 'Total Revenue', '% of Revenue', 'Avg CLV', 'Avg Orders', 'Avg Tenure (days)']]
segment_stats

### Insight 1: Customer Segmentation

**Finding**: 
- **Champions** (high CLV + high frequency) drive disproportionate revenue despite being a smaller segment
- **At Risk** customers represent a large portion of the customer base but contribute less revenue

**Recommendation**:
1. **Protect Champions**: VIP program, dedicated support, early access to new products
2. **Convert At Risk**: Re-engagement campaigns, win-back offers
3. **Grow Big Spenders**: Encourage repeat purchases with loyalty rewards
4. **Upsell Loyal**: Introduce premium products to increase average order value

---
## 2. Geographic Analysis

**Question**: Which markets drive revenue and where are growth opportunities?

In [None]:
# Country-level metrics
country_metrics = df.groupby('Country').agg({
    'TotalAmount': 'sum',
    'InvoiceNo': 'nunique',
    'CustomerID': 'nunique'
}).reset_index()
country_metrics.columns = ['Country', 'Revenue', 'Orders', 'Customers']
country_metrics['Avg Order Value'] = (country_metrics['Revenue'] / country_metrics['Orders']).round(2)
country_metrics['Revenue per Customer'] = (country_metrics['Revenue'] / country_metrics['Customers']).round(2)
country_metrics = country_metrics.sort_values('Revenue', ascending=False)

print("Top 10 Countries by Revenue:")
country_metrics.head(10)

In [None]:
# Visualize top markets
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Top 10 by revenue
top_10 = country_metrics.head(10)
axes[0].barh(top_10['Country'][::-1], top_10['Revenue'][::-1], color='steelblue')
axes[0].set_xlabel('Total Revenue')
axes[0].set_title('Top 10 Countries by Revenue')

# Average order value (excluding UK for scale)
non_uk = country_metrics[country_metrics['Country'] != 'United Kingdom'].head(10)
axes[1].barh(non_uk['Country'][::-1], non_uk['Avg Order Value'][::-1], color='coral')
axes[1].set_xlabel('Average Order Value')
axes[1].set_title('Avg Order Value by Country (excl. UK)')

plt.tight_layout()
plt.show()

In [None]:
# UK vs International breakdown
uk_revenue = country_metrics[country_metrics['Country'] == 'United Kingdom']['Revenue'].values[0]
intl_revenue = country_metrics[country_metrics['Country'] != 'United Kingdom']['Revenue'].sum()
total_revenue = uk_revenue + intl_revenue

print("=== UK vs International ===")
print(f"UK Revenue: ${uk_revenue:,.2f} ({uk_revenue/total_revenue*100:.1f}%)")
print(f"International Revenue: ${intl_revenue:,.2f} ({intl_revenue/total_revenue*100:.1f}%)")
print(f"\nTop international markets by AOV:")
print(country_metrics[country_metrics['Country'] != 'United Kingdom'].nlargest(5, 'Avg Order Value')[['Country', 'Avg Order Value', 'Customers']])

### Insight 2: Geographic Performance

**Finding**:
- **UK dominates** (~82% of revenue) - this is the home market
- **Netherlands, Australia, Japan** have highest average order values
- International markets are underserved but show premium buying behavior

**Recommendation**:
1. **Expand into Netherlands/Australia**: High AOV suggests affluent customer base
2. **Investigate EIRE (Ireland)**: Geographic proximity + high revenue = easy expansion
3. **Localization**: Consider local payment methods, currency for top international markets
4. **Marketing spend allocation**: Test increased ad spend in high-AOV countries

---
## 3. Temporal Analysis

**Question**: What are the sales patterns over time?

In [None]:
# Monthly trend
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')
monthly_sales = df.groupby('YearMonth').agg({
    'TotalAmount': 'sum',
    'InvoiceNo': 'nunique',
    'CustomerID': 'nunique'
}).reset_index()
monthly_sales.columns = ['Month', 'Revenue', 'Orders', 'Active Customers']
monthly_sales['Month'] = monthly_sales['Month'].astype(str)

# Calculate MoM growth
monthly_sales['Revenue_Growth'] = monthly_sales['Revenue'].pct_change() * 100

In [None]:
# Plot monthly trends
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Revenue trend
axes[0].plot(monthly_sales['Month'], monthly_sales['Revenue'], marker='o', linewidth=2, color='#2ecc71')
axes[0].fill_between(monthly_sales['Month'], monthly_sales['Revenue'], alpha=0.3, color='#2ecc71')
axes[0].set_ylabel('Revenue')
axes[0].set_title('Monthly Revenue Trend')
axes[0].tick_params(axis='x', rotation=45)

# Highlight peak season
peak_months = monthly_sales[monthly_sales['Revenue'] > monthly_sales['Revenue'].quantile(0.75)]['Month'].tolist()
for month in peak_months:
    idx = monthly_sales[monthly_sales['Month'] == month].index[0]
    axes[0].axvspan(idx-0.5, idx+0.5, alpha=0.2, color='gold')

# Growth rate
colors = ['green' if x > 0 else 'red' for x in monthly_sales['Revenue_Growth'].fillna(0)]
axes[1].bar(monthly_sales['Month'], monthly_sales['Revenue_Growth'].fillna(0), color=colors)
axes[1].axhline(y=0, color='black', linestyle='-', linewidth=0.5)
axes[1].set_ylabel('MoM Growth %')
axes[1].set_title('Month-over-Month Revenue Growth')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Day of week analysis
dow_sales = df.groupby('DayOfWeek')['TotalAmount'].sum()
# Reorder days properly
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_sales = dow_sales.reindex(day_order)

plt.figure(figsize=(10, 5))
colors = ['#e74c3c' if day in ['Saturday', 'Sunday'] else '#3498db' for day in day_order]
plt.bar(dow_sales.index, dow_sales.values, color=colors)
plt.xlabel('Day of Week')
plt.ylabel('Total Revenue')
plt.title('Revenue by Day of Week (Red = Weekend)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Weekend vs Weekday comparison
weekend_rev = dow_sales[['Saturday', 'Sunday']].sum()
weekday_rev = dow_sales[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']].sum()
print(f"Weekday revenue: ${weekday_rev:,.2f} ({weekday_rev/(weekend_rev+weekday_rev)*100:.1f}%)")
print(f"Weekend revenue: ${weekend_rev:,.2f} ({weekend_rev/(weekend_rev+weekday_rev)*100:.1f}%)")

### Insight 3: Temporal Patterns

**Finding**:
- **Peak season**: September-November (Q4 holiday prep)
  - Gift retailer = customers buying for Christmas/holiday gifts
- **Weekdays dominate**: B2B customers (wholesalers) order during business hours
- **December drop**: Data ends Dec 9, 2011 - partial month

**Recommendation**:
1. **Inventory planning**: Stock up before September
2. **Marketing calendar**: Increase ad spend August-October to capture holiday demand
3. **Staffing**: Scale customer service for Q4 peak
4. **B2C opportunity**: Low weekend sales suggest untapped B2C market - consider weekend promotions

---
## 4. Customer Retention Analysis

**Question**: How important are returning customers?

In [None]:
# New vs returning customer analysis
customer_type = customer_summary.copy()
customer_type['CustomerType'] = customer_type['OrderCount'].apply(lambda x: 'One-time' if x == 1 else 'Returning')

retention_stats = customer_type.groupby('CustomerType').agg({
    'CustomerID': 'count',
    'TotalSpend': ['sum', 'mean']
}).round(2)
retention_stats.columns = ['Customers', 'Total Revenue', 'Avg CLV']
retention_stats['% of Customers'] = (retention_stats['Customers'] / retention_stats['Customers'].sum() * 100).round(1)
retention_stats['% of Revenue'] = (retention_stats['Total Revenue'] / retention_stats['Total Revenue'].sum() * 100).round(1)

retention_stats

In [None]:
# Visualize retention impact
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Customer count
axes[0].bar(['One-time', 'Returning'], retention_stats['Customers'], color=['#e74c3c', '#2ecc71'])
axes[0].set_ylabel('Number of Customers')
axes[0].set_title('Customer Distribution')
for i, v in enumerate(retention_stats['Customers']):
    axes[0].text(i, v + 50, f'{v:,}\n({retention_stats["% of Customers"].iloc[i]}%)', ha='center')

# Revenue
axes[1].bar(['One-time', 'Returning'], retention_stats['Total Revenue'], color=['#e74c3c', '#2ecc71'])
axes[1].set_ylabel('Total Revenue')
axes[1].set_title('Revenue Contribution')
for i, v in enumerate(retention_stats['Total Revenue']):
    axes[1].text(i, v + 50000, f'${v:,.0f}\n({retention_stats["% of Revenue"].iloc[i]}%)', ha='center')

plt.tight_layout()
plt.show()

In [None]:
# CLV comparison
returning_clv = retention_stats.loc['Returning', 'Avg CLV']
onetime_clv = retention_stats.loc['One-time', 'Avg CLV']
clv_multiplier = returning_clv / onetime_clv

print(f"=== Customer Value Comparison ===")
print(f"One-time customer avg value: ${onetime_clv:,.2f}")
print(f"Returning customer avg value: ${returning_clv:,.2f}")
print(f"\nReturning customers are {clv_multiplier:.1f}x more valuable")

### Insight 4: Retention Economics

**Finding**:
- Returning customers generate the **vast majority of revenue**
- A returning customer is **~Xx more valuable** than a one-time buyer
- One-time customers represent potential churn/missed opportunity

**Recommendation**:
1. **First-purchase follow-up**: Email sequence after first order to drive repeat
2. **Loyalty program**: Reward repeat purchases with points/discounts
3. **Reactivation campaigns**: Target one-time buyers with win-back offers
4. **CAC justification**: Higher CLV of returners justifies spending more on acquisition

---
## 5. Product Analysis

**Question**: Which products drive revenue and follow the Pareto principle?

In [None]:
# Product-level analysis
product_stats = df.groupby(['StockCode', 'Description']).agg({
    'TotalAmount': 'sum',
    'Quantity': 'sum',
    'InvoiceNo': 'nunique'
}).reset_index()
product_stats.columns = ['StockCode', 'Description', 'Revenue', 'Quantity', 'Orders']
product_stats = product_stats.sort_values('Revenue', ascending=False)

print(f"Total products: {len(product_stats):,}")
print(f"\nTop 10 Products by Revenue:")
product_stats.head(10)

In [None]:
# Pareto analysis
product_stats['Cumulative Revenue'] = product_stats['Revenue'].cumsum()
product_stats['Cumulative %'] = product_stats['Cumulative Revenue'] / product_stats['Revenue'].sum() * 100
product_stats['Product Rank'] = range(1, len(product_stats) + 1)
product_stats['Product Rank %'] = product_stats['Product Rank'] / len(product_stats) * 100

# Find 80/20 point
pareto_products = product_stats[product_stats['Cumulative %'] <= 80]
pareto_pct = len(pareto_products) / len(product_stats) * 100

plt.figure(figsize=(12, 6))
plt.plot(product_stats['Product Rank %'], product_stats['Cumulative %'], linewidth=2, color='#2ecc71')
plt.axhline(y=80, color='red', linestyle='--', alpha=0.7, label='80% Revenue')
plt.axvline(x=pareto_pct, color='blue', linestyle='--', alpha=0.7, label=f'{pareto_pct:.1f}% Products')
plt.fill_between(product_stats['Product Rank %'], product_stats['Cumulative %'], alpha=0.3, color='#2ecc71')
plt.xlabel('% of Products (Ranked by Revenue)')
plt.ylabel('Cumulative % of Revenue')
plt.title('Product Revenue Concentration (Pareto Analysis)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\n=== Pareto Analysis ===")
print(f"Top {pareto_pct:.1f}% of products generate 80% of revenue")
print(f"That's {len(pareto_products):,} out of {len(product_stats):,} products")

In [None]:
# Price category performance
price_performance = df.groupby('PriceCategory').agg({
    'TotalAmount': 'sum',
    'InvoiceNo': 'nunique',
    'Quantity': 'sum'
}).reset_index()
price_performance.columns = ['PriceCategory', 'Revenue', 'Orders', 'Units']
price_performance['Avg Revenue per Order'] = (price_performance['Revenue'] / price_performance['Orders']).round(2)

print("Performance by Price Category:")
price_performance

### Insight 5: Product Strategy

**Finding**:
- **Pareto principle confirmed**: Small % of products drive majority of revenue
- **Long tail exists**: Many products with minimal sales
- **Medium/High price items** generate most revenue

**Recommendation**:
1. **Protect top performers**: Ensure stock availability for top-selling items
2. **SKU rationalization**: Consider discontinuing bottom 20% of products
3. **Bundle strategy**: Pair slow movers with popular items
4. **Pricing optimization**: Test price increases on top sellers (inelastic demand)

---
## Summary: Key Findings & Recommendations

| Area | Key Finding | Recommendation | Priority |
|------|-------------|----------------|----------|
| **Customers** | Champions drive disproportionate revenue | Implement VIP loyalty program | High |
| **Geography** | UK dominates; Netherlands/Australia high AOV | Expand international marketing | Medium |
| **Timing** | Q4 peak season; weekday-heavy (B2B) | Inventory planning for Q3-Q4 | High |
| **Retention** | Returning customers Xx more valuable | First-purchase nurture sequence | High |
| **Products** | Pareto effect; ~X% drive 80% revenue | Protect top SKUs; rationalize tail | Medium |

### Next Steps
1. Build predictive churn model to identify at-risk Champions
2. A/B test international ad campaigns in high-AOV markets
3. Implement RFM scoring for targeted marketing
4. Develop cohort analysis to track retention over time