# üìä Professional Data Analysis & EDA Report

## 1. Executive Summary
This analysis aims to uncover patterns in customer purchasing behavior to drive segmentation and predictive modeling. We utilize the `Online Retail II` dataset, which includes transactions from a UK-based non-store online retail.

**Key Objectives:**
1.  **Data Quality Assessment**: Evaluate the integrity of the raw data.
2.  **Sales Trends**: Identify seasonal patterns and peak sales periods.
3.  **Product Analysis**: Determine top-performing products.
4.  **Customer Behavior (RFM)**: Analyze Recency, Frequency, and Monetary distributions for segmentation.

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Load Processed Data provided by our Pipeline
try:
    df = pd.read_csv('../data/processed/cleaned_transactions.csv')
    rfm = pd.read_csv('../data/processed/rfm_customer_data.csv')
    print("‚úÖ Data Loaded Successfully")
except FileNotFoundError:
    print("‚ùå Processed data not found. Please run src/data_prep.py first.")

## 2. Univariate Analysis

### distribution of Key Metrics
Understanding the distribution of `Quantity` and `Unit Price` helps us detect any remaining outliers and understand the typical transaction size.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 5))

# Quantity Distribution (Log Scale for visibility)
sns.histplot(df['Quantity'], bins=50, ax=ax[0], color='skyblue')
ax[0].set_title('Distribution of Quantity (Log Scale)', fontsize=14)
ax[0].set_yscale('log')
ax[0].set_xlabel('Quantity Ordered')

# Price Distribution
sns.histplot(df['Price'], bins=50, ax=ax[1], color='salmon')
ax[1].set_title('Distribution of Unit Price (Log Scale)', fontsize=14)
ax[1].set_yscale('log')
ax[1].set_xlabel('Unit Price (¬£)')

plt.tight_layout()
plt.show()

**üìù Insight:**
- The distributions are highly right-skewed, which is typical for retail data (Power Law).
- Most transactions involve small quantities (consumers), but there are significant bulk purchases (likely wholesalers).
- Price similarly follows a long-tail distribution.

## 3. Sales Trends (Time Series Analysis)

Analyzing sales over time helps us understand seasonality, such as holiday spikes.

In [None]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
daily_sales = df.set_index('InvoiceDate').resample('D')['TotalAmount'].sum()

plt.figure(figsize=(15, 6))
daily_sales.plot(color='teal', linewidth=1.5)
plt.title('Daily Sales Revenue Trend', fontsize=16)
plt.ylabel('Total Revenue (¬£)')
plt.xlabel('Date')
plt.show()

**üìù Insight:**
- We observe clear spikes in Q4 (November/December), indicating strong seasonality likely driven by holiday shopping.
- There are occasional dips which might correspond to weekends or stockouts.

## 4. Geographic Analysis
Where are our customers located?

In [None]:
top_countries = df.groupby('Country')['TotalAmount'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='viridis')
plt.title('Top 10 Countries by Revenue', fontsize=16)
plt.xlabel('Total Revenue')
plt.show()

## 5. Customer Segmentation (RFM Analysis)

We utilize **RFM (Recency, Frequency, Monetary)** analysis, the gold standard for retail segmentation.

- **Recency**: Days since last purchase.
- **Frequency**: Total number of transactions.
- **Monetary**: Total revenue contributed.

Let's visualize the relationships.

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Using log scale for better visualization due to outliers
x = np.log1p(rfm['Recency'])
y = np.log1p(rfm['Frequency'])
z = np.log1p(rfm['Monetary'])

sc = ax.scatter(x, y, z, c=z, cmap='plasma', s=20, alpha=0.6)
ax.set_xlabel('Log(Recency)')
ax.set_ylabel('Log(Frequency)')
ax.set_zlabel('Log(Monetary)')
plt.colorbar(sc, label='Log(Monetary Value)')
plt.title('3D View of Customer RFM Profile', fontsize=16)
plt.show()

### Correlation Matrix of RFM Trends

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(rfm[['Recency', 'Frequency', 'Monetary']].corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix of RFM Variables', fontsize=14)
plt.show()

**üìù Insight:**
- **Frequency and Monetary** value are naturally highly correlated (0.8+). Customers who buy often, spend more.
- **Recency** has a negative correlation with Frequency. Evaluating 'Churn' often involves looking at high Recency and low Frequency customers.

## 6. Recommendations
Based on the analysis:
1. **Target Seasonal Spikes**: Prepare inventory and marketing campaigns for Q4.
2. **Wholesaler Strategy**: Separate 'Whales' (high qty/monetary) from regular retail customers for tailored B2B service.
3. **Re-engagement**: Customers with high Monetary value but high Recency (haven't bought in a while) should be targeted with 'We Miss You' campaigns.