# Capstone Business Analysis Project

**Objective:** Perform an end-to-end business analysis on sales data and deliver actionable recommendations suitable for portfolio presentation.

**This project includes:**
- Problem definition & scope
- Data ingestion & cleaning
- Exploratory Data Analysis (summary stats, trends)
- Visualizations: time series, box plots, violin plots, heatmaps, multiple subplots
- Customer segmentation (KMeans)
- Product & category performance analysis
- Recommendations & implementation plan
- Appendix: reproducibility, next steps, files included

**Files in this bundle:** `analysis.ipynb`, `sales_data.csv`, `README.md`, `presentation.pptx`, `requirements.txt`.


In [None]:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
df = pd.read_csv('sales_data.csv', parse_dates=['order_date'])
df.head()


## Data Cleaning & Overview
- Check missing values
- Basic summary statistics
- Create aggregated monthly revenue

In [None]:

# Basic checks
print('Rows:', len(df))
print(df.info())
print(df.isna().sum())

# Aggregations
df['month'] = df['order_date'].dt.to_period('M').dt.to_timestamp()
monthly = df.groupby('month').agg({'revenue':'sum','order_id':'count'}).rename(columns={'order_id':'orders'}).reset_index()
monthly.head()


## Visualizations
- Monthly revenue time series
- Top categories revenue
- Box & Violin plots for revenue distribution by category
- Heatmap: region vs month revenue

In [None]:

# Monthly revenue plot
plt.figure(figsize=(10,4))
plt.plot(monthly['month'], monthly['revenue'], marker='o')
plt.title('Monthly Revenue')
plt.xlabel('Month')
plt.ylabel('Revenue')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:

# Top categories
cat_rev = df.groupby('category').revenue.sum().sort_values(ascending=False)
cat_rev


In [None]:

# Box plot for revenue by category (per-order revenue)
plt.figure(figsize=(8,4))
df.boxplot(column='revenue', by='category', rot=45)
plt.suptitle(''); plt.title('Revenue distribution by category')
plt.tight_layout(); plt.show()


In [None]:

# Violin-like using multiple density plots (matplotlib only)
import seaborn as sns
plt.figure(figsize=(8,4))
sns.violinplot(x='category', y='revenue', data=df)
plt.xticks(rotation=45)
plt.title('Revenue distribution by category (violin)')
plt.tight_layout(); plt.show()


In [None]:

# Heatmap: pivot table region vs month
pivot = df.pivot_table(index='region', columns=df['month'].dt.strftime('%Y-%m'), values='revenue', aggfunc='sum', fill_value=0)
plt.figure(figsize=(10,4))
sns.heatmap(pivot, annot=True, fmt='.0f')
plt.title('Revenue by Region and Month')
plt.tight_layout(); plt.show()


## Customer Segmentation
- RFM (Recency, Frequency, Monetary)
- KMeans clustering to create segments

In [None]:

# RFM features
import datetime as dt
snapshot = df['order_date'].max() + pd.Timedelta(days=1)
rfm = df.groupby('customer_id').agg({
    'order_date': lambda x: (snapshot - x.max()).days,
    'order_id': 'count',
    'revenue': 'sum'
}).rename(columns={'order_date':'recency','order_id':'frequency','revenue':'monetary'}).reset_index()

rfm.describe()


In [None]:

# KMeans clustering (scale features)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

X = rfm[['recency','frequency','monetary']].copy()
scaler = StandardScaler()
Xs = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=4, random_state=42).fit(Xs)
rfm['segment'] = kmeans.labels_

rfm.groupby('segment').agg({'recency':'mean','frequency':'mean','monetary':'mean','customer_id':'count'})


## Product & Category Performance
- Top products by revenue
- Recommendations

In [None]:

top_products = df.groupby('product_id').revenue.sum().sort_values(ascending=False).head(10)
top_products


### Recommendations (example)
- Focus marketing on top categories (e.g., Electronics)
- Retention campaigns for high monetary customers
- Pricing experiments for mid-performing products
- Inventory adjustments by region based on heatmap insights

## Appendix
- How to run
- Requirements
- Next steps for real data


In [None]:

# Save RFM and monthly summary for reference
rfm.to_csv('rfm_segments.csv', index=False)
monthly.to_csv('monthly_summary.csv', index=False)
print('Saved rfm_segments.csv and monthly_summary.csv')
