# Olist E-Commerce Analytics — NTU SCTP Module 2 (Team 3)

This notebook reproduces the **business insights** used in the final presentation.

## What this notebook does
1. Loads the exported dataset (monthly revenue by product category) **from CSV** (default)
2. Produces key charts:
   - Monthly revenue trend
   - Monthly orders trend
   - Top categories by revenue
3. Generates NTU-safe summary insights (no causal overclaims)

> If you prefer loading directly from **BigQuery**, see the optional section at the end.


## 0) Setup

Install dependencies (if needed):

```bash
pip install pandas matplotlib numpy
```


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 140)


## 1) Load data (recommended: from CSV)

Expected columns (based on your export):
- `month` (e.g., `2017-11-01` or `2017-11`)
- `category_name`
- `total_orders`
- `product_revenue`
- `total_revenue_with_freight` (optional)

Update `CSV_PATH` to point to your exported file.


In [None]:
# ✅ Change this path if needed
CSV_PATH = "results-20251216-151001 - results-20251216-151001.csv"

df = pd.read_csv(CSV_PATH)
df.head()

## 2) Clean & validate schema

We normalize `month` to `YYYY-MM`, enforce numeric columns, and run quick sanity checks.


In [None]:
required_cols = {"month", "category_name", "total_orders", "product_revenue"}
missing = required_cols - set(df.columns)
if missing:
    raise ValueError(f"Missing expected columns: {missing}")

# Month normalization
df["month"] = df["month"].astype(str).str.slice(0, 7)  # 'YYYY-MM'

# Enforce numerics
for col in ["total_orders", "product_revenue"]:
    df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0)

# Optional column
if "total_revenue_with_freight" in df.columns:
    df["total_revenue_with_freight"] = pd.to_numeric(df["total_revenue_with_freight"], errors="coerce").fillna(0)

print("Rows:", len(df))
print("Date range:", df["month"].min(), "to", df["month"].max())
print("Categories:", df["category_name"].nunique())
df.describe(include="all")

## 3) Aggregate metrics

- Monthly totals (all categories)
- Category totals (entire period)
- Top categories by revenue


In [None]:
monthly = (
    df.groupby("month", as_index=False)
      .agg(total_orders=("total_orders", "sum"),
           product_revenue=("product_revenue", "sum"),
           total_revenue_with_freight=("total_revenue_with_freight", "sum") if "total_revenue_with_freight" in df.columns else ("product_revenue", "sum"))
      .sort_values("month")
)

category_totals = (
    df.groupby("category_name", as_index=False)
      .agg(total_orders=("total_orders", "sum"),
           product_revenue=("product_revenue", "sum"))
      .sort_values("product_revenue", ascending=False)
)

top10 = category_totals.head(10)

monthly.head(), top10

## 4) Charts

> Note: We intentionally keep charts simple and readable for NTU presentation.


In [None]:
# Monthly revenue trend
plt.figure()
plt.plot(monthly["month"], monthly["product_revenue"])
plt.xticks(rotation=90)
plt.title("Monthly Product Revenue")
plt.xlabel("Month")
plt.ylabel("Product Revenue")
plt.tight_layout()
plt.show()

In [None]:
# Monthly orders trend
plt.figure()
plt.plot(monthly["month"], monthly["total_orders"])
plt.xticks(rotation=90)
plt.title("Monthly Orders")
plt.xlabel("Month")
plt.ylabel("Total Orders")
plt.tight_layout()
plt.show()

In [None]:
# Top 10 categories by revenue
plt.figure()
plt.bar(top10["category_name"], top10["product_revenue"])
plt.xticks(rotation=90)
plt.title("Top 10 Categories by Product Revenue")
plt.xlabel("Category")
plt.ylabel("Product Revenue")
plt.tight_layout()
plt.show()

## 5) NTU-safe insights (auto-generated)

We avoid causal claims and keep conclusions as **observations from the data**.


In [None]:
total_rev = category_totals["product_revenue"].sum()
top5_share = category_totals.head(5)["product_revenue"].sum() / total_rev if total_rev else 0
top10_share = category_totals.head(10)["product_revenue"].sum() / total_rev if total_rev else 0

peak_month = monthly.loc[monthly["product_revenue"].idxmax(), "month"]

insights = {
    "date_range": f"{monthly['month'].min()} to {monthly['month'].max()}",
    "peak_month": peak_month,
    "top5_share_pct": round(top5_share * 100, 1),
    "top10_share_pct": round(top10_share * 100, 1),
    "top_categories": category_totals.head(5)["category_name"].tolist(),
}

insights

### Suggested slide bullets (copy/paste)


In [None]:
print("Revenue Trend Slide bullets:")
print(f"- Data range: {insights['date_range']}")
print(f"- Peak revenue month observed: {insights['peak_month']}")
print("- Revenue trend indicates scaling over time (note partial-year coverage if applicable).\n")

print("Category Concentration Slide bullets:")
print(f"- Revenue is concentrated: Top 5 categories ≈ {insights['top5_share_pct']}% of total revenue")
print(f"- Top 10 categories ≈ {insights['top10_share_pct']}% of total revenue")
print(f"- Leading categories: {', '.join(insights['top_categories'])}")


## (Optional) Load directly from BigQuery

Use this section only if you want to query tables directly (e.g., `fct_order_items`).

Prereqs:
- `pip install google-cloud-bigquery db-dtypes`
- Authentication set up (ADC or service account)


In [None]:
# Uncomment to use BigQuery
# from google.cloud import bigquery
# client = bigquery.Client(project="your-gcp-project-id")
#
# query = """
# SELECT
#   FORMAT_DATE('%Y-%m', DATE(order_purchase_timestamp)) AS month,
#   product_category_name AS category_name,
#   COUNT(DISTINCT order_id) AS total_orders,
#   SUM(price) AS product_revenue
# FROM `your_project.olist_analytics.fct_order_items`
# GROUP BY 1,2
# ORDER BY 1,4 DESC
# """
# df_bq = client.query(query).to_dataframe()
# df_bq.head()