## Phase 5: Transformation & Features

### Motivation
In this phase, we create transformed features that make patterns clearer, reduce skew, and support interactive dashboard exploration.

### Questions to answer
1. What derived variables would help answer my questions?
2. How should I bin continuous variables for analysis?
3. What date features are relevant?
4. Do I need numeric representations of categories?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("ggplot")

df = pd.read_csv("../data/raw/amazon_sales_2025_INR.csv")
df["Date"] = pd.to_datetime(df["Date"])

df.head()


In [None]:
df["Month"] = df["Date"].dt.month
df["Month_Name"] = df["Date"].dt.strftime("%b")
df["Quarter"] = df["Date"].dt.quarter

df[["Date", "Month", "Month_Name", "Quarter"]].head()


**Interpretation**
1. These features allow us to analyze monthly sales trends (RQ4).
2. Quarter helps group seasonal shopping patterns.
3. Month_Name improves interpretability in visual dashboards.

In [None]:
df["Delivered_Flag"] = (df["Delivery_Status"] == "Delivered").astype(int)

delivery_dummies = pd.get_dummies(df["Delivery_Status"], prefix="Delivery")
df = pd.concat([df, delivery_dummies], axis=1)

df[["Delivery_Status", "Delivered_Flag"]].head()


**Interpretation**
1. Delivered_Flag lets us isolate successful deliveries vs. failed deliveries.
2. Full dummy encoding allows filtering by each delivery type in dashboards.
3. These features support deeper investigation in RQ2.

In [None]:
df["Satisfied"] = (df["Review_Rating"] >= 4).astype(int)

df[["Review_Rating", "Satisfied"]].head()

**Interpretation**
1. Simplifies the rating scale for subgroup analysis.
2. Helps identify conditions that lead to high satisfaction.
3. Useful for filtering in visual dashboards.

In [None]:
df["Log_Total_Sales"] = np.log1p(df["Total_Sales_INR"])

fig, ax = plt.subplots(1, 2, figsize=(10,4))
sns.histplot(df["Total_Sales_INR"], bins=20, ax=ax[0])
ax[0].set_title("Original Total Sales")

sns.histplot(df["Log_Total_Sales"], bins=20, ax=ax[1])
ax[1].set_title("Log-Transformed Sales")

plt.tight_layout()
plt.show()


**Interpretation**
1. Skew is significantly reduced.
2. Patterns among medium and high-value purchases become clearer.
3. Useful for normalizing the dataset in later modeling or dashboard trends.

In [None]:
category_avg = df.groupby("Product_Category")["Unit_Price_INR"].mean()
df["Category_Avg_Price"] = df["Product_Category"].map(category_avg)

df[["Product_Category", "Category_Avg_Price"]].head()


**Interpretation**
1. Electronics likely has higher average prices - explains higher revenue.
2. Price levels may affect return patterns or satisfaction rates.
3. Supports deeper category-based analysis (RQ1, RQ2).

In [None]:
payment_dummies = pd.get_dummies(df["Payment_Method"], prefix="Pay")
df = pd.concat([df, payment_dummies], axis=1)

df.head()


**Interpretation**
1. Enables comparing return rates or spending levels by payment type.
2. Supports visualization grouping in the dashboard.
3. Helps answer RQ3.

In [None]:
df.head()

## Conclusion
- Added multiple new variables that directly support RQ1â€“RQ4.
- Reduced skew for Total Sales.
- Created interpretable group-level features (Satisfied, Delivered_Flag).
- Improved temporal analysis (Month, Quarter).
- Enhanced dashboard filter capability via dummies.

## Hypothesis Generation
1. Customers using Credit Card / UPI spend more than COD customers.
2. Delivery_Flag is a stronger predictor of satisfaction than Unit Price.
3. Categories with higher Category_Avg_Price produce more stable log-sales patterns.
4. Quarter may predict spikes in demand (e.g., Q4 = festival season).
5. Satisfied customers are more common in states with higher revenue tiers.

## Iteration Signals
1. Skew reduction via Log_Total_Sales suggests new plots will be easier to interpret.
2. Dummy variables enable multi-filter dashboards in Phase 6.
3. Satisfaction flag simplifies modeling and subgroup analysis.
4. Month and Quarter features support deeper time-series slicing.
5. No missing values - no need to revisit Phase 2 cleaning.