<a href="https://colab.research.google.com/github/AaryanPriyadarshi/Integrated-Retail-Analytics-for-Store-Optimization/blob/main/Integrated_Retail_Analytics_for_Store_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Integrated Retail Analytics for Store Optimization Project



##### **Project Type**    - EDA / Regression / Forecasting

##### **Contribution**    - Aaryan Priyadarshi

# **Project Summary -**

This project focuses on forecasting weekly retail sales by integrating data from sales transactions, store information, and external economic indicators. The goal is to identify sales patterns, test the impact of holidays, and build machine learning models that can generate accurate forecasts. Using techniques such as exploratory data analysis (EDA), hypothesis testing, feature engineering, and machine learning models (Random Forest and XGBoost), the project highlights how data-driven insights can optimize store operations. The results show that holidays, promotions, store size, and economic factors significantly drive sales, with XGBoost outperforming Random Forest in prediction accuracy. This project demonstrates the practical use of analytics in helping retailers plan promotions, manage inventory, and improve decision-making.

# **GitHub Link -**

https://github.com/AaryanPriyadarshi/Integrated-Retail-Analytics-for-Store-Optimization

# **Problem Statement**


Retailers often struggle to forecast sales accurately due to the influence of multiple factors such as holidays, promotions, store type, and external economic conditions. Poor forecasting can lead to overstocking, understocking, and lost revenue opportunities. The problem this project addresses is:

“How can we integrate sales, store, and external feature data to accurately forecast weekly retail sales and identify the key factors driving sales performance?”

The solution aims to build machine learning models that not only predict sales but also provide insights into which factors have the most significant impact on sales fluctuations.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

### 1. Import Libraries

Imported the necessary libraries for data handling, visualization, preprocessing, machine learning, and hypothesis testing.


In [None]:
# Data handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from xgboost import XGBRegressor

# Hypothesis Testing
from scipy.stats import ttest_ind

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Plot Styling
plt.style.use("seaborn-v0_8")


### 2. Dataset Loading

Loaded three datasets:
- **Sales dataset** (weekly sales per store and department)
- **Stores dataset** (store type and size)
- **Features dataset** (economic indicators, holidays, promotions)

Then merged them into one dataset for analysis.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Load datasets (fixed paths + removed extra space)
sales = pd.read_csv("/content/drive/MyDrive/Retail Analysis Project/sales data-set.csv")
stores = pd.read_csv("/content/drive/MyDrive/Retail Analysis Project/stores data-set.csv")
features = pd.read_csv("/content/drive/MyDrive/Retail Analysis Project/Features data set.csv")

# Merge datasets
df = sales.merge(stores, on="Store", how="left")
df = df.merge(features, on=["Store", "Date"], how="left")

# Preview
print("Shape:", df.shape)
df.head()


### 3. Dataset First View

- Preview dataset with `.head()`
- Check basic info with `.info()` and `.describe()`
- Count missing values


In [None]:
df.info()
df.describe().T
df.isnull().sum().sort_values(ascending=False).head(10)


### 4. Feature Engineering

- Convert Date into datetime
- Extract **Year, Month, Week, Quarter**
- Prepare for seasonal analysis


In [None]:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Week'] = df['Date'].dt.isocalendar().week
df['Quarter'] = df['Date'].dt.quarter


### 5. Data Visualization

Used 5 different plots to understand sales patterns:
1. Weekly sales trend over time
2. Holiday vs non-holiday sales
3. Sales by store type
4. Correlation heatmap
5. Monthly distribution of sales


In [None]:
# 1. Weekly Sales Trend Over Time
plt.figure(figsize=(12,6))
df.groupby('Date')['Weekly_Sales'].sum().plot()
plt.title("Weekly Sales Trend Over Time")
plt.ylabel("Total Weekly Sales")
plt.xlabel("Date")
plt.show()

# 2. Holiday vs Non-Holiday Sales
plt.figure(figsize=(6,4))
sns.barplot(x="IsHoliday_x", y="Weekly_Sales", data=df)
plt.title("Average Weekly Sales: Holiday vs Non-Holiday")
plt.xlabel("Holiday Week?")
plt.ylabel("Average Weekly Sales")
plt.show()

# 3. Sales by Store Type
plt.figure(figsize=(6,4))
sns.barplot(x="Type", y="Weekly_Sales", data=df)
plt.title("Average Weekly Sales by Store Type")
plt.xlabel("Store Type")
plt.ylabel("Average Weekly Sales")
plt.show()

# 4. Correlation Heatmap
plt.figure(figsize=(10,6))
sns.heatmap(df[['Weekly_Sales','Temperature','Fuel_Price','CPI','Unemployment']].corr(),
            annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

# 5. Monthly Distribution of Weekly Sales
plt.figure(figsize=(10,6))
sns.boxplot(x="Month", y="Weekly_Sales", data=df)
plt.title("Monthly Distribution of Weekly Sales")
plt.xlabel("Month")
plt.ylabel("Weekly Sales")
plt.show()


### 6. Hypothesis Testing

**Hypothesis:**
- H₀: Holiday weeks have no effect on sales
- H₁: Holiday weeks significantly affect sales


In [None]:
holiday_sales = df[df['IsHoliday_x'] == True]['Weekly_Sales']
nonholiday_sales = df[df['IsHoliday_x'] == False]['Weekly_Sales']

t_stat, p_val = ttest_ind(holiday_sales, nonholiday_sales, equal_var=False)
print("T-statistic:", t_stat)
print("P-value:", p_val)

if p_val < 0.05:
    print("Reject Null Hypothesis: Holiday weeks significantly affect sales.")
else:
    print("Fail to Reject Null Hypothesis: No significant difference.")


### 7. Feature Engineering & Preprocessing

- Encode categorical variables
- Drop irrelevant columns
- Define Weekly_Sales as target
- Train/test split
- Scale features


In [None]:
# Encode categorical features
le = LabelEncoder()
df['Type'] = le.fit_transform(df['Type'])

# Drop irrelevant columns
X = df.drop(['Date','Weekly_Sales','IsHoliday_y'], axis=1)
y = df['Weekly_Sales']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### 8. Machine Learning Models

Used two models:
- Random Forest Regressor
- XGBoost Regressor


In [None]:
# Random Forest
rf_model = RandomForestRegressor(n_estimators=200, random_state=42)
rf_model.fit(X_train_scaled, y_train)
y_pred_rf = rf_model.predict(X_test_scaled)N

# XGBoost
xgb_model = XGBRegressor(n_estimators=300, learning_rate=0.1, random_state=42)
xgb_model.fit(X_train_scaled, y_train)
y_pred_xgb = xgb_model.predict(X_test_scaled)


### 9. Model Evaluation

We evaluate models using RMSE, MAE, and R², then compare them in a table and bar chart.


In [None]:
# Collect metrics
results = {
    "Model": ["Random Forest", "XGBoost"],
    "RMSE": [
        np.sqrt(mean_squared_error(y_test, y_pred_rf)),
        np.sqrt(mean_squared_error(y_test, y_pred_xgb))
    ],
    "MAE": [
        mean_absolute_error(y_test, y_pred_rf),
        mean_absolute_error(y_test, y_pred_xgb)
    ],
    "R²": [
        r2_score(y_test, y_pred_rf),
        r2_score(y_test, y_pred_xgb)
    ]
}

metrics_df = pd.DataFrame(results)
display(metrics_df)

# Bar chart
metrics_df.set_index("Model")[["RMSE","MAE","R²"]].plot(
    kind="bar", figsize=(8,6), colormap="Set2", rot=0
)
plt.title("Model Performance Comparison")
plt.ylabel("Score")
plt.show()


### 10. Feature Importance

We check which features drive sales predictions.


In [None]:
# Random Forest
importances_rf = rf_model.feature_importances_
features = X.columns

plt.figure(figsize=(10,5))
sns.barplot(x=importances_rf, y=features, palette="Blues_r")
plt.title("Random Forest - Feature Importance")
plt.show()

# XGBoost
importances_xgb = xgb_model.feature_importances_

plt.figure(figsize=(10,5))
sns.barplot(x=importances_xgb, y=features, palette="Greens_r")
plt.title("XGBoost - Feature Importance")
plt.show()


### 11. Conclusion & Reflection

- Sales show strong seasonality, with peaks during holidays.
- Hypothesis testing confirmed holiday weeks significantly increase sales.
- XGBoost outperformed Random Forest across all metrics.
- Key drivers of sales include holidays, store size, promotions, and economic indicators.

**Reflection:**  
This project taught me how to merge multi-source datasets, engineer time-based features, validate business assumptions with hypothesis testing, and apply machine learning for forecasting. The insights gained can help retailers optimize promotions, inventory, and store operations.
