<a href="https://colab.research.google.com/github/JitinSaxenaa/Store-Optimization/blob/main/Store_Optimization_submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1     - Jitin Saxena


# **Project Summary -**

In the era of data-driven business decisions, retailers face immense pressure to optimize their operations, especially when managing multiple stores with fluctuating customer demand and economic conditions. This capstone project aims to build a comprehensive, data-driven solution for analyzing and forecasting sales performance using integrated retail data. The goal is to help businesses make informed decisions regarding resource allocation, promotional campaigns, inventory planning, and overall store management.

This project utilizes three datasets: store information, features (external and internal variables), and weekly sales records across departments and locations. The analysis begins with Exploratory Data Analysis (EDA) to understand sales trends, variable relationships, and seasonality. We performed structured visualization using the UBM rule—Univariate, Bivariate, and Multivariate analysis—to extract actionable insights. Key variables explored include temperature, fuel price, markdowns, CPI, unemployment, and store types, among others.

Subsequently, we focused on preprocessing and feature engineering. Missing values were treated appropriately based on business logic (e.g., filling markdowns with zeros if promotions didn’t occur). Outliers were detected using boxplots and Z-scores. Categorical data like store type and department were encoded using Label and OneHot Encoding depending on the model requirements. All numeric features were scaled using StandardScaler to ensure fair weightage across algorithms.

The modeling section focuses on predicting weekly sales, a regression problem. We implemented three models: Linear Regression (as a baseline), Random Forest Regressor, and XGBoost Regressor. Each model's performance was evaluated using RMSE, MAE, and R². Hyperparameter tuning using GridSearchCV was employed to improve accuracy. Among the models, XGBoost gave the best result with the lowest RMSE and highest R² score, indicating strong predictive power.

Additionally, statistical hypothesis testing was conducted to validate business assumptions. For example, we tested if promotions significantly impact sales or if holiday weeks show different trends. T-tests and ANOVA were used with P-values to accept or reject null hypotheses, ensuring that all data-driven decisions were statistically sound.

Lastly, model explainability tools such as feature importance plots were used to identify the most influential variables. Markdown1, CPI, store type, and department were among the top drivers of sales. The final model was saved using joblib and tested with new unseen data to verify deployment readiness.

This project has both predictive and prescriptive value. On the predictive side, it forecasts store-level sales with high accuracy. On the prescriptive side, it guides store managers on which features—like markdowns, timing, and store type—should be prioritized to maximize sales. In real-world deployment, such models can significantly enhance decision-making, reduce inventory waste, and increase customer satisfaction.

In conclusion, this project exemplifies the power of machine learning in retail analytics. With structured EDA, rigorous modeling, and business-oriented storytelling, we provide a comprehensive solution that is ready for real-world deployment. Future extensions may include real-time dashboards, demand forecasting at SKU level, or integration with external APIs for dynamic pricing.

# **GitHub Link -**

https://github.com/JitinSaxenaa/Store-Optimization/blob/main/Store_Optimization_submission.ipynb

# **Problem Statement**


Retail chains operating multiple stores across different regions often face challenges in optimizing store performance due to varying customer behavior, regional economic indicators, and inconsistent promotional strategies. These challenges lead to inefficient resource allocation, suboptimal inventory management, and missed sales opportunities.

The objective of this project is to develop a machine learning-based analytical framework that helps predict weekly sales at the store-department level by integrating various datasets containing sales history, promotional features, and store metadata. By uncovering patterns and trends in the data through exploratory data analysis (EDA) and building predictive models using regression algorithms, the project aims to identify the key drivers of sales and provide actionable insights.

Furthermore, this project seeks to answer critical business questions such as:

How do markdowns and holiday events affect store performance?

What are the most important features influencing store-level sales?

Can we build a robust, production-ready machine learning model to forecast weekly sales accurately?

The solution must not only predict sales but also offer explainable, deployable, and business-friendly insights to support strategic planning, promotional decisions, and inventory optimization across the retail chain.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Standard Data Science Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# For statistical testing and modeling
from scipy import stats
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Warnings
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
# Load the datasets
features_df = pd.read_csv("Features data set.csv")
stores_df = pd.read_csv("stores data-set.csv")
sales_df = pd.read_csv("sales data-set.csv")


### Dataset First View

In [None]:
# Dataset First Look
# First 5 rows of each dataset
print("Sales Data:\n", sales_df.head(), "\n")
print("Features Data:\n", features_df.head(), "\n")
print("Stores Data:\n", stores_df.head())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Sales Shape:", sales_df.shape)
print("Features Shape:", features_df.shape)
print("Stores Shape:", stores_df.shape)


### Dataset Information

In [None]:
# Dataset Info
# Basic info
print("Sales Info:")
sales_df.info()

print("\nFeatures Info:")
features_df.info()

print("\nStores Info:")
stores_df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Duplicates in Sales:", sales_df.duplicated().sum())
print("Duplicates in Features:", features_df.duplicated().sum())
print("Duplicates in Stores:", stores_df.duplicated().sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Count of missing values
print("Missing values in Sales:\n", sales_df.isnull().sum())
print("\nMissing values in Features:\n", features_df.isnull().sum())
print("\nMissing values in Stores:\n", stores_df.isnull().sum())


In [None]:
# Visualizing the missing values
# Heatmaps for missing data
plt.figure(figsize=(14, 4))
sns.heatmap(sales_df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values in Sales Data")
plt.show()

plt.figure(figsize=(14, 4))
sns.heatmap(features_df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values in Features Data")
plt.show()

plt.figure(figsize=(6, 2))
sns.heatmap(stores_df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values in Stores Data")
plt.show()


### What did you know about your dataset?



```
- The dataset includes three files: store details, promotional/seasonal features, and weekly sales.
- There are missing values mainly in markdown and CPI-related columns in the Features dataset.
- No duplicate values are present, which is good.
- Features and Sales datasets are much larger than the Stores dataset, indicating a need for merging.

```

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Sales Columns:", sales_df.columns.tolist())
print("Features Columns:", features_df.columns.tolist())
print("Stores Columns:", stores_df.columns.tolist())


In [None]:
# Dataset Describe
# Basic stats for each dataset
print(sales_df.describe())
print(features_df.describe())
print(stores_df.describe())




### Variables Description

- **Store**: Unique identifier for each store.
- **Dept**: Department number within the store.
- **Date**: Week ending date.
- **Weekly_Sales**: Sales figures (target variable).
- **IsHoliday**: Whether the week includes a holiday.
- **Store Type**: Category of store (A, B, C).
- **Size**: Physical size of the store.
- **Temperature**: Average temperature that week.
- **Fuel_Price**: Local fuel price.
- **CPI**: Consumer Price Index.
- **Unemployment**: Unemployment rate in the region.
- **Markdown1-5**: Promotional markdown amounts.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Unique value count for each column
print("Sales Unique Values:\n", sales_df.nunique())
print("\nFeatures Unique Values:\n", features_df.nunique())
print("\nStores Unique Values:\n", stores_df.nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert Date to datetime with dayfirst=True
# Merge datasets
merged_df = sales_df.merge(stores_df, on='Store', how='left')
merged_df = merged_df.merge(features_df, on=['Store', 'Date'], how='left')

# Convert Date to datetime format correctly
merged_df['Date'] = pd.to_datetime(merged_df['Date'], dayfirst=True)



# Check merged dataset
print("Final Merged Data Shape:", merged_df.shape)
print(merged_df.head())



### What all manipulations have you done and insights you found?

- Merged three datasets based on common columns (Store, Date).
- Converted `Date` to datetime format for time-series insights.
- Ensured there are no duplicate rows after merging.
- Missing values will be handled next during preprocessing.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10, 5))
sns.histplot(merged_df['Weekly_Sales'], bins=50, kde=True, color='teal')
plt.title('Distribution of Weekly Sales')
plt.xlabel('Weekly Sales')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

This histogram is ideal to understand how sales are spread and whether the data is skewed.

##### 2. What is/are the insight(s) found from the chart?

Sales distribution is right-skewed, with most weeks having sales below 20,000 but some very high outliers

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The presence of extreme sales peaks suggests the importance of high-performing departments during peak seasons or promotions. This can guide targeted marketing.



#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10, 5))
sns.histplot(merged_df['Size'], bins=30, kde=True, color='orange')
plt.title('Distribution of Store Sizes')
plt.xlabel('Store Size')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

To examine how many small, medium, and large stores exist.

##### 2. What is/are the insight(s) found from the chart?

Store sizes show a multimodal distribution, indicating different classes or tiers of stores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps categorize stores into small/medium/large for differentiated inventory and layout planning.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8, 5))
sns.boxplot(x='Type', y='Weekly_Sales', data=merged_df, palette='Set2')
plt.title('Weekly Sales by Store Type')
plt.xlabel('Store Type')
plt.ylabel('Weekly Sales')
plt.show()


##### 1. Why did you pick the specific chart?

To compare how store types (A/B/C) perform.

##### 2. What is/are the insight(s) found from the chart?



```
Store Type A has the highest sales median and more variance, suggesting it's the flagship or largest format.
```



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Store format can be a key segmentation in forecasting and promotion strategy.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
time_sales = merged_df.groupby('Date')['Weekly_Sales'].sum().reset_index()
plt.figure(figsize=(15, 5))
sns.lineplot(data=time_sales, x='Date', y='Weekly_Sales')
plt.title('Total Weekly Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Time series plots are perfect for identifying seasonality.

##### 2. What is/are the insight(s) found from the chart?

Sales peak in November–December, indicating strong holiday effects.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Strong case for focused marketing and inventory planning before holidays.

#### Chart - 5

In [None]:
# Rename 'IsHoliday_y' to 'IsHoliday' for plotting
merged_df.rename(columns={'IsHoliday_y': 'IsHoliday'}, inplace=True)

# Check if it's now present
print('IsHoliday' in merged_df.columns)  # should return True

plt.figure(figsize=(7, 5))
sns.boxplot(x='IsHoliday', y='Weekly_Sales', data=merged_df, palette='coolwarm')
plt.title('Sales During Holiday vs Non-Holiday Weeks')
plt.xlabel('Is Holiday')
plt.ylabel('Weekly Sales')
plt.show()





##### 1. Why did you pick the specific chart?

To test if holidays lead to higher or lower sales.

##### 2. What is/are the insight(s) found from the chart?

Median sales are slightly higher during holidays, but also have more variance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Holiday promotions are a risk-reward play—strategies must be tailored carefully.



#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Temperature', y='Weekly_Sales', data=merged_df, alpha=0.4)
plt.title('Temperature vs Weekly Sales')
plt.xlabel('Temperature')
plt.ylabel('Weekly Sales')
plt.show()


##### 1. Why did you pick the specific chart?



```
Scatter plots are good for spotting correlations between numerical features.
```



##### 2. What is/are the insight(s) found from the chart?

No strong linear relationship. Sales happen in both hot and cold climates.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Temperature alone isn’t a reliable predictor but may help in regional segmentation.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Fuel_Price', y='Weekly_Sales', data=merged_df, alpha=0.4, color='purple')
plt.title('Fuel Price vs Weekly Sales')
plt.xlabel('Fuel Price')
plt.ylabel('Weekly Sales')
plt.show()


##### 1. Why did you pick the specific chart?

To explore economic effects on sales.

##### 2. What is/are the insight(s) found from the chart?

Slight negative trend—higher fuel prices may reduce customer visits.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Promotions during fuel hikes may help offset lost footfall.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
store_sales = merged_df.groupby('Store')['Weekly_Sales'].sum().reset_index()
plt.figure(figsize=(18, 6))
sns.barplot(x='Store', y='Weekly_Sales', data=store_sales, palette='Spectral')
plt.title('Total Sales by Store')
plt.xlabel('Store')
plt.ylabel('Total Weekly Sales')
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

Bar plots show distribution across categories well.

##### 2. What is/are the insight(s) found from the chart?

Some stores clearly outperform—possible location, management, or demographics.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Best practices can be replicated in underperforming stores.



#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='CPI', y='Weekly_Sales', data=merged_df, alpha=0.5, color='gold')
plt.title('CPI vs Weekly Sales')
plt.xlabel('CPI')
plt.ylabel('Weekly Sales')
plt.show()


##### 1. Why did you pick the specific chart?

To check macroeconomic pressure on spending.

##### 2. What is/are the insight(s) found from the chart?

No clear linear trend, but volatility increases at extreme CPI levels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in adjusting pricing during inflation phases.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Unemployment', y='Weekly_Sales', data=merged_df, alpha=0.5)
plt.title('Unemployment vs Weekly Sales')
plt.xlabel('Unemployment Rate')
plt.ylabel('Weekly Sales')
plt.show()


##### 1. Why did you pick the specific chart?



```
To assess regional economic factors.
```

Answer Here.

##### 2. What is/are the insight(s) found from the chart?



```
Sales tend to dip slightly with high unemployment.
```

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



```
Retailers can plan discounts in affected areas to maintain loyalty.
```

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Select only numeric features
numeric_df = merged_df.select_dtypes(include=['float64', 'int64'])

plt.figure(figsize=(14, 10))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap Between Numeric Features')
plt.show()


##### 1. Why did you pick the specific chart?

Heatmaps give a quick overview of correlation among multiple numeric variables.

##### 2. What is/are the insight(s) found from the chart?

CPI, Unemployment, and Fuel_Price have mild correlations with each other.

Markdown features have weak correlations with sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Low correlation with sales suggests we need non-linear models (like Random Forest/XGBoost). Also helps avoid multicollinearity issues.



#### Chart - 12

In [None]:
# Chart - 12 visualization code
top_vars = ['Weekly_Sales', 'Size', 'CPI', 'Unemployment', 'MarkDown1']
sns.pairplot(merged_df[top_vars].dropna(), diag_kind='kde', height=2.5)
plt.suptitle('Pair Plot Between Important Numerical Variables', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Pair plots are perfect for seeing multiple pairwise relationships at once.

##### 2. What is/are the insight(s) found from the chart?

Confirms weak to moderate relationships among variables. Some right-skewed distributions too.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Encourages feature scaling and transformation before modeling.



#### Chart - 13

In [None]:
# Chart - 13 visualization code
dept_sales = merged_df.groupby('Dept')['Weekly_Sales'].sum().reset_index()
plt.figure(figsize=(18, 6))
sns.barplot(x='Dept', y='Weekly_Sales', data=dept_sales, palette='cubehelix')
plt.title('Total Weekly Sales by Department')
plt.xlabel('Department')
plt.ylabel('Total Sales')
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

To identify high- and low-performing departments.

##### 2. What is/are the insight(s) found from the chart?

Some departments consistently drive sales. Others may be redundant or seasonal.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in prioritizing stock allocation and marketing campaigns.

#### Chart - 14 - Correlation Heatmap

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='MarkDown1', y='Weekly_Sales', data=merged_df, alpha=0.5)
plt.title('MarkDown1 vs Weekly Sales')
plt.xlabel('MarkDown1')
plt.ylabel('Weekly Sales')
plt.show()



##### 1. Why did you pick the specific chart?

To understand if promotions (Markdown1) are working.

##### 2. What is/are the insight(s) found from the chart?

Higher Markdown1 values are associated with occasional sales spikes, but not consistently.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(x='Type', y='Weekly_Sales', hue='Size', data=merged_df)
plt.title('Interaction of Store Type and Size on Weekly Sales')
plt.xlabel('Store Type')
plt.ylabel('Weekly Sales')
plt.legend([],[], frameon=False)  # Hiding legend for cleaner look
plt.show()


##### 1. Why did you pick the specific chart?

To show interaction between two variables (size & type) on sales.

##### 2. What is/are the insight(s) found from the chart?

Larger Type A stores tend to have the highest and most variable sales.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

We’ll define 3 hypothetical business questions, state the null and alternate hypotheses, and conduct proper statistical tests using P-values.



### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant difference in average weekly sales between holiday and non-holiday weeks.

Alternate Hypothesis (H₁):
There is a significant difference in average weekly sales between holiday and non-holiday

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Holiday vs Non-Holiday sales
holiday_sales = merged_df[merged_df['IsHoliday'] == True]['Weekly_Sales']
nonholiday_sales = merged_df[merged_df['IsHoliday'] == False]['Weekly_Sales']

# Perform t-test
t_stat, p_val = stats.ttest_ind(holiday_sales, nonholiday_sales, equal_var=False)
print("T-Statistic:", t_stat)
print("P-Value:", p_val)


##### Which statistical test have you done to obtain P-Value?

T-Test is ideal for comparing means between two independent groups (holiday vs non-holiday).

##### Why did you choose the specific statistical test?

If P < 0.05, we reject H₀ and conclude holidays affect sales significantly.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant difference in average weekly sales between Store Type A and Store Type B.

Alternate Hypothesis (H₁):
Store Type A has significantly higher weekly sales than Store Type B.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Sales from Store Type A and B
type_a_sales = merged_df[merged_df['Type'] == 'A']['Weekly_Sales']
type_b_sales = merged_df[merged_df['Type'] == 'B']['Weekly_Sales']

# Perform t-test
t_stat, p_val = stats.ttest_ind(type_a_sales, type_b_sales, equal_var=False)
print("T-Statistic:", t_stat)
print("P-Value:", p_val)


##### Which statistical test have you done to obtain P-Value?

To compare sales between two different store types using t-test.

##### Why did you choose the specific statistical test?

If P < 0.05, we conclude Type A stores perform significantly better.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.



```
## Null Hypothesis (H₀):
There is no significant difference in average weekly sales among different store types.

Alternate Hypothesis (H₁):
There is a significant difference in average weekly sales among store types.
```

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import f_oneway

# Grouping sales by store type
group_a = merged_df[merged_df['Type'] == 'A']['Weekly_Sales']
group_b = merged_df[merged_df['Type'] == 'B']['Weekly_Sales']
group_c = merged_df[merged_df['Type'] == 'C']['Weekly_Sales']

# Perform one-way ANOVA
f_stat, p_val = f_oneway(group_a, group_b, group_c)
print("F-Statistic:", f_stat)
print("P-Value:", p_val)


##### Which statistical test have you done to obtain P-Value?

ANOVA (Analysis of Variance) compares means across 3 or more groups

##### Why did you choose the specific statistical test?

If P < 0.05, at least one group has a significantly different sales average.



## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Check missing values
missing_values = merged_df.isnull().sum()
print(missing_values[missing_values > 0])


#### What all missing value imputation techniques have you used and why did you use those techniques?

Markdowns → Replace with 0 (assumes no promotion).

CPI, Unemployment → Fill forward (or region-wise mean imputation).

Temperature, Fuel_Price → Use median imputation.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
Q1 = merged_df['Weekly_Sales'].quantile(0.25)
Q3 = merged_df['Weekly_Sales'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Filter out outliers
merged_df = merged_df[(merged_df['Weekly_Sales'] >= lower) & (merged_df['Weekly_Sales'] <= upper)]


##### What all outlier treatment techniques have you used and why did you use those techniques?

The Interquartile Range (IQR) method is robust and doesn’t assume normal distribution.

Prevents extreme values from skewing the model.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Encode 'Type' using Label Encoding
merged_df['Type'] = merged_df['Type'].map({'A': 0, 'B': 1, 'C': 2})

# Encode 'IsHoliday' as integer
merged_df['IsHoliday'] = merged_df['IsHoliday'].astype(int)


#### What all categorical encoding techniques have you used & why did you use those techniques?

We use label encoding since these are ordinal or binary variables.

OneHotEncoding would be unnecessary for tree-based models and can increase dimensionality.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Drop columns not needed for modeling
model_df = merged_df.drop(['Date'], axis=1)



#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
X = model_df.drop(['Weekly_Sales'], axis=1)
y = model_df['Weekly_Sales']

##### What all feature selection methods have you used  and why?

We used a correlation heatmap to visually inspect the relationships between features and the target variable (Weekly_Sales).

To remove redundant features that are either:

Irrelevant to the target, or

Highly correlated with each other (which can cause overfitting in linear models).



##### Which all features you found important and why?

MarkDown1
Store Type
Store Size
IsHoliday

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data Not applied here — tree-based models like Random Forest/XGBoost do not require transformation. We'll revisit this if needed for linear models.


### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

We scale numeric features for algorithms sensitive to scale (e.g., Linear Regression).

RandomForest/XGBoost don’t need scaling, but we’ll scale for consistency.



In [None]:
# DImensionality Reduction (If needed) - Not needed — feature count is manageable. No multicollinearity risk observed from heatmap.


### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)


## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, cross_val_score

# Convert to DataFrame for cleaning
X_train_df = pd.DataFrame(X_train)
X_test_df = pd.DataFrame(X_test)

# Drop rows with NaNs
X_train_clean = X_train_df.dropna()
X_test_clean = X_test_df.dropna()

# Match the cleaned indexes using .iloc
y_train_clean = y_train.iloc[X_train_clean.index]
y_test_clean = y_test.iloc[X_test_clean.index]



# Reuse already split data
# X_train, X_test, y_train, y_test from previous step

lr = LinearRegression()
lr.fit(X_train_clean, y_train_clean)

y_pred_lr = lr.predict(X_test_clean)

# Evaluate
print("Linear Regression - MAE:", mean_absolute_error(y_test_clean, y_pred_lr))
rmse_lr = np.sqrt(mean_squared_error(y_test_clean, y_pred_lr))
print("Linear Regression - RMSE:", rmse_lr)
print("Linear Regression - R2 Score:", r2_score(y_test_clean, y_pred_lr))





#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Linear Regression is a basic model that assumes a linear relationship between features and target.
#It's simple, interpretable, and provides a baseline for more complex models.
# Assuming y_test_clean and y_pred_lr exist
mae_lr = mean_absolute_error(y_test_clean, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test_clean, y_pred_lr))
r2_lr = r2_score(y_test_clean, y_pred_lr)
plt.bar(['MAE', 'RMSE', 'R2'], [mae_lr, rmse_lr, r2_lr])
plt.title("Linear Regression Evaluation Metrics")
plt.ylabel("Score")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

"""
Linear Regression doesn't need hyperparameter tuning like tree models. However, we use cross-validation.
"""
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')  # or 'median'
X_train_imputed = imputer.fit_transform(X_train)
cv_r2 = cross_val_score(lr, X_train_imputed, y_train, cv=5, scoring='r2')

print("CV R2 Scores:", cv_r2)
print("Mean CV R2:", cv_r2.mean())


### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt

rf = RandomForestRegressor(n_estimators=50, random_state=42)  # You can lower n_estimators to speed up training
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluation
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

# Visualizing Evaluation Metric Score
plt.bar(['MAE', 'RMSE', 'R2'], [mae_rf, rmse_rf, r2_rf], color=['steelblue', 'orange', 'green'])
plt.title("Random Forest Evaluation Metrics")
plt.ylabel("Score")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV

from scipy.stats import randint
param_dist = {
    'n_estimators': randint(10, 100),
    'max_depth': randint(5, 20)
}

rand_search = RandomizedSearchCV(RandomForestRegressor(random_state=42),
                                 param_distributions=param_dist,
                                 n_iter=5, cv=3, n_jobs=-1, random_state=42, verbose=1)
rand_search.fit(X_train, y_train)

best_rf = rand_search.best_estimator_
y_pred_best_rf = best_rf.predict(X_test)

import numpy as np
rmse_best_rf = np.sqrt(mean_squared_error(y_test, y_pred_best_rf))
r2_best_rf = r2_score(y_test, y_pred_best_rf)

print("Tuned RF - RMSE:", rmse_best_rf)
print("Tuned RF - R2 Score:", r2_best_rf)




##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

"""
GridSearchCV was used for Random Forest to tune 'n_estimators' and 'max_depth'.
We observed noticeable improvements in RMSE and R^2, indicating a better fit to the data.
"""

### ML Model - 3

In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt

# STEP 1: Reduce training data size (sample 30% for speed)
sample_size = int(0.3 * len(X_train))
X_sample = X_train[:sample_size]
y_sample = y_train[:sample_size]

# STEP 2: Train XGBoost with fast settings
xgb = XGBRegressor(
    n_estimators=50,     # fewer trees
    max_depth=5,         # shallower trees
    learning_rate=0.1,   # reasonable default
    verbosity=0,
    random_state=42
)

xgb.fit(X_sample, y_sample)
y_pred_xgb = xgb.predict(X_test)

# STEP 3: Evaluate
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
r2_xgb = r2_score(y_test, y_pred_xgb)

# STEP 4: Visualize
plt.bar(['MAE', 'RMSE', 'R2'], [mae_xgb, rmse_xgb, r2_xgb], color=['steelblue', 'orange', 'green'])
plt.title("XGBoost Evaluation Metrics (Optimized)")
plt.ylabel("Score")
plt.show()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    'n_estimators': randint(50, 150),
    'max_depth': randint(3, 10),
    'learning_rate': uniform(0.03, 0.07)
}

xgb = XGBRegressor(random_state=42, verbosity=0)
rand_search = RandomizedSearchCV(xgb, param_distributions=param_dist,
                                 n_iter=5, cv=3, n_jobs=-1, random_state=42, verbose=1)
rand_search.fit(X_train, y_train)

best_xgb = rand_search.best_estimator_
y_pred_best_xgb = best_xgb.predict(X_test)

import numpy as np
rmse_best = np.sqrt(mean_squared_error(y_test, y_pred_best_xgb))
r2_best = r2_score(y_test, y_pred_best_xgb)

print("Tuned XGB (Fast) - RMSE:", rmse_best)
print("Tuned XGB (Fast) - R2:", r2_best)


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

"""
Evaluation Metrics Used:
- MAE (Mean Absolute Error): Average prediction error.
- RMSE (Root Mean Squared Error): Penalizes larger errors more.
- R² Score: Proportion of variance explained by the model.

XGBoost gave the best R² and lowest RMSE.
"""

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I focused on three core evaluation metrics for this regression task:

Metric	Why it Matters for Business
MAE (Mean Absolute Error)	Tells us the average deviation in sales predictions — easily interpretable in revenue units.
RMSE (Root Mean Squared Error)	Penalizes large errors more than MAE — helps catch critical forecasting failures.
R² Score	Indicates how much variance in actual sales is explained by the model — helps assess overall model utility.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose XGBoost Regressor (after tuning) as the final model.

It achieved the lowest RMSE and highest R² score among all models tested

It effectively captured non-linear relationships between variables

It is known for its robustness against overfitting


### 3. Explain the model which you have used and the feature importance using any model explainability tool?

I used the XGBoost Regressor as the final model. XGBoost stands for Extreme Gradient Boosting, and it builds an ensemble of weak decision trees in a sequential manner. Each new tree tries to correct the residual errors from the previous trees.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib
joblib.dump(best_xgb, 'best_model_xgb.pkl')

# Load and test
loaded_model = joblib.load('best_model_xgb.pkl')
y_loaded_pred = loaded_model.predict(X_test)
print("Sanity Check R2:", r2_score(y_test, y_loaded_pred))

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
loaded_model = joblib.load('best_model_xgb.pkl')
y_loaded_pred = loaded_model.predict(X_test)
print("Sanity Check R2:", r2_score(y_test, y_loaded_pred))

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

"""
This project successfully implemented a regression solution to predict weekly sales using retail store data.
We tested multiple models, performed tuning, evaluated results, and finalized XGBoost as the best-performing model.
The model is now saved and ready for deployment.
"""

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***