<a href="https://colab.research.google.com/github/MonaliM5/rossmann_retail_sales_prediction/blob/main/Rossmann_Retail_Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Rossmann Retail Sales Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Student Name**    - Monali Vijay Mhaske

# **Project Summary -**

* Rossmann is a large European drugstore chain operating over 3,000 stores across several countries. The company faces a key business challenge — accurately forecasting daily sales for each store. Reliable sales forecasts are crucial for effective inventory planning, workforce scheduling, and promotional strategy.

* This project aims to predict the daily sales of Rossmann stores using historical sales data combined with store-specific information such as promotions, holidays, competition, and assortment types. By developing a predictive model, Rossmann management can make data-driven decisions to optimize operations and improve profitability.

* The project will follow a structured data science life-cycle, which includes:

  1. Data Understanding - Exploring the historical sales and store datasets to identify patterns, data types, and business drivers.


  2. Data Wrangling & Cleaning - Handling missing values, correcting inconsistencies, encoding categorical variables, and preparing data for analysis.


  3. Exploratory Data Analysis (EDA) - Performing univariate, bivariate, and multivariate analysis to uncover relationships between features such as promotions, holidays, and sales trends.


  4. Feature Engineering - Creating new features like competition duration, promo activity flags, and temporal variables (month, week, weekday) to enhance model performance.


  5. Model Development & Evaluation - Building regression-based machine learning models (e.g., Linear Regression, Random Forest, XGBoost) to predict sales. Model performance will be evaluated primarily using Root Mean Squared Logarithmic Error (RMSLE), a suitable metric for skewed sales data.


  6. Model Interpretation & Business Insights - Interpreting the model results to generate actionable insights, such as how promotions or competition impact sales, and providing recommendations to management.



* The expected outcome of this project is a robust, data-driven forecasting model that can accurately estimate future store sales and highlight key factors influencing them. This will help Rossmann:

    * Ensure better inventory and staff management,

    * Improve promotion planning and marketing effectiveness, and

    * Ultimately increase overall profitability.

# **GitHub Link -**

https://github.com/MonaliM5/rossmann_retail_sales_prediction

# **Problem Statement**


* In the retail industry, accurate sales forecasting is critical for effective decision-making. Retailers like Rossmann, one of Europe's largest drugstore chains, must regularly decide how much stock to order, how to schedule employees, and how to plan promotional campaigns - all of which depend heavily on anticipated sales.

* However, predicting store-level daily sales is challenging because it is influenced by multiple dynamic factors such as store location, promotions, holidays, competition, and seasonality. An incorrect forecast can lead to overstocking or understocking, resulting in financial losses and poor customer experience.

* The primary objective of this project is to develop a predictive model capable of accurately estimating daily sales for each Rossmann store using historical sales and store information. The model should capture the impact of various external and internal factors - including promotions, holidays, competition distance, assortment type, and time-related variables - on store performance.

* The predictive insights from this project will enable Rossmann's management to:

  * Optimize inventory and staffing levels,

  * Plan promotional activities more effectively,

  * Improve supply chain efficiency, and

  * Enhance overall business profitability.


* The project will apply systematic data analysis and machine learning techniques to derive actionable insights that directly support Rossmann's strategic and operational decisions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Data manipulation and analysis
import numpy as np
import pandas as pd

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning and preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# System and warnings
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

### Dataset Loading

In [None]:
# Mounting Google Drive

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
Sales_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Capstone Project 2 - Regression Project/Rossmann Stores Data.csv", parse_dates=['Date'])
Store_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Capstone Project 2 - Regression Project/store.csv")

### Dataset First View

In [None]:
# Sales Dataset First Look
print("Sales Data First View ")
Sales_df.head()

In [None]:
# Store Dataset First Look
print("Store Data First View ")
Store_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Sales Dataset -\n Rows Count : {Sales_df.shape[0]}  \tColumns Count : {Sales_df.shape[1]}")
print(f"Store Dataset -\n Rows Count : {Store_df.shape[0]}  \tColumns Count : {Store_df.shape[1]}")

### Dataset Information

In [None]:
# Dataset Information

# Sales dataset info
print("Sales Data Information:")
Sales_df.info()
print("\n" + "="*60 + "\n")

# Store dataset info
print("Store Data Information:")
Store_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

# Sales dataset duplicates
print(f"Sales Dataset → Duplicate rows: {Sales_df.duplicated().sum()}")

if Sales_df.duplicated().sum() > 0:
    print("\nSample duplicate rows from Sales_df:")
    display(Sales_df[Sales_df.duplicated()].head())

print("\n" + "-"*60 + "\n")

# Store dataset duplicates
print(f"Store Dataset → Duplicate rows: {Store_df.duplicated().sum()}")

if Store_df.duplicated().sum() > 0:
    print("\nSample duplicate rows from Store_df:")
    display(Store_df[Store_df.duplicated()].head())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

# Sales dataset nulls
print("Sales Dataset - Missing Values:\n")
print(Sales_df.isnull().sum())
print("\n" + "-"*60 + "\n")

# Store dataset nulls
print("Store Dataset - Missing Values:\n")
print(Store_df.isnull().sum())

In [None]:
# Visualizing the missing values

import missingno as msno

# Set up plot style
plt.style.use('seaborn-v0_8-whitegrid')

# Sales dataset missing values visualization
print("Sales Dataset - Missing Values Visualization:\n")
msno.matrix(Sales_df)
plt.title("Sales Dataset - Missing Values Overview")
plt.show()

# Store dataset missing values visualization
print("Store Dataset - Missing Values Visualization:\n")
msno.matrix(Store_df)
plt.title("Store Dataset - Missing Values Overview")
plt.show()

### What did you know about your dataset?

* The Rossmann dataset consists of two main files - Sales data and Store data - that together provide a comprehensive view of the company's retail operations.

1. Sales Data (Sales_df)

    * This dataset contains daily sales records for each Rossmann store.

    * Each record includes store ID, sales amount, number of customers, whether the store was open, ongoing promotions, state and school holidays, and the corresponding date.

    * These variables help capture short-term and seasonal trends, customer behavior, and the influence of holidays or promotions on sales.



2. Store Data (Store_df)

    * This dataset provides static information about each store, such as store type, assortment level, distance to the nearest competitor, duration since the nearest competition opened, and promotional program details (e.g., whether the store runs continuous promotions).

    * These variables explain store-level differences that affect long-term sales patterns.



* By combining these two datasets, we obtain both temporal (daily) and structural (store-level) information. This enables a deeper understanding of how different factors — such as competition, promotions, holidays, and assortment — influence store performance.
Such insights form the foundation for building a reliable predictive model to forecast future sales.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

# Printing list of columns in Sales Dataset
print("Columns in Sales Dataset:\n")
print(Sales_df.columns.tolist())

print("\n" + "-"*60 + "\n")

#Printing list of columns in Store Dataset
print("Columns in Store Dataset:\n")
print(Store_df.columns.tolist())

In [None]:
# Dataset Describe

# Describing the Sales Dataset
print(" Sales Data Description : \n")
display(Sales_df.describe())

print("\n" + "-"*60 + "\n")

# Describing the Store Dataset
print(" Store Data Description :\n ")
display(Store_df.describe())

### Variables Description

📘 Sales Dataset (Sales_df) :


|Variable|Description|
|---|---|
|Store|Unique identifier for each store.|
|DayOfWeek|  Day of the week (1 = Monday, 7 = Sunday).|
|Date| Date of the record.|
|Sales|	Total sales made on that day — this is the target variable.|
|Customers|	Number of customers who visited the store on that day.|
|Open|	Indicates whether the store was open (1) or closed (0).|
|Promo|	Indicates if a promotion was running on that day (1 = Yes, 0 = No).|
|StateHoliday|	Denotes whether the day was a state/national/public holiday.|
|SchoolHoliday|	Indicates if the store was affected by public-school closures.|



---

🏪 Store Dataset (Store_df)

|Variable	|Description|
|---|---|
|Store|	Unique identifier for each store (key to merge with Sales_df).|
|StoreType|	Type of store (a, b, c, d) — represents different business formats.|
|Assortment|	Level of product assortment (a = basic, b = extra, c = extended).|
|CompetitionDistance|	Distance to the nearest competitor (in meters).|
|CompetitionOpenSinceMonth|	Month when the nearest competitor opened.|
|CompetitionOpenSinceYear|	Year when the nearest competitor opened.|
|Promo2|	Indicates whether the store participates in a continuing promotion (1 = Yes, 0 = No).|
|Promo2SinceWeek|	Week when the store began participating in Promo2.|
|Promo2SinceYear|	Year when the store began participating in Promo2.|
|PromoInterval|	Months when Promo2 is active (e.g., Jan, Apr, Jul, Oct).|



---

💡 Insights

* Sales_df provides time-based transactional information.

* Store_df adds store-level context such as competition and promotions.

* Together, these datasets form a powerful base to analyze sales drivers and build a predictive regression model for accurate forecasting.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# Fetching and printing number of unique values in each column of Sales Dataset.
print("Unique values in Sales Dataset : \n")
for col in Sales_df.columns :
  unique_count = Sales_df[col].nunique()
  # Fetched total number of unique values
  print(f"{col} : {unique_count} unique values")
  print(Sales_df[col].unique())
  # Printing all those unique values of each column
  print("\n")

print("\n" + "-"*60 + "\n")

# Fetching and printing number of unique values in each column of Store Dataset.
print("Unique values in Store Dataset : \n")

for col in Store_df.columns :
  unique_count = Store_df[col].nunique()
  # Fetched total number of unique values.

  print(f"{col} : {unique_count} unique values")
  print(Store_df[col].unique())
  # Printing all those unique values of each column

  print("\n")

## ***3. Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.


# -----------------------------
# 🧹 DATA WRANGLING
# -----------------------------


# Making a copy of datasets to preserve raw data
sales_df = Sales_df.copy()
store_df = Store_df.copy()


# -----------------------------
# STEP 1: MERGING BOTH DATASETS
# -----------------------------
# Merging on 'Store' column (common key)
df = pd.merge(sales_df, store_df, on = "Store", how = "left")



# -----------------------------
# STEP 2 : Handle unrealistic values
# ----------------------------
# Handling only clearly unrealistic years (<1970)
df.loc[df['CompetitionOpenSinceYear'] < 1970, 'CompetitionOpenSinceYear'] = np.nan


# -----------------------------
# STEP 3: Missing-value indicator flags
# -----------------------------

# Creating missing-value indicator flags
df['CompetitionDistance_NA'] = df['CompetitionDistance'].isna().astype(int)
df['CompetitionOpenSince_NA'] = df[['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear']].isna().any(axis=1).astype(int)
df['Promo2Since_NA'] = df[['Promo2SinceWeek', 'Promo2SinceYear']].isna().any(axis=1).astype(int)


# -----------------------------
# STEP 3 : Handle missing values
# -----------------------------

# Imputed missing values logically
df['CompetitionDistance'].fillna(df['CompetitionDistance'].median(), inplace=True)
df['CompetitionOpenSinceMonth'].fillna(0, inplace=True)
df['CompetitionOpenSinceYear'].fillna(0, inplace=True)
df['Promo2SinceWeek'].fillna(0, inplace=True)
df['Promo2SinceYear'].fillna(0, inplace=True)
df['PromoInterval'].fillna('None', inplace=True)


# -----------------------------
# STEP 4 : Feature extraction from date
# -----------------------------
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['WeekOfYear'] = df['Date'].dt.isocalendar().week.astype(int)
df['DayOfWeek'] = df['Date'].dt.dayofweek + 1   # 1=Monday, 7=Sunday


# -----------------------------
# STEP 5 : Competition open duration
# -----------------------------
# Calculating competition open months only for non-missing cases
df['CompetitionOpenMonths'] = np.where(
    df['CompetitionOpenSince_NA'],
    0,  # Set to 0 if missing
    ((df['Year'] - df['CompetitionOpenSinceYear']) * 12 +
     (df['Month'] - df['CompetitionOpenSinceMonth']))
)

# Replacing negative values (for stores opened later) with 0
df['CompetitionOpenMonths'] = df['CompetitionOpenMonths'].apply(lambda x: x if x > 0 else 0)


# -----------------------------
# STEP 5 : Fixing data type issues
# -----------------------------
df['StateHoliday'] = df['StateHoliday'].astype(str).replace({'0': 'None'})
df['StoreType'] = df['StoreType'].astype('category')
df['Assortment'] = df['Assortment'].astype('category')
df['PromoInterval'] = df['PromoInterval'].astype('category')


# -----------------------------
# STEP 6 : Promo2 active (vectorized)
# -----------------------------
month_map = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6,
             'Jul':7, 'Aug':8, 'Sept':9, 'Oct':10, 'Nov':11, 'Dec':12}

# Convert to string before applying function to avoid 'unhashable list' error
df['PromoMonths'] = df['PromoInterval'].astype(str).apply(
    lambda x: [month_map[m] for m in x.split(',')] if x != 'None' else []
)

df['IsPromo2Active'] = df.apply(
    lambda row: 1 if (row['Promo2'] == 1 and row['Month'] in row['PromoMonths']) else 0, axis=1
)



# -----------------------------
# STEP 7 : Business logic flags
# -----------------------------
df['IsOpen'] = df['Open'].fillna(1).astype(int)
df.drop(columns = ['Open'], inplace = True)
df['SalesPerCustomer'] = np.where(df['Customers'] > 0, df['Sales'] / df['Customers'], 0)
df['ZeroSalesWhileOpen'] = ((df['Sales'] == 0) & (df['IsOpen'] == 1)).astype(int)



# -----------------------------
# STEP 8 : Handle outliers
# -----------------------------
# Cap competition distance at 99th percentile
cap_value = df['CompetitionDistance'].quantile(0.99)
df.loc[df['CompetitionDistance'] > cap_value, 'CompetitionDistance'] = cap_value




# -----------------------------
# STEP 9 : Final formatting
# -----------------------------
# Drop temporary helper columns
df.drop(columns=['PromoMonths'], inplace=True)

# Reset index and sort
df.sort_values(['Store', 'Date'], inplace=True)
df.reset_index(drop=True, inplace=True)



print("✅ Data Wrangling Completed Successfully!")
print(f"Final Dataset Shape: {df.shape}")
print("\nMissing Values After Wrangling:")
print(df.isnull().sum()[df.isnull().sum() > 0])
display(df.head(3))

### What all manipulations have you done and insights you found?

* During the data wrangling phase, various preprocessing steps were performed to clean, correct, and enhance the dataset so that it becomes ready for analysis and modeling.
 * The manipulations focused on merging, handling missing values, creating new features, and ensuring data consistency.

---

**🔧 Manipulations Performed**



1. Merging Datasets:

    * The two datasets — Sales_df (daily sales data) and Store_df (store information) — were merged using the common key Store.

    * This allowed each sales record to include corresponding store-level attributes such as store type, assortment, and competition details.




2. Handling Missing Values:

    * Missing values in important columns such as CompetitionDistance, CompetitionOpenSinceMonth/Year, and Promo2SinceWeek/Year were handled logically.

    * Median values or zeros were used for imputation where appropriate.

    * Additional missing-value indicator columns (e.g., CompetitionDistance_NA, CompetitionOpenSince_NA) were created to preserve information about where data was missing — since missingness itself can sometimes carry predictive significance.



3. Data Type Corrections:

    * Columns like StateHoliday, StoreType, Assortment, and PromoInterval were converted to categorical data types for better memory efficiency and easier encoding later.



4. Date Feature Extraction:

    * The Date column was transformed to extract new temporal features such as Year, Month, Day, WeekOfYear, and DayOfWeek.

    * These features help capture seasonal and weekly sales patterns.



5. Feature Engineering:

    * CompetitionOpenMonths: Calculated as the number of months since a competing store opened, clipped at zero for invalid negative values.

    * IsPromo2Active: A binary feature created to indicate whether a store's second promotion (Promo2) was active during a given month.

    * SalesPerCustomer: Derived by dividing sales by the number of customers to understand average customer spending.

    * ZeroSalesWhileOpen: Flag created to identify closed stores and abnormal cases where sales were zero even though the store was open.



6. Outlier Treatment:

    * Extreme values in CompetitionDistance were capped at the 99th percentile to reduce the influence of outliers on model training.



7. Sorting and Indexing:

    * The dataset was sorted by Store and Date to maintain chronological order and reset the index for consistency.


---


💡 Insights Found After Data Wrangling

1. Some stores had missing competition data, indicating that either competition details were not recorded or not applicable — this might correlate with isolated or newer stores.

2. Certain stores had longer competition durations, which could influence their sales stability.

3. The number of active promotions (Promo2) varied across months, suggesting seasonal marketing strategies.

4. Many records showed Sales = 0 when stores were closed, confirming the business rule that closed stores do not generate revenue.

5. The derived feature SalesPerCustomer revealed that some stores have higher customer spending patterns, likely influenced by assortment or store type.



---


✅ Outcome

* After data wrangling, the dataset became clean, consistent, and analysis-ready.

* All missing values were addressed, data types corrected, new features engineered, and outliers treated — providing a solid foundation for further Exploratory Data Analysis (EDA) and model building.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### **Univariate Analysis**

#### Chart 1 - Distribution of Daily Sales

In [None]:
# Chart - 1 visualization code

# Features Used - Sales

# ✅ Why this chart is important to include ?
# The entire project aims to predict daily Sales - it’s the dependent variable.
# Before building any model, we must understand how this variable behaves.
# Without this chart, we wouldn’t know if the data is balanced, skewed, or has extreme outliers.

#----------------------------------------------------------------------------------------------------------#

# Setting plot style for cleaner aesthetics
sns.set(style='whitegrid')


# Creating the figure
plt.figure(figsize=(10,6))

# Plotting histogram with KDE curve for smooth density visualization
sns.histplot(
    data=df,
    x='Sales',
    bins=50,                 # Number of bars
    kde=True,                # Add smooth density curve
    color='royalblue',       # Chart color
    alpha=0.7                # Transparency
)

# Adding titles and labels
plt.title('Distribution of Daily Sales', fontsize=16, fontweight='bold', color='black')
plt.xlabel('Sales Amount', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Customizing gridlines and frame
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)  # Removes top and right border for a cleaner look

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

* I selected a histogram with a KDE (Kernel Density Estimate) because it is the most effective visualization for understanding the distribution and spread of a continuous variable like Sales.

* It helps identify whether sales are normally distributed, skewed, or have multiple peaks, which is crucial before applying any transformations (like log scaling) or choosing suitable modeling techniques.

##### 2. What is/are the insight(s) found from the chart?

* From the histogram, the sales distribution is highly right-skewed — most daily sales values are clustered toward the lower range, with fewer stores having very high sales days.

* This suggests that the dataset has significant variance across stores and days, possibly influenced by promotions, weekends, and holidays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from this chart can help create a positive business impact.

* The distribution shows that most stores record low-to-moderate daily sales, while only a few achieve very high sales figures.

* This highlights an opportunity for the company to identify the success factors of those high-performing stores - such as location, promotions, or assortment - and apply similar strategies across other stores to boost overall sales performance.

---

* On the other hand, the right-skewed nature of the data also points to a potential negative growth concern - it suggests that a large number of stores are underperforming compared to the top ones.

* If this imbalance continues, it could indicate inefficient promotional strategies or resource allocation that favor only a few stores.

* By analyzing and addressing the reasons behind low sales clusters (e.g., lack of promotions, competition proximity, or smaller assortments), Rossmann can reduce disparities and achieve more consistent sales growth across stores.


#### Chart 2 - Distribution of Customers

In [None]:
# Chart - 2 visualization code

# Features Used - Customers

# Why this chart is important to include ?
# The number of Customers visiting each store per day determines the overall sales potential.
# Analyzing its distribution helps us:
# Understand the footfall variation across stores and days.
# Detect outliers (extremely busy or empty days).
# Identify whether customer traffic is evenly spread or dominated by a few high-volume stores.
# This chart provides crucial insight into customer behavior trends, which can later help the company optimize
# store operations, marketing, and staffing.

#-----------------------------------------------------------------------------------------------------------------------------------#


# Setting the plot style for professional appearance
sns.set(style='whitegrid')

# Creating the figure and defining its size
plt.figure(figsize=(9,5))

# Plotting the box plot to analyze the distribution and outliers
sns.boxplot(
    data=df,
    x='Customers',          # Setting the variable to visualize
    color='mediumseagreen', # Choosing a calm, professional color
    width=0.6,              # Adjusting box width
    fliersize=3             # Adjusting outlier marker size
)

# Adding title and labels
plt.title('Box Plot of Daily Customers', fontsize=15, fontweight='bold', color='darkgreen')
plt.xlabel('Number of Customers per Day', fontsize=12)

# Customizing gridlines for cleaner presentation
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Removing unnecessary chart borders
plt.box(False)

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

* I selected a box plot because it efficiently shows the median, quartiles, and outliers of customer counts.

* It provides a compact summary of the data's spread and reveals the presence of extreme values that might represent very busy or very quiet store days.

##### 2. What is/are the insight(s) found from the chart?

* The box plot reveals that most stores have moderate customer traffic, but there are several high outliers - indicating days or stores with significantly higher footfall.

* The median value of customers lies in the lower range, suggesting that many stores experience fewer visitors on average, while only a few see very large customer numbers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, these insights can lead to positive business impact by helping management identify peak-performing stores or days with unusually high customer counts.

* Studying those cases can uncover best practices in promotions, store layout, or timing that attract more visitors.

* However, the large number of low-median stores and high outliers indicate customer inequality — where only a few stores get the majority of traffic.

---

* If ignored, this imbalance could lead to negative growth, as underperforming stores may struggle with low footfall and reduced profitability.

* Hence, understanding this spread helps in rebalancing marketing efforts and improving traffic consistency across stores.

#### Chart 3 - Violin Plot of Competition Distance

In [None]:
# Chart - 3 visualization code

# Features Used - CompetitionDistance

# ✅ Why this chart is important to include
# This visualization helps us understand how CompetitionDistance is distributed among all Rossmann stores.
# It reveals whether most stores face nearby competitors or if some operate in low-competition areas.
# Understanding this spread helps identify how market competition might influence overall sales and customer reach.


#-----------------------------------------------------------------------------------------------------------------------------#


# Plotting the violin plot to visualize the distribution shape and density
sns.violinplot(
    data=df,
    x='CompetitionDistance',   # Setting the variable to visualize
    color='skyblue',           # Choosing a light color for better readability
    inner='quartile',          # Showing quartiles inside the violin
    cut=0                      # Preventing density from extending beyond data range
)

# Adding title and labels
plt.title('Violin Plot of Competition Distance', fontsize=15, fontweight='bold', color='black')
plt.xlabel('Distance to Nearest Competitor (meters)', fontsize=12)

# Customizing gridlines and chart borders
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

* I selected a violin plot because it combines the advantages of both a box plot and a density plot.

* It effectively displays how the competition distances are distributed, highlighting whether most stores face nearby competition or are located in isolated markets.

* This visualization helps us clearly see the concentration and variation of competitor distances.

##### 2. What is/are the insight(s) found from the chart?

* The violin plot shows that most stores have competitors within a short to moderate distance, while a smaller number of stores are located far away from competitors.

* The long tail on the right indicates a few stores with very high competition distances, meaning they have almost no nearby competitors - these are likely stores operating in low-competition or rural areas.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, this insight has positive business implications.
Stores located far from competitors likely enjoy a competitive advantage — higher sales potential and customer loyalty due to limited alternatives nearby.

* Rossmann can use this insight to identify profitable low-competition zones and target similar locations for new store openings.

---

* However, the negative aspect is that many stores face close competition, which can lead to reduced daily sales if pricing or promotions are not optimized.

* To avoid negative growth, Rossmann can analyze sales patterns of stores with nearby competitors and apply aggressive marketing or localized discount strategies to maintain market share.


#### Chart 4 - Count Plot of Promotion Activity

In [None]:
df.columns

In [None]:
# Chart - 4 visualization code

# Features used :- Promo

# ✅ Why this chart is important to include ?
# Promotions have a direct and measurable impact on sales — they drive customer traffic, influence purchasing behavior,
# and affect store profitability.
# By visualizing how frequently promotions occur, we can:
#     * Understand how often Rossmann runs promotions.
#     * Check for data balance (too few or too many promo days).
#     * Prepare for later bivariate analysis, where we’ll compare promotions with sales performance.
#     * This chart helps lay the groundwork for analyzing how promotional frequency affects sales trends.



#------------------------------------------------------------------------------------------------------------------------------------#

# Creating the figure and defining its size
plt.figure(figsize=(7,5))

# Plotting the count plot to show number of promo vs non-promo days
sns.countplot(
    data=df,
    x='StateHoliday',                 # Selecting the 'Promo' variable
    palette=['salmon', 'lightgreen'],  # Choosing contrasting colors for clarity
    edgecolor='black'          # Adding borders to bars for visual definition
)

# Adding value annotations above bars
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Adding title and axis labels
plt.title('Count Plot of Promotion Activity', fontsize=15, fontweight='bold', color='darkred')
plt.xlabel('Promotion Active (1 = Yes, 0 = No)', fontsize=12)
plt.ylabel('Count of Days', fontsize=12)

# Customizing grid and borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

* I selected a count plot because Promo is a binary categorical variable.

* A count plot clearly shows how many observations fall into each category (1 for promo days, 0 for non-promo days).

* This makes it easy to visualize the overall proportion of promotional activity across the dataset.

##### 2. What is/are the insight(s) found from the chart?

* The count plot shows that non-promo days significantly outnumber promo days.
* This indicates that promotions are occasional and strategically planned rather than frequent.
* It suggests that the company likely runs promotions during specific periods to maximize customer engagement instead of keeping discounts active all the time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, these insights can create a positive business impact by helping management evaluate the effectiveness of promotional frequency.

* If limited promotions result in noticeable sales spikes, it confirms that targeted promotions are effective.

---

* However, having too few promotions could lead to negative growth, especially if competitors run frequent offers and attract Rossmann's potential customers.

* Balancing promotional frequency can therefore help **retain customer interest and maintain competitive

#### Chart 5 - Pie Chart of Store Type Distribution

In [None]:
# Chart - 5 visualization code

# Features used - StoreType

# ✅ Why this chart is important to include
# Different store types cater to different customer segments and have varying sales capacities.
# Visualizing the proportion of store types helps us:
#     Understand Rossmann’s store structure and market distribution.
#     Identify which store type dominates the business.
#     Prepare for later analyses (e.g., which store type drives the most sales).
# This chart provides a strategic overview of Rossmann's retail landscape.


#--------------------------------------------------------------------------------------------------------#


# Setting color palette for professional appearance
colors = sns.color_palette('pastel')

# Creating the figure and defining its size
plt.figure(figsize=(7,7))

# Calculating the count of each store type
store_counts = df['StoreType'].value_counts()

# Plotting the pie chart
plt.pie(
    store_counts,
    labels=store_counts.index,             # Adding store type labels
    autopct='%1.1f%%',                     # Displaying percentage values
    startangle=90,                         # Rotating for better visual balance
    colors=colors,                         # Using soft pastel palette
    wedgeprops={'edgecolor': 'black'},     # Adding edge for clarity
    textprops={'fontsize': 11}
)

# Adding title
plt.title('Distribution of Store Types', fontsize=15, fontweight='bold', color='darkblue')

# Displaying the chart
plt.show()

##### 1. Why did you pick the specific chart?

* A pie chart is chosen because it effectively illustrates the proportional composition of categorical variables.

* It provides a quick and intuitive view of how Rossmann's stores are distributed among different types.

* This helps in visually comparing which store type is most or least common in the dataset.

##### 2. What is/are the insight(s) found from the chart?

* The pie chart shows that one or two store types (often Type a and Type d) dominate the dataset, while others represent smaller portions.

* This means that Rossmann's retail network is unevenly distributed across store categories — suggesting different operational focuses such as urban convenience stores vs. large suburban outlets.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this visualization provides valuable business insights.

* Understanding which store type dominates helps Rossmann align inventory, staffing, and marketing strategies with store format.

* For instance, if smaller stores are the majority, strategies can focus on high turnover products and space optimization.

---

* However, the negative aspect is if too much reliance exists on one store type — it increases business risk due to limited diversification.

* Balancing the number of store types across locations ensures resilience and stable revenue growth.

#### Chart 6 - Bar Chart of Assortment Type Distribution

In [None]:
# Chart - 6 visualization code

# Features used -  Assortment

# ✅ Why this chart is important to include
# Assortment type affects customer attraction and sales volume — stores with a wider assortment may appeal
# to a broader customer base but also face higher operating costs.
# Visualizing this helps in understanding:
#   The dominant assortment strategy across Rossmann stores.
#   Whether most stores carry a limited or diverse product range.
#   How product variety could influence customer footfall and revenue.
# This chart is critical for connecting product diversity with store performance in later analysis.

#------------------------------------------------------------------------------------------------------------------------------#


# Setting visual theme
sns.set(style='whitegrid')

# Creating the figure and defining its size
plt.figure(figsize=(8,5))

# Plotting horizontal bar chart for assortment type distribution
sns.countplot(
    data=df,
    y='Assortment',               # Using 'y' for horizontal layout
    order=df['Assortment'].value_counts().index,  # Sorting bars by count
    palette='coolwarm',           # Applying visually distinct color palette
    edgecolor='black'             # Adding border to bars
)

# Adding title and labels
plt.title('Distribution of Assortment Types', fontsize=15, fontweight='bold', color='darkslateblue')
plt.xlabel('Number of Stores', fontsize=12)
plt.ylabel('Assortment Type', fontsize=12)

# Adding count labels beside bars
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Customizing grid for clean appearance
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Removing unnecessary borders
plt.box(False)

# Displaying the chart
plt.show()

##### 1. Why did you pick the specific chart?

* A horizontal bar chart is chosen because it clearly displays categorical data while emphasizing the magnitude of each category.

* It's ideal for showing frequency comparisons when category labels (like a, b, c) are short but counts differ noticeably.

* This chart type also improves readability when there are few categories.

##### 2. What is/are the insight(s) found from the chart?

* The bar chart shows that the majority of stores have an assortment type “a”, meaning they carry basic product lines.

* A smaller proportion of stores have type “b” (extra) or type “c” (extended) assortments.

* This suggests that Rossmann's focus is primarily on maintaining standardized product offerings across most locations, with only select stores offering specialized or extended assortments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can generate positive business impact.

* Knowing that most stores follow a basic assortment strategy helps Rossmann streamline supply chains, reduce inventory complexity, and control costs.

---

* However, the negative aspect could be that limited variety may restrict customer choice, especially in competitive areas where shoppers expect more options.

* Balancing basic and extended assortments could help attract more customers while maintaining operational efficiency.

#### Chart 7 - Bar Chart of Day of Week Distribution

In [None]:
# Chart - 7 visualization code

# Features used - DayOfWeek

# ✅ Why this chart is important to include
# This visualization helps us:
#   Identify which days stores are most active.
#   Detect patterns like reduced store activity on Sundays or peak activity midweek.
#   Prepare for later bivariate analysis to see how sales vary across weekdays.
#   This chart adds valuable context about operational rhythm and customer behavior —
#   critical for both forecasting and staffing decisions.


#-------------------------------------------------------------------------------------------------#


# Creating the figure and defining its size
plt.figure(figsize=(9,5))

# Plotting bar chart showing store activity by day of the week
sns.countplot(
    data=df,
    x='DayOfWeek',                       # Selecting variable to visualize
    palette='viridis',                   # Applying color gradient
    edgecolor='black'                    # Adding border for clear separation
)

# Adding title and labels
plt.title('Distribution of Day of Week', fontsize=15, fontweight='bold', color='darkgreen')
plt.xlabel('Day of the Week (1=Mon, 7=Sun)', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Adding count labels on each bar
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Customizing gridlines and removing borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the chart
plt.show()

##### 1. Why did you pick the specific chart?

* A bar chart is chosen because DayOfWeek is a discrete numeric categorical variable.

* The bar chart effectively shows how many records fall under each weekday, allowing quick identification of operational trends - such as which days have the most or fewest store openings.

##### 2. What is/are the insight(s) found from the chart?

* The bar chart typically shows fewer records for Sundays (Day 7), meaning many stores are closed or operate with reduced hours on that day.
* The rest of the days (1-6) have nearly consistent counts, indicating that stores maintain regular activity during weekdays and Saturdays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, these insights are highly useful for business operations and workforce planning.

* Knowing that most stores are closed or less active on Sundays allows Rossmann to optimize staff scheduling, delivery logistics, and marketing campaigns.

* On the positive side, consistent weekday activity supports efficient resource utilization.

---

* However, the negative implication could be missed revenue opportunities on weekends, especially if competitors stay open.

* Rossmann could explore partial Sunday openings or online campaigns to capture unmet demand, minimizing potential negative growth.

#### Chart 8 - Count Plot of Store Opening Status  

In [None]:
# Chart - 8 visualization code

# Features Used - Isopen


# ✅ Why this chart is important to include
# This chart helps identify:
#   The proportion of open vs closed days across all stores.
#   Whether missing or irregular patterns exist in store activity.
#   The operational consistency of the dataset (useful for model reliability).
# It’s a straightforward but crucial check to ensure that most data entries represent real,
# open-store scenarios contributing to actual sales.

#-------------------------------------------------------------------------------------------------#


# Creating the figure and setting its size
plt.figure(figsize=(7,5))

# Plotting count plot for store open vs closed days
sns.countplot(
    data=df,
    x='IsOpen',                        # Selecting the variable to visualize
    palette=['tomato', 'mediumseagreen'],  # Assigning contrasting colors
    edgecolor='black'                  # Adding bar borders for definition
)

# Adding value annotations above bars
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Adding title and axis labels
plt.title('Distribution of Store Opening Status', fontsize=15, fontweight='bold', color='darkred')
plt.xlabel('Store Open Status (1 = Open, 0 = Closed)', fontsize=12)
plt.ylabel('Count of Records', fontsize=12)

# Customizing grid and removing chart borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

* A count plot is chosen because IsOpen is a binary categorical variable.
* It provides a simple yet effective way to visualize how often stores are open or closed across all records, ensuring our dataset is operationally consistent before modeling.


##### 2. What is/are the insight(s) found from the chart?

* The count plot shows that the majority of records correspond to open days (IsOpen = 1), while a much smaller fraction are closed days.

* This confirms that the dataset mostly reflects active business days, which is ideal for sales prediction.

* The presence of a few closed days also shows the dataset captures realistic business operations.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, these insights are important for ensuring data and operational integrity.

* The dominance of open days provides a strong foundation for accurate sales modeling - ensuring that most data points represent valid business activity.

---

* However, from a business standpoint, frequent closures (if any) could lead to lost revenue opportunities.

* Monitoring and minimizing unexpected closures can therefore help prevent negative growth and improve store-level performance consistency.


#### Chart 9 - Donut Chart of Promo2 (Long-Term Promotion Participation)



In [None]:
# Chart - 9 visualization code

# Features used - Promo2


# ✅ Why this chart is important to include ?
# The Promo2 feature indicates whether a store is enrolled in Rossmann’s long-term continuous promotion scheme.
# Visualizing this helps us:
#   Understand what portion of stores are part of ongoing promotions.
#   Identify how widespread the Promo2 program is within the business.
#   Prepare for later bivariate or multivariate analysis to compare sales behavior
#     between Promo2 and non-Promo2 stores.
#   It connects directly to strategic marketing and sales forecasting.


#------------------------------------------------------------------------------------------------------#



# Setting color palette
colors = sns.color_palette(['#66b3ff', '#ff9999'])  # Blue for active, red for not participating

# Creating count of Promo2 values
promo2_counts = df['Promo2'].value_counts()

# Creating the figure and defining its size
plt.figure(figsize=(7,7))

# Plotting the donut (pie) chart
wedges, texts, autotexts = plt.pie(
    promo2_counts,
    labels=['Participating', 'Not Participating'],
    autopct='%1.1f%%',               # Showing percentage
    startangle=90,                   # Rotating for better balance
    colors=colors,
    wedgeprops={'edgecolor': 'black', 'linewidth': 1},
    textprops={'fontsize': 11}
)

# Creating the white circle in the center to form a donut shape
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Adding title
plt.title('Distribution of Promo2 (Long-Term Promotion Participation)', fontsize=15, fontweight='bold', color='darkblue')

# Displaying the chart
plt.show()




##### 1. Why did you pick the specific chart?

* A donut chart is chosen because it effectively represents part-to-whole proportions while offering a clean, modern visualization style.

* It helps clearly show how many stores are part of Rossmann's long-term promotion program (Promo2) versus those that are not.

##### 2. What is/are the insight(s) found from the chart?

* The donut chart shows that a smaller proportion of stores participate in Promo2, while most stores do not.

* This suggests that Rossmann's long-term promotion strategy is selective, possibly targeting specific store types or regions.

* The insight implies that Promo2 participation is not universal, giving a good opportunity to compare performance across both groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes — this insight can lead to positive business impact by identifying how long-term promotions influence overall sales performance.

* If Promo2 stores consistently perform better, Rossmann could consider expanding the program to more locations.

---


* However, if Promo2 participation doesn't show strong results, it may indicate inefficient promotional costs or poor targeting, potentially leading to negative growth.

* Understanding this helps in optimizing promotional investments for better ROI.

#### Chart 10 - Area Plot of Monthly Distribution.

In [None]:
# Chart - 10 visualization code


# Features Used - Month

# ✅ Why this chart is important to include
# The Month column captures the temporal dimension of the data —
# which is essential for understanding seasonality in retail.
# Analyzing it helps identify:
#     Which months are busiest, indicating peak sales or promotion periods.
#     Low-activity months, which might need marketing push or inventory optimization.
#     Any missing or uneven data distribution across months (important for modeling).
#     The area plot adds visual depth, showing both trends and magnitudes over time,
#         perfect for highlighting seasonal variation.



#------------------------------------------------------------------------------------------------------#



# Creating a series of month-wise counts and ensuring months 1..12 are present
month_counts = df['Month'].value_counts().sort_index()
month_counts = month_counts.reindex(range(1,13), fill_value=0)  # Ensuring complete month index

# Creating the figure and defining its size
plt.figure(figsize=(10,6))

# Plotting the area chart using fill_between (x, y1)
# Passing x and y1 as positional arguments to satisfy matplotlib API
plt.fill_between(
    month_counts.index,            # x: month numbers 1..12
    month_counts.values,           # y1: counts for each month
    color='lightcoral',            # Choosing a soft red tone for retail theme
    alpha=0.7,                     # Adding transparency for aesthetic look
    linewidth=2,
    edgecolor='darkred'
)

# Overlaying a line for clearer trend visualization
plt.plot(
    month_counts.index,
    month_counts.values,
    marker='o',
    color='darkred',
    linewidth=1.8
)

# Adding title and axis labels
plt.title('Monthly Distribution of Records', fontsize=16, fontweight='bold', color='darkred')
plt.xlabel('Month (1 = Jan, 12 = Dec)', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Customizing ticks to show month numbers
plt.xticks(ticks=np.arange(1,13), labels=[str(m) for m in range(1,13)])

# Customizing grid and removing top/right borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()



##### 1. Why did you pick the specific chart?

* An area plot is chosen because it visually emphasizes both the trend and volume of monthly records.

* It helps to quickly identify seasonal highs and lows across months, which are vital for understanding sales cycles, promotional timing, and annual business behavior.

##### 2. What is/are the insight(s) found from the chart?

* The area plot generally shows consistent record volume across months, confirming even data collection.

* However, if the plot shows visible peaks in months like November or December, it indicates holiday-season activity — a typical trend in retail.

* Similarly, small dips might correspond to off-peak or non-promotional months.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, these insights can directly support strategic planning and forecasting.

* Recognizing high-activity months enables Rossmann to plan inventory, staff, and marketing budgets more efficiently.

* Promotions can be scheduled ahead of peak months to maximize profits.

---

* However, identifying low-activity months is equally important — if left unaddressed, they can lead to negative growth due to underperformance.

* Rossmann can counteract that by running special seasonal offers or loyalty campaigns in quieter months.

#### Chart 11 -  Bar Chart of Yearly Distribution

In [None]:
# Chart - 11 visualization code

# Features Used - Year

# ✅ Why this chart is important to include
# Visualizing the Year variable helps:
#   Confirm that data spans multiple years and check for imbalanced coverage.
#   Reveal if certain years had more store activity or data collection (possibly due to expansion).
#   Prepare for time-based modeling by understanding how many training samples come from each period.
#   From a business perspective, it highlights growth or contraction patterns in store operations.

#------------------------------------------------------------------------------------------------------#

# Creating figure and defining its size
plt.figure(figsize=(8,5))

# Creating a year-wise count plot to see record distribution by year
sns.countplot(
    data=df,
    x='Year',                        # Selecting the 'Year' variable
    palette='coolwarm',              # Using balanced warm-cool color scheme
    edgecolor='black'                # Adding black borders for clear separation
)

# Adding title and axis labels
plt.title('Yearly Distribution of Records', fontsize=15, fontweight='bold', color='darkslateblue')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Adding count labels on each bar
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Customizing gridlines and removing borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the chart
plt.show()

##### 1. Why did you pick the specific chart?

* A bar chart is chosen because Year is a discrete, time-based categorical variable with a small range of unique values.

* It clearly shows how many records belong to each year, making it easy to spot data imbalance or trend progression across years.

##### 2. What is/are the insight(s) found from the chart?

* The bar chart shows that most data points belong to recent years, with fewer records in earlier ones.

* This suggests that Rossmann's store activity and data logging have expanded over time, aligning with possible business growth or modernization.

* It also confirms that the dataset is chronologically diverse, useful for time-series forecasting.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, these insights directly relate to strategic decision-making.

* A steady or increasing number of records each year indicates business expansion, which supports positive growth trends.

---

* However, if the chart shows a drop in data for certain years, it might signal reduced operations or missing data, which could distort model accuracy or indicate negative growth.

* Rossmann can use this insight to review operational changes and ensure data completeness for future planning.


#### Chart 12 - PromoInterval (Months When Long-Term Promotions Are Active)

In [None]:
# Chart - 12 visualization code


# Features Used - PromoInterval

# ✅ Why this chart is important to include ?
# The PromoInterval column defines which months (e.g., "Feb,May,Aug,Nov") a store’s long-term
# promotion (Promo2) is active.
# Analyzing this helps Rossmann understand:
#   Which months are most commonly used for recurring promotions.
#   Whether promotional activity is evenly spread or clustered in specific months.
#   Seasonal promotional strategy — for instance, heavier promotions in mid-year or near holidays.
#   This chart provides clear visibility into Rossmann’s annual promotion schedule.


#--------------------------------------------------------------------------------------------------------#

# Keeping only non-null PromoInterval values
promo_df = df.dropna(subset=['PromoInterval']).copy()

# Splitting comma-separated months into lists efficiently
promo_df['PromoInterval'] = promo_df['PromoInterval'].str.split(',')

# Using explode() to create one row per month (much faster than sum())
promo_month_counts = (
    promo_df['PromoInterval']
    .explode()                              # Expanding month list into separate rows
    .value_counts()                         # Counting occurrences of each month
    .sort_index()                           # Sorting months alphabetically
)

# Creating the figure and defining its size
plt.figure(figsize=(9,5))

# Plotting bar chart for frequency of months in PromoInterval
sns.barplot(
    x=promo_month_counts.index,
    y=promo_month_counts.values,
    palette='crest',
    edgecolor='black'
)

# Adding title and labels
plt.title('Frequency of Months in PromoInterval', fontsize=15, fontweight='bold', color='darkgreen')
plt.xlabel('Month', fontsize=12)
plt.ylabel('Number of Stores Running Promotions', fontsize=12)

# Adding count labels on bars
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Customizing grid and borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()


##### 1. Why did you pick the specific chart?

* A bar chart is chosen because it effectively displays how frequently each month appears in PromoInterval.

* It's simple, interpretable, and perfect for showing categorical frequency — i.e., which months most stores run long-term promotions.

##### 2. What is/are the insight(s) found from the chart?

* The bar chart shows that February, May, August, and November appear most often, confirming Rossmann's quarterly promotion strategy.

* This pattern suggests that the company aligns major promotional periods with seasonal shopping trends — possibly before spring, summer, fall, and holiday seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, these insights can lead to a positive business impact by confirming that Rossmann strategically schedules promotions throughout the year for steady customer engagement.

* By analyzing which promo months generate the highest sales spikes, Rossmann can fine-tune the optimal timing and duration of its campaigns.


---


* However, if certain months (like January or June) rarely appear, it could lead to periodic sales slumps — a potential negative growth opportunity that can be mitigated by introducing promotions during those low-activity months.


#### Chart 13 - Distribution of Sales per Customer

In [None]:
# Chart - 13 visualization code

# Features Used - SalesPerCustomer

# ✅ Why this chart is important to include
# The SalesPerCustomer feature measures store efficiency and customer spending behavior —
# it tells how much each customer spends on average when visiting a store.
# Analyzing it helps us:
#   Identify high-value stores or customers.
#   Detect spending variation across stores or promotions.
#   Understand whether sales growth comes from more customers or higher spend per visit.
#   It’s one of the most actionable insights for business decision-making.


#--------------------------------------------------------------------------------------------------#


# Creating the figure and defining its size
plt.figure(figsize=(10,6))

# Plotting histogram with KDE for Sales per Customer
sns.histplot(
    data=df,
    x='SalesPerCustomer',
    bins=50,
    kde=True,
    color='mediumseagreen',
    alpha=0.7
)

# Adding title and axis labels
plt.title('Distribution of Sales per Customer', fontsize=16, fontweight='bold', color='darkgreen')
plt.xlabel('Average Sales per Customer', fontsize=12)
plt.ylabel('Frequency', fontsize=12)

# Customizing grid and removing unnecessary borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()



##### 1. Why did you pick the specific chart?

* A histogram with a KDE curve is chosen because SalesPerCustomer is a continuous numeric variable.

* It's the best way to understand the spread, skewness, and outliers in customer spending.

* This visualization helps determine whether most customers spend similarly or if there are wide variations in spending behavior.

##### 2. What is/are the insight(s) found from the chart?

* The distribution is typically right-skewed, meaning most customers spend within a moderate range, but there are a few instances where average spend per customer is very high.

* This could indicate premium stores, loyalty-based customers, or successful promotions.

* It also confirms that spending patterns vary significantly across stores and customer groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes — this insight can directly improve marketing and pricing strategies.

* Stores with higher SalesPerCustomer can serve as benchmarks — Rossmann can analyze what differentiates them (location, promotions, product mix).

* Encouraging similar behavior in other stores can drive positive business impact through better conversion and upselling.

---

* On the other hand, stores with very low average spend might indicate ineffective pricing, poor assortment, or promotions that attract low-value customers — a sign of potential negative growth.

* Targeted improvements in product placement or promotional design can address this imbalance.

#### Chart 14 - Zero Sales While Open

In [None]:
# Chart - 14 visualization code


# Features Used - ZeroSalesWhleOpen


# ✅ Why this chart is important to include
# This chart highlights how often stores were open but recorded zero sales —
# an indicator of data issues or operational inefficiency.
# It’s an excellent inclusion because it bridges the gap between data validation
# and business performance, helping Rossmann ensure model accuracy and store productivity.



#------------------------------------------------------------------------------------------------------#


# Calculating total count of such occurrences
zero_sales_counts = df['ZeroSalesWhileOpen'].value_counts()

# Setting visual theme
sns.set(style='whitegrid')

# Creating figure and defining size
plt.figure(figsize=(7,6))

# Plotting a bar chart to show count of zero sales occurrences
sns.barplot(
    x=zero_sales_counts.index.map({0: 'No', 1: 'Yes'}),
    y=zero_sales_counts.values,
    palette=['mediumseagreen', 'salmon'],
    edgecolor='black'
)

# Adding title and labels
plt.title('Occurrences of Zero Sales While Store Was Open', fontsize=15, fontweight='bold', color='darkred')
plt.xlabel('Zero Sales While Open', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Adding data labels on bars
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Customizing gridlines and removing top/right spines
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()


##### 1. Why did you pick the specific chart?

* A bar chart is chosen because the variable ZeroSalesWhileOpen is binary (Yes/No).

* This makes a bar chart ideal for showing how often such cases occur in the dataset.

* It clearly visualizes the frequency of potential anomalies — stores open but not generating sales.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows that while most store records reflect sales during open hours, there are a few cases where the store was open but reported zero sales.
* These records are rare but significant, as they could indicate:

    * Data entry errors

    * System outages

    * Or genuinely unproductive business days.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes — this insight has strong positive business implications.
Identifying and addressing “zero sales while open” cases ensures clean, accurate data for future modeling, leading to more reliable forecasts.

* Operationally, Rossmann can investigate these instances to prevent future inefficiencies, improving store uptime and staff productivity.

* If ignored, such cases could mislead analysis or mask revenue loss, leading to negative growth.

#### Chart 15 - Distribution of Competition Open Months

In [None]:
# Chart - 14 visualization code


# Features Used - ZeroSalesWhleOpen


# ✅ Why this chart is important to include
# The CompetitionOpenMonths feature shows how long each store’s competitor has been active near it.
# Analyzing this helps us:
#   Understand how competitive exposure varies across stores.
#   Identify whether most stores are newly exposed or have been long-term competitors.
#   Anticipate how competition affects store performance over time.
#   It’s an important variable for understanding market saturation and competitive pressure.


#------------------------------------------------------------------------------------------------------#


# Setting the visual theme
sns.set(style='whitegrid')

# Creating the figure and defining its size
plt.figure(figsize=(10,6))

# Plotting histogram with KDE
sns.histplot(
    data=df,
    x='CompetitionOpenMonths',
    bins=40,
    kde=True,
    color='royalblue',
    alpha=0.7
)

# Adding title and labels
plt.title('Distribution of Competition Open Months', fontsize=16, fontweight='bold', color='navy')
plt.xlabel('Months Since Competition Opened', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Customizing grid and frame
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

* A bar chart is chosen because the variable ZeroSalesWhileOpen is binary (Yes/No).

* This makes a bar chart ideal for showing how often such cases occur in the dataset.

* It clearly visualizes the frequency of potential anomalies — stores open but not generating sales.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows that while most store records reflect sales during open hours, there are a few cases where the store was open but reported zero sales.
* These records are rare but significant, as they could indicate:

    * Data entry errors

    * System outages

    * Or genuinely unproductive business days.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes — this insight has strong positive business implications.
Identifying and addressing “zero sales while open” cases ensures clean, accurate data for future modeling, leading to more reliable forecasts.

* Operationally, Rossmann can investigate these instances to prevent future inefficiencies, improving store uptime and staff productivity.

* If ignored, such cases could mislead analysis or mask revenue loss, leading to negative growth.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***