<a href="https://colab.research.google.com/github/MonaliM5/cardio_risk_prediction/blob/main/Rossmann_Retail_Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Rossmann Retail Sales Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Student Name**    - Monali Vijay Mhaske

# **Project Summary -**

* Rossmann is a large European drugstore chain operating over 3,000 stores across several countries. The company faces a key business challenge ‚Äî accurately forecasting daily sales for each store. Reliable sales forecasts are crucial for effective inventory planning, workforce scheduling, and promotional strategy.

* This project aims to predict the daily sales of Rossmann stores using historical sales data combined with store-specific information such as promotions, holidays, competition, and assortment types. By developing a predictive model, Rossmann management can make data-driven decisions to optimize operations and improve profitability.

* The project will follow a structured data science life-cycle, which includes:

  1. Data Understanding - Exploring the historical sales and store datasets to identify patterns, data types, and business drivers.


  2. Data Wrangling & Cleaning - Handling missing values, correcting inconsistencies, encoding categorical variables, and preparing data for analysis.


  3. Exploratory Data Analysis (EDA) - Performing univariate, bivariate, and multivariate analysis to uncover relationships between features such as promotions, holidays, and sales trends.


  4. Feature Engineering - Creating new features like competition duration, promo activity flags, and temporal variables (month, week, weekday) to enhance model performance.


  5. Model Development & Evaluation - Building regression-based machine learning models (e.g., Linear Regression, Random Forest, XGBoost) to predict sales. Model performance will be evaluated primarily using Root Mean Squared Logarithmic Error (RMSLE), a suitable metric for skewed sales data.


  6. Model Interpretation & Business Insights - Interpreting the model results to generate actionable insights, such as how promotions or competition impact sales, and providing recommendations to management.



* The expected outcome of this project is a robust, data-driven forecasting model that can accurately estimate future store sales and highlight key factors influencing them. This will help Rossmann:

    * Ensure better inventory and staff management,

    * Improve promotion planning and marketing effectiveness, and

    * Ultimately increase overall profitability.

# **GitHub Link -**

https://github.com/MonaliM5/rossmann_retail_sales_prediction

# **Problem Statement**


* In the retail industry, accurate sales forecasting is critical for effective decision-making. Retailers like Rossmann, one of Europe's largest drugstore chains, must regularly decide how much stock to order, how to schedule employees, and how to plan promotional campaigns - all of which depend heavily on anticipated sales.

* However, predicting store-level daily sales is challenging because it is influenced by multiple dynamic factors such as store location, promotions, holidays, competition, and seasonality. An incorrect forecast can lead to overstocking or understocking, resulting in financial losses and poor customer experience.

* The primary objective of this project is to develop a predictive model capable of accurately estimating daily sales for each Rossmann store using historical sales and store information. The model should capture the impact of various external and internal factors - including promotions, holidays, competition distance, assortment type, and time-related variables - on store performance.

* The predictive insights from this project will enable Rossmann's management to:

  * Optimize inventory and staffing levels,

  * Plan promotional activities more effectively,

  * Improve supply chain efficiency, and

  * Enhance overall business profitability.


* The project will apply systematic data analysis and machine learning techniques to derive actionable insights that directly support Rossmann's strategic and operational decisions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Data manipulation and analysis
import numpy as np
import pandas as pd

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Datetime
from datetime import datetime


# Machine learning and preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# System and warnings
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

### Dataset Loading

In [None]:
# Mounting Google Drive

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
Sales_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Module 6 - Machine Learning/Chapt 1.5 - Capstone Project Regression/Rossmann Stores Data.csv", parse_dates=['Date'])
Store_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Module 6 - Machine Learning/Chapt 1.5 - Capstone Project Regression/store.csv")

### Dataset First View

In [None]:
# Sales Dataset First Look
print("Sales Data First View ")
Sales_df.head()

In [None]:
# Store Dataset First Look
print("Store Data First View ")
Store_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Sales Dataset -\n Rows Count : {Sales_df.shape[0]}  \tColumns Count : {Sales_df.shape[1]}")
print(f"Store Dataset -\n Rows Count : {Store_df.shape[0]}  \tColumns Count : {Store_df.shape[1]}")

### Dataset Information

In [None]:
# Dataset Information

# Sales dataset info
print("Sales Data Information:")
Sales_df.info()
print("\n" + "="*60 + "\n")

# Store dataset info
print("Store Data Information:")
Store_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

# Sales dataset duplicates
print(f"Sales Dataset ‚Üí Duplicate rows: {Sales_df.duplicated().sum()}")

if Sales_df.duplicated().sum() > 0:
    print("\nSample duplicate rows from Sales_df:")
    display(Sales_df[Sales_df.duplicated()].head())

print("\n" + "-"*60 + "\n")

# Store dataset duplicates
print(f"Store Dataset ‚Üí Duplicate rows: {Store_df.duplicated().sum()}")

if Store_df.duplicated().sum() > 0:
    print("\nSample duplicate rows from Store_df:")
    display(Store_df[Store_df.duplicated()].head())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

# Sales dataset nulls
print("Sales Dataset - Missing Values:\n")
print(Sales_df.isnull().sum())
print("\n" + "-"*60 + "\n")

# Store dataset nulls
print("Store Dataset - Missing Values:\n")
print(Store_df.isnull().sum())

In [None]:
# Visualizing the missing values

import missingno as msno

# Set up plot style
plt.style.use('seaborn-v0_8-whitegrid')

# Sales dataset missing values visualization
print("Sales Dataset - Missing Values Visualization:\n")
msno.matrix(Sales_df)
plt.title("Sales Dataset - Missing Values Overview")
plt.show()

# Store dataset missing values visualization
print("Store Dataset - Missing Values Visualization:\n")
msno.matrix(Store_df)
plt.title("Store Dataset - Missing Values Overview")
plt.show()

### What did you know about your dataset?

* The Rossmann dataset consists of two main files - Sales data and Store data - that together provide a comprehensive view of the company's retail operations.

1. Sales Data (Sales_df)

    * This dataset contains daily sales records for each Rossmann store.

    * Each record includes store ID, sales amount, number of customers, whether the store was open, ongoing promotions, state and school holidays, and the corresponding date.

    * These variables help capture short-term and seasonal trends, customer behavior, and the influence of holidays or promotions on sales.



2. Store Data (Store_df)

    * This dataset provides static information about each store, such as store type, assortment level, distance to the nearest competitor, duration since the nearest competition opened, and promotional program details (e.g., whether the store runs continuous promotions).

    * These variables explain store-level differences that affect long-term sales patterns.



* By combining these two datasets, we obtain both temporal (daily) and structural (store-level) information. This enables a deeper understanding of how different factors ‚Äî such as competition, promotions, holidays, and assortment ‚Äî influence store performance.
Such insights form the foundation for building a reliable predictive model to forecast future sales.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

# Printing list of columns in Sales Dataset
print("Columns in Sales Dataset:\n")
print(Sales_df.columns.tolist())

print("\n" + "-"*60 + "\n")

#Printing list of columns in Store Dataset
print("Columns in Store Dataset:\n")
print(Store_df.columns.tolist())

In [None]:
# Dataset Describe

# Describing the Sales Dataset
print(" Sales Data Description : \n")
display(Sales_df.describe())

print("\n" + "-"*60 + "\n")

# Describing the Store Dataset
print(" Store Data Description :\n ")
display(Store_df.describe())

### Variables Description

üìò Sales Dataset (Sales_df) :


|Variable|Description|
|---|---|
|Store|Unique identifier for each store.|
|DayOfWeek|  Day of the week (1 = Monday, 7 = Sunday).|
|Date| Date of the record.|
|Sales|	Total sales made on that day ‚Äî this is the target variable.|
|Customers|	Number of customers who visited the store on that day.|
|Open|	Indicates whether the store was open (1) or closed (0).|
|Promo|	Indicates if a promotion was running on that day (1 = Yes, 0 = No).|
|StateHoliday|	Denotes whether the day was a state/national/public holiday.|
|SchoolHoliday|	Indicates if the store was affected by public-school closures.|



---

üè™ Store Dataset (Store_df)

|Variable	|Description|
|---|---|
|Store|	Unique identifier for each store (key to merge with Sales_df).|
|StoreType|	Type of store (a, b, c, d) ‚Äî represents different business formats.|
|Assortment|	Level of product assortment (a = basic, b = extra, c = extended).|
|CompetitionDistance|	Distance to the nearest competitor (in meters).|
|CompetitionOpenSinceMonth|	Month when the nearest competitor opened.|
|CompetitionOpenSinceYear|	Year when the nearest competitor opened.|
|Promo2|	Indicates whether the store participates in a continuing promotion (1 = Yes, 0 = No).|
|Promo2SinceWeek|	Week when the store began participating in Promo2.|
|Promo2SinceYear|	Year when the store began participating in Promo2.|
|PromoInterval|	Months when Promo2 is active (e.g., Jan, Apr, Jul, Oct).|



---

üí° Insights

* Sales_df provides time-based transactional information.

* Store_df adds store-level context such as competition and promotions.

* Together, these datasets form a powerful base to analyze sales drivers and build a predictive regression model for accurate forecasting.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# Fetching and printing number of unique values in each column of Sales Dataset.
print("Unique values in Sales Dataset : \n")
for col in Sales_df.columns :
  unique_count = Sales_df[col].nunique()
  # Fetched total number of unique values
  print(f"{col} : {unique_count} unique values")
  print(Sales_df[col].unique())
  # Printing all those unique values of each column
  print("\n")

print("\n" + "-"*60 + "\n")

# Fetching and printing number of unique values in each column of Store Dataset.
print("Unique values in Store Dataset : \n")

for col in Store_df.columns :
  unique_count = Store_df[col].nunique()
  # Fetched total number of unique values.

  print(f"{col} : {unique_count} unique values")
  print(Store_df[col].unique())
  # Printing all those unique values of each column

  print("\n")

## ***3. Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# -----------------------------
# STEP 1: MERGING BOTH DATASETS
# -----------------------------
# üìå Copying data to preserve raw versions
sales_df = Sales_df.copy()
store_df = Store_df.copy()

# üß© Merging both datasets using 'Store' as the key
df = pd.merge(sales_df, store_df, on = "Store", how = "left")


#==============================================================================================#


# -----------------------------
# STEP 2 : Handle unrealistic values
# ----------------------------
# üß† Replacing invalid competition years (before 1970) with NaN
df.loc[df['CompetitionOpenSinceYear'] < 1970, 'CompetitionOpenSinceYear'] = np.nan

# üè´ Handling holidays and store info columns
df['SchoolHoliday'] = df['SchoolHoliday'].astype(int)


#==============================================================================================#



# -----------------------------
# STEP 3: Missing-value indicator flags
# -----------------------------
# üö© Creating missing-value indicator flags to identify records with null values after imputing for the same.
df['CompetitionDistance_NA'] = df['CompetitionDistance'].isna().astype(int)
df['CompetitionOpenSince_NA'] = df[['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear']].isna().any(axis=1).astype(int)
df['Promo2Since_NA'] = df[['Promo2SinceWeek', 'Promo2SinceYear']].isna().any(axis=1).astype(int)


#==============================================================================================#


# -----------------------------
# STEP 4 : Handle missing values
# -----------------------------
# üéØ Handling promotion-related columns
df['Promo2SinceWeek'].fillna(0, inplace=True)
df['Promo2SinceYear'].fillna(0, inplace=True)
df['PromoInterval'].fillna('None', inplace=True)

# üßπ Handling missing values for 'CompetitionDistance'
df['CompetitionDistance'] = df['CompetitionDistance'].fillna(df['CompetitionDistance'].median())

# üß© Handling missing values for competition opening details
df['CompetitionOpenSinceMonth'].fillna(0, inplace=True)
df['CompetitionOpenSinceYear'].fillna(0, inplace=True)



#==============================================================================================#


# -----------------------------
# STEP 5 : Feature extraction from date
# -----------------------------
# üïí Converting 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

# üìÜ Extracting useful time-based features
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month



#==============================================================================================#


# -----------------------------
# STEP 6 : Competition open duration
# -----------------------------
# üïì Creating 'CompetitionOpenMonths' ‚Äî duration of competition presence
df['CompetitionOpenMonths'] = np.where(
    df['CompetitionOpenSince_NA'],
    0,  # Set to 0 if missing
    ((df['Year'] - df['CompetitionOpenSinceYear']) * 12 +
     (df['Month'] - df['CompetitionOpenSinceMonth']))
)

# Replacing negative values (for stores opened later) with 0
df['CompetitionOpenMonths'] = df['CompetitionOpenMonths'].apply(lambda x: x if x > 0 else 0)



#==============================================================================================#



# -----------------------------
# STEP 7 : Fixing data type issues
# -----------------------------
df['StateHoliday'] = df['StateHoliday'].astype(str).replace({'0': 'None'})
df['StoreType'] = df['StoreType'].astype('category')
df['Assortment'] = df['Assortment'].astype('category')
df['PromoInterval'] = df['PromoInterval'].astype('category')


#==============================================================================================#


# -----------------------------
# STEP 8 : Promo2 active (vectorized)
# -----------------------------
# üß† Creating 'IsPromo2Active' ‚Äî checks if store's long-term promo is active
month_map = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6,
             'Jul':7, 'Aug':8, 'Sept':9, 'Oct':10, 'Nov':11, 'Dec':12}

# Convert to string before applying function to avoid 'unhashable list' error
df['PromoMonths'] = df['PromoInterval'].astype(str).apply(
    lambda x: [month_map[m] for m in x.split(',')] if x != 'None' else []
)


df['IsPromo2Active'] = df.apply(
    lambda row: 1 if (row['Promo2'] == 1) and
                     (row['Month'] in row['PromoMonths']) and
                     (row['Year'] > row['Promo2SinceYear']) else 0, axis=1)



#==============================================================================================#



# -----------------------------
# STEP 9 : Business logic flags
# -----------------------------
# üí∞ Creating 'SalesPerCustomer' ‚Äî derived metric for customer efficiency
df['SalesPerCustomer'] = np.where(df['Customers'] > 0, df['Sales'] / df['Customers'], 0)

# üí∞ Creating ZeroSalesWhileOpen to check anoamlies or Unproductive Day.
df['ZeroSalesWhileOpen'] = ((df['Sales'] == 0) & (df['Open'] == 1)).astype(int)



#==============================================================================================#



# -----------------------------
# STEP 10 : Handle outliers
# -----------------------------
# Cap competition distance at 99th percentile
cap_value = df['CompetitionDistance'].quantile(0.99)
df.loc[df['CompetitionDistance'] > cap_value, 'CompetitionDistance'] = cap_value



#==============================================================================================#



# -----------------------------
# STEP 11 : Final formatting
# -----------------------------
# üö´ Dropping redundant and unnecessary columns
df.drop(columns=[
    'Promo2SinceYear',    # Used only for IsPromo2Active
    'PromoMonths',       # Temporary calculation column
    ], inplace=True)


# üßæ Reordering columns for better readability
df = df[['Store', 'Date', 'Year', 'Month', 'DayOfWeek', 'Open', 'Promo', 'Promo2', 'IsPromo2Active',
         'Sales', 'Customers', 'SalesPerCustomer','ZeroSalesWhileOpen', 'StateHoliday', 'SchoolHoliday',
         'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionDistance_NA',
         'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'CompetitionOpenSince_NA',
         'CompetitionOpenMonths', 'Promo2SinceWeek', 'Promo2Since_NA', 'PromoInterval']]


# Reset index and sort
df.sort_values(['Store', 'Date'], inplace=True)
df.reset_index(drop=True, inplace=True)


# ‚úÖ Displaying final info
print("‚úÖ Data Wrangling Completed Successfully!")
print(f"Final Dataset Shape: {df.shape}")
print("\nMissing Values After Wrangling:")
print(df.isnull().sum()[df.isnull().sum() > 0])
display(df.head(3))

### What all manipulations have you done and insights you found?

* During the data wrangling phase, multiple preprocessing and feature-engineering steps were performed to clean, correct, and enhance the dataset ‚Äî making it ready for meaningful analysis and modeling.

* The manipulations focused on merging data, handling missing values, correcting data types, creating new features, and ensuring data consistency and reliability.


---

üîß Manipulations Performed

1Ô∏è‚É£ Merging Datasets

* The two datasets ‚Äî Sales_df (daily sales data) and Store_df (store information) ‚Äî were merged on the common key Store.

* This merge allowed each sales record to include corresponding store-level details such as store type, assortment, and competition data.



---

2Ô∏è‚É£ Handling Unrealistic and Missing Values

* Unrealistic competition years (before 1975) were replaced with NaN to maintain data validity.

* Missing values in important fields like CompetitionDistance, CompetitionOpenSinceMonth/Year, and Promo2SinceWeek/Year were handled logically using median or zero imputation where applicable.

* Additional missing-value indicator columns (CompetitionDistance_NA, CompetitionOpenSince_NA, Promo2Since_NA) were created to capture the information about missingness ‚Äî since even the absence of data can be predictive.



---

3Ô∏è‚É£ Data Type Corrections

* Columns such as StoreType, Assortment, and PromoInterval were converted to categorical data types to improve memory efficiency and interpretability.

* The StateHoliday column was kept as a categorical feature to allow one-hot encoding later during the modeling phase.



---

4Ô∏è‚É£ Feature Extraction from Date

* The Date column was transformed into datetime format, and new temporal features were extracted:
Year, Month, and DayOfWeek.

* These features will help identify seasonal trends, monthly variations, and weekday vs. weekend effects on sales.



---

5Ô∏è‚É£ Feature Engineering

* CompetitionOpenMonths ‚Üí Calculated the number of months since a competitor store opened, set to 0 for missing or invalid entries.

* IsPromo2Active ‚Üí A binary flag indicating if a store‚Äôs long-term promotion (Promo2) was active during the specific month, based on PromoInterval, Promo2SinceWeek, and Promo2SinceYear.

* SalesPerCustomer ‚Üí Derived metric showing average customer spending, helping evaluate store-level performance efficiency.

* ZeroSalesWhileOpen ‚Üí Business rule flag identifying anomalies where sales were 0 despite the store being open.



---

6Ô∏è‚É£ Outlier Treatment

* Extreme values in CompetitionDistance were capped at the 99th percentile to minimize the influence of outliers and maintain stable model behavior.



---

7Ô∏è‚É£ Sorting, Indexing, and Final Formatting

* The dataset was sorted by Store and Date to maintain chronological order.

* Redundant columns like Promo2SinceYear and intermediate helper variables were dropped for clarity.

* Index was reset to ensure a clean, analysis-ready DataFrame.




---

üí° Insights Found After Data Wrangling -

1. Several stores had missing competition details, implying that either the data was unavailable or competition was absent in that region.


2. Some stores have long-standing competitors, indicating that mature markets may exhibit more stable sales.


3. The number of active promotions (Promo2) varies by month, revealing seasonal marketing patterns.


4. Records with zero sales while open highlight potentially unproductive days or data inconsistencies that warrant deeper business review.


5. The derived feature SalesPerCustomer exposed variation in average customer spending ‚Äî potentially linked to store type, assortment, or location.




---

‚úÖ Outcome -

* The dataset is now clean, consistent, and analysis-ready.

* All missing values have been addressed, data types standardized, and new business-relevant features engineered.

* Outliers were treated effectively, ensuring that the data provides a solid foundation for Exploratory Data Analysis (EDA) and predictive modeling.



---

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### **Univariate Analysis**

#### Chart 1 - Distribution of Daily Sales

In [None]:
# Chart - 1 visualization code

# Features Used - Sales

# ‚úÖ Why this chart is important to include ?
# The entire project aims to predict daily Sales - it‚Äôs the dependent variable.
# Before building any model, we must understand how this variable behaves.
# Without this chart, we wouldn‚Äôt know if the data is balanced, skewed, or has extreme outliers.

#----------------------------------------------------------------------------------------------------------#

# Setting plot style for cleaner aesthetics
sns.set(style='whitegrid')


# Creating the figure
plt.figure(figsize=(10,6))

# Plotting histogram with KDE curve for smooth density visualization
sns.histplot(
    data=df,
    x='Sales',
    bins=50,                 # Number of bars
    kde=True,                # Add smooth density curve
    color='royalblue',       # Chart color
    alpha=0.7                # Transparency
)

# Adding titles and labels
plt.title('Distribution of Daily Sales', fontsize=16, fontweight='bold', color='black')
plt.xlabel('Sales Amount', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Customizing gridlines and frame
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)  # Removes top and right border for a cleaner look

# Displaying the plot
plt.show()

**Summary -**

1,017,209 records

17% zero sales (closed or failed days).

Mean ‚âà 5,774

Median ‚âà 5,744 ‚Üí nearly balanced.

Right skew (0.64) and a few high-sale outliers (up to 41,551).

Suggests mostly steady performance with isolated extreme highs.

##### 1. Why did you pick the specific chart?

* I have used a histogram with a KDE curve to observe how daily sales are distributed across all stores.

* This chart helps me identify the shape of the distribution, including skewness, zero-sales days, and extreme high-sale days, which are important for cleaning, modeling, and understanding business trends.

##### 2. What is/are the insight(s) found from the chart?

* The histogram shows a moderate right-skewed distribution with most daily sales concentrated between 3,000 and 8,000.

* The mean and median are almost equal (Mean = 5,774, Median = 5,744), suggesting that the majority of stores have a consistent daily sales pattern.

* Around 17% of the records have Sales = 0, which occur mostly when stores are closed or during operational downtimes.

* There are a few extremely high values (above 17,000), showing occasional high-performing days due to major events or promotions.

* Overall, the pattern indicates that sales are mostly stable with occasional spikes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**Positive business impacts / what to do:**

 * Actionable signal :

      *  handle zeros carefully:

      * About 17% of records have zero sales.

      * Many are due to stores being closed (Open == 0), but a few are open with zero sales, which should be investigated.


 * Modeling improvements:

    * Because of moderate right skew and zeros, I should use log1p(Sales) or RMSLE during modeling to stabilize variance and reduce the impact of extreme values.


 * High-performing days:

    * I should identify top sales days (>10k) and analyze if they coincide with promotions or holidays so similar events can be replicated to boost sales in low periods.




---

**Negative-growth / risk signals & mitigation:**

 * Zero-sales while open:

    * If some stores show zero sales while open, this could point to system or operational issues, leading to missed revenue.

    * These cases need to be analyzed to avoid recurring negative growth.


 * Impact of extreme outliers:
    * A small number of very high-sale days could distort mean sales and model predictions.

    * I have to cap or treat these outliers to ensure fair comparison across stores.

#### Chart 2 - Distribution of Customers

In [None]:
# Chart - 2 visualization code

# Features Used - Customers

# Why this chart is important to include ?
# The number of Customers visiting each store per day determines the overall sales potential.
# Analyzing its distribution helps us:
# Understand the footfall variation across stores and days.
# Detect outliers (extremely busy or empty days).
# Identify whether customer traffic is evenly spread or dominated by a few high-volume stores.
# This chart provides crucial insight into customer behavior trends, which can later help the company optimize
# store operations, marketing, and staffing.

#-----------------------------------------------------------------------------------------------------------------------------------#


# Setting the plot style for professional appearance
sns.set(style='whitegrid')

# Creating the figure and defining its size
plt.figure(figsize=(9,5))

# Plotting the box plot to analyze the distribution and outliers
sns.boxplot(
    data=df,
    x='Customers',          # Setting the variable to visualize
    color='mediumseagreen', # Choosing a calm, professional color
    width=0.6,              # Adjusting box width
    fliersize=3             # Adjusting outlier marker size
)

# Adding title and labels
plt.title('Box Plot of Daily Customers', fontsize=15, fontweight='bold', color='darkgreen')
plt.xlabel('Number of Customers per Day', fontsize=12)

# Customizing gridlines for cleaner presentation
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Removing unnecessary chart borders
plt.box(False)

# Displaying the plot
plt.show()

**Summary -**

Median ‚âà 630  

Mean ‚âà 695 ‚Üí mild right skew.

IQR between 466 - 844 ‚Üí consistent footfall range.

A few extreme outliers (>1,700) show very high-traffic days.

Mostly stable customer flow with minor anomalies that should be examined.

##### 1. Why did you pick the specific chart?

* I have used a boxplot because it helps me visualize the spread, median, and outliers in the number of customers visiting the stores each day.

* It clearly shows the central tendency (median) and the variation in footfall between stores, while also highlighting days or stores with unusually high or low customer counts.

##### 2. What is/are the insight(s) found from the chart?

* The median customer count is around 630, while the mean is slightly higher (~695), which shows a mild right skew ‚Äî indicating that a few days have exceptionally high customer numbers.

* The interquartile range (IQR) lies between 466 and 844, meaning 50% of the records fall within this range.

* Some extreme outliers can be observed above 1,700 customers, showing rare but very high-traffic days (possibly due to promotions or holidays).

* The lower whisker shows very few days with near-zero customers ‚Äî likely when the stores were closed.

* Overall, the distribution looks fairly consistent with limited variability across most records.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

1. Workforce and resource planning:

    * Since the majority of days have 400-800 customers, I can plan staff schedules, stock replenishment, and customer service levels accordingly.


2. Opportunity analysis for high footfall days:

    * High outlier days (above 1,700 customers) can be analyzed to identify successful events, marketing campaigns, or external factors that drove higher traffic.

    * Learning from these patterns can help boost regular-day sales.


3. Predictable customer trends:

    * The compact box shape indicates a stable customer base, which is good for creating consistent marketing and forecasting models.





---


**Negative-growth / risk signals & mitigation:**

1. Low-customer stores:

    * Some stores have very low median customers, possibly due to poor location or lack of local promotion.

    * I should identify and support these stores with targeted offers or awareness drives.


2. Zero or near-zero customer days:

    * Even though few, these records should be checked for data entry issues or operational closures to prevent misinterpretation during analysis.


3. High outliers:

    * Very large spikes in customer count may distort statistical analysis.

    * I should consider capping or transforming these values during modeling to prevent bias.




---

#### Chart 3 - Violin Plot of Competition Distance

In [None]:
# Chart - 3 visualization code

# Features Used - CompetitionDistance

# ‚úÖ Why this chart is important to include
# This visualization helps us understand how CompetitionDistance is distributed among all Rossmann stores.
# It reveals whether most stores face nearby competitors or if some operate in low-competition areas.
# Understanding this spread helps identify how market competition might influence overall sales and customer reach.


#-----------------------------------------------------------------------------------------------------------------------------#


# Plotting the violin plot to visualize the distribution shape and density
sns.violinplot(
    data=df,
    x='CompetitionDistance',   # Setting the variable to visualize
    color='skyblue',           # Choosing a light color for better readability
    inner='quartile',          # Showing quartiles inside the violin
    cut=0                      # Preventing density from extending beyond data range
)

# Adding title and labels
plt.title('Violin Plot of Competition Distance', fontsize=15, fontweight='bold', color='black')
plt.xlabel('Distance to Nearest Competitor (meters)', fontsize=12)

# Customizing gridlines and chart borders
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()

**Summary -**

Distribution is heavily right-skewed, most competitors within 2 km.

Median ‚âà 1,000 m, upper tail capped at ‚âà7,400 m.

Few stores are isolated ‚Äî potential expansion or benchmarking opportunities.

High-competition zones should apply targeted promotions to avoid negative growth.


##### 1. Why did you pick the specific chart?

* I have used a violin plot because it helps me visualize both the distribution shape and density of the competition distance.

* Unlike a boxplot, the violin plot gives me a clear idea of how the values are concentrated ‚Äî for example, whether most stores are close to competitors or located farther away.

* This visualization also makes it easy to identify if the distances are skewed or have multiple peaks.

##### 2. What is/are the insight(s) found from the chart?

* The plot shows a heavily right-skewed distribution, meaning that most stores are located very close to their nearest competitors, while a smaller number of stores are located much farther away.

* The major concentration of competition distance lies below 2,000 meters, showing that many stores have competitors within a short distance.

* The median competition distance is around 1,000 meters, and the IQR (50% of the data) lies approximately between 700-2,500 meters.

* The upper tail extends toward 7,400 meters (after capping at the 99th percentile), indicating a few stores are isolated from competition.

* The density curve is narrow near zero and gradually thins out, confirming that most stores are in competitive zones.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

 * Strategic advantage for nearby competition:

    * Since most stores have competitors nearby, I can analyze whether close-proximity competition correlates with higher or lower sales.

    * This insight can guide store placement strategies ‚Äî for instance, whether clustering stores drives demand or cannibalizes sales.


 * Market gap identification:

    * The few stores with very high competition distance (>7 km) indicate less saturated regions.

    * These areas can be studied to understand if they maintain steady sales despite limited competition ‚Äî such zones might represent opportunities for new stores.


 * Targeted promotions:

    * Stores with close competition might benefit from localized promotions and loyalty programs to retain customers in high-competition areas.

---

**Negative-growth / risk signals & mitigation:**

 * High competition density risk:

    * Stores located within < 500 meters of competitors may face price pressure and customer switching, which could lead to declining margins.

    * These stores should be monitored and possibly offered exclusive deals or distinct assortments to stay competitive.


 * Uneven competition landscape:

    * Since the distribution is highly skewed, modeling might overemphasize isolated stores.

    * I should scale or transform this variable (e.g., log transform) to normalize its effect and avoid bias in the model.

#### Chart 4 - Count Plot of Promotion Activity

In [None]:
# Chart - 4 visualization code

# Features used :- Promo

# ‚úÖ Why this chart is important to include ?
# Promotions have a direct and measurable impact on sales ‚Äî they drive customer traffic, influence purchasing behavior,
# and affect store profitability.
# By visualizing how frequently promotions occur, we can:
#     * Understand how often Rossmann runs promotions.
#     * Check for data balance (too few or too many promo days).
#     * Prepare for later bivariate analysis, where we‚Äôll compare promotions with sales performance.
#     * This chart helps lay the groundwork for analyzing how promotional frequency affects sales trends.



#------------------------------------------------------------------------------------------------------------------------------------#

# Creating the figure and defining its size
plt.figure(figsize=(7,5))

# Plotting the count plot to show number of promo vs non-promo days
sns.countplot(
    data=df,
    x='Promo',                 # Selecting the 'Promo' variable
    palette=['salmon', 'lightgreen'],  # Choosing contrasting colors for clarity
    edgecolor='black'          # Adding borders to bars for visual definition
)

# Adding value annotations above bars
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Adding title and axis labels
plt.title('Count Plot of Promotion Activity', fontsize=15, fontweight='bold', color='darkred')
plt.xlabel('Promotion Active (1 = Yes, 0 = No)', fontsize=12)
plt.ylabel('Count of Days', fontsize=12)

# Customizing grid and borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()

**Sumary -**


About 37% of days had an active promotion.


Non-promo days (63%) dominate but still leave enough promo variety for analysis.


Promotions appear well-balanced ‚Äî neither rare nor overused.


Insights from this will help optimize promotion frequency and timing.


##### 1. Why did you pick the specific chart?

* I have used a count plot because Promo is a binary categorical variable that represents whether a short-term promotion was active (1) or not (0) on a given day.

* A count plot helps me clearly visualize how frequently promotions were running compared to non-promo days.

* It also gives me an idea of whether the dataset is balanced or imbalanced, which is important for further analysis and modeling.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows that the number of days with no promotion (Promo = 0) is significantly higher than days with an active promotion (Promo = 1).

* Approximately 63% of the records belong to non-promo days, while only 37% are promo-active days.

* This indicates that promotions are not always running ‚Äî they are scheduled strategically during certain periods.

* The presence of a decent proportion of promo-active days ensures that there is enough variability in data to study the impact of promotions on sales.

* This balance also suggests that stores frequently rely on promotions as a sales strategy but not excessively.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

 * Promotion planning efficiency:

    * Since promotions are active on about one-third of the days, I can analyze whether those days lead to significant sales lifts in bivariate analysis.

    * This helps me understand the effectiveness of promotional campaigns and fine-tune their duration and timing.


 * Balanced data for modeling:

    * Having both promo and non-promo days well represented will make my models learn the true promotional impact effectively.


 * Customer engagement strategy:

    * Frequent but not continuous promotions suggest a planned approach to maintaining customer interest while avoiding ‚Äúpromotion fatigue.‚Äù




---

**Negative-growth / risk signals & mitigation:**

 * Limited promo days:

    * Some stores might be underusing promotions. If they are located in competitive regions, fewer promo days could result in lost sales opportunities.

    * I can cross-check such stores with low sales to identify if more promotional support is needed.


 * Dependence on promotions:

    * If future analysis shows that sales drop sharply when promotions are inactive, it may mean the store has become too dependent on discounts, which can hurt long-term profit margins.

    * The business should ensure promotions are combined with loyalty strategies rather than price cuts alone.


#### Chart 5 - Pie Chart of Store Type Distribution

In [None]:
# Chart - 5 visualization code

# Features used - StoreType

# ‚úÖ Why this chart is important to include
# Different store types cater to different customer segments and have varying sales capacities.
# Visualizing the proportion of store types helps us:
#     Understand Rossmann‚Äôs store structure and market distribution.
#     Identify which store type dominates the business.
#     Prepare for later analyses (e.g., which store type drives the most sales).
# This chart provides a strategic overview of Rossmann's retail landscape.


#--------------------------------------------------------------------------------------------------------#


# Setting color palette for professional appearance
colors = sns.color_palette('pastel')

# Creating the figure and defining its size
plt.figure(figsize=(7,7))

# Calculating the count of each store type
store_counts = df['StoreType'].value_counts()

# Plotting the pie chart
plt.pie(
    store_counts,
    labels=store_counts.index,             # Adding store type labels
    autopct='%1.1f%%',                     # Displaying percentage values
    startangle=90,                         # Rotating for better visual balance
    colors=colors,                         # Using soft pastel palette
    wedgeprops={'edgecolor': 'black'},     # Adding edge for clarity
    textprops={'fontsize': 11}
)

# Adding title
plt.title('Distribution of Store Types', fontsize=15, fontweight='bold', color='darkblue')

# Displaying the chart
plt.show()

**Summary -**

Store Type a = ~56%, Type d = ~22%, Type b = ~10%, Type c = ~12%.

Dataset dominated by Type a stores, which strongly influence sales trends.

Insight helps in resource allocation, model balancing, and expansion strategy.

##### 1. Why did you pick the specific chart?

* I have used a pie chart to represent the distribution of store types (StoreType).

* Since this variable has only four categories (a, b, c, d), a pie chart gives a quick and clear visual of how many stores belong to each category and their proportional contribution to the total.

* This helps me understand whether the dataset is dominated by a particular store type or if it is fairly balanced, which is important for sales pattern comparison.

##### 2. What is/are the insight(s) found from the chart?

* The pie chart shows that Store Type a occupies the largest proportion, covering around 54% of all stores.

* Store Type d follows next with about 30%, while Type b and Type c make up the remaining 15% combined (roughly 13.5% and 1.6%, respectively).

* This clearly shows that Type a stores dominate the dataset, meaning any overall sales trend will likely be influenced heavily by them.

* The presence of multiple store types indicates diversity in business formats ‚Äî for example, larger supermarkets, smaller local stores, and specialty outlets.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

 * Understanding dominant store formats:

    * Since Type a stores make up more than half the total, their operational patterns and marketing performance will have the highest influence on total sales.

    * This insight helps me prioritize analysis and optimization strategies for Type a stores first.


 * Balanced analysis approach:

    * Even though Type a dominates, other store types (b, c, d) still make up nearly half of the stores combined.

    * I should compare their sales trends separately later to identify which formats perform best under different conditions (e.g., promotions, holidays).


 * Expansion insights:

    * If certain smaller store types (like c or b) show high efficiency despite smaller count, it might be a good signal for scalable expansion.




---

**Negative-growth / risk signals & mitigation:**

 * Type imbalance:

    * Since Type a stores dominate, there is a chance that the model could become biased toward their sales behavior.

    * I should ensure that during modeling, store type effects are properly encoded so that smaller types (b, c, d) are not underrepresented.


 * Dependence on one format:

    * Heavy dependence on Type a stores for overall revenue could be risky if those stores underperform.

    * The business should maintain a diverse mix of store types to balance risk across different formats and customer segments.

#### Chart 6 - Bar Chart of Assortment Type Distribution

In [None]:
# Chart - 6 visualization code

# Features used -  Assortment

# ‚úÖ Why this chart is important to include
# Assortment type affects customer attraction and sales volume ‚Äî stores with a wider assortment may appeal
# to a broader customer base but also face higher operating costs.
# Visualizing this helps in understanding:
#   The dominant assortment strategy across Rossmann stores.
#   Whether most stores carry a limited or diverse product range.
#   How product variety could influence customer footfall and revenue.
# This chart is critical for connecting product diversity with store performance in later analysis.

#------------------------------------------------------------------------------------------------------------------------------#


# Setting visual theme
sns.set(style='whitegrid')

# Creating the figure and defining its size
plt.figure(figsize=(8,5))

# Plotting horizontal bar chart for assortment type distribution
sns.countplot(
    data=df,
    y='Assortment',               # Using 'y' for horizontal layout
    order=df['Assortment'].value_counts().index,  # Sorting bars by count
    palette='coolwarm',           # Applying visually distinct color palette
    edgecolor='black'             # Adding border to bars
)

# Adding title and labels
plt.title('Distribution of Assortment Types', fontsize=15, fontweight='bold', color='darkslateblue')
plt.xlabel('Number of Stores', fontsize=12)
plt.ylabel('Assortment Type', fontsize=12)

# Adding count labels beside bars
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Customizing grid for clean appearance
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Removing unnecessary borders
plt.box(False)

# Displaying the chart
plt.show()

##### 1. Why did you pick the specific chart?

* I have used a horizontal bar chart because Assortment is a categorical variable with three levels, and a horizontal bar chart makes the category proportions easy to read (especially with long labels or when I want to add exact percentage labels).

* It clearly shows which assortment types are most common and helps me see the relative footprint of each assortment in the store network.

##### 2. What is/are the insight(s) found from the chart?

* Counts / proportions (from the dataset):

    1. Assortment a: 593 stores ‚âà 53.18%

    2. Assortment c: 513 stores ‚âà 46.01%

    3. Assortment b: 9 stores ‚âà 0.81%


* Interpretation:

    * Most stores follow either basic (a) or extended (c) assortments ‚Äî together they make up virtually the entire dataset.

    * Assortment b is extremely rare (less than 1%), so it‚Äôs effectively negligible for aggregate analysis but may still be interesting if those 9 stores behave unusually.

    * The near parity between a and c (about 53:46) means the company is split between standard and extended formats ‚Äî this gives a good basis to compare performance across two roughly large groups.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

 * Comparative performance analysis:

    * Because a and c together represent almost all stores, I will compare their average sales, margin, and customer metrics to identify whether extended assortments (c) deliver better revenue per customer.

    * If c outperforms a, I can recommend a targeted rollout of c assortments in regions with similar demographics.


 * Resource & merchandising planning:

    * Knowing the split helps me decide inventory mixes and promotional strategies tailored to each assortment type.


 * Investigate rare cases (b):

    * I will inspect those 9 b stores individually ‚Äî they might be special pilot stores or data-entry anomalies; if they perform very differently, they can yield actionable lessons.


---

**Negative-growth / risk signals & mitigation:**

 * Risk of bias in aggregate metrics:

    * Since a is slightly dominant, overall averages will be influenced by a stores; I must ensure store-type stratification when reporting metrics or training models so c behavior is not washed out.


 * Neglecting tiny segments (b):

    * While b is negligible numerically, if those stores show consistent underperformance it could indicate a format that doesn't work; conversely, if they overperform, they may be candidates for targeted scaling. I should not ignore them outright.

#### Chart 7 - Bar Chart of Day of Week Distribution

In [None]:
# Chart - 7 visualization code

# Features used - DayOfWeek

# ‚úÖ Why this chart is important to include
# This visualization helps us:
#   Identify which days stores are most active.
#   Detect patterns like reduced store activity on Sundays or peak activity midweek.
#   Prepare for later bivariate analysis to see how sales vary across weekdays.
#   This chart adds valuable context about operational rhythm and customer behavior ‚Äî
#   critical for both forecasting and staffing decisions.


#-------------------------------------------------------------------------------------------------#


# Creating the figure and defining its size
plt.figure(figsize=(9,5))

# Plotting bar chart showing store activity by day of the week
sns.countplot(
    data=df,
    x='DayOfWeek',                       # Selecting variable to visualize
    palette='viridis',                   # Applying color gradient
    edgecolor='black'                    # Adding border for clear separation
)

# Adding title and labels
plt.title('Distribution of Day of Week', fontsize=15, fontweight='bold', color='darkgreen')
plt.xlabel('Day of the Week (1=Mon, 7=Sun)', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Adding count labels on each bar
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Customizing gridlines and removing borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the chart
plt.show()

**Summary -**

Day-of-week counts are nearly equal across 1‚Äì7 in the wrangled dataset (no big Sunday dip).

This balance supports unbiased weekday-level analysis; next I should run Sales vs DayOfWeek to inspect actual sales differences by weekday.

##### 1. Why did you pick the specific chart?

* I have used a countplot because DayOfWeek is a discrete categorical variable (1 = Monday, 7 = Sunday).

* A countplot helps me check whether the dataset has even representation across weekdays, which is important for fair time-series analysis and for avoiding weekday bias in modeling.

##### 2. What is/are the insight(s) found from the chart?

* The countplot (on the wrangled dataset) shows that all seven weekdays have almost identical record counts ‚Äî there is no pronounced dip on any single weekday.

* This means the dataset is evenly balanced across weekdays, so daily-seasonality patterns won't be biased by unequal day counts.

* The near-uniformity arises because the dataset contains many stores across many dates; therefore, the total number of store-day records per weekday evens out.

* Practically, this tells me the data collection is consistent and there isn't systematic missingness for any particular weekday.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

 * Confidence in weekday analysis:

    * Because day counts are balanced, I can compare sales across weekdays without worrying that differences are caused by record imbalance ‚Äî that makes weekday-level inferences more reliable.


* Cleaner modeling:

    * I can include DayOfWeek as a feature in models without applying special weighting for under/over-sampled days. This simplifies preprocessing and reduces the risk of introducing sampling bias.


* Operational planning:

    * Since there is no underrepresented weekday, operational recommendations (staffing, promotions) drawn from weekday-level analysis will be based on equally-sampled evidence.




---

**Negative-growth / risk signals & mitigation:**

 * Watch for hidden operational differences:

    * Even though counts are balanced, sales behavior can still differ by day (e.g., Monday vs Saturday). Balanced counts simply make those differences trustworthy ‚Äî I still need to analyze Sales vs DayOfWeek (boxplot/violin) to detect performance gaps.


 * Model caution for per-store trends:

    * Balance at the aggregate level does not guarantee balance per-store. I should verify store-level weekday coverage for stores with sparse records before making store-level operational decisions.


#### Chart 8 - Count Plot of Store Opening Status  

In [None]:
# Chart - 8 visualization code

# Features Used - Open


# ‚úÖ Why this chart is important to include
# This chart helps identify:
#   The proportion of open vs closed days across all stores.
#   Whether missing or irregular patterns exist in store activity.
#   The operational consistency of the dataset (useful for model reliability).
# It‚Äôs a straightforward but crucial check to ensure that most data entries represent real,
# open-store scenarios contributing to actual sales.

#-------------------------------------------------------------------------------------------------#


# Creating the figure and setting its size
plt.figure(figsize=(7,5))

# Plotting count plot for store open vs closed days
sns.countplot(
    data=df,
    x='Open',                        # Selecting the variable to visualize
    palette=['tomato', 'mediumseagreen'],  # Assigning contrasting colors
    edgecolor='black'                  # Adding bar borders for definition
)

# Adding value annotations above bars
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Adding title and axis labels
plt.title('Distribution of Store Opening Status', fontsize=15, fontweight='bold', color='darkred')
plt.xlabel('Store Open Status (1 = Open, 0 = Closed)', fontsize=12)
plt.ylabel('Count of Records', fontsize=12)

# Customizing grid and removing chart borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()

**Summary -**

Most records correspond to stores being open (1) ‚Äî indicating normal operations and consistent data.

Closure days (0) are rare and should be treated separately in modeling to avoid skewing sales predictions.

##### 1. Why did you pick the specific chart?

* I have used a countplot for this variable because Open is a binary categorical column (0/1).

* The countplot clearly shows how many records correspond to open vs. closed stores, helping me understand if the dataset has a balanced representation of both.

* This is also important for verifying whether sales = 0 corresponds to store closure or not ‚Äî a crucial business sanity check.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows that the number of records where stores are open (1) is much higher compared to when they are closed (0).

* This is expected because stores are open most days, and closure happens only on holidays or maintenance days.

* This imbalance confirms that sales = 0 for closed stores is logical and not missing data.

* It also indicates that most of the dataset represents normal business operation days ‚Äî which is good for model learning.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

* The high frequency of open days means the model will have sufficient data to learn real sales patterns rather than being biased by closure days.

* This check ensures data validity ‚Äî I can confidently remove or separately handle closed days for accurate revenue forecasting.

* Understanding the distribution also helps in capacity planning ‚Äî fewer closed days means better operational utilization.

---

**Negative-growth / risk signals & mitigation:**

* If the dataset had shown a significant number of closed days, that would have hinted at operational inefficiencies or maintenance issues ‚Äî leading to revenue loss.

* Fortunately, that's not the case here. Still, if some stores have unusually high closure rates, I should analyze those stores individually for possible logistical or staffing problems.


#### Chart 9 - Donut Chart of Promo2 (Long-Term Promotion Participation)



In [None]:
# Chart - 9 visualization code

# Features used - Promo2


# ‚úÖ Why this chart is important to include ?
# The Promo2 feature indicates whether a store is enrolled in Rossmann‚Äôs long-term continuous promotion scheme.
# Visualizing this helps us:
#   Understand what portion of stores are part of ongoing promotions.
#   Identify how widespread the Promo2 program is within the business.
#   Prepare for later bivariate or multivariate analysis to compare sales behavior
#     between Promo2 and non-Promo2 stores.
#   It connects directly to strategic marketing and sales forecasting.


#------------------------------------------------------------------------------------------------------#



# Setting color palette
colors = sns.color_palette(['#66b3ff', '#ff9999'])  # Blue for active, red for not participating

# Creating count of Promo2 values
promo2_counts = df['Promo2'].value_counts()

# Creating the figure and defining its size
plt.figure(figsize=(7,7))

# Plotting the donut (pie) chart
wedges, texts, autotexts = plt.pie(
    promo2_counts,
    labels=['Participating', 'Not Participating'],
    autopct='%1.1f%%',               # Showing percentage
    startangle=90,                   # Rotating for better balance
    colors=colors,
    wedgeprops={'edgecolor': 'black', 'linewidth': 1},
    textprops={'fontsize': 11}
)

# Creating the white circle in the center to form a donut shape
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Adding title
plt.title('Distribution of Promo2 (Long-Term Promotion Participation)', fontsize=15, fontweight='bold', color='darkblue')

# Displaying the chart
plt.show()




**Summary -**

Promo2 adoption is 50.1% vs 49.9% non-adopters ‚Äî nearly balanced.


This balance enables robust comparative analysis; next steps: test Promo2's impact on sales and customer metrics.

##### 1. Why did you pick the specific chart?

* I have used a donut chart because Promo2 is a binary categorical variable (1 = participating in long-term promotions, 0 = not).

* The donut visually shows the proportion of stores in each group and makes it easy to spot if Promo2 is broadly adopted or limited to a subset.


##### 2. What is/are the insight(s) found from the chart?

* The dataset shows nearly half-and-half split between stores that participate in the long-term promotion program and those that do not.

* This balanced adoption means I can fairly compare the two groups without worrying about severe class imbalance.

* It also indicates the program is mature enough to be widespread, not just a niche pilot.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

* Fair comparison for impact analysis:

    * Since adoption is about 50:50, I can compare sales, customers, and retention between Promo2 and non-Promo2 stores with good statistical power.

    * I should run Sales vs Promo2 and SalesPerCustomer vs Promo2 next to quantify uplift and ROI.


* Segmentation & rollout strategy:

    * If Promo2 is found to increase average sales, the near-equal split suggests a clear opportunity to scale the program to the remaining stores where it's not active.

    * If the program is ineffective, I can focus on improving content/timing rather than rollout decisions.



---


**Negative-growth / risk signals & mitigation:**


* Risk of promotional dependency:

    * If Promo2 stores rely on discounts to sustain sales, expansion might erode margins. I should check profitability, not only revenue uplift.


* Heterogeneous effect risk:

    * Promo2 may perform differently across store types or regions. I must analyze interaction effects (e.g., Promo2 √ó StoreType) before making wide rollout decisions.

#### Chart 10 - Area Plot of Monthly Distribution.

In [None]:
# Chart - 10 visualization code


# Features Used - Month

# ‚úÖ Why this chart is important to include
# The Month column captures the temporal dimension of the data ‚Äî
# which is essential for understanding seasonality in retail.
# Analyzing it helps identify:
#     Which months are busiest, indicating peak sales or promotion periods.
#     Low-activity months, which might need marketing push or inventory optimization.
#     Any missing or uneven data distribution across months (important for modeling).
#     The area plot adds visual depth, showing both trends and magnitudes over time,
#         perfect for highlighting seasonal variation.



#------------------------------------------------------------------------------------------------------#



# Creating a series of month-wise counts and ensuring months 1..12 are present
month_counts = df['Month'].value_counts().sort_index()
month_counts = month_counts.reindex(range(1,13), fill_value=0)  # Ensuring complete month index

# Creating the figure and defining its size
plt.figure(figsize=(10,6))

# Plotting the area chart using fill_between (x, y1)
# Passing x and y1 as positional arguments to satisfy matplotlib API
plt.fill_between(
    month_counts.index,            # x: month numbers 1..12
    month_counts.values,           # y1: counts for each month
    color='lightcoral',            # Choosing a soft red tone for retail theme
    alpha=0.7,                     # Adding transparency for aesthetic look
    linewidth=2,
    edgecolor='darkred'
)

# Overlaying a line for clearer trend visualization
plt.plot(
    month_counts.index,
    month_counts.values,
    marker='o',
    color='darkred',
    linewidth=1.8
)

# Adding title and axis labels
plt.title('Monthly Distribution of Records', fontsize=16, fontweight='bold', color='darkred')
plt.xlabel('Month (1 = Jan, 12 = Dec)', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Customizing ticks to show month numbers
plt.xticks(ticks=np.arange(1,13), labels=[str(m) for m in range(1,13)])

# Customizing grid and removing top/right borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()



Data volume is highest in Jan-Jun (About 100K records each) and drops sharply after July (~60-65K).


This may reflect seasonality or partial data coverage and should be examined before modeling.

##### 1. Why did you pick the specific chart?

* I have used an area plot because it gives a clear and continuous visualization of how the number of records varies across different months.

* Since Month is a time-based feature, an area plot helps me spot seasonal activity patterns and check whether any month has missing or reduced data.

* It is also more visually appealing than a simple countplot for showing monthly volume trends.


##### 2. What is/are the insight(s) found from the chart?

* Exact month-wise counts (1 = Jan ‚Ä¶ 12 = Dec):

Jan = 103,694   Feb = 93,660     Mar = 103,695

Apr = 100,350   May = 103,695    Jun = 100,350

Jul = 98,115    Aug = 63,550     Sep = 61,500

Oct = 63,550    Nov = 61,500     Dec = 63,550


* The data shows that the first half of the year (Jan - Jun) has consistently higher counts (About 100K records each) compared to the second half (Aug - Dec) (~60-64K records each).

* This indicates that sales activity or data coverage is denser in the first half of the year.

* Months 8 to 12 show a notable drop, which could be due to partial data capture, fewer open stores, or a natural business slowdown later in the year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

* Seasonality awareness:

    * The clear difference between the first and second half suggests seasonal variation in operations or customer flow.

    * I can use this insight for better forecasting and resource planning, ensuring stock and staff allocation matches demand cycles.


 * Data quality check:

    * The even coverage in Jan-Jun confirms that the dataset is robust enough for modeling that period, helping ensure more accurate predictions for months with complete data.


--

**Negative-growth / risk signals & mitigation:**


* The sharp decline from August onward might point to missing or incomplete data rather than true business downturns.

* Before modeling, I should validate whether these months have fewer active stores or truncated data collection.

* If the drop is genuine, the business can introduce targeted promotions or seasonal campaigns in late-year months to balance annual revenue.

#### Chart 11 -  Bar Chart of Yearly Distribution

In [None]:
# Chart - 11 visualization code

# Features Used - Year

# ‚úÖ Why this chart is important to include
# Visualizing the Year variable helps:
#   Confirm that data spans multiple years and check for imbalanced coverage.
#   Reveal if certain years had more store activity or data collection (possibly due to expansion).
#   Prepare for time-based modeling by understanding how many training samples come from each period.
#   From a business perspective, it highlights growth or contraction patterns in store operations.

#------------------------------------------------------------------------------------------------------#

# Creating figure and defining its size
plt.figure(figsize=(8,5))

# Creating a year-wise count plot to see record distribution by year
sns.countplot(
    data=df,
    x='Year',                        # Selecting the 'Year' variable
    palette='coolwarm',              # Using balanced warm-cool color scheme
    edgecolor='black'                # Adding black borders for clear separation
)

# Adding title and axis labels
plt.title('Yearly Distribution of Records', fontsize=15, fontweight='bold', color='darkslateblue')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Adding count labels on each bar
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Customizing gridlines and removing borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the chart
plt.show()

**Summary -**


Yearly data shows a steady decline from 2013 ‚Üí 2015 (406 K ‚Üí 374 K ‚Üí 236 K).

This suggests either partial 2015 data or reduced store coverage ‚Äî important to verify before modeling trends.

##### 1. Why did you pick the specific chart?

* I have used a countplot because Year is a categorical time-based feature and a countplot makes it simple to compare how many records are available for each year.

* This helps me understand the data coverage over multiple years and check if there‚Äôs any imbalance that could influence long-term trend analysis or modeling.

##### 2. What is/are the insight(s) found from the chart?

* Exact year-wise counts:

      2013 ‚Üí 406,974 records

      2014 ‚Üí 373,855 records

      2015 ‚Üí 236,380 records


* The chart clearly shows that the data volume decreases steadily over the years.

* 2013 has the highest number of records, followed by 2014, and then a sharp drop in 2015.

* This pattern indicates that the data might not cover the full year of 2015 ‚Äî possibly ending mid-year.

* The gradual reduction from 2013 to 2015 suggests either fewer stores recorded, data truncation, or a decline in activity in the later period.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

* Improved modeling awareness:

    * Knowing that 2015 has fewer records allows me to adjust my analysis window or weight older years appropriately so models aren't biased by underrepresented recent data.


 * Operational benchmarking:

    * 2013 can serve as a baseline year with full activity coverage.

    * Comparing 2013-2015 helps identify long-term performance trends and the effect of promotions, competition, or seasonal shifts.


---


**Negative-growth / risk signals & mitigation:**

 * The decline in data volume after 2013 may mislead trend analysis if treated as a pure sales downturn.

 * If the 2015 drop is due to incomplete data capture, not actual business decline, it must be handled carefully in forecasting ‚Äî for example, excluding incomplete months or normalizing records.

 * Misinterpreting this reduction as negative growth could lead to wrong strategic conclusions, so ensuring data completeness is crucial.



#### Chart 12 - PromoInterval (Months When Long-Term Promotions Are Active)

In [None]:
# Chart - 12 visualization code


# Features Used - PromoInterval

# ‚úÖ Why this chart is important to include ?
# The PromoInterval column defines which months (e.g., "Feb,May,Aug,Nov") a store‚Äôs long-term
# promotion (Promo2) is active.
# Analyzing this helps Rossmann understand:
#   Which months are most commonly used for recurring promotions.
#   Whether promotional activity is evenly spread or clustered in specific months.
#   Seasonal promotional strategy ‚Äî for instance, heavier promotions in mid-year or near holidays.
#   This chart provides clear visibility into Rossmann‚Äôs annual promotion schedule.


#--------------------------------------------------------------------------------------------------------#

# Keeping only non-null PromoInterval values
promo_df = df.dropna(subset=['PromoInterval']).copy()

# Splitting comma-separated months into lists efficiently
promo_df['PromoInterval'] = promo_df['PromoInterval'].str.split(',')

# Using explode() to create one row per month (much faster than sum())
promo_month_counts = (
    promo_df['PromoInterval']
    .explode()                              # Expanding month list into separate rows
    .value_counts()                         # Counting occurrences of each month
    .sort_index()                           # Sorting months alphabetically
)

# Creating the figure and defining its size
plt.figure(figsize=(9,5))

# Plotting bar chart for frequency of months in PromoInterval
sns.barplot(
    x=promo_month_counts.index,
    y=promo_month_counts.values,
    palette='crest',
    edgecolor='black'
)

# Adding title and labels
plt.title('Frequency of Months in PromoInterval', fontsize=15, fontweight='bold', color='darkgreen')
plt.xlabel('Month', fontsize=12)
plt.ylabel('Number of Stores Running Promotions', fontsize=12)

# Adding count labels on bars
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Customizing grid and borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()


**Summary -**

PromoInterval months Jan, Apr, Jul, Oct (‚âà 293 K each) dominate ‚Äî reflecting Rossmann's quarterly promotion cycle.


Around 508 K records have ‚ÄúNo PromoInterval‚Äù, meaning no long-term promo was active for those periods.


##### 1. Why did you pick the specific chart?

* I have used a countplot because PromoInterval is a categorical column that lists the months in which recurring, long-term promotions (Promo2) are active.

* The countplot helps me clearly visualize which months are most frequently used for continuous promotions, giving an overview of the marketing calendar pattern adopted by Rossmann stores.

* This insight is valuable for identifying promotional clusters and understanding if the campaigns are evenly spread or concentrated in certain months.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows that the months January, April, July, and October have the highest frequency (‚âà 293 K each) ‚Äî these are the four most common intervals for recurring promotions.

* This confirms that Rossmann‚Äôs long-term promotions follow a quarterly pattern, launched roughly every three months.

* The large count under ‚ÄúNone‚Äù (‚âà 508 K) indicates that many records correspond to stores or periods without any long-term promo active, which aligns with about half the stores not participating in Promo2.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

* Marketing alignment:

    * The clear quarterly peaks (Jan-Apr-Jul-Oct) show a structured promotion strategy.

    * Marketing and inventory teams can prepare in advance for these months with stock build-ups and staff scheduling.


* Performance tracking:

    * Knowing the typical promotion months allows comparison of sales uplift in promo vs non-promo months to quantify promotion effectiveness.


* Opportunity for fine-tuning:

    * If promotions show diminishing impact in repeated quarters, the company can test different months or product mixes to maintain customer excitement.


---


**Negative-growth / risk signals & mitigation:**


* The ‚ÄúNone‚Äù category being the largest shows that many stores are missing out on promotional visibility.

* If sales are significantly lower in ‚ÄúNone‚Äù months, that represents a lost revenue opportunity.


* Repetition fatigue:

    * Quarterly repetition may reduce novelty; varying timing or themes could prevent customer desensitization.

#### Chart 13 - Distribution of Sales per Customer

In [None]:
# Chart - 13 visualization code

# Features Used - SalesPerCustomer

# ‚úÖ Why this chart is important to include
# The SalesPerCustomer feature measures store efficiency and customer spending behavior ‚Äî
# it tells how much each customer spends on average when visiting a store.
# Analyzing it helps us:
#   Identify high-value stores or customers.
#   Detect spending variation across stores or promotions.
#   Understand whether sales growth comes from more customers or higher spend per visit.
#   It‚Äôs one of the most actionable insights for business decision-making.


#--------------------------------------------------------------------------------------------------#


# Creating the figure and defining its size
plt.figure(figsize=(10,6))

# Plotting histogram with KDE for Sales per Customer
sns.histplot(
    data=df,
    x='SalesPerCustomer',
    bins=50,
    kde=True,
    color='mediumseagreen',
    alpha=0.7
)

# Adding title and axis labels
plt.title('Distribution of Sales per Customer', fontsize=16, fontweight='bold', color='darkgreen')
plt.xlabel('Average Sales per Customer', fontsize=12)
plt.ylabel('Frequency', fontsize=12)

# Customizing grid and removing unnecessary borders
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()



**Summary -**

* SalesPerCustomer shows a right-skewed distribution, with most values between 5-12 and few high-value outliers.

* Indicates stable spending patterns with occasional premium purchases ‚Äî ideal for targeted marketing and revenue optimization.

##### 1. Why did you pick the specific chart?

* I have used a histogram with KDE (Kernel Density Estimate) because it effectively displays the distribution of the continuous numeric feature SalesPerCustomer.

* This chart helps me understand how much revenue each customer typically generates on average.

* The combination of histogram and KDE helps to identify central tendencies, spread, and potential outliers in customer spending behavior.


##### 2. What is/are the insight(s) found from the chart?

* The distribution is highly right-skewed ‚Äî most values are concentrated between 5 and 12, with a sharp peak near 8-9, indicating that this is the typical spending range per customer.

* There is a small spike near 0, representing records where customers made minimal purchases or possibly where the store had zero or very low sales.

* A few outliers can be seen beyond 20‚Äì60, but they are extremely rare, suggesting occasional high-value purchase days at some stores.

* Overall, the distribution shows that most customers spend within a stable mid-range, but a small portion of high-spending customers could significantly impact total sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

* Customer segmentation:

    * The main cluster around 8 - 10 suggests a core customer base with predictable spending behavior.

    * These insights can be used to design targeted promotions ‚Äî for example, encourage frequent moderate spenders to increase their basket size.


* Identifying premium customers:

    * The long right tail represents high-spending customers.

    * The company can introduce loyalty or premium programs to retain these customers and enhance their lifetime value.


* Modeling advantage:

    * The skewed shape supports applying a log transformation on SalesPerCustomer before modeling to stabilize variance and improve predictive accuracy.




---

**Negative-growth / risk signals & mitigation:**


* The small cluster near zero could indicate unproductive store days or low-traffic periods, which may reduce overall profitability.

* I need to verify if those zero or near-zero points align with operational anomalies (e.g., staff shortages, incorrect sales logging).

* Ignoring these cases might bias demand forecasts downward, so it's better to handle or investigate them separately.

#### Chart 14 - Zero Sales While Open

In [None]:
# Chart - 14 visualization code


# Features Used - ZeroSalesWhleOpen


# ‚úÖ Why this chart is important to include
# This chart highlights how often stores were open but recorded zero sales ‚Äî
# an indicator of data issues or operational inefficiency.
# It‚Äôs an excellent inclusion because it bridges the gap between data validation
# and business performance, helping Rossmann ensure model accuracy and store productivity.



#------------------------------------------------------------------------------------------------------#


# Calculating total count of such occurrences
zero_sales_counts = df['ZeroSalesWhileOpen'].value_counts()

# Setting visual theme
sns.set(style='whitegrid')

# Creating figure and defining size
plt.figure(figsize=(7,6))

# Plotting a bar chart to show count of zero sales occurrences
sns.barplot(
    x=zero_sales_counts.index.map({0: 'No', 1: 'Yes'}),
    y=zero_sales_counts.values,
    palette=['mediumseagreen', 'salmon'],
    edgecolor='black'
)

# Adding title and labels
plt.title('Occurrences of Zero Sales While Store Was Open', fontsize=15, fontweight='bold', color='darkred')
plt.xlabel('Zero Sales While Open', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Adding data labels on bars
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Customizing gridlines and removing top/right spines
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()


**Summary -**

Out of ~1 million records, only 54 cases (‚âà0.005%) show stores open but with zero sales ‚Äî confirming excellent data integrity and very few anomalies.

##### 1. Why did you pick the specific chart?

* I have used a barplot because ZeroSalesWhileOpen is a binary categorical column (0 or 1).

* This chart helps me verify whether there are any instances where stores were open but recorded zero sales, which can indicate operational anomalies or data entry issues.

* It's an important step to validate the accuracy of the data before modeling ‚Äî ensuring that zero sales only occur when stores are closed.


##### 2. What is/are the insight(s) found from the chart?

* Exact value counts:

      0 ‚Üí 1,017,155 records (stores open and generated sales normally)

      1 ‚Üí 54 records (stores open but with zero sales)


* The chart shows that the vast majority (‚âà 99.99%) of records are normal, meaning stores reported sales when open.

* Only 54 records have zero sales even though the stores were open ‚Äî this is an extremely small fraction, practically negligible compared to total data volume.

* These few cases might represent data logging errors, returns offsetting sales, or rare operational issues.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

* Data quality validation:

    * This check confirms that the dataset is highly reliable, as almost all records align with expected business logic (open stores have sales).


* Operational diagnostics:

    * The few exceptions (54 cases) can be investigated to find potential causes ‚Äî POS system issues, refund-only days, or incorrect reporting.

    * Identifying and fixing such inconsistencies improves future data accuracy and decision reliability.


* Modeling advantage:

    * Knowing that zeros are rare ensures that zero values in the Sales column truly represent closed stores, simplifying filtering before model training.




---

**Negative-growth / risk signals & mitigation:**

* Even though minimal, these 54 instances may point to lost revenue opportunities if they represent real business days with operational failures.

* Continuous monitoring can prevent such cases from expanding, ensuring smooth sales recording and accurate forecasting in production systems.

#### Chart 15 - Distribution of Competition Open Months

In [None]:
# Chart - 15 visualization code


# Features Used - CompetitionOpenMonths


# ‚úÖ Why this chart is important to include
# The CompetitionOpenMonths feature shows how long each store‚Äôs competitor has been active near it.
# Analyzing this helps us:
#   Understand how competitive exposure varies across stores.
#   Identify whether most stores are newly exposed or have been long-term competitors.
#   Anticipate how competition affects store performance over time.
#   It‚Äôs an important variable for understanding market saturation and competitive pressure.


#------------------------------------------------------------------------------------------------------#


# Setting the visual theme
sns.set(style='whitegrid')

# Creating the figure and defining its size
plt.figure(figsize=(10,6))

# Plotting histogram with KDE
sns.histplot(
    data=df,
    x='CompetitionOpenMonths',
    bins=40,
    kde=True,
    color='royalblue',
    alpha=0.7
)

# Adding title and labels
plt.title('Distribution of Competition Open Months', fontsize=16, fontweight='bold', color='navy')
plt.xlabel('Months Since Competition Opened', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# Customizing grid and frame
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.box(False)

# Displaying the plot
plt.show()

**Summary -**

* The CompetitionOpenMonths feature is heavily right-skewed, with most stores having 0‚Äì10 months of competition and very few beyond 150 months.

* Indicates that Rossmann mostly operates in newly competitive environments, with limited long-term rivalry.

##### 1. Why did you pick the specific chart?

* I have used a histogram with KDE because CompetitionOpenMonths is a continuous numeric feature that represents the number of months since a store‚Äôs nearest competitor opened.

* This chart helps visualize how long competitors have been active across all stores and whether most stores face new or established competition.

* It is crucial for understanding the competitive landscape and how long-term competition might influence store sales.

##### 2. What is/are the insight(s) found from the chart?

* The distribution is highly right-skewed, with a sharp peak near 0-10 months.

* A large number of records cluster around 0, meaning many stores either have very recent competitors or no competition at all (0 indicates missing or new competition).

* As months increase, the frequency drops significantly ‚Äî very few stores have had competition for more than 100 months.

* Beyond 200 months, the records are extremely rare, showing that long-term competition presence is uncommon.

* This pattern suggests that Rossmann continuously faces emerging competition rather than long-standing rivals in most markets.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impacts / what to do:**

* Competitive readiness:

    * The fact that most competitors are new gives Rossmann a first-mover advantage in many areas.

    * Stores can act proactively ‚Äî e.g., intensifying local marketing or customer loyalty programs before new competitors gain traction.


* Strategic modeling:

    * CompetitionOpenMonths can be a strong feature in predictive models ‚Äî it captures how competition age affects sales.

    * Early-stage competition (low months) may have smaller impact compared to mature competitors.


* Regional planning:

    * The data supports targeted competitive response strategies ‚Äî newer competitors require defensive campaigns, while long-term competition may require differentiation strategies.




---

**Negative-growth / risk signals & mitigation:**

* The high number of zeros may indicate data gaps for stores where competition details weren't entered.

* Such cases should be flagged (via CompetitionOpenSince_NA) to prevent misleading interpretation in modeling.


* Stores with long-established competition (>150 months) might suffer from market saturation, requiring innovation or pricing adjustments to maintain sales.

#### Chart 16 - Distribution of Months Having Active Promo2

In [None]:
# Chart - 16 visualization code

# Features Used - IsPromo2Active


# ‚úÖ Why this chart is important to include
# it‚Äôs important to understand dataset balance ‚Äî  i.e., whether most stores usually have Promo2 running or not.
# This information does matter when building models or interpreting results,
# because an imbalanced distribution might reduce the effect of Promo2 during training.


#------------------------------------------------------------------------------------------------------#


# üé® Setting the visual style
plt.figure(figsize=(7,5))
sns.set(style="whitegrid")

# üìä Creating a bar plot to visualize the count of stores with and without active Promo2
sns.countplot(x='IsPromo2Active', data=df, palette='Set2')

# üè∑Ô∏è Adding chart title and labels
plt.title('Distribution of Long-term Promotion Activity (IsPromo2Active)', fontsize=14, fontweight='bold')
plt.xlabel('Is Promo2 Active?', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)

# üßæ Annotating bar values
for container in plt.gca().containers:
    plt.bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# üñºÔ∏è Displaying the plot
plt.tight_layout()
plt.show()

In [None]:
df['IsPromo2Active'].value_counts()

##### 1. Why did you pick the specific chart?

* A bar chart is chosen because the variable ZeroSalesWhileOpen is binary (Yes/No).

* This makes a bar chart ideal for showing how often such cases occur in the dataset.

* It clearly visualizes the frequency of potential anomalies ‚Äî stores open but not generating sales.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows that while most store records reflect sales during open hours, there are a few cases where the store was open but reported zero sales.
* These records are rare but significant, as they could indicate:

    * Data entry errors

    * System outages

    * Or genuinely unproductive business days.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes ‚Äî this insight has strong positive business implications.
Identifying and addressing ‚Äúzero sales while open‚Äù cases ensures clean, accurate data for future modeling, leading to more reliable forecasts.

* Operationally, Rossmann can investigate these instances to prevent future inefficiencies, improving store uptime and staff productivity.

* If ignored, such cases could mislead analysis or mask revenue loss, leading to negative growth.

### **Bivariate Analysis**

#### Chart 17 - Promo vs Sales

In [None]:
# Chart - 17 visualization code

# Features Used - Promo, Sales

# ‚úÖ Why this chart is important to include?
# I am creating this chart to visually compare sales performance between promotional and non-promotional days.
# This helps in understanding how promotions influence daily sales levels and overall business revenue patterns.
# It also helps confirm if promotional campaigns are truly effective or just causing short-term spikes.

# ---------------------------------------------------------------------------------------------------------- #

# Setting visual style for professional presentation
sns.set(style='whitegrid')

# Creating the figure
plt.figure(figsize=(10,6))

# Plotting the boxplot to compare sales distribution between promo and non-promo days
sns.boxplot(
    data=df,
    x='Promo',
    y='Sales',
    palette={'0': '#2E86C1', '1': '#E67E22'},
    showfliers=False
)

# Adding median annotations
medians = df.groupby('Promo')['Sales'].median().values
for i, median in enumerate(medians):
    plt.text(i, median + 300, f"Median = {int(median):,}",
             ha='center', fontweight='semibold', color='black')

# Adding titles and labels
plt.title('Distribution of Sales during Promo vs Non-Promo Days', fontsize=16, fontweight='bold', color='black')
plt.xlabel('Promo (0 = No Promo, 1 = Promo)', fontsize=12)
plt.ylabel('Sales Amount', fontsize=12)

# Customizing gridlines and layout
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()

# Displaying the plot
plt.show()


üìÑ Summary -

Total Records: 1,017,209

Promo Days: 388,080 (‚âà 38.2%)

Non-Promo Days: 629,129 (‚âà 61.8%)

Median (Promo = 0): 4,622

Median (Promo = 1): 7,553 ‚Üí ‚âà +63.4% uplift

Mean (Promo = 0): 4,406

Mean (Promo = 1): 7,991 ‚Üí ‚âà +81.4% uplift

The chart clearly shows that stores experience a substantial increase in sales during promotion periods.

##### 1. Why did you pick the specific chart?

* I have used a boxplot because it visually represents both the median and spread of sales values between promo and non-promo days.

* It helps me see if promotions truly drive higher sales and whether the sales boost is consistent or highly variable.

* Since promotions are key business levers, analyzing their impact on sales distribution is critical for campaign strategy and revenue forecasting.


##### 2. What is/are the insight(s) found from the chart?

* There is a clear and strong uplift in sales during promo periods ‚Äî the median sales increased by 63% and mean sales increased by 81% compared to non-promo days.

* The IQR (spread) during promo days is wider, indicating that some promotions perform exceptionally well while others have moderate results.

* The median sales during non-promo days (‚Çπ4,622) are considerably lower than during promo days (‚Çπ7,553), confirming that promotions generally help increase sales volume.

* Overall, the chart confirms that promotions have a positive impact on sales performance but with variable outcomes across stores or time periods.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impacts / What to Do:**

* The data clearly shows that promotions significantly boost sales.
‚Üí I can recommend continuing short-term promotional campaigns as they yield tangible benefits.

* Since variance is high, I should focus on identifying which types of promos, months, or store types give the strongest lift and standardize those campaigns.

* Promotions can also be strategically used during slower months to maintain stable cash flow.

* Including Promo as a feature in modeling will improve forecast accuracy and help the business estimate campaign effects.


---


**Negative-Growth / Risk Signals & Mitigation:**

* The large spread in promo-day sales indicates inconsistent promo performance ‚Äî some may underperform.
‚Üí This requires better targeting, monitoring, and post-analysis to avoid wasted marketing costs.

* Over-reliance on promotions can cause customer habituation (waiting for discounts) and margin erosion.
‚Üí To mitigate this, balance promotional intensity with regular pricing strategies.

#### Chart 18 -

In [None]:
# Chart - 17 visualization code

# Features Used -


# ‚úÖ Why this chart is important to include
#

#------------------------------------------------------------------------------------------------------#


##### 1. Why did you pick the specific chart?

##### 2. What is/are the insight(s) found from the chart?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Chart  -

In [None]:
# Chart - 17 visualization code

# Features Used -


# ‚úÖ Why this chart is important to include
#

#------------------------------------------------------------------------------------------------------#


##### 1. Why did you pick the specific chart?

##### 2. What is/are the insight(s) found from the chart?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***