# Exploratory Data Analysis (EDA) - Rossmann Stores

In this notebook, we explore and visualize the cleaned dataset of **Rossmann Stores** in order to uncover hidden patterns, relationships, and trends that may support our predictive modeling and business understanding.

The goal of this analysis is to gain insights into the behavior of **store sales**, how different features (e.g., promotions, holidays, customers, store type) affect performance, and to identify any interesting correlations or anomalies.

We will use a combination of:
- **Static visualizations** using Matplotlib and Seaborn.
- **Interactive charts** using Plotly for deeper exploration.

This step is essential for informing the next stages of the project, including **feature engineering**, **model building**, and generating meaningful **business recommendations**.

### Importing required libraries

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Static data visualization
import matplotlib.pyplot as plt
%matplotlib inline  
import seaborn as sns

# Interactive data visualization
import plotly.express as px
import plotly.graph_objects as go

# Suppressing warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

### Loading Data

In [None]:
# Reading the dataset from a CSV file
df = pd.read_csv(r"clean_data.csv")
df.head()# Display the first 5 rows of the dataset to check its structure\

### Monthly total sales by store type

In [None]:
# Group sales by Year, Month, and StoreType, and calculate the sum of Sales
monthly_store_sales = df.groupby(['Year', 'Month', 'StoreType'])['Sales'].sum().reset_index()

# Create a new column that combines Year and Month as a single string
monthly_store_sales['Year-Month'] = monthly_store_sales['Year'].astype(str) + '-' + monthly_store_sales['Month'].astype(str)

# Create a line plot showing monthly sales by store type
fig = px.line(
    monthly_store_sales,        # Data source
    x='Year-Month',             # X-axis: combined year and month
    y='Sales',                  # Y-axis: total sales
    color='StoreType',          # Line color based on store type
    markers=True,               # Show markers on lines
    title='Monthly Sales by Store Type Over Time',  # Chart title
    labels={'Year-Month': 'Year-Month', 'Sales': 'Total Sales', 'StoreType': 'Store Type'}  # Axis labels
)

# Display the plot
fig.show()


### Monthly total sales overall

In [None]:
# Group sales by Year and Month, and calculate the total sales
monthly_sales = df.groupby(['Year', 'Month'])['Sales'].sum().reset_index()

# Create a new column that combines Year and Month into a single string
monthly_sales['Year-Month'] = monthly_sales['Year'].astype(str) + '-' + monthly_sales['Month'].astype(str)

# Create a line plot to show total monthly sales over time
fig = px.line(
    monthly_sales,         # Data source
    x='Year-Month',        # X-axis: combined year and month
    y='Sales',             # Y-axis: total sales
    markers=True,          # Show data points on the line
    title='Total Sales Over Time (Monthly)',  # Chart title
    labels={'Year-Month': 'Year-Month', 'Sales': 'Total Sales'}  # Axis labels
)

# Display the plot
fig.show()


### Monthly average sales

In [None]:
# Group sales by Year and Month, and calculate the average sales
monthly_avg_sales = df.groupby(['Year', 'Month'])['Sales'].mean().reset_index()

# Create a new column that combines Year and Month into a single string
monthly_avg_sales['Year-Month'] = monthly_avg_sales['Year'].astype(str) + '-' + monthly_avg_sales['Month'].astype(str)

# Create a line plot to show average monthly sales over time
fig = px.line(
    monthly_avg_sales,         # Data source
    x='Year-Month',            # X-axis: combined year and month
    y='Sales',                 # Y-axis: average sales
    markers=True,              # Show data points on the line
    title='Average Sales Over Time (Monthly)',  # Chart title
    labels={'Year-Month': 'Year-Month', 'Sales': 'Average Sales'},  # Axis labels
    color_discrete_sequence=['orange']  # Set line color to orange
)

# Display the plot
fig.show()


### Sales trend by Promo2 status

In [None]:
# Group sales by Year, Month, and Promo2 status, then calculate total sales
monthly_promo2 = df.groupby(['Year', 'Month', 'Promo2'])['Sales'].sum().reset_index()

# Create a new column that combines Year and Month into one string
monthly_promo2['Year-Month'] = monthly_promo2['Year'].astype(str) + '-' + monthly_promo2['Month'].astype(str)

# Create a line plot to show sales trend based on Promo2 status
fig = px.line(
    monthly_promo2,         # Data source
    x='Year-Month',         # X-axis: combined year and month
    y='Sales',              # Y-axis: total sales
    color='Promo2',         # Line color based on Promo2 status
    markers=True,           # Show data points
    title='Sales Trend by Promo2 Status',  # Chart title
    labels={
        'Year-Month': 'Year-Month',
        'Sales': 'Sales',
        'Promo2': 'Promo2 Status'
    }  # Axis labels
)

# Display the plot
fig.show()


### Average sales by day of the week

In [None]:
# Group sales by day of the week and calculate the average sales
avg_sales_day = df.groupby('DayOfWeek')['Sales'].mean().reset_index()

# Set the size of the figure
plt.figure(figsize=(8, 5))

# Create a bar chart for average sales per day of the week
plt.bar(avg_sales_day['DayOfWeek'], avg_sales_day['Sales'], color='skyblue')

# Add chart title and axis labels
plt.title('Average Sales by Day of Week')
plt.xlabel('Day of Week (1=Monday)')
plt.ylabel('Average Sales')

# Show all day labels on the x-axis
plt.xticks(avg_sales_day['DayOfWeek'])

# Display the plot
plt.show()


### Boxplot: Sales distribution by store type

In [None]:
# Set the size of the figure
plt.figure(figsize=(8, 6))

# Create a boxplot to show the distribution of sales for each store type
sns.boxplot(data=df, x='StoreType', y='Sales', palette='rocket')

# Add chart title and axis labels
plt.title('Sales Distribution by Store Type')
plt.xlabel('Store Type')
plt.ylabel('Sales')

# Display the plot
plt.show()


### Boxplot: Sales distribution by Promo

In [None]:
# Set the size of the figure
plt.figure(figsize=(8, 6))

# Create a boxplot to show how promotions affect sales
sns.boxplot(data=df, x='Promo', y='Sales', palette='rocket')

# Add chart title and axis labels
plt.title('Effect of Promo on Sales')
plt.xlabel('Promo (0=No, 1=Yes)')
plt.ylabel('Sales')

# Display the plot
plt.show()


### Average sales by state holiday

In [None]:
# Group sales by StateHoliday and calculate the average sales
avg_sales_holiday = df.groupby('StateHoliday')['Sales'].mean().reset_index()

# Map holiday codes to more descriptive labels
holiday_labels = {
    '0': 'No Holiday',
    'a': 'Public Holiday',
    'b': 'Easter Holiday',
    'c': 'Christmas Holiday'
}

# Replace holiday codes with readable labels
avg_sales_holiday['StateHoliday'] = avg_sales_holiday['StateHoliday'].map(holiday_labels)

# Define the desired order of categories for the x-axis
category_order = ['No Holiday', 'Public Holiday', 'Easter Holiday', 'Christmas Holiday']
avg_sales_holiday['StateHoliday'] = pd.Categorical(avg_sales_holiday['StateHoliday'], categories=category_order, ordered=True)

# Create a bar chart
fig = px.bar(
    avg_sales_holiday,              # Data source
    x='StateHoliday',               # X-axis: holiday types
    y='Sales',                      # Y-axis: average sales
    color='StateHoliday',           # Color by holiday type
    title='Average Sales during State Holidays',
    labels={'StateHoliday': 'Holiday Type', 'Sales': 'Average Sales'},
    color_discrete_sequence=px.colors.qualitative.Set2,
    text='Sales'                    # Show values on bars
)

# Customize the text and layout
fig.update_traces(
    texttemplate='%{text:.2f}',    # Format text labels
    textposition='outside'         # Position text above bars
)

fig.update_layout(
    showlegend=False,
    height=500,
    width=800,
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='white',
    font=dict(size=14),
    title_x=0.5                    # Center the title
)

# Show the final plot
fig.show()

### Scatter plot: Customers vs Sales

In [None]:
# Create a scatter plot using Plotly to visualize the relationship between 'Customers' and 'Sales'
fig = px.scatter(
    df, x='Customers', y='Sales',
    title="Customers vs Sales",
    labels={'Customers': 'Number of Customers', 'Sales': 'Sales'},
    template='plotly_dark'  # Use a dark theme for better contrast
)

# Customize the scatter plot: set marker color to gold and marker size to 7
fig.update_traces(marker=dict(color='gold', size=7))

# Display the scatter plot
fig.show()


### Monthly average sales for sample stores

In [None]:
# Group sales by Year, Month, and Store, then calculate average sales
monthly_store_avg = df.groupby(['Year', 'Month', 'Store'])['Sales'].mean().reset_index()

# Create a new column combining Year and Month as a string
monthly_store_avg['Year-Month'] = monthly_store_avg['Year'].astype(str) + '-' + monthly_store_avg['Month'].astype(str)

# Select a sample of 10 unique stores
sample_stores = monthly_store_avg['Store'].unique()[:10]

# Filter data to include only the sample stores
filtered_data = monthly_store_avg[monthly_store_avg['Store'].isin(sample_stores)]

# Create a line plot showing average monthly sales for the sample stores
fig = px.line(
    filtered_data,           # Data source
    x='Year-Month',         # X-axis: combined year and month
    y='Sales',              # Y-axis: average sales
    color='Store',          # Different lines by store
    markers=True,           # Show data points on lines
    title='Average Monthly Sales for Sample Stores'  # Chart title
)

# Customize layout: axis titles, rotate x-axis labels, legend title, and figure size
fig.update_layout(
    xaxis_title='Year-Month',
    yaxis_title='Average Sales',
    xaxis_tickangle=45,
    legend_title='Store',
    height=600,
    width=1000
)

# Display the plot
fig.show()


### Boxplot: Effect of school holidays on sales

In [None]:
# Set the size of the figure
plt.figure(figsize=(8, 6))

# Create a boxplot to show how school holidays affect sales
sns.boxplot(data=df, x='SchoolHoliday', palette='rocket', y='Sales')

# Add title and axis labels
plt.title('Effect of School Holiday on Sales')
plt.xlabel('School Holiday (0=No, 1=Yes)')
plt.ylabel('Sales')

# Display the plot
plt.show()


### Average sales by day of week and store type

In [None]:
# Group data by day of the week and store type, calculate average sales
dayofweek_store_sales = df.groupby(['DayOfWeek', 'StoreType'])['Sales'].mean().reset_index()

# Create a grouped bar chart to show average sales by day and store type
fig = px.bar(
    dayofweek_store_sales,    # Data source
    x='DayOfWeek',            # X-axis: day of the week
    y='Sales',                # Y-axis: average sales
    color='StoreType',        # Bars colored by store type
    barmode='group',          # Group bars side by side
    title='Sales by Day of Week and Store Type'  # Chart title
)

# Set axis and legend titles
fig.update_layout(
    xaxis_title='Day of Week',
    yaxis_title='Average Sales',
    legend_title='Store Type'
)

# Show the plot
fig.show()


### Sales trend by day of week over time

In [None]:
# Group data by year, month, and day of the week; calculate average sales
dayofweek_sales = df.groupby(['Year', 'Month', 'DayOfWeek'])['Sales'].mean().reset_index()

# Create a new column combining year and month as a string
dayofweek_sales['Year-Month'] = dayofweek_sales['Year'].astype(str) + '-' + dayofweek_sales['Month'].astype(str)

# Create a line chart showing sales trends over time by day of the week
fig = px.line(
    dayofweek_sales,          # Data source
    x='Year-Month',           # X-axis: combined year and month
    y='Sales',                # Y-axis: average sales
    color='DayOfWeek',        # Lines colored by day of the week
    markers=True,             # Show markers on data points
    title='Sales Trend by Day of Week Over Time'  # Chart title
)

# Update layout with axis titles and rotate x-axis labels for readability
fig.update_layout(
    xaxis_title='Year-Month',
    yaxis_title='Average Sales',
    xaxis_tickangle=45
)

# Display the plot
fig.show()


### Daily total sales

In [None]:
# Group data by date and sum sales for each day
daily_sales = df.groupby('Date')['Sales'].sum().reset_index()

# Create a line chart showing total daily sales over time
fig = px.line(
    daily_sales,       # Data source
    x='Date',          # X-axis: dates
    y='Sales',         # Y-axis: total sales
    title='Total Sales Time Series (Daily)'  # Chart title
)

# Display the plot
fig.show()


### Correlation heatmap

In [None]:
# Calculate correlation matrix for numerical columns only
corr = df.select_dtypes(include=['float64', 'int64', 'int32']).corr()

# Create a mask for the upper triangle (to hide duplicate correlations)
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set figure size
plt.figure(figsize=(14, 10))

# Plot heatmap of correlations with annotations and a reversed rocket color map
sns.heatmap(
    corr,              # Correlation matrix data
    mask=mask,         # Apply mask to show only lower triangle
    annot=True,        # Show correlation values on the heatmap
    fmt=".2f",         # Format numbers to 2 decimal places
    cmap='rocket_r',   # Color map (rocket reversed)
    square=True        # Make cells square-shaped
)

# Add title to the heatmap
plt.title('Heatmap of Correlations Between Numerical Features')

# Show the plot
plt.show()


## Financial Suggestions Based on Data Analysis

- Focus on identifying and addressing missing data patterns to improve data quality and reliability for better decision-making.
- Analyze customer behavior trends during holidays and promotional periods to optimize store operations and marketing campaigns.
- Investigate the impact of store competition distance (`compdistance`) on sales to develop localized marketing strategies.
- Examine the performance differences between various store types (`StoreType`) and product assortments (`Assortment`) to tailor inventory and promotions accordingly.
- Utilize insights from sales and customer counts over different days of the week and months to plan staffing and inventory levels effectively.
- Monitor open and closed store days to understand their effect on sales fluctuations and adapt business hours if necessary.
- Review the effectiveness of promotional intervals (`PromoInterval`) and long-term promotions (`Promo2`) in driving customer engagement and sales.
- Explore the effect of school holidays and state holidays on customer footfall and sales volume to improve seasonal planning.
- Leverage visualization insights to communicate key trends and anomalies to stakeholders for informed business decisions.