# Project Overview

This project is structured into **four major components**, each addressing a critical phase of **time series forecasting** using the **Rossmann Store Sales dataset**:

🔹 **Part 1**: Data Ingestion, Exploratory Data Analysis (EDA)

We begin by importing the dataset, performing a thorough exploratory data analysis to assess data quality, uncover patterns, and identify potential issues such as missing values, anomalies, and inconsistencies.

🔹 **Part 2**: Feature Engineering and Data Visualization

This section delves into engineering relevant features (predictors) to prepare the data for modeling and presents key visualizations to uncover underlying trends, recurring patterns, and seasonal effects.

🔹 **Part 3**: Classical Time Series Analysis and Forecasting

This section focuses on widely adopted forecasting techniques in the data science domain. We will implement and evaluate several standard algorithms, including:

 - **Statistical Models**: ARIMA, SARIMA
    
 - **Ensemble Methods:** XGBoost, LightGBM
    
 - **Facebook Prophet:** A robust model for time series forecasting with built-in seasonality and holiday effects
    
 - **Deep Learning Models:** LSTM, Temporal Fusion Transformers (TFT), N-BEATS

🔹 **Part 4**: Hybrid Time Series Forecasting

This advanced section explores hybrid modeling approaches typically used by experienced data scientists. These models combine the strengths of multiple algorithms to improve forecasting accuracy:

 - **ARIMA + XGBoost**
    
 - **Prophet + LightGBM / XGBoost**
    
 - **Prophet + LSTM**
    
 - **TFT + ARIMA**

 # Model Selection Strategy
----------------------------

The choice of forecasting algorithm depends on the characteristics of the dataset and the domain expertise of the practitioner. In this project, we will experiment with all the above methods and compare their performance to determine the most effective approach for our data.

**Let’s dive in and explore which model delivers the most accurate forecasts!**

## Part 1: Data Ingestion and Exploratory Data Analysis (EDA)
-------------------------------------------------------------

In this section, we will focus on **ingesting** the Rossmann Store Sales dataset and conducting a comprehensive **Exploratory Data Analysis (EDA)** prior to applying time series modeling and forecasting techniques. The workflow will proceed through the following steps:

1. **Import Libraries and Dependencies:** Load all necessary Python libraries and packages required for data manipulation, visualization, and analysis.

1. **Data Ingestion:** Load the Rossmann Store Sales dataset into the working environment for analysis.

3. **Exploratory Data Analysis (EDA):** Perform a detailed examination of the dataset to understand its structure and key characteristics:

   - **Inspect dataset metadata**: data types, number of observations (rows), and variables (columns)

    - **Identify and quantify missing values**

    - **Detect and handle duplicate records**

    - **Generate summary statistics** (mean, median, standard deviation, etc.)

    - **Analyze individual features and their distributions**

    - **Apply feature engineering techniques to enhance model readiness**

    - **Evaluate feature correlations to identify relationships**

    - **Visualize data using appropriate plots and charts**

    - **Conduct deeper analysis to uncover trends, patterns, and seasonality within the time series**

## 1. Setup & Imports Libraries
---------------------------------------

In [None]:
import time 

In [None]:
# Step 1: Data Ingestion
print("Step 1: Setup and Import Libraries in progress...")
time.sleep(1)  # Simulate processing time

In [None]:
# Data Manipulation & Processing
import os
import holidays
import pandas as pd
import numpy as np
from pathlib import Path
import scipy.stats as stats
from datetime import datetime
from sklearn.preprocessing import *

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

# Warnings
import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore')

print("="*60)
print("Rossman Store Sales Time Series Analysis - Part 1")
print("="*60)
print("All libraries imported successfully!")

In [None]:
print("✅ Setup and Import Liraries completed.\n")

In [None]:
# Start analysis

analysis_begin = pd.Timestamp.now()

bold_start = '\033[1m'
bold_end = '\033[0m'

print("🔍 Analysis Started")
print(f"🟢 Begin Date: {bold_start}{analysis_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}\n")


## 2. Data Ingestion
----------------------------

In [None]:
# Step 2: Data Ingestion
print("Step 2: Data Ingestion in progress...")
time.sleep(1)  # Simulate processing time

In [None]:
# Use absolute path to avoid pwd issues
base_path = Path.home() /"Desktop"/"Time_Series_Analysis"/"data"/"raw"
train_path = base_path /"Retail_train_data.csv"
test_path = base_path / "Retail_test_data.csv"

# Check if files exist
print(f"Looking for files in: {base_path}\n")
print(f"Train file exists: {train_path.exists()}")
print(f"Test file exists: {test_path.exists()}")

if train_path.exists() and test_path.exists():
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)

    print("\nFiles loaded successfully!\n")
    print(f"Train data shape: {train_df.shape}")
    print(f"Test data shape: {test_df.shape}\n")
    print('There are : %s Rows and %s Columns in training data' % (str(train_df.shape[0]) ,str(train_df.shape[1])))
    print('There are : %s Rows and %s Columns in testing data' % (str(test_df.shape[0]) ,str(test_df.shape[1])))
else:
    print("Files not found. Please check the file paths and names.")

In [None]:
print("✅ Data Ingestion completed.\n")

# 3. Exploratory Data Analysis (EDA)
## 3.1. Basic Inspection
-------------------

In [None]:
# Step 3: Exploratory Data Analysis (EDA)
print("📊 Step 3: Exploratory Data Analysis in progress...")
time.sleep(1)  # Simulate processing time

In [None]:
train_df.columns = train_df.columns.str.lower()
test_df.columns = test_df.columns.str.lower()

# --- BASIC INFO AND DUPLICATES ---
print("DataFrame Info:")
train_df.info()

## View or Display Dataset

In [None]:
print("\nTrain Data Preview:")
print(train_df.head(), "\n")

In [None]:
train_df['date'].min(), train_df['date'].max()

In [None]:
train_df.sort_values('date', inplace =True, ascending =True)

print("\nTrain Data Preview:")
print(train_df.head(), "\n")
print("\nTest Data Preview:")
print(test_df.tail())

In [None]:

test_df.sort_values('date', inplace =True, ascending =True)

print("\nTest Data Preview:")
print(train_df.head(), "\n")
print("\nTest Data Preview:")
print(test_df.tail())

In [None]:
print("\nMissing Values per Column:")
print(train_df.isna().sum())
print('\nNumber of duplicated rows:', train_df.duplicated().sum())

## 3.2 Summary Statistics
---------------------

In [None]:
print("\nSummary Statistics:")
print(train_df.describe())

----------------------------------------------------

### This my previous version of Statistical Analysis

1. **Store-Level Insights :**
 *  Store IDs range from **1 to 1115**, suggesting over a thousand **(1115)** unique stores in the dataset.
 *  The distribution of store IDs **(mean ≈ 558, std ≈ 322)** implies a relatively even spread across stores, but some clustering may exist around the median (558).
   
2. **Temporal Patterns :**
 *  DayOfWeek ranges from **1 to 7**, with a mean of **~4.00**—indicating a fairly uniform distribution of entries across all days.
 *  Could be interesting to examine how sales and customers vary by weekday, especially near the weekend peak **(e.g., Friday/Saturday)**.

3. **Sales & Customer Traffic:**
 *  Average daily sales per entry: **5760.84** with a very wide spread **(std ≈ 3857.57)**, and a maximum sale day of **$41,551**—a strong hint at major spikes for certain stores or events.
 *  Customer count averages **~633** per day, also with high variability **(max nearly 7,400)**, pointing to inconsistent foot traffic across locations and days.
 *  Zero-sales and zero-customer days are quite high **(≈ 17% of records)**. These likely correspond to closed stores or unusual business days and should be accounted for in any forecasting or revenue analysis.
 *   The mean for sales is **5760.843 and the median 5731.00**, suggesting the **sales distribution is right skewed**

5. **Store Availability:**
 *  Open flag mean is **0.829**, meaning stores are open roughly **83%** of the time.

6. **Promotional Impact**
 *  38% of entries have active promotions, and sales/customer variability suggests these promotions could have significant but uneven impact.
 *  Worth analyzing sales uplift due to **Promo = 1 compared to Promo = 0**.

7. **School Holidays**
 *  **Only 17.2%** of entries fall on a school holiday, suggesting that sales are not high on School Holidays
 *  This feature may influence customer counts, especially for stores near schools or those that rely on family shoppers.

--------------------------------------------------------
### This my current (updated) version of Statistical Analysis

Looking at these retail summary statistics, several key insights emerge about this dataset's business patterns and data quality:

**Business Operations Pattern**
The data reveals a clear weekly rhythm with stores typically open **5-6 days per week (82.9% open rate)**. The day-of-week distribution **(mean ~4.0, std 2.0)** suggests relatively even coverage across the week, though weekends likely see different patterns given the standard deviation.

**Sales Performance Distribution**
Daily sales **average €5,761** with substantial variation **(std €3,858)**, indicating significant differences between high and low-performing periods. The distribution appears right-skewed, as the **median (€5,731)** sits below the **75th percentile (€7,847)**, while the **maximum of €41,551** represents exceptional peak performance days.

**Customer traffic** follows a similar pattern, **averaging 633 customers per day** with the same **right-skewed distribution**. The sales-to-customer ratio of approximately **€9.11 per customer** suggests this could be a grocery or everyday retail environment.

**Promotional and Seasonal Effects**
Promotions run on **38%** of operating days, suggesting strategic rather than constant promotional activity. School holidays affect only **17.2%** of the dataset, indicating this captures mostly regular school periods with some holiday shopping included.

**Data Quality Concerns**
The most striking finding is that **168,494 rows (17.1%)** show zero sales, with an almost identical number showing zero customers **(168,492)**. This near-perfect correlation suggests these represent genuine store closures rather than data collection errors. These closed days significantly impact the overall averages and explain why the "Open" variable shows **82.9% rather than 100%**.

**Strategic Implications**
The wide performance range and promotional frequency suggest opportunities for optimizing both timing and targeting of marketing efforts. The substantial day-to-day variation in both sales and customer counts indicates strong external factors **(likely day-of-week effects, promotions, and seasonal patterns)** that could be leveraged for better forecasting and inventory management.

This appears to be a comprehensive retail dataset suitable for time series analysis, promotional effectiveness studies, and customer behavior modeling.

## 3.3 Variables Analysis
-------------------------

#### 1. Suspicious Days (Quality Check)

In [None]:
# Days with zero customers and non-zero sales (quality check)
odd_days = train_df[(train_df["customers"] == 0) & (train_df["sales"] > 0)]
print(f"Suspicious days with sales but no customers: {odd_days.shape[0]}")

#### 2. Unique days of the week

In [None]:
unique_days = train_df['dayofweek'].sort_values().unique()
print(f"Unique days of the week in the dataset: {unique_days}")

#### 3. Unique State Holidays

In [None]:
unique_stateholiday = train_df['stateholiday'].unique()
print(f"Unique State holiday in the dataset: {unique_stateholiday}")

In [None]:
# -- That mix of '0' (string) and 0 (integer) in your array means the column has inconsistent data types — which can definitely mess up mappings and analysis.

# -- Clean the Data Types
train_df['stateholiday'] = train_df['stateholiday'].astype(str)

# --- Sorted Output with counts
print("State Holiday Value Counts:")
print(train_df['stateholiday'].value_counts().sort_index())

-------------------------------------------------------
### State Holidays namming convention

 .  **a** stands for **Public Holiday**

 .  **b** stands for **Easter Holiday**

 .  **c** stands for **Christmas Holiday**

 .  **0** or None means **No holiday**

These codes were defined by the dataset creators to simplify **holiday categorization across German states**. So when we preprocess the data, we map those codes to their actual meanings to make analysis and visualization more intuitive.

-------------------------------------------------------

#### 4. Closed Count

In [None]:
closed_count = (train_df['open'] == 0).sum()
print(f"\nTotal number of times stores were closed: {bold_start}{closed_count}{bold_end}")

#### 5. Opened_count

In [None]:
opened_count = (train_df['open'] == 1).sum()
print(f"\nTotal number of times stores were opened: {bold_start}{opened_count}{bold_end}")

#### 6. Compare Open vs Closed

In [None]:
total_days = len(train_df)

closed_count = (train_df['open'] == 0).sum()
opened_count = (train_df['open'] == 1).sum()

print(f"Opened: {bold_start}{opened_count}{bold_end} times ({bold_start}{(opened_count / total_days) * 100:.2f}%){bold_end}")
print(f"Closed: {bold_start}{closed_count}{bold_end} times ({bold_start}{(closed_count / total_days) * 100:.2f}%){bold_end}")

#### 7. Check for zero Sales

In [None]:
no_sales_count = (train_df['sales'] == 0).sum()
print(f"\nThere are {bold_start}{no_sales_count}{bold_end} times when stores made no sales.")

#### 8. Check for Sales

In [None]:
sales_count = (train_df['sales'] != 0).sum()
print(f"\nThere are {bold_start}{sales_count}{bold_end} times when stores recorded sales.")

#### 9. Compare Sales vs No Sales

In [None]:
total_days = len(train_df)

print(f"Sales: {bold_start}{sales_count}{bold_end}  times ({bold_start}{(sales_count / total_days) * 100:.2f}%){bold_end}")
print(f"No-sales: {bold_start}{no_sales_count}{bold_end} times ({bold_start}{(no_sales_count / total_days) * 100:.2f}%){bold_end}")

#### 10. Customers Check

In [None]:
total_customers = len(train_df)
no_customers_count = (train_df['customers'] == 0).sum()

print(f"\nThere are {bold_start}{no_customers_count}{bold_end} times when no customers visited the stores.")
print(f"No-Customers: {bold_start}{no_customers_count}{bold_end} on ({bold_start}{(no_customers_count / total_customers) * 100:.2f}%){bold_end}")

#### 11. Impact of Day of the Week - Average Customers and Sales by Day

In [None]:
day_analysis = (
    train_df.groupby('dayofweek')[['customers', 'sales']]
    .mean()
    .reset_index()
    .rename(columns={'customers': 'avg_customers', 'sales': 'avg_sales'})
)

print(day_analysis)


#### 12. Impact of Promotion - Average Customers and Sales by Promotion Status

In [None]:
promo_analysis = (
    train_df.groupby('promo')[['customers', 'sales']]
    .mean()
    .reset_index()
    .rename(columns={'customers': 'avg_customers', 'sales': 'avg_sales'})
)

print(promo_analysis)

#### 13. Day of Week Impact

In [None]:
# Average customers and sales by day of the week
dow_analysis = (
    train_df
    .groupby("dayofweek")[["customers", "sales"]]
    .mean()
    .reset_index()
    .rename(columns={"customers": "avg_customers", "sales": "avg_sales"})
)

print(dow_analysis)


#### 14. SchoolHoliday Impact

In [None]:
schoolholiday_analysis = (
    train_df
    .groupby("schoolholiday")[["customers", "sales"]]
    .mean()
    .reset_index()
    .rename(columns={"customers": "avg_customers", "sales": "avg_sales"})
)

print(schoolholiday_analysis)

#### 15. Top 10 Crowded Store

In [None]:
top10_crowded_store = (
    train_df.groupby('store')['customers']
    .mean()
    .nlargest(10)
    .reset_index()
    .rename(columns={'customers': 'avg_customers'})
)

print(top10_crowded_store)

#### 16. Top 10 Highest-Selling Stores

In [None]:
top10_selling_store = (
    train_df.groupby('store')['sales']
    .mean()
    .nlargest(10)
    .reset_index()
    .rename(columns={'sales': 'avg_sales'})
)

print(top10_selling_store)

#### 18. Crowded vs. Selling Stores - Sort by avg_sales in descending order

In [None]:
comparison_df = (
    pd.merge(top10_crowded_store, top10_selling_store, on='store', how='outer')
    .fillna(0)
    .sort_values(by='avg_sales', ascending=False)
    .reset_index(drop=True)
)

print(comparison_df)

#### 19. Dual Comparison

In [None]:
comparison_df['top_both'] = (
    comparison_df['store'].isin(top10_crowded_store['store']) &
    comparison_df['store'].isin(top10_selling_store['store'])
)

print(comparison_df)

## 4. Features Engineering
--------------------------

In [None]:
# Make a copy of the original dataframe to avoid modifying it
df_features = train_df.copy()

# Ensure date column is in datetime format
if not pd.api.types.is_datetime64_any_dtype(df_features['date']):
    df_features['date'] = pd.to_datetime(df_features['date'])


#### 4.1. Create Time-based Covariates -  Basic Temporal Features

In [None]:
df_features['day'] = df_features['date'].dt.strftime('%a')
df_features['week'] = df_features['date'].dt.isocalendar().week
df_features['month'] = df_features['date'].dt.strftime('%b')
df_features['quarter'] = df_features['date'].dt.quarter
df_features['year'] = df_features['date'].dt.year

#### 4.2. StateHoliday Mapping

In [None]:
# Define the holiday type mapping
holiday_map = {
    "0": "None",
    "a": "Public",
    "b": "Easter",
    "c": "Christmas"  
}

df_features['stateholiday']= df_features['stateholiday'].map(holiday_map)

#### Promo mapping

In [None]:
df_features['promo'] = df_features['promo'].astype(str).map({'1': 'Promo', '0': 'No Promo'})

#### Features Enginerring check

In [None]:
print(f'The shape after feature Engineering : {df_features.shape}')

In [None]:
df_features.head()

## Feature Engineering for ML

In [None]:
# Filter for clean and unbiased data
ts_train = train_df[(train_df['open'] == 1) & (train_df['sales'] > 0)].copy()

# Sort by date
ts_train.sort_values('date', ascending=True, inplace=True)

# Ensure 'date' is datetime
ts_train['date'] = pd.to_datetime(ts_train['date'])

# Temporal features
ts_train['dayofmonth'] = ts_train['date'].dt.day
ts_train['dayofyear'] = ts_train['date'].dt.dayofyear
ts_train['weekofyear'] = ts_train['date'].dt.isocalendar().week
ts_train['month'] = ts_train['date'].dt.month
ts_train['quarter'] = ts_train['date'].dt.quarter
ts_train['year'] = ts_train['date'].dt.year

# Cyclical features
ts_train['day_sin'] = np.sin(2 * np.pi * ts_train['dayofweek'] / 7)
ts_train['day_cos'] = np.cos(2 * np.pi * ts_train['dayofweek'] / 7)
ts_train['month_sin'] = np.sin(2 * np.pi * ts_train['month'] / 12)
ts_train['month_cos'] = np.cos(2 * np.pi * ts_train['month'] / 12)
ts_train['week_sin'] = np.sin(2 * np.pi * ts_train['weekofyear'] / 52)
ts_train['week_cos'] = np.cos(2 * np.pi * ts_train['weekofyear'] / 52)

# Business features
ts_train['isweekend'] = (ts_train['dayofweek'] > 5).astype(int)
ts_train['ismonthstart'] = ts_train['date'].dt.is_month_start.astype(int)
ts_train['ismonthend'] = ts_train['date'].dt.is_month_end.astype(int)
ts_train['isquarterstart'] = ts_train['date'].dt.is_quarter_start.astype(int)
ts_train['isquarterend'] = ts_train['date'].dt.is_quarter_end.astype(int)

# Lag features
for lag in [1, 2, 3, 7, 14, 30]:
    ts_train[f'sales_lag_{lag}'] = ts_train.groupby('store')['sales'].shift(lag)

# Rolling window features
for window in [7, 14, 30]:
    ts_train[f'sales_rolling_mean_{window}'] = (
        ts_train.groupby('store')['sales'].rolling(window).mean().reset_index(level=0, drop=True)
    )
    ts_train[f'sales_rolling_std_{window}'] = (
        ts_train.groupby('store')['sales'].rolling(window).std().reset_index(level=0, drop=True)
    )
    ts_train[f'sales_rolling_min_{window}'] = (
        ts_train.groupby('store')['sales'].rolling(window).min().reset_index(level=0, drop=True)
    )
    ts_train[f'sales_rolling_max_{window}'] = (
        ts_train.groupby('store')['sales'].rolling(window).max().reset_index(level=0, drop=True)
    )

# Exponential moving averages
for alpha in [0.1, 0.3, 0.5]:
    ts_train[f'sales_ema_{alpha}'] = (
        ts_train.groupby('store')['sales'].ewm(alpha=alpha).mean().reset_index(level=0, drop=True)
    )

# Interaction features
ts_train['promo_schoolholiday'] = ts_train['promo'] * ts_train['schoolholiday']
ts_train['promo_stateholiday'] = ts_train['promo'] * ts_train['stateholiday']


# Set date as index
ts_train.set_index('date', inplace=True)


#### 4.3. StateHoliday Impact Analysis

In [None]:
# Create summary table for holiday impact
stateholiday_analysis = (
    df_features
    .groupby("stateholiday")[["customers", "sales"]]
    .mean()
    .reset_index()
    .rename(columns={"customers": "avg_customers", "sales": "avg_sales"})
    .sort_values(by="avg_sales", ascending=False)
)

print(stateholiday_analysis)

# Count times stores were closed during holidays (using temp labels)
closed_holiday_days = df_features[(df_features["open"] == 0) & (df_features.stateholiday != "None")].shape[0]
print(f"Number of closed times during holidays: {bold_start}{closed_holiday_days}{bold_end}")

#### 4.4. Day & Seasonality Effects

In [None]:
day_analysis = (
    df_features
    .groupby("day")[["customers","sales"]]
    .mean()
    .reset_index()
    .rename(columns={"customers": "avg_customers","sales": "avg_sales"})
    .sort_values(by="avg_sales", ascending=False)
)

print(day_analysis)

#### 4.5. Month & Seasonality Effects

In [None]:

month_analysis = (
    df_features
    .groupby("month")[["customers","sales"]]
    .mean()
    .reset_index()
    .rename(columns={"sales": "avg_sales", "customers": "avg_customers"})
    .sort_values(by="avg_sales", ascending=False)
)

print(month_analysis)


#### 4.6. Year & Seasonality Effects

In [None]:

year_analysis = (
    df_features
    .groupby("year")[["customers", "sales"]]
    .mean()
    .reset_index()
    .rename(columns={"sales": "avg_sales", "customers": "avg_customers"})
    .sort_values(by="avg_sales", ascending=False)
)

print(year_analysis)

#### 4.5. Promo × DayOfWeek Interaction

In [None]:
promo_dow_analysis = (
    df_features
    .groupby(["promo", "day"])[["customers", "sales",]]
    .mean()
    .reset_index()
    .rename(columns={"sales": "avg_sales", "customers": "avg_customers"})
    .sort_values(by="avg_sales", ascending=False)
)

print(promo_dow_analysis)


In [None]:
# Create a copy to avoid modifying the original DataFrame
df_temp = df_features.copy()

# Insert promo_flag immediately after the 'promo' column
promo_index = df_temp.columns.get_loc("promo")
df_temp.insert(promo_index + 1, "promo_flag", df_temp["promo"] == 1)

promo_dow_analysis = (
    df_temp
    .groupby(["promo", "promo_flag", "day"])[["customers", "sales"]]
    .mean()
    .reset_index()
    .rename(columns={"sales": "avg_sales", "customers": "avg_customers"})
    .sort_values(by="avg_sales", ascending=False)
)

print(promo_dow_analysis)


### Quarterly Analysis

In [None]:
# Simple and robust quarterly analysis
quarter_avg = df_features.groupby('quarter')['sales'].mean().sort_values(ascending=False)

print("Quarterly Sales Analysis:")
print("=" * 50)
print(f"{'Quarter':<10} {'Avg Sales':<12} {'Rank':<6} {'vs Q1':<8}")
print("-" * 50)

for i, (quarter, sales) in enumerate(quarter_avg.items(), 1):
    vs_q1 = ((sales - quarter_avg[1]) / quarter_avg[1]) * 100
    vs_q1_str = f"{vs_q1:+.1f}%" if quarter != 1 else "Base"
    print(f"Q{quarter:<9} €{sales:>8,.0f}    {i:<6} {vs_q1_str:<8}")

print(f"\nQuarterly Summary:")
print("-" * 20)
best_q = quarter_avg.index[0]
worst_q = quarter_avg.index[-1]
range_pct = ((quarter_avg.max() - quarter_avg.min()) / quarter_avg.mean()) * 100

print(f"Best quarter: Q{best_q} (€{quarter_avg[best_q]:,.0f})")
print(f"Worst quarter: Q{worst_q} (€{quarter_avg[worst_q]:,.0f})")
print(f"Performance gap: {range_pct:.1f}%")

# Growth pattern
print(f"\nQuarter-to-Quarter Growth:")
print("-" * 25)
for q in [2, 3, 4]:
    growth = ((quarter_avg[q] - quarter_avg[q-1]) / quarter_avg[q-1]) * 100
    print(f"Q{q-1} to Q{q}: {growth:+.1f}%")

print(f"\nKey Insight: Q{best_q} generates {((quarter_avg[best_q]/quarter_avg.sum())*100):.1f}% of annual revenue")

# ✨ Pro Tips : Reusable Functions - Best Practices for Reusable Functions in Data Analysis
-----------
Reusable functions streamline analytical workflows by promoting consistency, reducing redundancy, and improving maintainability. To maximize their effectiveness:

 - Design for flexibility: Use parameters and config objects to adapt logic across datasets and use cases.

 - Keep functions atomic: Focus each function on a single task—cleaning, aggregating, visualizing, or exporting.

 - Avoid side effects: Return outputs explicitly; defer file I/O or plotting to higher-level orchestration.

 - Document clearly: Use concise docstrings and intuitive naming for better readability and future reuse.

 - Centralize configuration: Store defaults and settings in external files or global dictionaries for easy updates.

 - Efficient function design leads to cleaner notebooks, faster iteration, and scalable analysis pipelines.


# 🔧 Generalized Reusable Function

## Impact promo analysis

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt
import seaborn as sns

def clean_promo_analysis(df, sales_col='sales', customers_col='customers', 
                        store_col='store', promo_col='promo', date_col='date', top_n=10):
    """
    Clean and comprehensive promotional impact analysis
    """
    print("🎯 PROMOTIONAL IMPACT ANALYSIS REPORT")
    print("="*60)
    
    # Data preprocessing
    df_clean = df.copy()
    
    # Remove closed stores (sales = 0)
    df_clean = df_clean[df_clean[sales_col] > 0]
    print(f"📊 Data Overview: {len(df_clean):,} records after removing closed days")
    
    # Create binary promo flag
    df_clean['promo_flag'] = (df_clean[promo_col] == 'Promo').astype(int)
    
    # Get top stores by average sales
    top_stores = df_clean.groupby(store_col)[sales_col].mean().nlargest(top_n).index
    df_analysis = df_clean[df_clean[store_col].isin(top_stores)]
    
    print(f"🏪 Analyzing top {len(top_stores)} stores: {list(top_stores)}")
    print(f"📈 Analysis dataset: {len(df_analysis):,} records")
    
    # Split data
    promo_data = df_analysis[df_analysis['promo_flag'] == 1]
    non_promo_data = df_analysis[df_analysis['promo_flag'] == 0]
    
    print(f"🎯 Promotional days: {len(promo_data):,} ({len(promo_data)/len(df_analysis)*100:.1f}%)")
    print(f"📅 Regular days: {len(non_promo_data):,} ({len(non_promo_data)/len(df_analysis)*100:.1f}%)")
    
    # Calculate key metrics
    results = {}
    
    # Sales metrics
    promo_avg_sales = promo_data[sales_col].mean()
    non_promo_avg_sales = non_promo_data[sales_col].mean()
    sales_lift = promo_avg_sales - non_promo_avg_sales
    sales_lift_pct = (sales_lift / non_promo_avg_sales) * 100
    
    # Customer metrics  
    promo_avg_customers = promo_data[customers_col].mean()
    non_promo_avg_customers = non_promo_data[customers_col].mean()
    customer_lift = promo_avg_customers - non_promo_avg_customers
    customer_lift_pct = (customer_lift / non_promo_avg_customers) * 100
    
    # Efficiency metrics
    promo_sales_per_customer = promo_avg_sales / promo_avg_customers
    non_promo_sales_per_customer = non_promo_avg_sales / non_promo_avg_customers
    efficiency_improvement = ((promo_sales_per_customer - non_promo_sales_per_customer) / 
                             non_promo_sales_per_customer) * 100
    
    # Statistical test
    t_stat, p_value = ttest_ind(promo_data[sales_col], non_promo_data[sales_col])
    is_significant = p_value < 0.05
    
    # Store-level analysis
    store_results = []
    for store in top_stores:
        store_data = df_analysis[df_analysis[store_col] == store]
        store_promo = store_data[store_data['promo_flag'] == 1]
        store_regular = store_data[store_data['promo_flag'] == 0]
        
        if len(store_promo) > 0 and len(store_regular) > 0:
            store_sales_lift = ((store_promo[sales_col].mean() - store_regular[sales_col].mean()) / 
                              store_regular[sales_col].mean()) * 100
            store_customer_lift = ((store_promo[customers_col].mean() - store_regular[customers_col].mean()) / 
                                 store_regular[customers_col].mean()) * 100
            
            store_results.append({
                'Store': store,
                'Promo Days': len(store_promo),
                'Regular Days': len(store_regular),
                'Promo Rate (%)': len(store_promo) / len(store_data) * 100,
                'Sales Lift (%)': store_sales_lift,
                'Customer Lift (%)': store_customer_lift,
                'Promo Avg Sales': store_promo[sales_col].mean(),
                'Regular Avg Sales': store_regular[sales_col].mean(),
                'Promo Avg Customers': store_promo[customers_col].mean(),
                'Regular Avg Customers': store_regular[customers_col].mean()
            })
    
    store_df = pd.DataFrame(store_results)
    
    # Print results
    print(f"\n💰 SALES PERFORMANCE ANALYSIS")
    print(f"="*40)
    print(f"🎯 Average Sales (Promotional): ${promo_avg_sales:,.0f}")
    print(f"📊 Average Sales (Regular): ${non_promo_avg_sales:,.0f}")
    print(f"⬆️  Absolute Sales Lift: ${sales_lift:,.0f}")
    print(f"📈 Percentage Sales Lift: +{sales_lift_pct:.2f}%")
    
    print(f"\n👥 CUSTOMER TRAFFIC ANALYSIS") 
    print(f"="*40)
    print(f"🎯 Average Customers (Promotional): {promo_avg_customers:,.0f}")
    print(f"📊 Average Customers (Regular): {non_promo_avg_customers:,.0f}")
    print(f"⬆️  Customer Traffic Lift: +{customer_lift:.0f}")
    print(f"📈 Customer Traffic Lift: +{customer_lift_pct:.2f}%")
    
    print(f"\n🎯 EFFICIENCY & PROFITABILITY")
    print(f"="*40)
    print(f"💳 Sales per Customer (Promotional): ${promo_sales_per_customer:.2f}")
    print(f"💳 Sales per Customer (Regular): ${non_promo_sales_per_customer:.2f}")
    print(f"📊 Spending Efficiency Gain: +{efficiency_improvement:.2f}%")
    
    print(f"\n📊 STATISTICAL VALIDATION")
    print(f"="*40)
    print(f"🧮 T-Statistic: {t_stat:.2f}")
    print(f"📈 P-Value: {p_value:.6f}")
    print(f"✅ Statistically Significant: {'YES' if is_significant else 'NO'} (α=0.05)")
    
    # Business insights
    print(f"\n💡 KEY BUSINESS INSIGHTS")
    print(f"="*40)
    
    if sales_lift_pct > 50:
        print(f"🚀 EXCEPTIONAL PERFORMANCE: Promotions drive outstanding sales growth!")
        recommendation = "MAXIMIZE promotional frequency - ROI is excellent"
    elif sales_lift_pct > 25:
        print(f"✅ STRONG PERFORMANCE: Promotions are highly effective")
        recommendation = "INCREASE promotional activities strategically"
    elif sales_lift_pct > 10:
        print(f"👍 GOOD PERFORMANCE: Promotions show solid results")
        recommendation = "MAINTAIN current promotional strategy"
    elif sales_lift_pct > 0:
        print(f"⚠️  WEAK PERFORMANCE: Minimal promotional benefit")
        recommendation = "REVIEW promotional costs vs benefits"
    else:
        print(f"❌ NEGATIVE IMPACT: Promotions may be hurting performance")
        recommendation = "URGENT REVIEW of promotional strategy needed"
    
    print(f"📋 RECOMMENDATION: {recommendation}")
    
    # Traffic vs Spending analysis
    if customer_lift_pct > efficiency_improvement:
        print(f"👥 PRIMARY DRIVER: Promotions mainly drive FOOT TRAFFIC (+{customer_lift_pct:.1f}%)")
        print(f"   → Focus on conversion and upselling during promotions")
    elif efficiency_improvement > customer_lift_pct:
        print(f"💰 PRIMARY DRIVER: Promotions increase SPENDING PER VISIT (+{efficiency_improvement:.1f}%)")
        print(f"   → Excellent basket size improvement")
    else:
        print(f"⚖️  BALANCED IMPACT: Both traffic and spending improve equally")
    
    # Store performance insights
    if not store_df.empty:
        best_store = store_df.loc[store_df['Sales Lift (%)'].idxmax()]
        worst_store = store_df.loc[store_df['Sales Lift (%)'].idxmin()]
        
        print(f"\n🏆 TOP PERFORMING STORE: #{int(best_store['Store'])}")
        print(f"   📈 Sales Lift: +{best_store['Sales Lift (%)']:.1f}%")
        print(f"   👥 Customer Lift: +{best_store['Customer Lift (%)']:.1f}%")
        print(f"   🎯 Promo Rate: {best_store['Promo Rate (%)']:.1f}%")
        
        print(f"\n📉 LOWEST PERFORMING STORE: #{int(worst_store['Store'])}")
        print(f"   📈 Sales Lift: +{best_store['Sales Lift (%)']:.1f}%")
        print(f"   👥 Customer Lift: +{worst_store['Customer Lift (%)']:.1f}%")
        print(f"   🎯 Promo Rate: {worst_store['Promo Rate (%)']:.1f}%")
        
        avg_lift = store_df['Sales Lift (%)'].mean()
        consistent_stores = ((store_df['Sales Lift (%)'] - avg_lift).abs() < 10).sum()
        
        print(f"\n📊 CONSISTENCY ANALYSIS:")
        print(f"   🎯 Average Lift Across Stores: +{avg_lift:.1f}%")
        print(f"   📏 Performance Consistency: {consistent_stores}/{len(store_df)} stores within ±10%")
        
        if consistent_stores / len(store_df) > 0.8:
            print(f"   ✅ HIGHLY CONSISTENT: Promotions work well across all stores")
        elif consistent_stores / len(store_df) > 0.6:
            print(f"   👍 MODERATELY CONSISTENT: Most stores benefit similarly")
        else:
            print(f"   ⚠️  INCONSISTENT: Results vary significantly by store")
            print(f"   → Investigate store-specific factors affecting promotional performance")
    
    # Time-based insights (if date available)
    if date_col in df_analysis.columns:
        df_analysis['month'] = pd.to_datetime(df_analysis[date_col]).dt.month_name()
        df_analysis['weekday'] = pd.to_datetime(df_analysis[date_col]).dt.day_name()
        
        # Monthly performance
        monthly_promo = df_analysis[df_analysis['promo_flag']==1].groupby('month')[sales_col].mean()
        monthly_regular = df_analysis[df_analysis['promo_flag']==0].groupby('month')[sales_col].mean()
        monthly_lift = ((monthly_promo - monthly_regular) / monthly_regular * 100).round(1)
        
        best_month = monthly_lift.idxmax()
        worst_month = monthly_lift.idxmin()
        
        print(f"\n📅 SEASONAL INSIGHTS:")
        print(f"   🏆 Best Month for Promos: {best_month} (+{monthly_lift[best_month]:.1f}%)")
        print(f"   📉 Worst Month for Promos: {worst_month} (+{monthly_lift[worst_month]:.1f}%)")
    
    print(f"\n🎉 ANALYSIS COMPLETE!")
    
    # Return structured results
    return {
        'summary_metrics': {
            'sales_lift_pct': sales_lift_pct,
            'customer_lift_pct': customer_lift_pct,
            'efficiency_improvement': efficiency_improvement,
            'statistical_significance': is_significant,
            'p_value': p_value
        },
        'store_performance': store_df,
        'raw_data': {
            'promo_avg_sales': promo_avg_sales,
            'regular_avg_sales': non_promo_avg_sales,
            'promo_avg_customers': promo_avg_customers,
            'regular_avg_customers': non_promo_avg_customers
        }
    }

def create_promo_visualization(results_dict, store_df):
    """
    Create visualizations for promotional analysis
    """
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Sales comparison
    metrics = ['Promo', 'Regular']
    sales_values = [results_dict['raw_data']['promo_avg_sales'], 
                   results_dict['raw_data']['regular_avg_sales']]
    
    bars1 = ax1.bar(metrics, sales_values, color=['#ff6b6b', '#4ecdc4'], alpha=0.8)
    ax1.set_title('Average Sales: Promotional vs Regular Days', fontsize=14, fontweight='bold')
    ax1.set_ylabel('Average Sales ($)')
    
    # Add value labels on bars
    for bar in bars1:
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height,
                f'${height:,.0f}', ha='center', va='bottom', fontsize=12)
    
    # 2. Customer traffic comparison
    customer_values = [results_dict['raw_data']['promo_avg_customers'],
                      results_dict['raw_data']['regular_avg_customers']]
    
    bars2 = ax2.bar(metrics, customer_values, color=['#ff9f43', '#54a0ff'], alpha=0.8)
    ax2.set_title('Average Customer Traffic: Promotional vs Regular Days', fontsize=14, fontweight='bold')
    ax2.set_ylabel('Average Customers')
    
    for bar in bars2:
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:,.0f}', ha='center', va='bottom', fontsize=12)
    
    # 3. Store performance distribution
    if not store_df.empty:
        ax3.hist(store_df['Sales Lift (%)'], bins=8, alpha=0.7, color='#ff6b6b', edgecolor='black')
        ax3.set_title('Distribution of Sales Lift Across Stores', fontsize=14, fontweight='bold')
        ax3.set_xlabel('Sales Lift (%)')
        ax3.set_ylabel('Number of Stores')
        ax3.axvline(store_df['Sales Lift (%)'].mean(), color='red', linestyle='--', 
                   label=f'Mean: {store_df["Sales Lift (%)"].mean():.1f}%')
        ax3.legend()
    
    # 4. Key metrics summary
    lift_pct = results_dict['summary_metrics']['sales_lift_pct']
    customer_lift_pct = results_dict['summary_metrics']['customer_lift_pct']
    efficiency = results_dict['summary_metrics']['efficiency_improvement']
    
    metrics_names = ['Sales Lift', 'Customer Lift', 'Efficiency Gain']
    metrics_values = [lift_pct, customer_lift_pct, efficiency]
    colors = ['#ff6b6b', '#4ecdc4', '#45aaf2']
    
    bars4 = ax4.bar(metrics_names, metrics_values, color=colors, alpha=0.8)
    ax4.set_title('Key Performance Metrics (%)', fontsize=14, fontweight='bold')
    ax4.set_ylabel('Improvement (%)')
    
    for bar in bars4:
        height = bar.get_height()
        ax4.text(bar.get_x() + bar.get_width()/2., height,
                f'+{height:.1f}%', ha='center', va='bottom', fontsize=12)
    
    plt.tight_layout()
    plt.show()
    
    return fig


results = clean_promo_analysis(df_features)

In [None]:
# To pull train_df from one notebook to another in JupyterLab
%store train_df


---------------------------

In [None]:
print("✅ Data Ingestion and Exploratory Data Analysis completed successfully!")
print(f"🗓️ Analysis Date: {bold_start}{pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")

--------------------------------

In [None]:
# End analysis
analysis_end = pd.Timestamp.now()
duration = analysis_end - analysis_begin

# Final summary print
print("\n📋 Analysis Summary")
print(f"🟢 Begin Date: {bold_start}{analysis_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"✅ End Date:   {bold_start}{analysis_end.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"⏱️ Duration:   {bold_start}{str(duration)}{bold_end}")

-------------------------
## Project Design Rationale: Notebook Separation

To promote **clarity, maintainability, and scalability** within the project, **data engineering** and **visualization tasks** are intentionally separated into distinct notebooks. This modular approach prevents the accumulation of excessive code in a single notebook, making it easier to **debug, update, and collaborate across different stages of the workflow**. By isolating data transformation logic from visual analysis, **each notebook remains focused and purpose-driven**, ultimately **enhancing the overall efficiency and readability of the project**.


ARNAUD DAVY - MUKWA NDUDI
-----------------------------------------------------------------