# 🛠️ Feature Engineering Overview – Rossmann Store Sales

This step focused on transforming raw inputs into meaningful features that capture temporal patterns, store-specific characteristics, and external influences—essential for improving model accuracy in forecasting daily sales.
Key Feature Engineering Strategies:

- **Date-based decomposition:** Extracted features like day of week, month, year, and week of year to model seasonality and calendar effects.

- **Lag and rolling statistics:** Introduced lagged sales values and rolling averages to reflect short-term trends and autocorrelation.

- **Store-level attributes:** Included static features such as store type, assortment level, and competition distance to differentiate store behavior.

- **Promotion indicators:** Engineered features from Promo, Promo2, and PromoInterval to capture promotional impact over time.

- **Competition and holiday effects:** Added flags and time-since metrics for competition openings and holidays to account for external disruptions.

- **Interaction terms:** Created combined features (e.g., StoreType × DayOfWeek) to capture complex relationships.

By enriching the dataset with these engineered features, the model is better equipped to learn from historical patterns and contextual signals, ultimately enhancing predictive performance.

## 1. Setup & Imports Libraries
---------------------------------------

In [1]:
import time 

In [2]:
# Step 1: Data Ingestion
print("Step 1: Setup and Import Libraries in progress...")
time.sleep(1)  # Simulate processing time

Step 1: Setup and Import Libraries in progress...


In [3]:
# Data Manipulation & Processing
import os
import holidays
import pandas as pd
import numpy as np
from pathlib import Path
import scipy.stats as stats
from datetime import datetime
from sklearn.preprocessing import *

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

# Warnings
import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore')

print("="*60)
print("Rossman Store Sales Time Series Analysis - Part 1")
print("="*60)
print("All libraries imported successfully!")

Rossman Store Sales Time Series Analysis - Part 1
All libraries imported successfully!


In [4]:
print("✅ Setup and Import Liraries completed.\n")

✅ Setup and Import Liraries completed.



In [5]:
# Start analysis

analysis_begin = pd.Timestamp.now()

bold_start = '\033[1m'
bold_end = '\033[0m'

print("🔍 Analysis Started")
print(f"🟢 Begin Date: {bold_start}{analysis_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}\n")

🔍 Analysis Started
🟢 Begin Date: [1m2025-08-13 21:19:58[0m




## Restore file
----------------------------

In [6]:
# Step 2: Data Ingestion
print("Step 2: Data Ingestion in progress...")
time.sleep(1)  # Simulate processing time

Step 2: Data Ingestion in progress...


In [7]:
# To pull df_features from one notebook to another in JupyterLab
%store -r train_df
%store -r df_viz_feat

In [8]:
print("✅ Data Ingestion completed.\n")

✅ Data Ingestion completed.



# 3. Exploratory Data Analysis (EDA)
## 3.1. Basic Inspection
-------------------

In [9]:
# Step 3: Exploratory Data Analysis (EDA)
print("📊 Step 3: Exploratory Data Analysis in progress...")
time.sleep(1)  # Simulate processing time

📊 Step 3: Exploratory Data Analysis in progress...


In [10]:
train_df.columns = train_df.columns.str.lower()

# --- BASIC INFO AND DUPLICATES ---
print("DataFrame Info:")
train_df.info()

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 982644 entries, 982643 to 0
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   store          982644 non-null  int64 
 1   dayofweek      982644 non-null  int64 
 2   date           982644 non-null  object
 3   sales          982644 non-null  int64 
 4   customers      982644 non-null  int64 
 5   open           982644 non-null  int64 
 6   promo          982644 non-null  int64 
 7   stateholiday   982644 non-null  object
 8   schoolholiday  982644 non-null  int64 
dtypes: int64(7), object(2)
memory usage: 75.0+ MB


## Feature Engineering for ML

In [11]:
# Filter for clean and unbiased data
ts_train = train_df[(train_df['open'] == 1) & (train_df['sales'] > 0)].copy()

# Sort by date
ts_train.sort_values('date', ascending=True, inplace=True)

# Ensure 'date' is datetime
ts_train['date'] = pd.to_datetime(ts_train['date'])

# Temporal features
ts_train['dayofmonth'] = ts_train['date'].dt.day
ts_train['dayofyear'] = ts_train['date'].dt.dayofyear
ts_train['weekofyear'] = ts_train['date'].dt.isocalendar().week
ts_train['month'] = ts_train['date'].dt.month
ts_train['quarter'] = ts_train['date'].dt.quarter
ts_train['year'] = ts_train['date'].dt.year

# Cyclical features
ts_train['day_sin'] = np.sin(2 * np.pi * ts_train['dayofweek'] / 7)
ts_train['day_cos'] = np.cos(2 * np.pi * ts_train['dayofweek'] / 7)
ts_train['month_sin'] = np.sin(2 * np.pi * ts_train['month'] / 12)
ts_train['month_cos'] = np.cos(2 * np.pi * ts_train['month'] / 12)
ts_train['week_sin'] = np.sin(2 * np.pi * ts_train['weekofyear'] / 52)
ts_train['week_cos'] = np.cos(2 * np.pi * ts_train['weekofyear'] / 52)

# Business features
ts_train['isweekend'] = (ts_train['dayofweek'] > 5).astype(int)
ts_train['ismonthstart'] = ts_train['date'].dt.is_month_start.astype(int)
ts_train['ismonthend'] = ts_train['date'].dt.is_month_end.astype(int)
ts_train['isquarterstart'] = ts_train['date'].dt.is_quarter_start.astype(int)
ts_train['isquarterend'] = ts_train['date'].dt.is_quarter_end.astype(int)

# Lag features
for lag in [1, 2, 3, 7, 14, 30]:
    ts_train[f'sales_lag_{lag}'] = ts_train.groupby('store')['sales'].shift(lag)

# Rolling window features
for window in [7, 14, 30]:
    ts_train[f'sales_rolling_mean_{window}'] = (
        ts_train.groupby('store')['sales'].rolling(window).mean().reset_index(level=0, drop=True)
    )
    ts_train[f'sales_rolling_std_{window}'] = (
        ts_train.groupby('store')['sales'].rolling(window).std().reset_index(level=0, drop=True)
    )
    ts_train[f'sales_rolling_min_{window}'] = (
        ts_train.groupby('store')['sales'].rolling(window).min().reset_index(level=0, drop=True)
    )
    ts_train[f'sales_rolling_max_{window}'] = (
        ts_train.groupby('store')['sales'].rolling(window).max().reset_index(level=0, drop=True)
    )

# Exponential moving averages
for alpha in [0.1, 0.3, 0.5]:
    ts_train[f'sales_ema_{alpha}'] = (
        ts_train.groupby('store')['sales'].ewm(alpha=alpha).mean().reset_index(level=0, drop=True)
    )

# Interaction features
ts_train['promo_schoolholiday'] = ts_train['promo'] * ts_train['schoolholiday']
ts_train['promo_stateholiday'] = ts_train['promo'] * ts_train['stateholiday']


# Set date as index
ts_train.set_index('date', inplace=True)


#### 4.3. StateHoliday Impact Analysis

In [12]:
# Create summary table for holiday impact
stateholiday_analysis = (
    df_viz_feat
    .groupby("stateholiday")[["customers", "sales"]]
    .mean()
    .reset_index()
    .rename(columns={"customers": "avg_customers", "sales": "avg_sales"})
    .sort_values(by="avg_sales", ascending=False)
)

print(stateholiday_analysis)

# Count times stores were closed during holidays (using temp labels)
closed_holiday_days = df_viz_feat[(df_viz_feat["open"] == 0) & (df_viz_feat.stateholiday != "None")].shape[0]
print(f"Number of closed times during holidays: {bold_start}{closed_holiday_days}{bold_end}")

  stateholiday  avg_customers  avg_sales
2   Normal Day         652.11    5940.39
3       Public          43.82     290.74
1       Easter          36.56     214.31
0    Christmas          27.17     168.73
Number of closed times during holidays: [1m168440[0m


#### 4.4. Day & Seasonality Effects

In [13]:
day_analysis = (
    df_viz_feat
    .groupby("day")[["customers","sales"]]
    .mean()
    .reset_index()
    .rename(columns={"customers": "avg_customers","sales": "avg_sales"})
    .sort_values(by="avg_sales", ascending=False)
)

print(day_analysis)

   day  avg_customers  avg_sales
1  Mon         812.93    7797.64
5  Tue         761.86    7005.52
0  Fri         742.53    6703.50
6  Wed         721.20    6536.45
4  Thu         695.78    6216.11
2  Sat         658.76    5856.78
3  Sun          35.58     202.62


#### 4.5. Month & Seasonality Effects

In [14]:

month_analysis = (
    df_viz_feat
    .groupby("month")[["customers","sales"]]
    .mean()
    .reset_index()
    .rename(columns={"sales": "avg_sales", "customers": "avg_customers"})
    .sort_values(by="avg_sales", ascending=False)
)

print(month_analysis)


   month  avg_customers  avg_sales
2    Dec         703.07    6826.61
5    Jul         663.59    6022.61
9    Nov         654.15    6008.11
7    Mar         629.40    5784.58
6    Jun         624.79    5760.96
0    Apr         630.61    5738.87
1    Aug         642.50    5693.02
3    Feb         626.72    5645.25
11   Sep         634.44    5570.25
10   Oct         631.10    5537.04
8    May         601.99    5489.64
4    Jan         601.62    5465.40


#### 4.6. Year & Seasonality Effects

In [15]:

year_analysis = (
    df_viz_feat
    .groupby("year")[["customers", "sales"]]
    .mean()
    .reset_index()
    .rename(columns={"sales": "avg_sales", "customers": "avg_customers"})
    .sort_values(by="avg_sales", ascending=False)
)

print(year_analysis)

   year  avg_customers  avg_sales
1  2014         643.27    5833.29
2  2015         620.84    5832.95
0  2013         629.04    5658.53


#### 4.5. Promo × DayOfWeek Interaction

In [16]:
promo_dow_analysis = (
    df_viz_feat
    .groupby(["promo", "day"])[["customers", "sales",]]
    .mean()
    .reset_index()
    .rename(columns={"sales": "avg_sales", "customers": "avg_customers"})
    .sort_values(by="avg_sales", ascending=False)
)

print(promo_dow_analysis)


       promo  day  avg_customers  avg_sales
8      Promo  Mon         938.67    9709.13
10     Promo  Tue         837.47    8226.49
11     Promo  Wed         785.91    7540.85
9      Promo  Thu         774.00    7241.50
7      Promo  Fri         765.53    7168.90
0   No Promo  Fri         716.68    6180.31
2   No Promo  Sat         658.76    5856.78
5   No Promo  Tue         675.35    5608.49
1   No Promo  Mon         666.24    5567.56
6   No Promo  Wed         648.26    5404.23
4   No Promo  Thu         607.85    5063.39
3   No Promo  Sun          35.58     202.62


In [17]:
# Create a copy to avoid modifying the original DataFrame
df_temp = df_viz_feat.copy()

# Insert promo_flag immediately after the 'promo' column
promo_index = df_temp.columns.get_loc("promo")
df_temp.insert(promo_index + 1, "promo_flag", df_temp["promo"] == "Promo")

promo_dow_analysis = (
    df_temp
    .groupby(["promo", "promo_flag", "day"])[["customers", "sales"]]
    .mean()
    .reset_index()
    .rename(columns={"sales": "avg_sales", "customers": "avg_customers"})
    .sort_values(by="avg_sales", ascending=False)
)

print(promo_dow_analysis)


       promo  promo_flag  day  avg_customers  avg_sales
8      Promo        True  Mon         938.67    9709.13
10     Promo        True  Tue         837.47    8226.49
11     Promo        True  Wed         785.91    7540.85
9      Promo        True  Thu         774.00    7241.50
7      Promo        True  Fri         765.53    7168.90
0   No Promo       False  Fri         716.68    6180.31
2   No Promo       False  Sat         658.76    5856.78
5   No Promo       False  Tue         675.35    5608.49
1   No Promo       False  Mon         666.24    5567.56
6   No Promo       False  Wed         648.26    5404.23
4   No Promo       False  Thu         607.85    5063.39
3   No Promo       False  Sun          35.58     202.62


### Quarterly Analysis

In [18]:
# Simple and robust quarterly analysis
quarter_avg = df_viz_feat.groupby('quarter')['sales'].mean().sort_values(ascending=False)

print("Quarterly Sales Analysis:")
print("=" * 50)
print(f"{'Quarter':<10} {'Avg Sales':<12} {'Rank':<6} {'vs Q1':<8}")
print("-" * 50)

for i, (quarter, sales) in enumerate(quarter_avg.items(), 1):
    vs_q1 = ((sales - quarter_avg[1]) / quarter_avg[1]) * 100
    vs_q1_str = f"{vs_q1:+.1f}%" if quarter != 1 else "Base"
    print(f"Q{quarter:<9} €{sales:>8,.0f}    {i:<6} {vs_q1_str:<8}")

print(f"\nQuarterly Summary:")
print("-" * 20)
best_q = quarter_avg.index[0]
worst_q = quarter_avg.index[-1]
range_pct = ((quarter_avg.max() - quarter_avg.min()) / quarter_avg.mean()) * 100

print(f"Best quarter: Q{best_q} (€{quarter_avg[best_q]:,.0f})")
print(f"Worst quarter: Q{worst_q} (€{quarter_avg[worst_q]:,.0f})")
print(f"Performance gap: {range_pct:.1f}%")

# Growth pattern
print(f"\nQuarter-to-Quarter Growth:")
print("-" * 25)
for q in [2, 3, 4]:
    growth = ((quarter_avg[q] - quarter_avg[q-1]) / quarter_avg[q-1]) * 100
    print(f"Q{q-1} to Q{q}: {growth:+.1f}%")

print(f"\nKey Insight: Q{best_q} generates {((quarter_avg[best_q]/quarter_avg.sum())*100):.1f}% of annual revenue")

Quarterly Sales Analysis:
Quarter    Avg Sales    Rank   vs Q1   
--------------------------------------------------
Q4         €   6,125    1      +8.8%   
Q3         €   5,764    2      +2.4%   
Q2         €   5,661    3      +0.5%   
Q1         €   5,631    4      Base    

Quarterly Summary:
--------------------
Best quarter: Q4 (€6,125)
Worst quarter: Q1 (€5,631)
Performance gap: 8.5%

Quarter-to-Quarter Growth:
-------------------------
Q1 to Q2: +0.5%
Q2 to Q3: +1.8%
Q3 to Q4: +6.3%

Key Insight: Q4 generates 26.4% of annual revenue


# ✨ Pro Tips : Reusable Functions - Best Practices for Reusable Functions in Data Analysis
-----------
Reusable functions streamline analytical workflows by promoting consistency, reducing redundancy, and improving maintainability. To maximize their effectiveness:

 - Design for flexibility: Use parameters and config objects to adapt logic across datasets and use cases.

 - Keep functions atomic: Focus each function on a single task—cleaning, aggregating, visualizing, or exporting.

 - Avoid side effects: Return outputs explicitly; defer file I/O or plotting to higher-level orchestration.

 - Document clearly: Use concise docstrings and intuitive naming for better readability and future reuse.

 - Centralize configuration: Store defaults and settings in external files or global dictionaries for easy updates.

 - Efficient function design leads to cleaner notebooks, faster iteration, and scalable analysis pipelines.


# 🔧 Generalized Reusable Function

## Impact promo analysis

In [19]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt
import seaborn as sns

def clean_promo_analysis(df, sales_col='sales', customers_col='customers', 
                        store_col='store', promo_col='promo', date_col='date', top_n=10):
    """
    Clean and comprehensive promotional impact analysis
    """
    print("🎯 PROMOTIONAL IMPACT ANALYSIS REPORT")
    print("="*60)
    
    # Data preprocessing
    df_clean = df.copy()
    
    # Remove closed stores (sales = 0)
    df_clean = df_clean[df_clean[sales_col] > 0]
    print(f"📊 Data Overview: {len(df_clean):,} records after removing closed days")
    
    # Create binary promo flag
    df_clean['promo_flag'] = (df_clean[promo_col] == 'Promo').astype(int)
    
    # Get top stores by average sales
    top_stores = df_clean.groupby(store_col)[sales_col].mean().nlargest(top_n).index
    df_analysis = df_clean[df_clean[store_col].isin(top_stores)]
    
    print(f"🏪 Analyzing top {len(top_stores)} stores: {list(top_stores)}")
    print(f"📈 Analysis dataset: {len(df_analysis):,} records")
    
    # Split data
    promo_data = df_analysis[df_analysis['promo_flag'] == 1]
    non_promo_data = df_analysis[df_analysis['promo_flag'] == 0]
    
    print(f"🎯 Promotional days: {len(promo_data):,} ({len(promo_data)/len(df_analysis)*100:.1f}%)")
    print(f"📅 Regular days: {len(non_promo_data):,} ({len(non_promo_data)/len(df_analysis)*100:.1f}%)")
    
    # Calculate key metrics
    results = {}
    
    # Sales metrics
    promo_avg_sales = promo_data[sales_col].mean()
    non_promo_avg_sales = non_promo_data[sales_col].mean()
    sales_lift = promo_avg_sales - non_promo_avg_sales
    sales_lift_pct = (sales_lift / non_promo_avg_sales) * 100
    
    # Customer metrics  
    promo_avg_customers = promo_data[customers_col].mean()
    non_promo_avg_customers = non_promo_data[customers_col].mean()
    customer_lift = promo_avg_customers - non_promo_avg_customers
    customer_lift_pct = (customer_lift / non_promo_avg_customers) * 100
    
    # Efficiency metrics
    promo_sales_per_customer = promo_avg_sales / promo_avg_customers
    non_promo_sales_per_customer = non_promo_avg_sales / non_promo_avg_customers
    efficiency_improvement = ((promo_sales_per_customer - non_promo_sales_per_customer) / 
                             non_promo_sales_per_customer) * 100
    
    # Statistical test
    t_stat, p_value = ttest_ind(promo_data[sales_col], non_promo_data[sales_col])
    is_significant = p_value < 0.05
    
    # Store-level analysis
    store_results = []
    for store in top_stores:
        store_data = df_analysis[df_analysis[store_col] == store]
        store_promo = store_data[store_data['promo_flag'] == 1]
        store_regular = store_data[store_data['promo_flag'] == 0]
        
        if len(store_promo) > 0 and len(store_regular) > 0:
            store_sales_lift = ((store_promo[sales_col].mean() - store_regular[sales_col].mean()) / 
                              store_regular[sales_col].mean()) * 100
            store_customer_lift = ((store_promo[customers_col].mean() - store_regular[customers_col].mean()) / 
                                 store_regular[customers_col].mean()) * 100
            
            store_results.append({
                'Store': store,
                'Promo Days': len(store_promo),
                'Regular Days': len(store_regular),
                'Promo Rate (%)': len(store_promo) / len(store_data) * 100,
                'Sales Lift (%)': store_sales_lift,
                'Customer Lift (%)': store_customer_lift,
                'Promo Avg Sales': store_promo[sales_col].mean(),
                'Regular Avg Sales': store_regular[sales_col].mean(),
                'Promo Avg Customers': store_promo[customers_col].mean(),
                'Regular Avg Customers': store_regular[customers_col].mean()
            })
    
    store_df = pd.DataFrame(store_results)
    
    # Print results
    print(f"\n💰 SALES PERFORMANCE ANALYSIS")
    print(f"="*40)
    print(f"🎯 Average Sales (Promotional): ${promo_avg_sales:,.0f}")
    print(f"📊 Average Sales (Regular): ${non_promo_avg_sales:,.0f}")
    print(f"⬆️  Absolute Sales Lift: ${sales_lift:,.0f}")
    print(f"📈 Percentage Sales Lift: +{sales_lift_pct:.2f}%")
    
    print(f"\n👥 CUSTOMER TRAFFIC ANALYSIS") 
    print(f"="*40)
    print(f"🎯 Average Customers (Promotional): {promo_avg_customers:,.0f}")
    print(f"📊 Average Customers (Regular): {non_promo_avg_customers:,.0f}")
    print(f"⬆️  Customer Traffic Lift: +{customer_lift:.0f}")
    print(f"📈 Customer Traffic Lift: +{customer_lift_pct:.2f}%")
    
    print(f"\n🎯 EFFICIENCY & PROFITABILITY")
    print(f"="*40)
    print(f"💳 Sales per Customer (Promotional): ${promo_sales_per_customer:.2f}")
    print(f"💳 Sales per Customer (Regular): ${non_promo_sales_per_customer:.2f}")
    print(f"📊 Spending Efficiency Gain: +{efficiency_improvement:.2f}%")
    
    print(f"\n📊 STATISTICAL VALIDATION")
    print(f"="*40)
    print(f"🧮 T-Statistic: {t_stat:.2f}")
    print(f"📈 P-Value: {p_value:.6f}")
    print(f"✅ Statistically Significant: {'YES' if is_significant else 'NO'} (α=0.05)")
    
    # Business insights
    print(f"\n💡 KEY BUSINESS INSIGHTS")
    print(f"="*40)
    
    if sales_lift_pct > 50:
        print(f"🚀 EXCEPTIONAL PERFORMANCE: Promotions drive outstanding sales growth!")
        recommendation = "MAXIMIZE promotional frequency - ROI is excellent"
    elif sales_lift_pct > 25:
        print(f"✅ STRONG PERFORMANCE: Promotions are highly effective")
        recommendation = "INCREASE promotional activities strategically"
    elif sales_lift_pct > 10:
        print(f"👍 GOOD PERFORMANCE: Promotions show solid results")
        recommendation = "MAINTAIN current promotional strategy"
    elif sales_lift_pct > 0:
        print(f"⚠️  WEAK PERFORMANCE: Minimal promotional benefit")
        recommendation = "REVIEW promotional costs vs benefits"
    else:
        print(f"❌ NEGATIVE IMPACT: Promotions may be hurting performance")
        recommendation = "URGENT REVIEW of promotional strategy needed"
    
    print(f"📋 RECOMMENDATION: {recommendation}")
    
    # Traffic vs Spending analysis
    if customer_lift_pct > efficiency_improvement:
        print(f"👥 PRIMARY DRIVER: Promotions mainly drive FOOT TRAFFIC (+{customer_lift_pct:.1f}%)")
        print(f"   → Focus on conversion and upselling during promotions")
    elif efficiency_improvement > customer_lift_pct:
        print(f"💰 PRIMARY DRIVER: Promotions increase SPENDING PER VISIT (+{efficiency_improvement:.1f}%)")
        print(f"   → Excellent basket size improvement")
    else:
        print(f"⚖️  BALANCED IMPACT: Both traffic and spending improve equally")
    
    # Store performance insights
    if not store_df.empty:
        best_store = store_df.loc[store_df['Sales Lift (%)'].idxmax()]
        worst_store = store_df.loc[store_df['Sales Lift (%)'].idxmin()]
        
        print(f"\n🏆 TOP PERFORMING STORE: #{int(best_store['Store'])}")
        print(f"   📈 Sales Lift: +{best_store['Sales Lift (%)']:.1f}%")
        print(f"   👥 Customer Lift: +{best_store['Customer Lift (%)']:.1f}%")
        print(f"   🎯 Promo Rate: {best_store['Promo Rate (%)']:.1f}%")
        
        print(f"\n📉 LOWEST PERFORMING STORE: #{int(worst_store['Store'])}")
        print(f"   📈 Sales Lift: +{best_store['Sales Lift (%)']:.1f}%")
        print(f"   👥 Customer Lift: +{worst_store['Customer Lift (%)']:.1f}%")
        print(f"   🎯 Promo Rate: {worst_store['Promo Rate (%)']:.1f}%")
        
        avg_lift = store_df['Sales Lift (%)'].mean()
        consistent_stores = ((store_df['Sales Lift (%)'] - avg_lift).abs() < 10).sum()
        
        print(f"\n📊 CONSISTENCY ANALYSIS:")
        print(f"   🎯 Average Lift Across Stores: +{avg_lift:.1f}%")
        print(f"   📏 Performance Consistency: {consistent_stores}/{len(store_df)} stores within ±10%")
        
        if consistent_stores / len(store_df) > 0.8:
            print(f"   ✅ HIGHLY CONSISTENT: Promotions work well across all stores")
        elif consistent_stores / len(store_df) > 0.6:
            print(f"   👍 MODERATELY CONSISTENT: Most stores benefit similarly")
        else:
            print(f"   ⚠️  INCONSISTENT: Results vary significantly by store")
            print(f"   → Investigate store-specific factors affecting promotional performance")
    
    # Time-based insights (if date available)
    if date_col in df_analysis.columns:
        df_analysis['month'] = pd.to_datetime(df_analysis[date_col]).dt.month_name()
        df_analysis['weekday'] = pd.to_datetime(df_analysis[date_col]).dt.day_name()
        
        # Monthly performance
        monthly_promo = df_analysis[df_analysis['promo_flag']==1].groupby('month')[sales_col].mean()
        monthly_regular = df_analysis[df_analysis['promo_flag']==0].groupby('month')[sales_col].mean()
        monthly_lift = ((monthly_promo - monthly_regular) / monthly_regular * 100).round(1)
        
        best_month = monthly_lift.idxmax()
        worst_month = monthly_lift.idxmin()
        
        print(f"\n📅 SEASONAL INSIGHTS:")
        print(f"   🏆 Best Month for Promos: {best_month} (+{monthly_lift[best_month]:.1f}%)")
        print(f"   📉 Worst Month for Promos: {worst_month} (+{monthly_lift[worst_month]:.1f}%)")
    
    print(f"\n🎉 ANALYSIS COMPLETE!")
    
    # Return structured results
    return {
        'summary_metrics': {
            'sales_lift_pct': sales_lift_pct,
            'customer_lift_pct': customer_lift_pct,
            'efficiency_improvement': efficiency_improvement,
            'statistical_significance': is_significant,
            'p_value': p_value
        },
        'store_performance': store_df,
        'raw_data': {
            'promo_avg_sales': promo_avg_sales,
            'regular_avg_sales': non_promo_avg_sales,
            'promo_avg_customers': promo_avg_customers,
            'regular_avg_customers': non_promo_avg_customers
        }
    }

def create_promo_visualization(results_dict, store_df):
    """
    Create visualizations for promotional analysis
    """
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Sales comparison
    metrics = ['Promo', 'Regular']
    sales_values = [results_dict['raw_data']['promo_avg_sales'], 
                   results_dict['raw_data']['regular_avg_sales']]
    
    bars1 = ax1.bar(metrics, sales_values, color=['#ff6b6b', '#4ecdc4'], alpha=0.8)
    ax1.set_title('Average Sales: Promotional vs Regular Days', fontsize=14, fontweight='bold')
    ax1.set_ylabel('Average Sales ($)')
    
    # Add value labels on bars
    for bar in bars1:
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height,
                f'${height:,.0f}', ha='center', va='bottom', fontsize=12)
    
    # 2. Customer traffic comparison
    customer_values = [results_dict['raw_data']['promo_avg_customers'],
                      results_dict['raw_data']['regular_avg_customers']]
    
    bars2 = ax2.bar(metrics, customer_values, color=['#ff9f43', '#54a0ff'], alpha=0.8)
    ax2.set_title('Average Customer Traffic: Promotional vs Regular Days', fontsize=14, fontweight='bold')
    ax2.set_ylabel('Average Customers')
    
    for bar in bars2:
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:,.0f}', ha='center', va='bottom', fontsize=12)
    
    # 3. Store performance distribution
    if not store_df.empty:
        ax3.hist(store_df['Sales Lift (%)'], bins=8, alpha=0.7, color='#ff6b6b', edgecolor='black')
        ax3.set_title('Distribution of Sales Lift Across Stores', fontsize=14, fontweight='bold')
        ax3.set_xlabel('Sales Lift (%)')
        ax3.set_ylabel('Number of Stores')
        ax3.axvline(store_df['Sales Lift (%)'].mean(), color='red', linestyle='--', 
                   label=f'Mean: {store_df["Sales Lift (%)"].mean():.1f}%')
        ax3.legend()
    
    # 4. Key metrics summary
    lift_pct = results_dict['summary_metrics']['sales_lift_pct']
    customer_lift_pct = results_dict['summary_metrics']['customer_lift_pct']
    efficiency = results_dict['summary_metrics']['efficiency_improvement']
    
    metrics_names = ['Sales Lift', 'Customer Lift', 'Efficiency Gain']
    metrics_values = [lift_pct, customer_lift_pct, efficiency]
    colors = ['#ff6b6b', '#4ecdc4', '#45aaf2']
    
    bars4 = ax4.bar(metrics_names, metrics_values, color=colors, alpha=0.8)
    ax4.set_title('Key Performance Metrics (%)', fontsize=14, fontweight='bold')
    ax4.set_ylabel('Improvement (%)')
    
    for bar in bars4:
        height = bar.get_height()
        ax4.text(bar.get_x() + bar.get_width()/2., height,
                f'+{height:.1f}%', ha='center', va='bottom', fontsize=12)
    
    plt.tight_layout()
    plt.show()
    
    return fig


results = clean_promo_analysis(df_viz_feat)

🎯 PROMOTIONAL IMPACT ANALYSIS REPORT
📊 Data Overview: 814,150 records after removing closed days
🏪 Analyzing top 10 stores: [817, 262, 1114, 251, 842, 513, 562, 788, 383, 756]
📈 Analysis dataset: 7,702 records
🎯 Promotional days: 3,325 (43.2%)
📅 Regular days: 4,377 (56.8%)

💰 SALES PERFORMANCE ANALYSIS
🎯 Average Sales (Promotional): $20,779
📊 Average Sales (Regular): $17,473
⬆️  Absolute Sales Lift: $3,306
📈 Percentage Sales Lift: +18.92%

👥 CUSTOMER TRAFFIC ANALYSIS
🎯 Average Customers (Promotional): 2,636
📊 Average Customers (Regular): 2,469
⬆️  Customer Traffic Lift: +167
📈 Customer Traffic Lift: +6.78%

🎯 EFFICIENCY & PROFITABILITY
💳 Sales per Customer (Promotional): $7.88
💳 Sales per Customer (Regular): $7.08
📊 Spending Efficiency Gain: +11.37%

📊 STATISTICAL VALIDATION
🧮 T-Statistic: 39.17
📈 P-Value: 0.000000
✅ Statistically Significant: YES (α=0.05)

💡 KEY BUSINESS INSIGHTS
👍 GOOD PERFORMANCE: Promotions show solid results
📋 RECOMMENDATION: MAINTAIN current promotional strategy


In [20]:
# To pull train_df from one notebook to another in JupyterLab
%store train_df


Stored 'train_df' (DataFrame)


---------------------------

In [21]:
print("✅ Data Ingestion and Exploratory Data Analysis completed successfully!")
print(f"🗓️ Analysis Date: {bold_start}{pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")

✅ Data Ingestion and Exploratory Data Analysis completed successfully!
🗓️ Analysis Date: [1m2025-08-13 21:20:08[0m


--------------------------------

In [22]:
# End analysis
analysis_end = pd.Timestamp.now()
duration = analysis_end - analysis_begin

# Final summary print
print("\n📋 Analysis Summary")
print(f"🟢 Begin Date: {bold_start}{analysis_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"✅ End Date:   {bold_start}{analysis_end.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"⏱️ Duration:   {bold_start}{str(duration)}{bold_end}")


📋 Analysis Summary
🟢 Begin Date: [1m2025-08-13 21:19:58[0m
✅ End Date:   [1m2025-08-13 21:20:08[0m
⏱️ Duration:   [1m0 days 00:00:10.512181[0m


-------------------------
## Project Design Rationale: Notebook Separation

To ensure **clarity, maintainability, and scalability** while adhering to **GitHub's file size limitations**, each **.ipynb notebook** should be modularized by task—allowing for **streamlined version control, easier collaboration, and more efficient** long-term project management.