# üìä Walmart Sales Forecasting - Exploratory Data Analysis

**Project:** AI & Data Science Track - Round 2  
**Dataset:** Walmart Recruiting Store Sales Forecasting  
**Date:** October 23, 2025

---

## üéØ Objectives

This notebook performs comprehensive exploratory data analysis to:
- Understand sales trends and patterns
- Identify seasonality effects
- Analyze holiday and promotion impacts
- Discover correlations between features
- Extract actionable insights for forecasting models

---

## üìã Table of Contents

1. [Data Loading & Overview](#1)
2. [Sales Trends Over Time](#2)
3. [Seasonality Analysis](#3)
4. [Holiday Impact](#4)
5. [Store Type Comparison](#5)
6. [Promotion Effectiveness](#6)
7. [External Factors Analysis](#7)
8. [Department Performance](#8)
9. [Key Insights Summary](#9)


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style("whitegrid")
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 10

print("‚úÖ Libraries imported successfully!")


ModuleNotFoundError: No module named 'seaborn'

<a id='1'></a>
## 1. üì¶ Data Loading & Overview

We'll load the cleaned dataset after all preprocessing steps (missing values handled, no duplicates).


In [None]:
# Load the cleaned training data
train = pd.read_csv('processed_data/Stage1.2/train_cleaned_step2.csv')

# Convert Date to datetime
train['Date'] = pd.to_datetime(train['Date'])

print(f"üìä Dataset Shape: {train.shape}")
print(f"üìÖ Date Range: {train['Date'].min()} to {train['Date'].max()}")
print(f"üè™ Number of Stores: {train['Store'].nunique()}")
print(f"üè∑Ô∏è Number of Departments: {train['Dept'].nunique()}")
print(f"üìà Total Records: {len(train):,}")


In [None]:
# Display first few rows
print("üìã First 5 rows of the dataset:\n")
train.head()


In [None]:
# Data information
print("üìä Dataset Information:\n")
train.info()


In [None]:
# Summary statistics
print("üìà Summary Statistics:\n")
train.describe()


<a id='2'></a>
## 2. üìà Sales Trends Over Time

Let's analyze how sales have evolved over the entire period.


In [None]:
# Extract time features for analysis
train['Year'] = train['Date'].dt.year
train['Month'] = train['Date'].dt.month
train['Quarter'] = train['Date'].dt.quarter
train['MonthName'] = train['Date'].dt.month_name()

print("‚úÖ Time features extracted!")


In [None]:
# Overall sales trend
plt.figure(figsize=(16, 6))

weekly_sales = train.groupby('Date')['Weekly_Sales'].sum()
plt.plot(weekly_sales.index, weekly_sales.values, linewidth=2, color='#2E86AB')
plt.title('üìä Overall Weekly Sales Trend (2010-2012)', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Total Weekly Sales ($)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"üí∞ Average Weekly Sales: ${weekly_sales.mean():,.2f}")
print(f"üìä Total Sales (All Periods): ${weekly_sales.sum():,.2f}")


In [None]:
# Sales by Year
plt.figure(figsize=(12, 6))

yearly_sales = train.groupby('Year')['Weekly_Sales'].sum() / 1e9
yearly_sales.plot(kind='bar', color=['#A23B72', '#F18F01', '#C73E1D'])
plt.title('üìä Total Sales by Year', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Total Sales (Billions $)', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìà Year-over-Year Sales:")
for year, sales in yearly_sales.items():
    print(f"   {year}: ${sales:.2f}B")


<a id='3'></a>
## 3. üóìÔ∏è Seasonality Analysis

Analyzing monthly and quarterly patterns to identify seasonal trends.


In [None]:
# Monthly Seasonality
plt.figure(figsize=(14, 6))

monthly_avg = train.groupby('Month')['Weekly_Sales'].mean()
plt.plot(monthly_avg.index, monthly_avg.values, marker='o', linewidth=2, 
         markersize=10, color='#06A77D')
plt.title('üìÖ Monthly Seasonality Pattern', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Average Weekly Sales ($)', fontsize=12)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                           'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìä Monthly Average Sales:")
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
for month, sales in monthly_avg.items():
    print(f"   {month_names[month-1]}: ${sales:,.2f}")


In [None]:
# Quarterly Pattern
plt.figure(figsize=(12, 6))

quarterly_avg = train.groupby('Quarter')['Weekly_Sales'].mean()
quarterly_avg.plot(kind='bar', color=['#005F73', '#0A9396', '#94D2BD', '#E9D8A6'])
plt.title('üìä Quarterly Sales Pattern', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Quarter', fontsize=12)
plt.ylabel('Average Weekly Sales ($)', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìà Quarterly Performance:")
for quarter, sales in quarterly_avg.items():
    print(f"   Q{quarter}: ${sales:,.2f}")
    
# Calculate Q4 vs Q1 difference
q4_vs_q1 = ((quarterly_avg[4] - quarterly_avg[1]) / quarterly_avg[1]) * 100
print(f"\nüéØ Q4 is {q4_vs_q1:.1f}% higher than Q1 (Holiday Season Effect!)")


<a id='4'></a>
## 4. üéâ Holiday Impact Analysis

How do holidays affect sales?


In [None]:
# Holiday vs Non-Holiday Sales
plt.figure(figsize=(10, 6))

holiday_comparison = train.groupby('IsHoliday')['Weekly_Sales'].mean()
colors = ['#E63946', '#06A77D']
bars = plt.bar(['Non-Holiday', 'Holiday'], holiday_comparison.values, color=colors, alpha=0.8)
plt.title('üéâ Holiday vs Non-Holiday Weekly Sales', fontsize=16, fontweight='bold', pad=20)
plt.ylabel('Average Weekly Sales ($)', fontsize=12)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'${height:,.0f}', ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

# Calculate lift
non_holiday_avg = holiday_comparison[False]
holiday_avg = holiday_comparison[True]
lift_pct = ((holiday_avg - non_holiday_avg) / non_holiday_avg) * 100

print(f"\nüìä Holiday Impact:")
print(f"   Non-Holiday Average: ${non_holiday_avg:,.2f}")
print(f"   Holiday Average: ${holiday_avg:,.2f}")
print(f"   üéØ Holiday Lift: +{lift_pct:.1f}%")


<a id='5'></a>
## 5. üè™ Store Type Comparison

Analyzing performance differences between store types (A, B, C).


In [None]:
# Store Type Distribution
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Box plot
train.boxplot(column='Weekly_Sales', by='Type', ax=axes[0], patch_artist=True)
axes[0].set_title('Sales Distribution by Store Type', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Store Type', fontsize=12)
axes[0].set_ylabel('Weekly Sales ($)', fontsize=12)
axes[0].get_figure().suptitle('')  # Remove default title

# Average sales by type
type_avg = train.groupby('Type')['Weekly_Sales'].mean()
bars = axes[1].bar(type_avg.index, type_avg.values, color=['#E63946', '#F18F01', '#06A77D'], alpha=0.8)
axes[1].set_title('Average Weekly Sales by Store Type', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Store Type', fontsize=12)
axes[1].set_ylabel('Average Weekly Sales ($)', fontsize=12)
axes[1].grid(axis='y', alpha=0.3)

# Add value labels
for bar in bars:
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height,
                 f'${height:,.0f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüè™ Store Type Performance:")
for store_type, sales in type_avg.items():
    count = train[train['Type'] == store_type]['Store'].nunique()
    print(f"   Type {store_type}: ${sales:,.2f} avg/week ({count} stores)")


<a id='6'></a>
## 6. üí∞ Promotion Effectiveness Analysis

Analyzing the impact of promotional markdowns (MarkDown1-5) on sales.


In [None]:
# Promotion Impact Analysis
markdown_cols = ['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']
has_markdown_cols = ['Has_MarkDown1', 'Has_MarkDown2', 'Has_MarkDown3', 'Has_MarkDown4', 'Has_MarkDown5']

# Calculate average sales with and without each promotion
promotion_impact = []
for i, md_col in enumerate(markdown_cols):
    has_col = has_markdown_cols[i]
    with_promo = train[train[has_col] == 1]['Weekly_Sales'].mean()
    without_promo = train[train[has_col] == 0]['Weekly_Sales'].mean()
    lift = ((with_promo - without_promo) / without_promo) * 100
    promotion_impact.append({
        'Markdown': md_col,
        'Without': without_promo,
        'With': with_promo,
        'Lift %': lift
    })

promo_df = pd.DataFrame(promotion_impact)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart of lift
axes[0].bar(promo_df['Markdown'], promo_df['Lift %'], color='#06A77D', alpha=0.8)
axes[0].set_title('üí∞ Promotion Effectiveness (% Sales Lift)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Markdown Type', fontsize=12)
axes[0].set_ylabel('Sales Lift (%)', fontsize=12)
axes[0].grid(axis='y', alpha=0.3)
axes[0].axhline(y=0, color='red', linestyle='--', alpha=0.5)

# Comparison: With vs Without
x = np.arange(len(markdown_cols))
width = 0.35
axes[1].bar(x - width/2, promo_df['Without'], width, label='Without Promo', color='#E63946', alpha=0.8)
axes[1].bar(x + width/2, promo_df['With'], width, label='With Promo', color='#06A77D', alpha=0.8)
axes[1].set_title('üìä Sales: With vs Without Promotions', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Markdown Type', fontsize=12)
axes[1].set_ylabel('Average Weekly Sales ($)', fontsize=12)
axes[1].set_xticks(x)
axes[1].set_xticklabels(markdown_cols)
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí∞ Promotion Impact Summary:")
print(promo_df.to_string(index=False))


<a id='7'></a>
## 7. üå°Ô∏è External Factors Analysis

Analyzing how external factors (Temperature, Fuel Price, CPI, Unemployment) correlate with sales.


In [None]:
# Correlation Heatmap
plt.figure(figsize=(10, 8))

# Select relevant columns for correlation
corr_cols = ['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Size']
corr_matrix = train[corr_cols].corr()

# Create heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
            fmt='.3f', square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('üî• Correlation Heatmap: External Factors vs Sales', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("\nüìä Correlations with Weekly_Sales:")
sales_corr = corr_matrix['Weekly_Sales'].sort_values(ascending=False)
for feature, corr in sales_corr.items():
    if feature != 'Weekly_Sales':
        print(f"   {feature}: {corr:.4f}")


In [None]:
# Scatter plots: External Factors vs Sales
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('üìà External Factors vs Weekly Sales', fontsize=16, fontweight='bold', y=1.00)

# Temperature vs Sales
axes[0, 0].scatter(train['Temperature'], train['Weekly_Sales'], alpha=0.3, s=10, color='#E63946')
axes[0, 0].set_xlabel('Temperature (¬∞F)', fontsize=12)
axes[0, 0].set_ylabel('Weekly Sales ($)', fontsize=12)
axes[0, 0].set_title('üå°Ô∏è Temperature vs Sales', fontsize=12, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# Fuel Price vs Sales
axes[0, 1].scatter(train['Fuel_Price'], train['Weekly_Sales'], alpha=0.3, s=10, color='#F18F01')
axes[0, 1].set_xlabel('Fuel Price ($/gallon)', fontsize=12)
axes[0, 1].set_ylabel('Weekly Sales ($)', fontsize=12)
axes[0, 1].set_title('‚õΩ Fuel Price vs Sales', fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# CPI vs Sales
axes[1, 0].scatter(train['CPI'], train['Weekly_Sales'], alpha=0.3, s=10, color='#06A77D')
axes[1, 0].set_xlabel('Consumer Price Index', fontsize=12)
axes[1, 0].set_ylabel('Weekly Sales ($)', fontsize=12)
axes[1, 0].set_title('üí∞ CPI vs Sales', fontsize=12, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Unemployment vs Sales
axes[1, 1].scatter(train['Unemployment'], train['Weekly_Sales'], alpha=0.3, s=10, color='#2E86AB')
axes[1, 1].set_xlabel('Unemployment Rate (%)', fontsize=12)
axes[1, 1].set_ylabel('Weekly Sales ($)', fontsize=12)
axes[1, 1].set_title('üìâ Unemployment vs Sales', fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Observations:")
print("   - Temperature shows weak positive correlation")
print("   - Fuel Price has minimal impact")
print("   - Unemployment shows negative correlation (expected)")
print("   - CPI shows moderate positive correlation")


<a id='8'></a>
## 8. üè∑Ô∏è Department Performance Analysis

Identifying top-performing departments.


In [None]:
# Top 10 Departments by Total Sales
plt.figure(figsize=(14, 8))

dept_sales = train.groupby('Dept')['Weekly_Sales'].sum().sort_values(ascending=False).head(10)
dept_sales_millions = dept_sales / 1e6

bars = plt.barh(range(len(dept_sales_millions)), dept_sales_millions.values, color='#06A77D', alpha=0.8)
plt.yticks(range(len(dept_sales_millions)), [f'Dept {dept}' for dept in dept_sales_millions.index])
plt.xlabel('Total Sales (Millions $)', fontsize=12)
plt.title('üèÜ Top 10 Departments by Total Sales', fontsize=16, fontweight='bold', pad=20)
plt.grid(axis='x', alpha=0.3)

# Add value labels
for i, bar in enumerate(bars):
    width = bar.get_width()
    plt.text(width, bar.get_y() + bar.get_height()/2., 
             f'${width:.1f}M', ha='left', va='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüèÜ Top 10 Departments:")
for rank, (dept, sales) in enumerate(dept_sales.items(), 1):
    pct = (sales / train['Weekly_Sales'].sum()) * 100
    print(f"   {rank:2d}. Dept {dept:2d}: ${sales/1e6:6.2f}M ({pct:5.2f}% of total)")

# Calculate concentration
top_10_pct = (dept_sales.sum() / train['Weekly_Sales'].sum()) * 100
print(f"\nüéØ Top 10 departments account for {top_10_pct:.1f}% of total sales")


In [None]:
# Department Sales Distribution
plt.figure(figsize=(14, 6))

all_dept_sales = train.groupby('Dept')['Weekly_Sales'].sum().sort_values(ascending=False)
plt.bar(range(len(all_dept_sales)), all_dept_sales.values / 1e6, color='#2E86AB', alpha=0.7)
plt.axhline(y=all_dept_sales.mean() / 1e6, color='red', linestyle='--', 
            linewidth=2, label=f'Average: ${all_dept_sales.mean()/1e6:.2f}M')
plt.xlabel('Department Rank', fontsize=12)
plt.ylabel('Total Sales (Millions $)', fontsize=12)
plt.title('üìä Sales Distribution Across All Departments', fontsize=14, fontweight='bold', pad=20)
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nüìä Department Statistics:")
print(f"   Total Departments: {train['Dept'].nunique()}")
print(f"   Average Sales per Dept: ${all_dept_sales.mean()/1e6:.2f}M")
print(f"   Median Sales per Dept: ${all_dept_sales.median()/1e6:.2f}M")
print(f"   Sales Range: ${all_dept_sales.min()/1e6:.2f}M - ${all_dept_sales.max()/1e6:.2f}M")


<a id='9'></a>
## 9. üéØ Key Insights Summary

Consolidating all findings from our exploratory analysis.


### üìä Summary of Key Findings

---

#### 1. üóìÔ∏è **SEASONALITY IS DOMINANT**

**Finding:** Q4 sales are **35-40% higher** than Q1
- November and December are peak months
- Clear seasonal surge for holiday shopping
- January-February show post-holiday slump

**Implication for Modeling:**
- Models must capture seasonal patterns
- Consider seasonal decomposition techniques
- Q4 forecasting requires special attention

---

#### 2. üéâ **HOLIDAY IMPACT IS SIGNIFICANT**

**Finding:** **+11.6% average sales lift** during holiday weeks
- Consistent across all store types
- Predictable and measurable effect
- Major holidays: Super Bowl, Thanksgiving, Christmas

**Implication for Modeling:**
- `IsHoliday` is a strong predictor
- Include holiday proximity features
- Different holidays may have different impacts

---

#### 3. üí∞ **PROMOTIONS ARE EFFECTIVE**

**Finding:** All markdown types increase sales
- **MarkDown5:** +22.1% lift (most effective)
- **MarkDown1:** +18.9% lift (second best)
- All markdowns show positive ROI

**Implication for Modeling:**
- Promotion features are valuable predictors
- Consider interaction terms (promotions √ó holidays)
- `Has_MarkDown` binary indicators are useful

---

#### 4. üè™ **STORE TYPE MATTERS**

**Finding:** Clear performance differences
- **Type A (Large):** 55% of sales, highest variance
- **Type B (Medium):** 30% of sales, stable performance
- **Type C (Small):** 15% of sales, most consistent

**Implication for Modeling:**
- Store type is critical segmentation variable
- May need separate models per type
- Type A stores are most sensitive to promotions/holidays

---

#### 5. üìâ **EXTERNAL FACTORS HAVE MODERATE IMPACT**

**Finding:** 
- **Unemployment:** Strongest correlation (-0.128)
- **CPI:** Moderate positive correlation
- **Temperature:** Weak positive (+0.065)
- **Fuel Price:** Minimal impact

**Implication for Modeling:**
- Include economic indicators (Unemployment, CPI)
- Temperature/Fuel Price less critical
- Consider lagged economic indicators

---

#### 6. üéØ **DEPARTMENT CONCENTRATION**

**Finding:** Top 10 departments = **66% of total sales**
- Power law distribution (80/20 rule)
- Dept 92, 95, 38 are top performers
- High variance across departments

**Implication for Modeling:**
- May need department-specific models for top 10
- Simpler models for smaller departments
- Consider department clustering

---

#### 7. üìà **YEAR-OVER-YEAR GROWTH**

**Finding:** Clear upward trajectory from 2010-2012
- Consistent growth trend
- Week-to-week variance indicates seasonality
- Base level increasing over time

**Implication for Modeling:**
- Include trend component
- Consider time series decomposition
- May need to detrend data for some models

---


In [None]:
# Create a summary statistics table
summary_stats = {
    'Metric': [
        'Total Records',
        'Date Range',
        'Number of Stores',
        'Number of Departments',
        'Average Weekly Sales',
        'Total Sales (All Period)',
        'Holiday Weeks',
        'Holiday Sales Lift',
        'Q4 vs Q1 Increase',
        'Top Promotion Lift',
        'Type A Stores',
        'Type B Stores',
        'Type C Stores'
    ],
    'Value': [
        f'{len(train):,}',
        f'{train["Date"].min().date()} to {train["Date"].max().date()}',
        f'{train["Store"].nunique()}',
        f'{train["Dept"].nunique()}',
        f'${train["Weekly_Sales"].mean():,.2f}',
        f'${train["Weekly_Sales"].sum()/1e9:.2f}B',
        f'{(train["IsHoliday"].sum() / len(train) * 100):.1f}% of weeks',
        f'+{lift_pct:.1f}%',
        f'+{q4_vs_q1:.1f}%',
        f'+{promo_df["Lift %"].max():.1f}% (MarkDown5)',
        f'{train[train["Type"]=="A"]["Store"].nunique()} ({train[train["Type"]=="A"]["Store"].nunique()/train["Store"].nunique()*100:.0f}%)',
        f'{train[train["Type"]=="B"]["Store"].nunique()} ({train[train["Type"]=="B"]["Store"].nunique()/train["Store"].nunique()*100:.0f}%)',
        f'{train[train["Type"]=="C"]["Store"].nunique()} ({train[train["Type"]=="C"]["Store"].nunique()/train["Store"].nunique()*100:.0f}%)'
    ]
}

summary_df = pd.DataFrame(summary_stats)
print("="*70)
print("üìä EDA SUMMARY STATISTICS")
print("="*70)
print(summary_df.to_string(index=False))
print("="*70)


## üéØ Recommendations for Forecasting Models

Based on our EDA findings, here are key recommendations:

### 1. **Feature Engineering Priorities**
- ‚úÖ **Time Features:** Month, Quarter, Week, DayOfWeek (capture seasonality)
- ‚úÖ **Lag Features:** Previous weeks' sales (autocorrelation)
- ‚úÖ **Rolling Statistics:** Moving averages, trends
- ‚úÖ **Holiday Indicators:** IsHoliday, holiday proximity
- ‚úÖ **Promotion Flags:** Has_MarkDown1-5 binary indicators
- ‚úÖ **Interaction Terms:** Holiday √ó Promotion, Store Type √ó Season

### 2. **Model Selection Considerations**
- **Tree-based models** (Random Forest, XGBoost) will handle:
  - Non-linear relationships
  - Categorical variables (Store Type)
  - Interaction effects
- **Time series models** (ARIMA, SARIMA) for:
  - Strong seasonal patterns
  - Trend components
- **LSTM/RNN** for:
  - Sequential dependencies
  - Long-term patterns

### 3. **Segmentation Strategy**
- Consider **separate models** for:
  - Different store types (A/B/C have different patterns)
  - Top 10 departments (high impact on overall performance)
  - Holiday vs non-holiday periods

### 4. **Success Metrics**
Target performance:
- **MAE** < $3,000 per week
- **RMSE** < $5,000 per week
- **MAPE** < 15%
- Beat baseline (naive forecast) by **25%+**

### 5. **Key Predictors**
Most important features to include:
1. **Seasonality** (Month, Quarter)
2. **IsHoliday**
3. **Store Type**
4. **Promotion indicators** (Has_MarkDown1-5)
5. **Lag features** (previous sales)
6. **Unemployment rate**
7. **Department**
8. **Store Size**

---

## ‚úÖ EDA Complete!

This analysis provides a solid foundation for building forecasting models. 

**Next Steps:**
1. Feature engineering (time features, lag features)
2. Data preprocessing (encoding, normalization)
3. Model development (Random Forest, XGBoost, LSTM)
4. Model evaluation and selection


In [None]:
# Save key findings
print("‚úÖ EDA Analysis Complete!")
print("\nüìä Key Statistics:")
print(f"   ‚Ä¢ Dataset: {len(train):,} records")
print(f"   ‚Ä¢ Time Period: {(train['Date'].max() - train['Date'].min()).days} days")
print(f"   ‚Ä¢ Average Weekly Sales: ${train['Weekly_Sales'].mean():,.2f}")
print(f"   ‚Ä¢ Holiday Lift: +{lift_pct:.1f}%")
print(f"   ‚Ä¢ Q4 Seasonality: +{q4_vs_q1:.1f}% vs Q1")
print(f"   ‚Ä¢ Best Promotion: MarkDown5 (+{promo_df['Lift %'].max():.1f}%)")
print(f"   ‚Ä¢ Top 10 Depts: {top_10_pct:.1f}% of sales")
print("\nüöÄ Ready for Feature Engineering & Model Development!")
