# Sales Forecasting Analytics Project

## Overview
This notebook demonstrates a complete end-to-end sales forecasting solution for retail stores. We'll cover:
1. Data exploration and visualization
2. Feature engineering with holiday and seasonality features
3. Multiple regression models (Linear, Random Forest, Gradient Boosting, XGBoost)
4. Model evaluation and comparison
5. Feature importance analysis
6. Results interpretation and insights

---

## 1. Setup and Imports

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime

# Import our custom modules
import sys
sys.path.append('../src')

from data_preprocessing import SalesDataPreprocessor, create_sample_data
from models import SalesForecaster
from evaluation import ModelEvaluator

# Settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Set random seed for reproducibility
np.random.seed(42)

print("Setup complete!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 2. Data Loading and Initial Exploration

In [None]:
# Create sample sales data
# In a real scenario, you would load your data from a CSV file:
# df = pd.read_csv('../data/sales_data.csv')

print("Creating sample sales data...")
df = create_sample_data(num_rows=1000, start_date='2010-01-01')

print(f"\nDataset shape: {df.shape}")
print(f"Date range: {df['Date'].min()} to {df['Date'].max()}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Basic statistics
print("Dataset Statistics:")
print("="*50)
df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
print("="*50)
missing = df.isnull().sum()
if missing.sum() == 0:
    print("No missing values found!")
else:
    print(missing[missing > 0])

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Sales over time
fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(df['Date'], df['Weekly_Sales'], alpha=0.7, linewidth=1)
ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('Weekly Sales', fontsize=12)
ax.set_title('Sales Over Time', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Sales distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogram
axes[0].hist(df['Weekly_Sales'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Weekly Sales', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Sales Distribution', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Box plot
axes[1].boxplot(df['Weekly_Sales'])
axes[1].set_ylabel('Weekly Sales', fontsize=12)
axes[1].set_title('Sales Box Plot', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix
correlation_matrix = df.select_dtypes(include=[np.number]).corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1)
plt.title('Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 4. Feature Engineering

Now we'll create comprehensive features including:
- Date-based features (day of week, month, quarter, season)
- Holiday flags and days to holidays
- Lag features
- Rolling window statistics

In [None]:
# Initialize preprocessor
preprocessor = SalesDataPreprocessor()

# Run complete preprocessing pipeline
df_processed = preprocessor.preprocess_pipeline(
    df=df.copy(),
    date_column='Date',
    target_column='Weekly_Sales',
    handle_outliers_cols=['Weekly_Sales']
)

print(f"\nProcessed dataset shape: {df_processed.shape}")
print(f"Number of features created: {df_processed.shape[1] - df.shape[1]}")

In [None]:
# Display sample of processed data
print("Sample of processed data:")
df_processed.head(10)

In [None]:
# View all created features
print("All features:")
print("="*50)
for i, col in enumerate(df_processed.columns, 1):
    print(f"{i:2d}. {col}")

## 5. Seasonality Analysis

In [None]:
# Sales by day of week
if 'DayOfWeek' in df_processed.columns:
    day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    weekly_sales = df_processed.groupby('DayOfWeek')['Weekly_Sales'].mean().reset_index()
    
    plt.figure(figsize=(12, 5))
    bars = plt.bar(weekly_sales['DayOfWeek'], weekly_sales['Weekly_Sales'], 
                   alpha=0.8, edgecolor='black')
    
    # Color weekend bars differently
    for i, bar in enumerate(bars):
        if i >= 5:  # Weekend
            bar.set_color('coral')
        else:
            bar.set_color('skyblue')
    
    plt.xticks(range(7), day_names, rotation=45)
    plt.xlabel('Day of Week', fontsize=12)
    plt.ylabel('Average Sales', fontsize=12)
    plt.title('Average Sales by Day of Week', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()

In [None]:
# Sales by month
if 'Month' in df_processed.columns:
    month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                   'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    monthly_sales = df_processed.groupby('Month')['Weekly_Sales'].mean().reset_index()
    
    plt.figure(figsize=(12, 5))
    plt.plot(monthly_sales['Month'], monthly_sales['Weekly_Sales'], 
             marker='o', linewidth=2, markersize=8, alpha=0.8)
    plt.xticks(range(1, 13), month_names)
    plt.xlabel('Month', fontsize=12)
    plt.ylabel('Average Sales', fontsize=12)
    plt.title('Average Sales by Month (Seasonality)', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
# Sales by season
if 'Season' in df_processed.columns:
    seasonal_sales = df_processed.groupby('Season')['Weekly_Sales'].agg(['mean', 'std']).reset_index()
    
    plt.figure(figsize=(10, 6))
    plt.bar(seasonal_sales['Season'], seasonal_sales['mean'], 
            yerr=seasonal_sales['std'], alpha=0.8, edgecolor='black', capsize=10)
    plt.xlabel('Season', fontsize=12)
    plt.ylabel('Average Sales', fontsize=12)
    plt.title('Average Sales by Season', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()

## 6. Holiday Impact Analysis

In [None]:
# Compare sales on holidays vs non-holidays
if 'IsHoliday' in df_processed.columns:
    holiday_comparison = df_processed.groupby('IsHoliday')['Weekly_Sales'].agg(['mean', 'std', 'count']).reset_index()
    holiday_comparison['IsHoliday'] = holiday_comparison['IsHoliday'].map({0: 'Non-Holiday', 1: 'Holiday'})
    
    print("Holiday Impact on Sales:")
    print("="*50)
    print(holiday_comparison)
    print()
    
    plt.figure(figsize=(10, 6))
    plt.bar(holiday_comparison['IsHoliday'], holiday_comparison['mean'],
            yerr=holiday_comparison['std'], alpha=0.8, edgecolor='black', capsize=10)
    plt.xlabel('Period', fontsize=12)
    plt.ylabel('Average Sales', fontsize=12)
    plt.title('Sales Comparison: Holidays vs Non-Holidays', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()

In [None]:
# Sales around specific holidays
holiday_cols = [col for col in df_processed.columns if col.startswith('Is') and 
                'Holiday' in col and col != 'IsHoliday' and col != 'IsPreHoliday' and col != 'IsPostHoliday']

if holiday_cols:
    holiday_sales = []
    for col in holiday_cols:
        holiday_name = col.replace('Is', '')
        avg_sales = df_processed[df_processed[col] == 1]['Weekly_Sales'].mean()
        holiday_sales.append({'Holiday': holiday_name, 'Avg_Sales': avg_sales})
    
    holiday_sales_df = pd.DataFrame(holiday_sales)
    
    if len(holiday_sales_df) > 0:
        plt.figure(figsize=(12, 6))
        plt.bar(holiday_sales_df['Holiday'], holiday_sales_df['Avg_Sales'], 
                alpha=0.8, edgecolor='black')
        plt.xlabel('Holiday', fontsize=12)
        plt.ylabel('Average Sales', fontsize=12)
        plt.title('Average Sales During Specific Holidays', fontsize=14, fontweight='bold')
        plt.xticks(rotation=45, ha='right')
        plt.grid(True, alpha=0.3, axis='y')
        plt.tight_layout()
        plt.show()

## 7. Model Training and Comparison

Now we'll train multiple regression models and compare their performance.

In [None]:
# Prepare data for modeling
forecaster = SalesForecaster(random_state=42)

# Remove rows with NaN values (from lag features)
df_model = df_processed.dropna().reset_index(drop=True)

print(f"Data after removing NaN: {df_model.shape}")

# Prepare train/test split
X_train, X_test, y_train, y_test = forecaster.prepare_data(
    df=df_model,
    target_column='Weekly_Sales',
    test_size=0.2
)

In [None]:
# Initialize and train all models
forecaster.initialize_models()
trained_models = forecaster.train_all_models(X_train, y_train)

# Display training summary
print("\nTraining Summary:")
print(forecaster.get_model_summary())

## 8. Model Evaluation

In [None]:
# Initialize evaluator
evaluator = ModelEvaluator()

# Evaluate each model
model_predictions = {}

for model_name in trained_models.keys():
    print(f"\nEvaluating {model_name}...")
    
    # Make predictions
    y_pred = forecaster.predict(model_name, X_test)
    model_predictions[model_name] = y_pred
    
    # Calculate metrics
    metrics = evaluator.evaluate_model(y_test, y_pred, model_name)

In [None]:
# Compare all models
comparison_df = evaluator.compare_models()
comparison_df

In [None]:
# Visualize model comparison
evaluator.plot_model_comparison(comparison_df, metric='RMSE')

In [None]:
# Multi-metric comparison
evaluator.plot_multiple_metrics_comparison(comparison_df)

## 9. Detailed Analysis of Best Model

In [None]:
# Get best model (lowest RMSE)
best_model_name = comparison_df.index[0]
best_predictions = model_predictions[best_model_name]

print(f"Best Model: {best_model_name}")
print("="*50)

# Generate detailed evaluation report
evaluator.generate_evaluation_report(
    y_test, 
    best_predictions, 
    best_model_name,
    save_dir='../results/visualizations'
)

## 10. Feature Importance Analysis

In [None]:
# Analyze feature importance for tree-based models
tree_models = ['Random Forest', 'Gradient Boosting', 'XGBoost']

for model_name in tree_models:
    if model_name in trained_models:
        print(f"\nFeature Importance - {model_name}:")
        print("="*50)
        
        importance_df = forecaster.get_feature_importance(model_name, top_n=15)
        
        if importance_df is not None:
            print(importance_df)
            print()
            evaluator.plot_feature_importance(
                importance_df, 
                model_name, 
                top_n=15,
                save_path=f'../results/visualizations/{model_name.replace(" ", "_")}_feature_importance.png'
            )

## 11. Cross-Validation Analysis

In [None]:
# Perform cross-validation for all models
print("Cross-Validation Results:")
print("="*70)

cv_results = {}
for model_name in trained_models.keys():
    cv_result = forecaster.cross_validate_model(
        model_name, 
        X_train, 
        y_train, 
        cv_folds=5
    )
    cv_results[model_name] = cv_result

# Create CV comparison DataFrame
cv_comparison = pd.DataFrame({
    model: {'Mean CV RMSE': results['mean_rmse'], 'Std CV RMSE': results['std_rmse']}
    for model, results in cv_results.items()
}).T

print("\n" + "="*70)
print(cv_comparison)

## 12. Predictions Visualization for All Models

In [None]:
# Plot predictions for all models
fig, axes = plt.subplots(2, 2, figsize=(18, 12))
axes = axes.ravel()

for idx, (model_name, y_pred) in enumerate(model_predictions.items()):
    if idx < 4:
        ax = axes[idx]
        
        # Plot first 100 samples for clarity
        sample_size = min(100, len(y_test))
        indices = np.arange(sample_size)
        
        ax.plot(indices, y_test[:sample_size], label='Actual', alpha=0.8, linewidth=2)
        ax.plot(indices, y_pred[:sample_size], label='Predicted', alpha=0.8, linewidth=2)
        
        ax.set_xlabel('Sample Index', fontsize=10)
        ax.set_ylabel('Sales', fontsize=10)
        ax.set_title(f'{model_name}', fontsize=12, fontweight='bold')
        ax.legend(fontsize=9)
        ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../results/visualizations/all_models_predictions.png', dpi=300, bbox_inches='tight')
plt.show()

## 13. Key Insights and Conclusions

### Model Performance Summary
Based on the evaluation metrics, we can draw the following conclusions:

1. **Best Performing Model**: The model with the lowest RMSE provides the most accurate predictions

2. **Feature Importance**: Key factors affecting sales include:
   - Historical sales patterns (lag features)
   - Seasonality (month, day of week)
   - Holiday effects
   - Rolling statistics

3. **Seasonality Patterns**: 
   - Clear weekly patterns with weekend effects
   - Monthly and seasonal variations
   - Holiday-driven spikes

4. **Model Complexity vs Performance**:
   - Linear Regression provides a good baseline
   - Tree-based models (Random Forest, Gradient Boosting, XGBoost) capture non-linear patterns better

### Recommendations

1. **Inventory Management**: Use predictions to optimize stock levels, especially around holidays
2. **Promotional Planning**: Time promotions based on predicted low-sales periods
3. **Staffing**: Adjust staffing levels based on predicted sales volume
4. **Model Deployment**: The best model can be deployed for real-time forecasting

### Next Steps

1. Fine-tune hyperparameters for better performance
2. Incorporate external factors (weather, economic indicators)
3. Implement ensemble methods combining multiple models
4. Set up automated retraining pipeline
5. Deploy model as a web service for real-time predictions

## 14. Save Models and Results

In [None]:
# Save best model
import os
os.makedirs('../models', exist_ok=True)

best_model_path = f'../models/{best_model_name.replace(" ", "_")}_best_model.pkl'
forecaster.save_model(best_model_name, best_model_path)

print(f"Best model ({best_model_name}) saved successfully!")

In [None]:
# Save comparison results
os.makedirs('../results', exist_ok=True)
comparison_df.to_csv('../results/model_comparison.csv')

print("Model comparison results saved to ../results/model_comparison.csv")

---

## Conclusion

This notebook demonstrated a complete end-to-end sales forecasting solution including:
- Comprehensive data preprocessing and feature engineering
- Multiple machine learning models for forecasting
- Thorough evaluation and comparison
- Feature importance analysis
- Actionable insights for business decisions

The modular code structure allows for easy customization and extension for specific business needs.

**Project Repository**: [Sales-Forecasting](https://github.com/Harshitpal1/Sales-Forecasting)

---