# Sales Forecast Model - Big Data Hackathon 2025

This notebook implements a comprehensive sales forecasting model using CatBoost regression to predict retail sales for the next 5 weeks. The model processes transaction data, performs feature engineering, and generates accurate predictions for store-product combinations.

## Overview
- **Data Sources**: Transactions, Products, and Stores data in Parquet format
- **Model**: CatBoost Regressor with optimized hyperparameters
- **Features**: Lag variables, rolling averages, seasonal components, and interaction features
- **Output**: Weekly sales predictions for January 2023

## 1. Import Required Libraries

First, let's import all the necessary libraries for data processing, machine learning, and model evaluation.

In [5]:
%pip install catboost

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from catboost import CatBoostRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_percentage_error
import os
import sys

print("All libraries imported successfully!")

All libraries imported successfully!


## 2. Define SalesForecastModel Class

The core class that handles all aspects of the sales forecasting pipeline including data loading, preprocessing, feature engineering, model training, and prediction generation.

In [42]:
class SalesForecastModel:
    def __init__(self, data_path="data/"):
        self.data_path = data_path
        self.transactions = None
        self.products = None
        self.stores = None
        self.model = None
        self.label_encoders = {}
        self.validation_metrics = {}
        
    def load_data(self):
        print("Loading data...")
        parquet_files = [f for f in os.listdir(self.data_path) if f.endswith('.parquet')]
        print(f"Found {len(parquet_files)} parquet files")
        
        for file in parquet_files:
            df = pd.read_parquet(os.path.join(self.data_path, file))
            print(f"{file}: Shape {df.shape}")
            
            if 'internal_store_id' in df.columns and 'quantity' in df.columns:
                self.transactions = df
                print("-> Identified as TRANSACTIONS data")
            elif 'produto' in df.columns and 'categoria' in df.columns:
                self.products = df
                print("-> Identified as PRODUCTS data")
            elif 'pdv' in df.columns and 'premise' in df.columns:
                self.stores = df
                print("-> Identified as STORES data")
        
        print(f"Data loaded successfully:")
        print(f"- Transactions: {self.transactions.shape[0]:,} rows")
        print(f"- Products: {self.products.shape[0]:,} rows")
        print(f"- Stores: {self.stores.shape[0]:,} rows")
        
    def cleanse_data(self):
        print("\nCleansing data...")
        
        initial_rows = len(self.transactions)
        self.transactions = self.transactions.dropna(subset=['internal_store_id', 'internal_product_id', 'quantity', 'transaction_date'])
        print(f"Removed {initial_rows - len(self.transactions):,} rows with null values")
        
        self.transactions = self.transactions[self.transactions['quantity'] > 0]
        print(f"Kept {len(self.transactions):,} rows with positive quantities")
        
        self.transactions[['transaction_date', 'reference_date']] = self.transactions[['transaction_date', 'reference_date']].apply(pd.to_datetime)
        
        self.transactions = self.transactions[
            (self.transactions['transaction_date'].dt.year == 2022)
        ]
        print(f"Filtered to 2022 data: {len(self.transactions):,} rows")
        
        print("Cleaning products and stores...")
        self.products['descricao'] = self.products['descricao'].fillna('Unknown')
        self.products['categoria'] = self.products['categoria'].fillna('Other')
        self.products['marca'] = self.products['marca'].fillna('Unknown')
        
        self.stores['categoria_pdv'] = self.stores['categoria_pdv'].fillna('Other')
        self.stores['premise'] = self.stores['premise'].fillna('Unknown')
        
    def merge_data(self):
        print("\nMerging data...")
        
        merged_data = self.transactions.merge(
            self.products, 
            left_on='internal_product_id', 
            right_on='produto', 
            how='left'
        )
        print(f"After product merge: {len(merged_data):,} rows")
        
        merged_data = merged_data.merge(
            self.stores,
            left_on='internal_store_id',
            right_on='pdv',
            how='left'
        )
        print(f"After store merge: {len(merged_data):,} rows")
        
        self.merged_data = merged_data
        print("Data merge completed")
        
    def create_weekly_aggregations(self):
        print("\nCreating weekly aggregations...")
        
        dt_info = self.merged_data['transaction_date'].dt
        self.merged_data['year'] = dt_info.year
        self.merged_data['week'] = dt_info.isocalendar().week
        self.merged_data['year_week'] = self.merged_data['year'].astype(str) + '_' + self.merged_data['week'].astype(str).str.zfill(2)
        
        weekly_data = self.merged_data.groupby([
            'year_week', 'week', 'internal_store_id', 'internal_product_id',
            'categoria', 'marca', 'premise', 'categoria_pdv'
        ]).agg({
            'quantity': ['sum', 'mean', 'count'],
            'gross_value': ['sum', 'mean'],
            'net_value': ['sum', 'mean'],
            'gross_profit': ['sum', 'mean']
        }).reset_index()
        
        weekly_data.columns = ['_'.join(col).strip() if col[1] else col[0] for col in weekly_data.columns.values]
        
        column_mapping = {
            'quantity_sum': 'total_quantity',
            'quantity_mean': 'avg_quantity_per_transaction',
            'quantity_count': 'num_transactions',
            'gross_value_sum': 'total_gross_value',
            'gross_value_mean': 'avg_gross_value',
            'net_value_sum': 'total_net_value',
            'net_value_mean': 'avg_net_value',
            'gross_profit_sum': 'total_gross_profit',
            'gross_profit_mean': 'avg_gross_profit'
        }
        weekly_data.rename(columns=column_mapping, inplace=True)
        
        self.weekly_data = weekly_data
        print(f"Created weekly aggregations: {len(self.weekly_data):,} rows")
        
    def build_features(self):
        print("\nBuilding optimized features (11 selected features)...")
        
        self.weekly_data = self.weekly_data.sort_values(['internal_store_id', 'internal_product_id', 'week'])
        
        # Create only quantity_lag_1 (not all lags)
        print("Creating lag feature (lag_1 only)...")
        grouped = self.weekly_data.groupby(['internal_store_id', 'internal_product_id'])['total_quantity']
        self.weekly_data['quantity_lag_1'] = grouped.shift(1)
        
        # Create only quantity_rolling_avg_2 (not all windows)
        print("Creating rolling average (window=2 only)...")
        self.weekly_data['quantity_rolling_avg_2'] = grouped.rolling(window=2, min_periods=1).mean().reset_index(level=[0,1], drop=True)
        
        # Create only week_cos (not week_sin)
        print("Creating seasonal feature (cos only)...")
        self.weekly_data['week_cos'] = np.cos(2 * np.pi * self.weekly_data['week'] / 52)
        
        # Create store-product interaction features (all needed) - only if not already exist
        store_product_cols = ['store_product_std_quantity', 'store_product_min_quantity', 
                             'store_product_max_quantity', 'store_product_avg_transactions']
        
        if not any(col in self.weekly_data.columns for col in store_product_cols):
            print("Creating store-product interaction features...")
            agg_dict = {
                'total_quantity': ['std', 'min', 'max'],
                'num_transactions': 'mean'
            }
            store_product_stats = self.weekly_data.groupby(['internal_store_id', 'internal_product_id']).agg(agg_dict)
            
            store_product_stats.columns = ['_'.join(col).strip() if col[1] else col[0] for col in store_product_stats.columns.values]
            store_product_stats.rename(columns={
                'total_quantity_std': 'store_product_std_quantity',
                'total_quantity_min': 'store_product_min_quantity',
                'total_quantity_max': 'store_product_max_quantity',
                'num_transactions_mean': 'store_product_avg_transactions'
            }, inplace=True)
            
            self.weekly_data = self.weekly_data.merge(
                store_product_stats,
                left_on=['internal_store_id', 'internal_product_id'],
                right_index=True,
                how='left'
            )
        else:
            print("Store-product features already exist, skipping...")
        
        self.weekly_data = self.weekly_data.fillna(0)
        print("Optimized feature engineering completed")
        
    def prepare_training_data(self):
        print("\nPreparing training data with 11 selected features...")
        
        # Create only needed categorical encodings
        categorical_features = ['categoria_pdv', 'marca']  # Only these 2
        
        for feature in categorical_features:
            if feature not in self.label_encoders:
                self.label_encoders[feature] = LabelEncoder()
                self.weekly_data[f'{feature}_encoded'] = self.label_encoders[feature].fit_transform(
                    self.weekly_data[feature].astype(str)
                )
        
        # Define exactly the 11 features requested
        feature_columns = [
            'avg_quantity_per_transaction',
            'store_product_max_quantity', 
            'categoria_pdv_encoded',
            'marca_encoded',
            'store_product_min_quantity',
            'quantity_rolling_avg_2',
            'quantity_lag_1',
            'store_product_avg_transactions',
            'store_product_std_quantity',
            'week_cos',
            'total_gross_value'
        ]
        
        train_data = self.weekly_data[self.weekly_data['week'] >= 5].copy()
        
        # Verify all features exist
        missing_features = [f for f in feature_columns if f not in train_data.columns]
        if missing_features:
            print(f"WARNING: Missing features: {missing_features}")
        
        X = train_data[feature_columns]
        y = train_data['total_quantity']
        
        print(f"Training data shape: X={X.shape}, y={y.shape}")
        print(f"Selected features ({len(feature_columns)}):")
        for i, feature in enumerate(feature_columns, 1):
            print(f"  {i:2d}. {feature}")
        
        return X, y, feature_columns, train_data
    
    def calculate_wmape(self, y_true, y_pred):
        return np.sum(np.abs(y_true - y_pred)) / np.sum(y_true) * 100
    
    def evaluate_model(self, X, y, train_data):
        # Use temporal split: train on earlier weeks, validate on later weeks
        max_week = train_data['week'].max()
        # Use last 20% of weeks for validation (approximately)
        week_cutoff = max_week - int(0.2 * (max_week - train_data['week'].min()))
        
        train_mask = train_data['week'] <= week_cutoff
        val_mask = train_data['week'] > week_cutoff
        
        X_train, X_val = X[train_mask], X[val_mask]
        y_train, y_val = y[train_mask], y[val_mask]
        
        print(f"Temporal split: Train weeks ≤ {week_cutoff}, Validation weeks > {week_cutoff}")
        print(f"Train weeks range: {train_data[train_mask]['week'].min()}-{train_data[train_mask]['week'].max()}")
        print(f"Validation weeks range: {train_data[val_mask]['week'].min()}-{train_data[val_mask]['week'].max()}")
        
        train_pred = self.model.predict(X_train)
        val_pred = self.model.predict(X_val)
        
        train_mape = mean_absolute_percentage_error(y_train, train_pred) * 100
        val_mape = mean_absolute_percentage_error(y_val, val_pred) * 100
        
        train_wmape = self.calculate_wmape(y_train, train_pred)
        val_wmape = self.calculate_wmape(y_val, val_pred)
        
        self.validation_metrics = {
            'train_mape': train_mape,
            'val_mape': val_mape,
            'train_wmape': train_wmape,
            'val_wmape': val_wmape,
            'train_samples': len(y_train),
            'val_samples': len(y_val)
        }
        
        print(f"Training MAPE: {train_mape:.2f}%")
        print(f"Validation MAPE: {val_mape:.2f}%")
        print(f"Training WMAPE: {train_wmape:.2f}%")
        print(f"Validation WMAPE: {val_wmape:.2f}%")
        
        return self.validation_metrics
        
    def train_model(self, X, y, train_data):
        print("\nTraining CatBoost model...")
        
        # Use temporal split: train on earlier weeks, validate on later weeks
        max_week = train_data['week'].max()
        # Use last 20% of weeks for validation (approximately)
        week_cutoff = max_week - int(0.2 * (max_week - train_data['week'].min()))
        
        train_mask = train_data['week'] <= week_cutoff
        val_mask = train_data['week'] > week_cutoff
        
        X_train, X_val = X[train_mask], X[val_mask]
        y_train, y_val = y[train_mask], y[val_mask]
        
        print(f"Temporal split for training: Train weeks ≤ {week_cutoff}, Validation weeks > {week_cutoff}")
        print(f"Training samples: {len(y_train):,}, Validation samples: {len(y_val):,}")
        
        self.model = CatBoostRegressor(
            iterations=1000,
            learning_rate=0.1,
            depth=6,
            loss_function='RMSE',
            random_seed=42,
            verbose=100
        )
        
        self.model.fit(
            X_train, y_train,
            eval_set=(X_val, y_val),
            early_stopping_rounds=100,
            verbose=100
        )
        
        self.evaluate_model(X, y, train_data)
        
    def generate_predictions(self, feature_columns, train_data, max_rows=1500000, weeks_to_predict=5, recent_weeks=8):
        print("\nGenerating predictions for January 2023...")
        
        # Discover unique pairs
        store_product_combinations = train_data[['internal_store_id', 'internal_product_id']].drop_duplicates()
        print(f"Found {len(store_product_combinations):,} unique store-product combinations")
        
        # Build latest snapshot per pair
        latest_records = train_data.loc[train_data.groupby(['internal_store_id', 'internal_product_id'])['week'].idxmax()].copy()
        
        # Select top active pairs to respect the 1.5M-row portal limit
        try:
            pairs_limit = max_rows // weeks_to_predict
            max_week = int(self.weekly_data['week'].max())
            start_week = max(1, max_week - int(recent_weeks) + 1)
            recent_slice = self.weekly_data[self.weekly_data['week'] >= start_week]
            activity = recent_slice.groupby(['internal_store_id', 'internal_product_id']).agg(
                recent_total_qty=('total_quantity', 'sum'),
                weeks_with_sales=('total_quantity', lambda s: int((s > 0).sum())),
                last_week_seen=('week', 'max')
            ).reset_index()
            activity = activity.sort_values(
                by=['recent_total_qty', 'weeks_with_sales', 'last_week_seen'],
                ascending=[False, False, False]
            )
            selected_pairs = activity.head(pairs_limit)[['internal_store_id', 'internal_product_id']]
            before = len(latest_records)
            latest_records = latest_records.merge(selected_pairs, on=['internal_store_id', 'internal_product_id'], how='inner')
            after = len(latest_records)
            print(f"Selected top {after:,} active pairs out of {before:,} (recent_weeks={recent_weeks})")
        except Exception as e:
            print(f"Pair selection step skipped due to error: {e}")
        
        all_predictions = []
        
        for week in range(1, weeks_to_predict + 1):
            print(f"Predicting week {week}...")
            
            week_data = latest_records.copy()
            week_data['week'] = week
            week_data['week_cos'] = np.cos(2 * np.pi * week / 52)
            
            features_matrix = week_data[feature_columns].values
            predictions = self.model.predict(features_matrix)
            predictions = np.maximum(0, predictions)
            
            week_predictions = pd.DataFrame({
                'semana': week,
                'pdv': week_data['internal_store_id'].astype(int),
                'produto': week_data['internal_product_id'].astype(int),
                'quantidade': predictions.round().astype(int)
            })
            
            all_predictions.append(week_predictions)
            print(f"Generated {len(week_predictions):,} predictions for week {week}")
        
        predictions_df = pd.concat(all_predictions, ignore_index=True)
        
        # Final safeguard
        if len(predictions_df) > max_rows:
            print(f"Limiting predictions to {max_rows:,} rows (was {len(predictions_df):,})")
            predictions_df = predictions_df.head(max_rows)
        
        print(f"Total predictions generated: {len(predictions_df):,}")
        self.analyze_predictions(predictions_df)
        return predictions_df
    
    def analyze_predictions(self, predictions_df):
        print("\nPrediction Analysis:")
        print(f"Total predictions: {len(predictions_df):,}")
        print(f"Zero predictions: {(predictions_df['quantidade'] == 0).sum():,}")
        print(f"Non-zero predictions: {(predictions_df['quantidade'] > 0).sum():,}")
        print(f"Mean prediction: {predictions_df['quantidade'].mean():.2f}")
        print(f"Median prediction: {predictions_df['quantidade'].median():.2f}")
        print(f"Max prediction: {predictions_df['quantidade'].max():,}")
        print(f"Std prediction: {predictions_df['quantidade'].std():.2f}")
        
        quantiles = predictions_df['quantidade'].quantile([0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
        print("Prediction quantiles:")
        for q, val in quantiles.items():
            print(f"  {q*100:.0f}%: {val:.2f}")
        
        weekly_stats = predictions_df.groupby('semana')['quantidade'].agg(['count', 'mean', 'sum']).round(2)
        print("\nWeekly prediction summary:")
        print(weekly_stats)
        
    def save_predictions(self, predictions_df, filename="sales_predictions.csv"):
        print(f"\nSaving predictions to {filename}...")
        
        predictions_df.to_csv(filename, sep=';', index=False, encoding='utf-8')
        
        print(f"Predictions saved successfully!")
        print(f"File: {filename}")
        print(f"Rows: {len(predictions_df):,}")
        print(f"Sample:")
        print(predictions_df.head(10))
        
    def print_performance_report(self):
        print("\n" + "="*50)
        print("PERFORMANCE REPORT")
        print("="*50)
        
        if self.validation_metrics:
            print("Model Validation Metrics:")
            print(f"  Training MAPE: {self.validation_metrics['train_mape']:.2f}%")
            print(f"  Validation MAPE: {self.validation_metrics['val_mape']:.2f}%")
            print(f"  Training WMAPE: {self.validation_metrics['train_wmape']:.2f}%")
            print(f"  Validation WMAPE: {self.validation_metrics['val_wmape']:.2f}%")
            print(f"  Training samples: {self.validation_metrics['train_samples']:,}")
            print(f"  Validation samples: {self.validation_metrics['val_samples']:,}")
            
            wmape_diff = abs(self.validation_metrics['val_wmape'] - self.validation_metrics['train_wmape'])
            mape_diff = abs(self.validation_metrics['val_mape'] - self.validation_metrics['train_mape'])
            
            print(f"\nOverfitting Check:")
            print(f"  MAPE difference: {mape_diff:.2f}%")
            print(f"  WMAPE difference: {wmape_diff:.2f}%")
            
            if wmape_diff < 5 and mape_diff < 5:
                print("  Status: Good generalization")
            elif wmape_diff < 10 and mape_diff < 10:
                print("  Status: Moderate overfitting")
            else:
                print("  Status: High overfitting risk")
        
        print("="*50)

print("SalesForecastModel class defined successfully!")
print("✅ Updated to use only 11 optimized features:")
print("   1. avg_quantity_per_transaction")
print("   2. store_product_max_quantity") 
print("   3. categoria_pdv_encoded")
print("   4. marca_encoded")
print("   5. store_product_min_quantity")
print("   6. quantity_rolling_avg_2")
print("   7. quantity_lag_1")
print("   8. store_product_avg_transactions")
print("   9. store_product_std_quantity")
print("  10. week_cos")
print("  11. total_gross_value")

SalesForecastModel class defined successfully!
✅ Updated to use only 11 optimized features:
   1. avg_quantity_per_transaction
   2. store_product_max_quantity
   3. categoria_pdv_encoded
   4. marca_encoded
   5. store_product_min_quantity
   6. quantity_rolling_avg_2
   7. quantity_lag_1
   8. store_product_avg_transactions
   9. store_product_std_quantity
  10. week_cos
  11. total_gross_value


## 3. Initialize Model and Load Data

Create an instance of the SalesForecastModel and load the data from parquet files. The model automatically identifies transaction, product, and store data based on column patterns.

In [43]:
# Initialize the model with the data path (fresh instance)
model = SalesForecastModel(data_path="data/")

# Load the data
model.load_data()

# Display basic information about the loaded datasets
print("\n=== Data Summary ===")
if model.transactions is not None:
    print(f"Transactions dataset columns: {list(model.transactions.columns)}")
    print(f"Transaction date range: {model.transactions['transaction_date'].min()} to {model.transactions['transaction_date'].max()}")
    
if model.products is not None:
    print(f"Products dataset columns: {list(model.products.columns)}")
    print(f"Unique categories: {model.products['categoria'].nunique()}")
    
if model.stores is not None:
    print(f"Stores dataset columns: {list(model.stores.columns)}")
    print(f"Unique store categories: {model.stores['categoria_pdv'].nunique()}")

Loading data...
Found 3 parquet files
part-00000-tid-2779033056155408584-f6316110-4c9a-4061-ae48-69b77c7c8c36-4-1-c000.snappy.parquet: Shape (14419, 4)
-> Identified as STORES data
part-00000-tid-5196563791502273604-c90d3a24-52f2-4955-b4ec-fb143aae74d8-4-1-c000.snappy.parquet: Shape (6560698, 11)
-> Identified as TRANSACTIONS data
part-00000-tid-7173294866425216458-eae53fbf-d19e-4130-ba74-78f96b9675f1-4-1-c000.snappy.parquet: Shape (7092, 8)
-> Identified as PRODUCTS data
Data loaded successfully:
- Transactions: 6,560,698 rows
- Products: 7,092 rows
- Stores: 14,419 rows

=== Data Summary ===
Transactions dataset columns: ['internal_store_id', 'internal_product_id', 'distributor_id', 'transaction_date', 'reference_date', 'quantity', 'gross_value', 'net_value', 'gross_profit', 'discount', 'taxes']
part-00000-tid-5196563791502273604-c90d3a24-52f2-4955-b4ec-fb143aae74d8-4-1-c000.snappy.parquet: Shape (6560698, 11)
-> Identified as TRANSACTIONS data
part-00000-tid-7173294866425216458-eae5

## 4. Data Cleansing and Preprocessing

Clean the data by removing null values, filtering for positive quantities, converting date columns, and handling missing values in categorical features.

In [44]:
# Cleanse the data
model.cleanse_data()

# Display data quality information after cleansing
print("\n=== Data Quality Check ===")
print(f"Final transaction data shape: {model.transactions.shape}")
print(f"Date range after filtering: {model.transactions['transaction_date'].min()} to {model.transactions['transaction_date'].max()}")
print(f"Positive quantities: {(model.transactions['quantity'] > 0).sum():,}")
print(f"Zero quantities: {(model.transactions['quantity'] == 0).sum():,}")

# Check for missing values in key columns
print("\n=== Missing Values Check ===")
print("Transactions missing values:")
print(model.transactions[['internal_store_id', 'internal_product_id', 'quantity', 'transaction_date']].isnull().sum())

print("\nProducts missing values:")
print(model.products[['produto', 'categoria', 'marca', 'descricao']].isnull().sum())

print("\nStores missing values:")
print(model.stores[['pdv', 'categoria_pdv', 'premise']].isnull().sum())


Cleansing data...
Removed 0 rows with null values
Removed 0 rows with null values
Kept 6,430,161 rows with positive quantities
Kept 6,430,161 rows with positive quantities
Filtered to 2022 data: 6,430,161 rows
Cleaning products and stores...

=== Data Quality Check ===
Final transaction data shape: (6430161, 11)
Date range after filtering: 2022-01-01 00:00:00 to 2022-12-31 00:00:00
Positive quantities: 6,430,161
Zero quantities: 0

=== Missing Values Check ===
Transactions missing values:
Filtered to 2022 data: 6,430,161 rows
Cleaning products and stores...

=== Data Quality Check ===
Final transaction data shape: (6430161, 11)
Date range after filtering: 2022-01-01 00:00:00 to 2022-12-31 00:00:00
Positive quantities: 6,430,161
Zero quantities: 0

=== Missing Values Check ===
Transactions missing values:
internal_store_id      0
internal_product_id    0
quantity               0
transaction_date       0
dtype: int64

Products missing values:
produto      0
categoria    0
marca        0

## 5. Data Merging and Weekly Aggregations

Merge transaction data with product and store information, then create weekly aggregations of sales metrics including quantity, revenue, and profit measures.

In [45]:
# Merge transaction data with product and store information
model.merge_data()

# Create weekly aggregations
model.create_weekly_aggregations()

# Display aggregated data information
print("\n=== Weekly Aggregation Summary ===")
print(f"Weekly data shape: {model.weekly_data.shape}")
print(f"Columns in weekly data: {list(model.weekly_data.columns)}")
print(f"Week range: {model.weekly_data['week'].min()} to {model.weekly_data['week'].max()}")
print(f"Unique store-product combinations: {model.weekly_data[['internal_store_id', 'internal_product_id']].drop_duplicates().shape[0]:,}")

# Show sample of weekly aggregated data
print("\n=== Sample Weekly Data ===")
display(model.weekly_data.head())

# Statistics on weekly quantities
print("\n=== Weekly Quantity Statistics ===")
print(model.weekly_data['total_quantity'].describe())


Merging data...
After product merge: 6,430,161 rows
After product merge: 6,430,161 rows
After store merge: 6,430,161 rows
Data merge completed

Creating weekly aggregations...
After store merge: 6,430,161 rows
Data merge completed

Creating weekly aggregations...
Created weekly aggregations: 6,133,925 rows

=== Weekly Aggregation Summary ===
Weekly data shape: (6133925, 17)
Columns in weekly data: ['year_week', 'week', 'internal_store_id', 'internal_product_id', 'categoria', 'marca', 'premise', 'categoria_pdv', 'total_quantity', 'avg_quantity_per_transaction', 'num_transactions', 'total_gross_value', 'avg_gross_value', 'total_net_value', 'avg_net_value', 'total_gross_profit', 'avg_gross_profit']
Week range: 1 to 52
Created weekly aggregations: 6,133,925 rows

=== Weekly Aggregation Summary ===
Weekly data shape: (6133925, 17)
Columns in weekly data: ['year_week', 'week', 'internal_store_id', 'internal_product_id', 'categoria', 'marca', 'premise', 'categoria_pdv', 'total_quantity', 'av

Unnamed: 0,year_week,week,internal_store_id,internal_product_id,categoria,marca,premise,categoria_pdv,total_quantity,avg_quantity_per_transaction,num_transactions,total_gross_value,avg_gross_value,total_net_value,avg_net_value,total_gross_profit,avg_gross_profit
0,2022_01,1,1001371918471115422,1029370090212151375,Package,Michelob Ultra,Off Premise,Convenience,3.0,3.0,1,80.25,80.25,73.410728,73.410728,23.136728,23.136728
1,2022_01,1,1001371918471115422,1120490062981954254,Package,Bud Light,Off Premise,Convenience,18.0,18.0,1,262.350006,262.350006,234.992905,234.992905,62.210903,62.210903
2,2022_01,1,1001371918471115422,2239307647969388381,Package,Natural Light,Off Premise,Convenience,2.0,2.0,1,37.950001,37.950001,31.870647,31.870647,8.182648,8.182648
3,2022_01,1,1001371918471115422,4353552881410365573,Package,Natural Light,Off Premise,Convenience,7.0,7.0,1,138.25,138.25,118.302116,118.302116,33.567116,33.567116
4,2022_01,1,1001371918471115422,4797439216678436447,Package,Bud Light Lime,Off Premise,Convenience,1.0,1.0,1,26.75,26.75,24.470242,24.470242,7.524241,7.524241



=== Weekly Quantity Statistics ===
count    6.133925e+06
mean     8.690512e+00
std      8.317377e+01
min      1.192093e-07
25%      1.000000e+00
50%      2.000000e+00
75%      4.000000e+00
max      9.423000e+04
Name: total_quantity, dtype: float64
count    6.133925e+06
mean     8.690512e+00
std      8.317377e+01
min      1.192093e-07
25%      1.000000e+00
50%      2.000000e+00
75%      4.000000e+00
max      9.423000e+04
Name: total_quantity, dtype: float64


## 6. Feature Engineering

Build features including lag variables, rolling averages, seasonal components, and store-product interaction features for the machine learning model.

In [46]:
# Build features for the model
model.build_features()

# Display feature engineering results
print("\n=== Feature Engineering Summary ===")
feature_columns = [col for col in model.weekly_data.columns if any(x in col for x in ['lag', 'rolling', 'cos', 'store_product'])]
print(f"Created {len(feature_columns)} engineered features:")
for feature in feature_columns:
    print(f"  - {feature}")

# Show sample of data with new features (updated to use actual columns)
print("\n=== Sample Data with Features ===")
sample_cols = ['internal_store_id', 'internal_product_id', 'week', 'total_quantity', 
               'quantity_lag_1', 'quantity_rolling_avg_2', 'week_cos', 'store_product_max_quantity']
display(model.weekly_data[sample_cols].head(10))

# Check for missing values after feature engineering
print("\n=== Missing Values After Feature Engineering ===")
missing_counts = model.weekly_data.isnull().sum()
print(f"Total missing values: {missing_counts.sum()}")
print(f"Columns with missing values: {missing_counts[missing_counts > 0].to_dict()}")


Building optimized features (11 selected features)...
Creating lag feature (lag_1 only)...
Creating lag feature (lag_1 only)...
Creating rolling average (window=2 only)...
Creating rolling average (window=2 only)...
Creating seasonal feature (cos only)...
Creating store-product interaction features...
Creating seasonal feature (cos only)...
Creating store-product interaction features...
Optimized feature engineering completed
Optimized feature engineering completed

=== Feature Engineering Summary ===
Created 7 engineered features:
  - quantity_lag_1
  - quantity_rolling_avg_2
  - week_cos
  - store_product_std_quantity
  - store_product_min_quantity
  - store_product_max_quantity
  - store_product_avg_transactions

=== Sample Data with Features ===

=== Feature Engineering Summary ===
Created 7 engineered features:
  - quantity_lag_1
  - quantity_rolling_avg_2
  - week_cos
  - store_product_std_quantity
  - store_product_min_quantity
  - store_product_max_quantity
  - store_product_a

Unnamed: 0,internal_store_id,internal_product_id,week,total_quantity,quantity_lag_1,quantity_rolling_avg_2,week_cos,store_product_max_quantity
433888,1000237487041964405,1837429607327399565,6,1.0,0.0,1.0,0.748511,2.0
529002,1000237487041964405,1837429607327399565,7,2.0,1.0,1.5,0.663123,2.0
2042472,1000237487041964405,1837429607327399565,21,1.0,2.0,1.5,-0.822984,2.0
2521528,1000237487041964405,1837429607327399565,25,2.0,1.0,1.5,-0.992709,2.0
2891177,1000237487041964405,1837429607327399565,28,2.0,2.0,2.0,-0.970942,2.0
3599754,1000237487041964405,1837429607327399565,34,2.0,2.0,2.0,-0.568065,2.0
4475319,1000237487041964405,1837429607327399565,39,1.0,2.0,1.5,-0.0,2.0
5314756,1000237487041964405,1837429607327399565,46,2.0,1.0,1.5,0.748511,2.0
1565795,1000237487041964405,4038588102284338370,17,1.0,0.0,1.0,-0.464723,1.0
433889,1000237487041964405,5429216175252037173,6,2.0,0.0,2.0,0.748511,2.0



=== Missing Values After Feature Engineering ===
Total missing values: 0
Columns with missing values: {}
Total missing values: 0
Columns with missing values: {}


## 7. Model Training with CatBoost

Prepare training data with label encoding for categorical features and train the CatBoost regression model with appropriate hyperparameters.

In [47]:
# Prepare training data
X, y, feature_columns, train_data = model.prepare_training_data()

print("\n=== Training Data Preparation ===")
print(f"Feature columns ({len(feature_columns)}):")
for i, feature in enumerate(feature_columns):
    print(f"  {i+1:2d}. {feature}")

print(f"\nTraining data statistics:")
print(f"- Features (X): {X.shape}")
print(f"- Target (y): {y.shape}")
print(f"- Target range: {y.min():.2f} to {y.max():.2f}")
print(f"- Target mean: {y.mean():.2f}")
print(f"- Target median: {y.median():.2f}")

# Train the CatBoost model with temporal split
print("\n" + "="*50)
print("TRAINING CATBOOST MODEL")
print("="*50)
model.train_model(X, y, train_data)


Preparing training data with 11 selected features...
Training data shape: X=(5800450, 11), y=(5800450,)
Selected features (11):
   1. avg_quantity_per_transaction
   2. store_product_max_quantity
   3. categoria_pdv_encoded
   4. marca_encoded
   5. store_product_min_quantity
   6. quantity_rolling_avg_2
   7. quantity_lag_1
   8. store_product_avg_transactions
   9. store_product_std_quantity
  10. week_cos
  11. total_gross_value

=== Training Data Preparation ===
Feature columns (11):
   1. avg_quantity_per_transaction
   2. store_product_max_quantity
   3. categoria_pdv_encoded
   4. marca_encoded
   5. store_product_min_quantity
   6. quantity_rolling_avg_2
   7. quantity_lag_1
   8. store_product_avg_transactions
   9. store_product_std_quantity
  10. week_cos
  11. total_gross_value

Training data statistics:
- Features (X): (5800450, 11)
- Target (y): (5800450,)
- Target range: 0.00 to 94230.00
- Target mean: 8.97
- Target median: 2.00

TRAINING CATBOOST MODEL

Training CatBoo

## 8. Model Evaluation and Validation

Evaluate the trained model using MAPE and WMAPE metrics on both training and validation sets to assess performance and check for overfitting.

In [48]:
# Display detailed validation metrics
print("\n=== MODEL EVALUATION RESULTS ===")
if model.validation_metrics:
    metrics = model.validation_metrics
    
    print(f"Training MAPE: {metrics['train_mape']:.3f}%")
    print(f"Validation MAPE: {metrics['val_mape']:.3f}%")
    print(f"Training WMAPE: {metrics['train_wmape']:.3f}%")
    print(f"Validation WMAPE: {metrics['val_wmape']:.3f}%")
    
    print(f"\nData Split:")
    print(f"Training samples: {metrics['train_samples']:,}")
    print(f"Validation samples: {metrics['val_samples']:,}")
    
    # Calculate overfitting indicators
    mape_diff = abs(metrics['val_mape'] - metrics['train_mape'])
    wmape_diff = abs(metrics['val_wmape'] - metrics['train_wmape'])
    
    print(f"\nOverfitting Analysis:")
    print(f"MAPE difference (val - train): {mape_diff:.3f}%")
    print(f"WMAPE difference (val - train): {wmape_diff:.3f}%")
    
    if wmape_diff < 5 and mape_diff < 5:
        status = "✅ Good generalization"
    elif wmape_diff < 10 and mape_diff < 10:
        status = "⚠️ Moderate overfitting"
    else:
        status = "❌ High overfitting risk"
    
    print(f"Model Status: {status}")

# Feature importance analysis
if hasattr(model.model, 'feature_importances_'):
    print("\n=== TOP 10 FEATURE IMPORTANCES ===")
    importances = model.model.feature_importances_
    feature_importance = list(zip(feature_columns, importances))
    feature_importance.sort(key=lambda x: x[1], reverse=True)
    
    for i, (feature, importance) in enumerate(feature_importance[:10]):
        print(f"{i+1:2d}. {feature:<30} {importance:.4f}")
else:
    print("\nFeature importance not available for this model.")


=== MODEL EVALUATION RESULTS ===
Training MAPE: 8096.167%
Validation MAPE: 74.603%
Training WMAPE: 17.984%
Validation WMAPE: 30.828%

Data Split:
Training samples: 4,732,158
Validation samples: 1,068,292

Overfitting Analysis:
MAPE difference (val - train): 8021.565%
WMAPE difference (val - train): 12.844%
Model Status: ❌ High overfitting risk

=== TOP 10 FEATURE IMPORTANCES ===
 1. avg_quantity_per_transaction   55.0207
 2. store_product_max_quantity     17.7013
 3. categoria_pdv_encoded          10.8308
 4. marca_encoded                  7.5732
 5. store_product_min_quantity     4.5030
 6. quantity_rolling_avg_2         1.4686
 7. store_product_avg_transactions 1.1425
 8. quantity_lag_1                 1.0988
 9. store_product_std_quantity     0.6612
10. week_cos                       0.0000


## 8.1. Feature Selection and Importance Analysis

Analyze feature importance and select the most impactful features to improve model performance and reduce overfitting.

In [18]:
def select_important_features(model, feature_columns, importance_threshold=0.01, max_features=20):
    """
    Select important features based on CatBoost feature importance
    
    Args:
        model: Trained CatBoost model
        feature_columns: List of all feature names
        importance_threshold: Minimum importance score (0.01 = 1%)
        max_features: Maximum number of features to keep
    """
    print(f"\n=== FEATURE SELECTION ===")
    print(f"Starting with {len(feature_columns)} features")
    
    if not hasattr(model.model, 'feature_importances_'):
        print("Feature importance not available")
        return feature_columns
    
    # Get feature importances
    importances = model.model.feature_importances_
    feature_importance = list(zip(feature_columns, importances))
    feature_importance.sort(key=lambda x: x[1], reverse=True)
    
    print(f"\n=== ALL FEATURE IMPORTANCES ===")
    for i, (feature, importance) in enumerate(feature_importance):
        status = "✅" if importance >= importance_threshold else "❌"
        print(f"{i+1:2d}. {feature:<35} {importance:.4f} {status}")
    
    # Apply thresholds
    important_features = [feat for feat, imp in feature_importance 
                         if imp >= importance_threshold]
    
    # Limit to max_features if specified
    if max_features and len(important_features) > max_features:
        important_features = important_features[:max_features]
        print(f"\nLimited to top {max_features} features")
    
    removed_features = [feat for feat in feature_columns 
                       if feat not in important_features]
    
    print(f"\n=== FEATURE SELECTION RESULTS ===")
    print(f"Selected features: {len(important_features)}")
    print(f"Removed features: {len(removed_features)}")
    print(f"Importance threshold: {importance_threshold:.3f}")
    
    if removed_features:
        print(f"\n=== REMOVED FEATURES ===")
        for feat in removed_features:
            imp = next((imp for f, imp in feature_importance if f == feat), 0)
            print(f"  - {feat:<35} {imp:.4f}")
    
    print(f"\n=== SELECTED FEATURES ===")
    for i, feat in enumerate(important_features):
        imp = next((imp for f, imp in feature_importance if f == feat), 0)
        print(f"  {i+1:2d}. {feat:<35} {imp:.4f}")
    
    return important_features

# Test different importance thresholds
print("="*80)
print("FEATURE IMPORTANCE ANALYSIS AND SELECTION")
print("="*80)

print("Testing different importance thresholds:")
print("\n" + "="*60)

FEATURE IMPORTANCE ANALYSIS AND SELECTION
Testing different importance thresholds:



In [19]:
# Conservative threshold (keep more features)
print("1️⃣ CONSERVATIVE APPROACH (0.1% threshold)")
selected_features_001 = select_important_features(
    model, feature_columns, 
    importance_threshold=0.001,  # 0.1%
    max_features=25
)

print("\n" + "="*60)

# Moderate threshold (balanced approach)
print("2️⃣ MODERATE APPROACH (1% threshold) - RECOMMENDED")
selected_features_01 = select_important_features(
    model, feature_columns, 
    importance_threshold=0.01,   # 1%
    max_features=20
)

print("\n" + "="*60)

# Aggressive threshold (keep only most important)
print("3️⃣ AGGRESSIVE APPROACH (2% threshold)")
selected_features_02 = select_important_features(
    model, feature_columns, 
    importance_threshold=0.02,   # 2%
    max_features=15
)

1️⃣ CONSERVATIVE APPROACH (0.1% threshold)

=== FEATURE SELECTION ===
Starting with 27 features

=== ALL FEATURE IMPORTANCES ===
 1. avg_quantity_per_transaction        54.6173 ✅
 2. store_product_max_quantity          17.1364 ✅
 3. categoria_pdv_encoded               11.4949 ✅
 4. marca_encoded                       7.1158 ✅
 5. store_product_min_quantity          4.1466 ✅
 6. quantity_rolling_avg_2              2.2517 ✅
 7. quantity_lag_1                      1.2920 ✅
 8. store_product_avg_transactions      1.1230 ✅
 9. store_product_std_quantity          0.6537 ✅
10. week_cos                            0.1393 ✅
11. total_gross_value                   0.0173 ✅
12. total_net_value                     0.0070 ✅
13. quantity_rolling_avg_4              0.0051 ✅
14. week                                0.0000 ❌
15. week_sin                            0.0000 ❌
16. num_transactions                    0.0000 ❌
17. avg_gross_value                     0.0000 ❌
18. avg_net_value                  

In [27]:
# Update feature_columns variable for subsequent predictions
print(f"\n=== FEATURE SELECTION RESULTS ===")
print(f"Original feature count: {len(feature_columns)}")

# Use the selected features from moderate threshold (1%)
final_features = selected_features_01
print(f"Selected feature count: {len(final_features)}")
print(f"Feature reduction: {(1 - len(final_features)/len(feature_columns))*100:.1f}%")

# Store original feature_columns for reference
original_feature_columns = feature_columns.copy()
feature_columns = final_features

print(f"\n=== SELECTED FEATURES ===")
for i, feature in enumerate(feature_columns, 1):
    print(f"{i:2d}. {feature}")

print(f"\n=== FEATURE SELECTION BENEFITS ===")
print("- Reduced model complexity by {:.1f}%".format((1 - len(feature_columns)/len(original_feature_columns))*100))
print("- Faster training and inference")
print("- Reduced overfitting risk")
print("- Focus on most impactful predictors")
print("- Better model interpretability")

print(f"\nFeature columns updated successfully!")
print(f"Model is now optimized with {len(feature_columns)} most important features!")
print("Ready to generate predictions with the improved model.")


=== FEATURE SELECTION RESULTS ===
Original feature count: 27
Selected feature count: 11
Feature reduction: 59.3%

=== SELECTED FEATURES ===
 1. avg_quantity_per_transaction
 2. store_product_max_quantity
 3. categoria_pdv_encoded
 4. marca_encoded
 5. store_product_min_quantity
 6. quantity_rolling_avg_2
 7. quantity_lag_1
 8. store_product_avg_transactions
 9. store_product_std_quantity
10. week_cos
11. total_gross_value

=== FEATURE SELECTION BENEFITS ===
- Reduced model complexity by 59.3%
- Faster training and inference
- Reduced overfitting risk
- Focus on most impactful predictors
- Better model interpretability

Feature columns updated successfully!
Model is now optimized with 11 most important features!
Ready to generate predictions with the improved model.


## 8.2. Feature Selection Summary

**Implementation Complete!** ✅

### What We Implemented:

1. **Feature Importance Analysis**: Analyzed all 27 features using CatBoost's built-in feature importance
2. **Multiple Thresholds**: Tested Conservative (0.1%), Moderate (1%), and Aggressive (2%) importance thresholds  
3. **Smart Selection**: Applied 1% importance threshold + max 20 features limit
4. **Feature Reduction**: Reduced from 27 to 11 features (59.3% reduction)

### Selected Features (Final Set):
- **Transaction Features**: `avg_quantity_per_transaction`
- **Historical Features**: `quantity_lag_1`, `quantity_rolling_avg_2`
- **Store-Product Stats**: `store_product_max_quantity`, `store_product_min_quantity`, `store_product_avg_transactions`, `store_product_std_quantity`
- **Categorical**: `categoria_pdv_encoded`, `marca_encoded`
- **Seasonal**: `week_cos`
- **Financial**: `total_gross_value`

### Removed Low-Impact Features:
- Higher lag features (`quantity_lag_2`, `quantity_lag_3`, `quantity_lag_4`)
- Some rolling averages (`quantity_rolling_avg_4`, `quantity_rolling_avg_8`)
- Individual financial metrics (`avg_gross_value`, `net_value` features)
- Some categorical encodings (`categoria_encoded`, `premise_encoded`)
- Week number and sine component

### Benefits:
- **59.3% reduction** in feature count
- **Faster training** and prediction
- **Reduced overfitting** risk
- **Better interpretability**
- Focus on **most impactful** predictors

The model will now use these optimized features for all subsequent predictions!

## 9. Generate Predictions

Generate sales forecasts for the next 5 weeks, select active store-product combinations, and create predictions while respecting data size limits.

In [14]:
# Generate predictions for the next 5 weeks
print("="*60)
print("GENERATING SALES PREDICTIONS FOR JANUARY 2023")
print("="*60)

predictions_df = model.generate_predictions(
    feature_columns, 
    train_data, 
    max_rows=1500000, 
    weeks_to_predict=5, 
    recent_weeks=8
)

# Display prediction results
print("\n=== PREDICTION RESULTS ===")
print(f"Total predictions generated: {len(predictions_df):,}")
print(f"Unique stores: {predictions_df['pdv'].nunique():,}")
print(f"Unique products: {predictions_df['produto'].nunique():,}")
print(f"Weeks predicted: {sorted(predictions_df['semana'].unique())}")

# Sample of predictions
print("\n=== SAMPLE PREDICTIONS ===")
display(predictions_df.head(10))

# Predictions by week
print("\n=== PREDICTIONS BY WEEK ===")
weekly_summary = predictions_df.groupby('semana').agg({
    'quantidade': ['count', 'sum', 'mean', 'median', 'max'],
    'pdv': 'nunique',
    'produto': 'nunique'
}).round(2)
weekly_summary.columns = ['Count', 'Total_Qty', 'Mean_Qty', 'Median_Qty', 'Max_Qty', 'Unique_Stores', 'Unique_Products']
display(weekly_summary)

GENERATING SALES PREDICTIONS FOR JANUARY 2023

Generating predictions for January 2023...
Found 999,323 unique store-product combinations
Found 999,323 unique store-product combinations
Selected top 300,000 active pairs out of 999,323 (recent_weeks=8)
Predicting week 1...
Generated 300,000 predictions for week 1
Predicting week 2...
Selected top 300,000 active pairs out of 999,323 (recent_weeks=8)
Predicting week 1...
Generated 300,000 predictions for week 1
Predicting week 2...
Generated 300,000 predictions for week 2
Predicting week 3...
Generated 300,000 predictions for week 3
Predicting week 4...
Generated 300,000 predictions for week 2
Predicting week 3...
Generated 300,000 predictions for week 3
Predicting week 4...
Generated 300,000 predictions for week 4
Predicting week 5...
Generated 300,000 predictions for week 5
Total predictions generated: 1,500,000

Prediction Analysis:
Total predictions: 1,500,000
Zero predictions: 0
Non-zero predictions: 1,500,000
Mean prediction: 7.91
M

Unnamed: 0,semana,pdv,produto,quantidade
0,1,1000237487041964405,777251454728290683,3
1,1,1001371918471115422,1009179103632945474,1
2,1,1001371918471115422,1029370090212151375,3
3,1,1001371918471115422,1120490062981954254,9
4,1,1001371918471115422,1371936917923350372,9
5,1,1001371918471115422,1394381856358939027,12
6,1,1001371918471115422,1454838625590783593,4
7,1,1001371918471115422,145852603040678098,1
8,1,1001371918471115422,1527082310248040324,1
9,1,1001371918471115422,1625722803643187564,1



=== PREDICTIONS BY WEEK ===


Unnamed: 0_level_0,Count,Total_Qty,Mean_Qty,Median_Qty,Max_Qty,Unique_Stores,Unique_Products
semana,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,300000,2372848,7.91,3.0,2016,10565,4058
2,300000,2372848,7.91,3.0,2016,10565,4058
3,300000,2372848,7.91,3.0,2016,10565,4058
4,300000,2372848,7.91,3.0,2016,10565,4058
5,300000,2372848,7.91,3.0,2016,10565,4058


## 10. Save and Analyze Results

Save predictions to CSV format, analyze prediction statistics, and generate a comprehensive performance report with validation metrics.

In [15]:
# Save predictions to CSV file
model.save_predictions(predictions_df, filename="sales_predictions_notebook.csv")

# Generate comprehensive performance report
model.print_performance_report()

# Additional analysis - prediction distribution
print("\n=== PREDICTION DISTRIBUTION ANALYSIS ===")

# Quantity distribution
print("Quantity Distribution:")
print(f"Zero predictions: {(predictions_df['quantidade'] == 0).sum():,} ({(predictions_df['quantidade'] == 0).mean()*100:.1f}%)")
print(f"Non-zero predictions: {(predictions_df['quantidade'] > 0).sum():,} ({(predictions_df['quantidade'] > 0).mean()*100:.1f}%)")

# Percentiles
percentiles = [25, 50, 75, 90, 95, 99]
print(f"\nQuantity Percentiles:")
for p in percentiles:
    value = predictions_df['quantidade'].quantile(p/100)
    print(f"  {p}th percentile: {value:.2f}")

# Top predicted products
print("\n=== TOP 10 PRODUCTS BY TOTAL PREDICTED QUANTITY ===")
top_products = predictions_df.groupby('produto')['quantidade'].sum().sort_values(ascending=False).head(10)
for i, (product, qty) in enumerate(top_products.items(), 1):
    print(f"{i:2d}. Product {product}: {qty:,} units")

# Top predicted stores
print("\n=== TOP 10 STORES BY TOTAL PREDICTED QUANTITY ===")
top_stores = predictions_df.groupby('pdv')['quantidade'].sum().sort_values(ascending=False).head(10)
for i, (store, qty) in enumerate(top_stores.items(), 1):
    print(f"{i:2d}. Store {store}: {qty:,} units")

print("\n" + "="*60)
print("SALES FORECAST PIPELINE COMPLETED SUCCESSFULLY!")
print("="*60)
print(f"Generated {len(predictions_df):,} predictions for 5 weeks of January 2023")
print(f"Predictions saved to: sales_predictions_notebook.csv")
print("Ready for submission to the Big Data Hackathon 2025!")


Saving predictions to sales_predictions_notebook.csv...
Predictions saved successfully!
File: sales_predictions_notebook.csv
Rows: 1,500,000
Sample:
   semana                  pdv              produto  quantidade
0       1  1000237487041964405   777251454728290683           3
1       1  1001371918471115422  1009179103632945474           1
2       1  1001371918471115422  1029370090212151375           3
3       1  1001371918471115422  1120490062981954254           9
4       1  1001371918471115422  1371936917923350372           9
5       1  1001371918471115422  1394381856358939027          12
6       1  1001371918471115422  1454838625590783593           4
7       1  1001371918471115422   145852603040678098           1
8       1  1001371918471115422  1527082310248040324           1
9       1  1001371918471115422  1625722803643187564           1

PERFORMANCE REPORT
Model Validation Metrics:
  Training MAPE: 9044.29%
  Validation MAPE: 54.45%
  Training WMAPE: 15.29%
  Validation WMAPE: 22.