# Rossmann Sales Forecasting — Time Series Modeling

---

## 1. Introduction

## Problem Statement

Rossmann operates over 3,000 drugstores across Europe. Due to the short shelf life of many pharmaceutical products, it's essential to forecast daily sales accurately.

Currently, store managers manually forecast daily sales for the next six weeks. To improve consistency and accuracy, we're tasked with building a **data-driven time-series model** to automate this process.

---

### Objective:

Build a robust **time-series forecasting pipeline** to predict **daily sales for the next 6 weeks**, for **9 key Rossmann stores** using:
- Time-series decomposition
- Stationarity checks (ADF test)
- Cointegration tests (Johansen)
- VAR or VARMAX modeling
- MAPE as evaluation metric


## 2. Load Data & Basic Inspection

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.api import VARMAX
from statsmodels.tsa.stattools import ccf
from statsmodels.tsa.seasonal import seasonal_decompose

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_percentage_error
from datetime import datetime

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Load CSVs
train_df = pd.read_csv("train.csv", parse_dates=["Date"])
store_df = pd.read_csv("store.csv")

In [None]:
# View data shapes
print("Train shape:", train_df.shape)
print("Store shape:", store_df.shape)

In [None]:
# Preview train data
train_df.head()

## 3. Data Preparation & Merging

In [None]:
# Merge train and store metadata
df = pd.merge(train_df, store_df, on='Store', how='left')

# Focus on 9 selected stores
selected_stores = [1, 3, 8, 9, 13, 25, 29, 31, 46]
df = df[df["Store"].isin(selected_stores)]

# Drop unnecessary columns
df.drop(columns=["Customers"], inplace=True)  # Customers highly correlated with Sales and not known in advance

# Fill NA in competition/opening columns
comp_cols = ["CompetitionDistance", "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", 
             "Promo2SinceWeek", "Promo2SinceYear", "PromoInterval"]
df[comp_cols] = df[comp_cols].fillna(0)

# ➕ Feature Engineering: Promo2 Active Flag
def is_promo2_active(row):
    if row["Promo2"] == 1 and row["PromoInterval"] != 0:
        promo_months = row["PromoInterval"].split(",")
        return row["Date"].strftime("%b") in promo_months
    return False

df["Promo2Active"] = df.apply(is_promo2_active, axis=1).astype(int)

# Convert Date to datetime if not already
df["Date"] = pd.to_datetime(df["Date"])

# Sort values by Store + Date
df.sort_values(by=["Store", "Date"], inplace=True)
df.reset_index(drop=True, inplace=True)

# Check cleaned dataset
df.head()

## 4. Outlier Removal & Skewness Check

In [None]:
# Remove outliers above 99th percentile of Sales per store
for store in selected_stores:
    p99 = df[df.Store == store]["Sales"].quantile(0.99)
    df.loc[(df.Store == store) & (df["Sales"] > p99), "Sales"] = p99

In [None]:
# Check skewness
sns.histplot(df["Sales"], kde=True)
plt.title("Sales Distribution after Outlier Removal")
plt.show()

- Outliers at the 99th percentile can skew the time series model drastically.
- We remove extreme sales values to ensure the stationarity test and variance structure remain valid.

## 5. Stratified Train-Test Split (Per Store)

In [None]:
# Split last 6 weeks as test set (per store)
df["is_test"] = 0

for store in selected_stores:
    store_data = df[df.Store == store]
    cutoff_date = store_data["Date"].max() - pd.Timedelta(days=42)
    df.loc[(df.Store == store) & (df["Date"] > cutoff_date), "is_test"] = 1

# Confirm split
df["is_test"].value_counts()

- Since this is time series, a temporal split ensures the model is trained on past data and tested on future unseen periods.
- By maintaining a 6-week window for each store, we simulate real-world forecasting conditions.