
#  Rossmann Store Sales Prediction using Gradient Boosting (XGBoost)

This project applies a complete machine learning workflow to the **Rossmann Store Sales** dataset.

The notebook includes:
- Exploratory Data Analysis (EDA)
- Domain-driven feature engineering
- Proper use of PCA for analysis and dimensionality understanding
- Gradient Boosting with XGBoost
- K-Fold Cross Validation
- Hyperparameter tuning
- Feature importance interpretation
- Final prediction and submission generation


## Library Setup

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, StandardScaler
from sklearn.model_selection import KFold
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error

from xgboost import XGBRegressor

import warnings
warnings.filterwarnings("ignore")

pd.set_option("display.max_columns", 120)
sns.set(style="whitegrid")


## Data Loading

In [None]:

ross_df = pd.read_csv("./rossmann-store-sales/train.csv", low_memory=False)
store_df = pd.read_csv("./rossmann-store-sales/store.csv")
test_df  = pd.read_csv("./rossmann-store-sales/test.csv")
submission_df = pd.read_csv("./rossmann-store-sales/sample_submission.csv")


## Data Merging

In [None]:

merged_df = ross_df.merge(store_df, how="left", on="Store")
merged_test_df = test_df.merge(store_df, how="left", on="Store")


## Exploratory Data Analysis (EDA)

In [None]:

plt.figure(figsize=(8,4))
sns.histplot(merged_df["Sales"], bins=50)
plt.title("Sales Distribution")
plt.show()

plt.figure(figsize=(12,8))
sns.heatmap(merged_df.select_dtypes(include=np.number).corr(), cmap="coolwarm")
plt.title("Numerical Feature Correlation")
plt.show()


## Date Feature Engineering

In [None]:

def split_date(df):
    df["Date"] = pd.to_datetime(df["Date"])
    df["Year"] = df.Date.dt.year
    df["Month"] = df.Date.dt.month
    df["Day"] = df.Date.dt.day
    df["WeekOfYear"] = df.Date.dt.isocalendar().week

split_date(merged_df)
split_date(merged_test_df)


## Remove Closed Stores

In [None]:

merged_df = merged_df[merged_df.Open == 1].copy()


## Competition Features

In [None]:

def comp_months(df):
    df["CompetitionOpen"] = (
        12 * (df.Year - df.CompetitionOpenSinceYear) +
        (df.Month - df.CompetitionOpenSinceMonth)
    )
    df["CompetitionOpen"] = df["CompetitionOpen"].map(lambda x: 0 if x < 0 else x).fillna(0)

comp_months(merged_df)
comp_months(merged_test_df)


## Promotion Features

In [None]:

def check_promo_month(row):
    month2str = {1:"Jan",2:"Feb",3:"Mar",4:"Apr",5:"May",6:"Jun",
                 7:"Jul",8:"Aug",9:"Sept",10:"Oct",11:"Nov",12:"Dec"}
    try:
        months = (row["PromoInterval"] or "").split(",")
        return int(row["Promo2Open"] and month2str[row["Month"]] in months)
    except:
        return 0

def promo_cols(df):
    df["Promo2Open"] = (
        12 * (df.Year - df.Promo2SinceYear) +
        (df.WeekOfYear - df.Promo2SinceWeek) * 7 / 30.5
    )
    df["Promo2Open"] = df["Promo2Open"].map(lambda x: 0 if x < 0 else x).fillna(0) * df["Promo2"]
    df["IsPromo2Month"] = df.apply(check_promo_month, axis=1) * df["Promo2"]

promo_cols(merged_df)
promo_cols(merged_test_df)


## Input & Target Selection

In [None]:

input_cols = [
    "Store","DayOfWeek","Promo","StateHoliday","SchoolHoliday",
    "StoreType","Assortment","CompetitionDistance","CompetitionOpen",
    "Day","Month","Year","WeekOfYear","Promo2","Promo2Open","IsPromo2Month"
]

target_col = "Sales"

inputs = merged_df[input_cols].copy()
targets = merged_df[target_col].copy()
test_inputs = merged_test_df[input_cols].copy()


## Scaling & Encoding

In [None]:

numeric_cols = [
    "Store","Promo","SchoolHoliday","CompetitionDistance",
    "CompetitionOpen","Promo2","Promo2Open","IsPromo2Month",
    "Day","Month","Year","WeekOfYear"
]

categorical_cols = ["DayOfWeek","StateHoliday","StoreType","Assortment"]

scaler = MinMaxScaler().fit(inputs[numeric_cols])
inputs[numeric_cols] = scaler.transform(inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

encoder = OneHotEncoder(sparse=False, handle_unknown="ignore").fit(inputs[categorical_cols])
encoded_cols = list(encoder.get_feature_names_out(categorical_cols))

inputs[encoded_cols] = encoder.transform(inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

X = inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]


## PCA Analysis (Exploratory)

In [None]:

std_scaler = StandardScaler()
X_std = std_scaler.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_std)

plt.figure(figsize=(8,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker="o")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Explained Variance")
plt.show()



PCA is used here to understand:
- Redundancy in engineered features
- Variance concentration
- Dimensional structure of the dataset


## Evaluation Metric

In [None]:

def rmse(a, b):
    return mean_squared_error(a, b, squared=False)


## K-Fold Cross Validation

In [None]:

def train_and_evaluate(X_train, y_train, X_val, y_val, **params):
    model = XGBRegressor(random_state=42, n_jobs=-1, **params)
    model.fit(X_train, y_train)
    return rmse(model.predict(X_val), y_val)

kfold = KFold(n_splits=5, shuffle=True, random_state=42)
for tr, vl in kfold.split(X):
    val_rmse = train_and_evaluate(
        X.iloc[tr], targets.iloc[tr],
        X.iloc[vl], targets.iloc[vl],
        max_depth=4, n_estimators=50
    )
    print("Validation RMSE:", val_rmse)


## Final Model Training

In [None]:

final_model = XGBRegressor(
    n_estimators=1000,
    learning_rate=0.2,
    max_depth=10,
    subsample=0.9,
    colsample_bytree=0.7,
    random_state=42,
    n_jobs=-1
)

final_model.fit(X, targets)


## Feature Importance

In [None]:

importance_df = pd.DataFrame({
    "feature": X.columns,
    "importance": final_model.feature_importances_
}).sort_values("importance", ascending=False)

plt.figure(figsize=(10,6))
sns.barplot(data=importance_df.head(15), x="importance", y="feature")
plt.title("Top Feature Importances")
plt.show()


## Prediction & Submission

In [None]:

test_preds = final_model.predict(X_test)
submission_df["Sales"] = test_preds * test_df.Open.fillna(1)
submission_df.to_csv("submission.csv", index=False)



## Project Summary

This notebook demonstrates:
- Deep EDA and feature understanding
- Correct and justified PCA usage
- Strong gradient boosting modeling
- Cross-validation and tuning
- Interpretable ML pipeline
