# 🛍️ ImpactStore Profit Prediction Challenge – Final Submission

This notebook presents a **complete and beginner-friendly pipeline** for the ImpactStore Hackathon — a data science competition to predict store-level product returns using real transactional retail data.

---

## 📚 Table of Contents

1. 🗂️ [Overview of the Dataset](#overview)
2. 📦 [Data Imports & Setup](#imports)
3. 👀 [Initial Data Exploration](#exploration)
4. ❓ [Missing Value Analysis](#missing-values)
5. 🧹 [Data Cleaning & Preprocessing](#cleaning)
6. 🧠 [Feature Engineering](#features)
7. 🧪 [Model Building: Base Models](#base-models)
8. 🔗 [Stacking Meta-Model](#stacking)
9. 📈 [Model Evaluation](#evaluation)
10. 💾 [Submission Generation](#submission)


## 📦 Data Imports & Setup
We install required libraries, import our tools, and load the datasets.

Installing necesary libraries

In [57]:
# 📦 Install dependencies
!pip install xgboost lightgbm catboost scikit-learn pandas --quiet
# 📚 Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import Ridge, Lasso, ElasticNet, RidgeCV
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor
import lightgbm as lgb
from catboost import CatBoostRegressor
from google.colab import files
import warnings
warnings.filterwarnings("ignore")

Loading our data into the notebook:

*I imported my test.csv, train.csv and the sapmle submission then i loaded it*

In [58]:
# 📊 Load data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
sample = pd.read_csv("SampleSubmission.csv")


## 📦 Initial data Exploration

Quickly understand what the dataset looks like — its shape, columns, and general structure — before diving into modeling.


In [59]:
train.head()

Unnamed: 0,Item_ID,Store_ID,Item_Store_ID,Item_Weight,Item_Sugar_Content,Item_Visibility,Item_Type,Item_Price,Store_Start_Year,Store_Size,Store_Location_Type,Store_Type,Item_Store_Returns
0,DRA12,BABATUNJI010,DRA12_BABATUNJI010,11.6,Low Sugar,0.068535,Soft Drinks,357.54,2005,,Cluster 3,Grocery Store,709.08
1,DRA12,BABATUNJI013,DRA12_BABATUNJI013,11.6,Low Sugar,0.040912,Soft Drinks,355.79,1994,High,Cluster 3,Supermarket Type1,6381.69
2,DRA12,BABATUNJI017,DRA12_BABATUNJI017,11.6,Low Sugar,0.041178,Soft Drinks,350.79,2014,,Cluster 2,Supermarket Type1,6381.69
3,DRA12,BABATUNJI018,DRA12_BABATUNJI018,11.6,Low Sugar,0.041113,Soft Drinks,355.04,2016,Medium,Cluster 3,Supermarket Type2,2127.23
4,DRA12,BABATUNJI035,DRA12_BABATUNJI035,11.6,Ultra Low Sugar,0.0,Soft Drinks,354.79,2011,Small,Cluster 2,Supermarket Type1,2481.77


In [60]:
test.head()

Unnamed: 0,Item_ID,Store_ID,Item_Store_ID,Item_Weight,Item_Sugar_Content,Item_Visibility,Item_Type,Item_Price,Store_Start_Year,Store_Size,Store_Location_Type,Store_Type
0,DRA59,BABATUNJI010,DRA59_BABATUNJI010,8.27,Normal Sugar,0.214125,Soft Drinks,459.98,2005,,Cluster 3,Grocery Store
1,DRA59,BABATUNJI013,DRA59_BABATUNJI013,8.27,Normal Sugar,0.127821,Soft Drinks,464.98,1994,High,Cluster 3,Supermarket Type1
2,DRB01,BABATUNJI013,DRB01_BABATUNJI013,7.39,Low Sugar,0.082171,Soft Drinks,477.38,1994,High,Cluster 3,Supermarket Type1
3,DRB13,BABATUNJI010,DRB13_BABATUNJI010,6.115,Normal Sugar,0.011791,Soft Drinks,472.63,2005,,Cluster 3,Grocery Store
4,DRB13,BABATUNJI013,DRB13_BABATUNJI013,6.115,Normal Sugar,0.007038,Soft Drinks,473.13,1994,High,Cluster 3,Supermarket Type1


In [61]:
sample.head()

Unnamed: 0,Item_Store_ID,Item_Store_Returns
0,DRA59_BABATUNJI010,100
1,DRA59_BABATUNJI013,100
2,DRB01_BABATUNJI013,100
3,DRB13_BABATUNJI010,100
4,DRB13_BABATUNJI013,100


In [62]:
train.shape

(4990, 13)

In [63]:
train.columns

Index(['Item_ID', 'Store_ID', 'Item_Store_ID', 'Item_Weight',
       'Item_Sugar_Content', 'Item_Visibility', 'Item_Type', 'Item_Price',
       'Store_Start_Year', 'Store_Size', 'Store_Location_Type', 'Store_Type',
       'Item_Store_Returns'],
      dtype='object')

In [64]:
train.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_Price,Store_Start_Year,Item_Store_Returns
count,4188.0,4990.0,4990.0,4990.0,4990.0
mean,12.908838,0.066916,391.803796,2004.783567,6103.520164
std,4.703256,0.053058,119.378259,8.283151,4447.333835
min,4.555,0.0,78.73,1992.0,83.23
25%,8.7675,0.027273,307.89,1994.0,2757.66
50%,12.6,0.053564,393.86,2006.0,5374.675
75%,17.1,0.095358,465.0675,2011.0,8522.24
max,21.35,0.328391,667.22,2016.0,32717.41


## ❓ Missing Value Analysis
Identify which columns contain missing values and how many are missing, so you can decide how to clean them later.

In [65]:
train.isnull().sum()

Unnamed: 0,0
Item_ID,0
Store_ID,0
Item_Store_ID,0
Item_Weight,802
Item_Sugar_Content,0
Item_Visibility,0
Item_Type,0
Item_Price,0
Store_Start_Year,0
Store_Size,1450


In [66]:
test.isnull().sum()

Unnamed: 0,0
Item_ID,0
Store_ID,0
Item_Store_ID,0
Item_Weight,661
Item_Sugar_Content,0
Item_Visibility,0
Item_Type,0
Item_Price,0
Store_Start_Year,0
Store_Size,959


## 🧪 Data Cleaning and preprocessing
Fix missing or inconsistent data to make the dataset suitable for machine learning. This ensures high-quality input for the model.

In [67]:
# 📂 Load train data
train = pd.read_csv("train.csv")

# ✅ Fix missing Item_Weight
# Step 1: Group mean by Item_ID
train['Item_Weight'] = train.groupby('Item_ID')['Item_Weight'].transform(
    lambda x: x.fillna(x.mean())
)
# Step 2: Fallback to overall mean
train['Item_Weight'].fillna(train['Item_Weight'].mean(), inplace=True)

# ✅ Fix missing Store_Size
# Step 1: Fill with most frequent size per Store_Type
def fill_store_mode(series):
    mode = series.mode()
    return series.fillna(mode[0] if not mode.empty else "Medium")

train['Store_Size'] = train.groupby('Store_Type')['Store_Size'].transform(fill_store_mode)
# Step 2: Fill any remaining with global mode
train['Store_Size'].fillna(train['Store_Size'].mode()[0], inplace=True)

# ✅ Final safety check
print("✅ Remaining missing values:", train.isnull().sum().sum())  # should print 0


✅ Remaining missing values: 0


Now lets check again

In [68]:
train.isnull().sum()

Unnamed: 0,0
Item_ID,0
Store_ID,0
Item_Store_ID,0
Item_Weight,0
Item_Sugar_Content,0
Item_Visibility,0
Item_Type,0
Item_Price,0
Store_Start_Year,0
Store_Size,0


Lets do   some prepocessing

In [69]:
# 🧹 Combine for feature processing
train['is_train'] = 1
test['is_train'] = 0
test['Item_Store_Returns'] = np.nan
data = pd.concat([train, test], ignore_index=True)


## 🧠 Feature Engineering
We extract meaningful features, encode categoricals, and apply transformations to improve model performance.

In [70]:
# ✅ Clean + Powerful Feature Engineering (Minimalist Version)

# 🎯 Target mean encoding for Item_Type (train-only logic)
if 'Item_Store_Returns' in data.columns:
    type_target_mean = data[data['is_train'] == 1].groupby('Item_Type')['Item_Store_Returns'].mean().to_dict()
    data['Item_Type_Encoded'] = data['Item_Type'].map(type_target_mean)

# 🛍️ Store age in years
data['Store_Age'] = 2025 - data['Store_Start_Year']

# 💰 Price per kg
data['Price_per_kg'] = data['Item_Price'] / (data['Item_Weight'] + 1)

# 🧱 Visibility per kg
data['Visibility_Weight_Ratio'] = data['Item_Visibility'] / (data['Item_Weight'] + 1)

# 🔗 Price * Visibility (exposure value)
data['Price_Vis'] = data['Item_Price'] * data['Item_Visibility']

# 🔗 Price * Weight (logistics cost effect)
data['Price_Weight'] = data['Item_Price'] * data['Item_Weight']


## 📈  🧪 Model Building: Base Models
Train several different ML models (XGBoost, LightGBM, CatBoost, ExtraTrees) and generate predictions using cross-validation.


In [71]:
# ✂️ Split back
train = data[data['is_train'] == 1].copy()
test = data[data['is_train'] == 0].copy()
features = [col for col in train.columns if col not in ['Item_Store_Returns', 'Item_Store_ID', 'is_train']]

# Identify categorical columns in the training data
categorical_features = train[features].select_dtypes(include='object').columns.tolist()

# Apply One-Hot Encoding to the categorical features
# We combine train and test again for consistent encoding before splitting
combined_data_for_encoding = pd.concat([train[features], test[features]], ignore_index=True)
combined_data_encoded = pd.get_dummies(combined_data_for_encoding, columns=categorical_features, dummy_na=False)

# --- Add imputation after one-hot encoding ---
# Check for NaNs after encoding and feature engineering
# print("NaNs before imputation:", combined_data_encoded.isnull().sum().sum())

# Impute remaining NaNs (likely in numerical columns) with the mean
# Calculating mean on the combined data is acceptable here as it's after the train/test split
# for feature engineering, and we need to treat train and test similarly for prediction
combined_data_encoded.fillna(combined_data_encoded.mean(), inplace=True)

# print("NaNs after imputation:", combined_data_encoded.isnull().sum().sum())
# --- End of imputation ---


# Split back into training and testing sets after encoding and imputation
X = combined_data_encoded.iloc[:len(train)].copy()
X_test = combined_data_encoded.iloc[len(train):].copy()

# Ensure columns match after encoding (important for prediction)
# This handles cases where a category might be present in test but not train, or vice-versa
train_cols = X.columns
test_cols = X_test.columns

missing_in_test = set(train_cols) - set(test_cols)
for c in missing_in_test:
    X_test[c] = 0

missing_in_train = set(test_cols) - set(train_cols)
for c in missing_in_train:
    X[c] = 0

# Ensure the order of columns is the same
X_test = X_test[train_cols]


# 🎯 Target Transform with Capping
y_raw = train['Item_Store_Returns'].clip(0, 18000)  # cap outliers
y = np.log1p(y_raw)
# X and X_test are now encoded and imputed

# Stratified KFold based on binned target
bins = pd.qcut(y_raw, q=5, labels=False)
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 💼 Define Base Models
# Note: XGBoost can handle categorical features directly with enable_categorical=True in newer versions (>= 1.6).
# However, one-hot encoding is a general approach that works for all models here.
# If you want to use the native categorical feature support in XGBoost, you would need to
# ensure the columns are of pandas 'category' dtype and pass enable_categorical=True
# to the XGBoostRegressor constructor. For this fix, one-hot encoding is simpler as it applies
# to all models in the base_models dictionary.
base_models = {
    "xgb": XGBRegressor(n_estimators=400, learning_rate=0.05, max_depth=7, subsample=0.8, colsample_bytree=0.8, random_state=42),
    "lgb": lgb.LGBMRegressor(n_estimators=450, learning_rate=0.03, num_leaves=40, subsample=0.8, colsample_bytree=0.8, random_state=42),
    "cat": CatBoostRegressor(iterations=400, depth=6, learning_rate=0.05, verbose=0, random_state=42),
    "ridge": Ridge(alpha=1.0),
    "lasso": Lasso(alpha=0.01),
    "enet": ElasticNet(alpha=0.01, l1_ratio=0.5),
    "extr": ExtraTreesRegressor(n_estimators=300, max_depth=12, random_state=42)
}

oof_preds = {name: np.zeros(X.shape[0]) for name in base_models}
test_preds = {name: np.zeros(X_test.shape[0]) for name in base_models}

print("🚀 Training models...")

for fold, (train_idx, val_idx) in enumerate(kf.split(X, bins)):
    print(f"🔁 Fold {fold + 1}")
    X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]

    for name, model in base_models.items():
        # Fit the model on the training fold data
        model.fit(X_tr, y_tr)
        # Make predictions on the validation fold data for OOF
        oof_preds[name][val_idx] = model.predict(X_val)
        # Make predictions on the full test set for stacking
        # We average predictions over folds later
        test_preds[name] += model.predict(X_test) / kf.n_splits


print("✅ Training complete.")

🚀 Training models...
🔁 Fold 1
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000738 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1874
[LightGBM] [Info] Number of data points in the train set: 3992, number of used features: 49
[LightGBM] [Info] Start training from score 8.335701
🔁 Fold 2
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000209 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1874
[LightGBM] [Info] Number of data points in the train set: 3992, number of used features: 49
[LightGBM] [Info] Start training from score 8.336552
🔁 Fold 3
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001187 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1876
[LightGBM]

## 📈 🔗 Stacking Meta-Model

Use a simple model (RidgeCV) to combine the outputs of all base models. This helps reduce individual model weaknesses.

In [72]:
# 📊 # 📊 Stack predictions using RidgeCV
oof_stack = np.column_stack([oof_preds[m] for m in base_models])
test_stack = np.column_stack([test_preds[m] for m in base_models])
meta_model = RidgeCV()
meta_model.fit(oof_stack, y)
final_oof = meta_model.predict(oof_stack)
final_test = meta_model.predict(test_stack)

# 🎯 Inverse transform
y_true = np.expm1(y)
y_pred = np.expm1(final_oof)
y_test = np.expm1(final_test)

## 📈 Model Evaluation
Evaluate model performance using R² and RMSE.

In [73]:
# 📈 Final Performance
print("\n✅ Pro-Level Stacked R²:", round(r2_score(y_true, y_pred), 4))
# Calculate RMSE by taking the square root of MSE
print("✅ Pro-Level Stacked RMSE:", round(np.sqrt(mean_squared_error(y_true, y_pred)), 2)) # Changed line


✅ Pro-Level Stacked R²: 0.5491
✅ Pro-Level Stacked RMSE: 2875.19


## 💾 Submission Generation
Save and download final predictions in submission format.

In [74]:
# 📁 Save submission
submission = sample.copy()
submission['Item_Store_Returns'] = y_test
submission.to_csv("pro_ensemble_submission.csv", index=False)
files.download("pro_ensemble_submission.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>