# Brief Description:

In this Notebook, I tried to predict bus seats for the year 2025, January-February, given by Analytics Vidhya. By reading the problem statement, I get an idea about how to approach this problem, tried many ways like using normal machine learning techniques like random forest decision tree, and SVM, etc, but got very poor results then I started with Boosting algorithms like LightGBM, CatBoost, XgBoost, but still getting poor results then tried optuna for hyperparameter tuning by trying so many approaches it's clear to me that this real life data has many challenges either outliers, having different statistical distributions of different variables I also tried time series but whichever algorithm I tried to follow gave me bad validation RMSE of or bad overfitting results. Then I tried stacking algos, tried to club one or two algos, getting good results and improving Validation RMSE without overfitting. Then I got my best baseline algorithm (which is described in this solution), then tried different feature engineering, and added many derived features to improve my RMSE.

Here's a brief description of my algorithm,
This solution consists of a blended ensemble model for bus seat demand forecasting using LightGBM, XGBoost, and Ridge regression as a meta-learner. It starts by loading and preprocessing the train, test, and transactions datasets, focusing on `dbd == 15` records to reflect booking behavior 15 days before the journey. The merged data undergoes rich feature engineering, including temporal features (like day of week, month, weekend flags), demand metrics (like search-per-seat ratio), route-level aggregations (mean, median, frequency), and interaction features such as `search_x_weekend` and `route_dow_final_seatcount_skew`.

Categorical variables are label-encoded, and several derived features like `is_same_region`, `route_tier_combo`, and `route_freq_rank` are created to capture region/tier dynamics and route popularity. After sorting chronologically, the last 15% of the data is held out for validation to preserve time-based integrity.

The model is trained using LightGBM and XGBoost separately with tuned hyperparameters and early stopping. Predictions from both models on validation and test sets are combined into a second-level dataset, which is fed into a Ridge regression meta-model. The meta-model is trained on validation predictions and then used to generate final test predictions. The final blended validation RMSE is printed to evaluate ensemble performance. This approach captures diverse model behaviors and route-specific patterns to enhance prediction accuracy.

### Importing Required Libraries

In [None]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import xgboost as xgb
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")

### Data Preprocessing

In [None]:
# Data Loading
train = pd.read_csv("/kaggle/input/redbus-dataset/train-zip/train/train.csv", parse_dates=["doj"])
transactions = pd.read_csv("/kaggle/input/redbus-dataset/train-zip/train/transactions.csv", parse_dates=["doj", "doi"])
test = pd.read_csv("/kaggle/input/redbus-dataset/test.csv", parse_dates=["doj"])

transactions["doj"] = pd.to_datetime(transactions["doj"])
transactions["doj_month"] = transactions["doj"].dt.month
transactions["doj_dayofweek"] = transactions["doj"].dt.dayofweek
transactions["route_id"] = transactions["srcid"] * 10000 + transactions["destid"]
transactions["search_per_seat"] = transactions["cumsum_searchcount"] / (transactions["cumsum_seatcount"] + 1)

# Merge Data
db15 = transactions[transactions["dbd"] == 15]
train_merged = train.merge(db15, on=["doj", "srcid", "destid"], how="left")
test_merged = test.merge(db15, on=["doj", "srcid", "destid"], how="left")

# Feature Engineering
def add_features(df):
    df["doj_dayofweek"] = df["doj"].dt.dayofweek
    df["doj_month"] = df["doj"].dt.month
    df["doj_day"] = df["doj"].dt.day
    df["is_weekend"] = df["doj_dayofweek"].isin([5, 6]).astype(int)
    df["search_per_seat"] = df["cumsum_searchcount"] / (df["cumsum_seatcount"] + 1)
    df["route_id"] = df["srcid"] * 10000 + df["destid"]
    return df

train_merged = add_features(train_merged)
test_merged = add_features(test_merged)

route_stats = train_merged.groupby("route_id")["final_seatcount"].agg(["mean", "count", "median"]).reset_index()
route_stats.columns = ["route_id", "route_avg_seatcount", "route_freq", "route_median"]
train_merged = train_merged.merge(route_stats, on="route_id", how="left")
test_merged = test_merged.merge(route_stats, on="route_id", how="left")

train_merged["search_x_weekend"] = train_merged["search_per_seat"] * train_merged["is_weekend"]
test_merged["search_x_weekend"] = test_merged["search_per_seat"] * test_merged["is_weekend"]

train_merged["is_month_end"] = train_merged["doj"].dt.is_month_end.astype(int)
test_merged["is_month_end"] = test_merged["doj"].dt.is_month_end.astype(int)
train_merged["days_to_weekend"] = 6 - train_merged["doj_dayofweek"]
test_merged["days_to_weekend"] = 6 - test_merged["doj_dayofweek"]

train_merged["search_to_seat_ratio"] = train_merged["cumsum_searchcount"] / (train_merged["cumsum_seatcount"] + 1)
test_merged["search_to_seat_ratio"] = test_merged["cumsum_searchcount"] / (test_merged["cumsum_seatcount"] + 1)

train_merged["is_same_region"] = (train_merged["srcid_region"] == train_merged["destid_region"]).astype(int)
test_merged["is_same_region"] = (test_merged["srcid_region"] == test_merged["destid_region"]).astype(int)

cat_cols = ["srcid_region", "destid_region", "srcid_tier", "destid_tier"]
for col in cat_cols:
    le = LabelEncoder()
    train_merged[col] = le.fit_transform(train_merged[col].astype(str))
    test_merged[col] = le.transform(test_merged[col].astype(str))

train_merged["route_tier_combo"] = train_merged["srcid_tier"] * 10 + train_merged["destid_tier"]
test_merged["route_tier_combo"] = test_merged["srcid_tier"] * 10 + test_merged["destid_tier"]

route_freq_rank = train_merged.groupby("route_id")["final_seatcount"].count().rank(method="min", ascending=False).to_dict()
train_merged["route_freq_rank"] = train_merged["route_id"].map(route_freq_rank)
test_merged["route_freq_rank"] = test_merged["route_id"].map(route_freq_rank)

# Feature Adding
skew_feature = (
    train_merged.groupby(["route_id", "doj_dayofweek"])["final_seatcount"]
    .skew()
    .reset_index()
    .rename(columns={"final_seatcount": "route_dow_final_seatcount_skew"})
)
train_merged = train_merged.merge(skew_feature, on=["route_id", "doj_dayofweek"], how="left")
test_merged = test_merged.merge(skew_feature, on=["route_id", "doj_dayofweek"], how="left")


### Model Training and Prediction of Test Data

In [None]:
train_merged = train_merged.sort_values("doj")
split_idx = int(len(train_merged) * 0.85)

features = [
    "doj_dayofweek", "doj_month", "doj_day", "is_weekend", "cumsum_seatcount", "cumsum_searchcount",
    "search_per_seat", "srcid_region", "destid_region", "srcid_tier", "destid_tier", "route_id",
    "route_avg_seatcount", "route_freq", "search_x_weekend", "route_median", "is_month_end",
    "days_to_weekend", "search_to_seat_ratio", "route_tier_combo", "is_same_region",
    "route_freq_rank", "route_dow_final_seatcount_skew"
]

X_train = train_merged.iloc[:split_idx][features]
y_train = train_merged.iloc[:split_idx]["final_seatcount"]
X_val = train_merged.iloc[split_idx:][features]
y_val = train_merged.iloc[split_idx:]["final_seatcount"]
X_test = test_merged[features]

# Setting the parameters for base models
params_lgb = {
    "objective": "regression", "metric": "rmse", "verbosity": -1,
    "boosting_type": "gbdt", "learning_rate": 0.03,
    "num_leaves": 50, "feature_fraction": 0.9,
    "bagging_fraction": 0.8, "bagging_freq": 5, "seed": 42
}
params_xgb = {
    "objective": "reg:squarederror", "eval_metric": "rmse",
    "learning_rate": 0.03, "max_depth": 7,
    "subsample": 0.8, "colsample_bytree": 0.9, "seed": 42
}

# Base Model 1
lgb_train = lgb.Dataset(X_train, y_train)
lgb_val = lgb.Dataset(X_val, y_val)
model_lgb = lgb.train(params_lgb, lgb_train, num_boost_round=2000, valid_sets=[lgb_val], callbacks=[lgb.early_stopping(50)])
preds_lgb_val = model_lgb.predict(X_val)
preds_lgb_test = model_lgb.predict(X_test)

# Base Model 2
model_xgb = xgb.XGBRegressor(**params_xgb, n_estimators=2000)
model_xgb.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=50, verbose=False)
preds_xgb_val = model_xgb.predict(X_val)
preds_xgb_test = model_xgb.predict(X_test)

# Meta-Model
stk_train = pd.DataFrame({"lgb": preds_lgb_val, "xgb": preds_xgb_val})
stk_test = pd.DataFrame({"lgb": preds_lgb_test, "xgb": preds_xgb_test})
meta = Ridge(alpha=1.0)

# Final Prediction on test data
meta.fit(stk_train, y_val)
meta_val_preds = meta.predict(stk_train) # this for checking validation set predictions
meta_test_preds = meta.predict(stk_test)

val_rmse = mean_squared_error(y_val, meta_val_preds, squared=False)
print("Validation RMSE (Blended):", val_rmse)


Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[356]	valid_0's rmse: 635.15
Validation RMSE (Blended): 565.4098054288642


### Submission File

In [None]:
submission = test[["route_key"]].copy()
submission["final_seatcount"] = np.round(np.clip(meta_test_preds, 0, None)).astype(int)
submission.to_csv("submission_skew.csv", index=False)
