**1. Setup & Data Load**

- Load train and test CSVs (with all engineered features)

**2. Preprocessing for Modeling**

- Encode categorical features (LabelEncoder or OneHotEncoder)

- Separate X and y

- Train/Val split (Stratified)

**3. Baseline Modeling**

- LightGBM (fast, accurate, and handles nulls + cats well)

- Basic training with cross-validation

- Output log loss and accuracy

- Plot confusion matrix

**4. Feature Importance**

- LightGBM’s built-in importances

- Optional: SHAP (if you want deeper analysis)

**5. Submission**
- Predict on test

- Export predictions to CSV for Kaggle submission

### **Load Data & Required Packages**

Start by importing the core libraries needed for modeling:

- `pandas` / `numpy` for data manipulation
- `matplotlib` / `seaborn` for plotting
- `sklearn` for cross-validation and metrics
- `lightgbm` for efficient tree-based modeling

I'll also load the cleaned and feature-engineered training and test sets from disk and confirm their shapes.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, log_loss
from lightgbm import LGBMClassifier

# Load the feature-engineered datasets
train = pd.read_csv('../data/train_clean.csv')
test = pd.read_csv('../data/test_clean.csv')

# Check initial shape
print("Train shape:", train.shape)
print("Test shape:", test.shape)


Train shape: (8693, 34)
Test shape: (4277, 33)


**Separate Features and Target**

We split the cleaned training dataset into:
- `X`: features (everything except the target)
- `y`: target label (`Transported`, converted to integer)

The test dataset (`X_test`) is kept as-is (it has no label).

In [2]:
# Separate features and labels
X = train.drop(columns=['Transported'])
y = train['Transported'].astype(int)  # Convert boolean to int for modeling
X_test = test.copy()

**Set Up Cross-Validation**

Using **StratifiedKFold** with 5 splits to:
- Maintain class distribution in each fold
- Get a reliable measure of model performance across different subsets
- Shuffle the data before splitting (random_state ensures reproducibility)


In [3]:
# Set up stratified k-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

**Train Baseline LightGBM Model**

Training a default **LightGBM classifier** using stratified 5-fold cross-validation.

- Log loss is computed for each fold and averaged.
- Out-of-fold (OOF) predictions help us evaluate performance without leaking validation data.
- Test set predictions are averaged across all folds to prepare for submission.


In [4]:
# Drop non-numeric columns for baseline
non_numeric_cols = X.select_dtypes(include='object').columns.tolist()
X = X.drop(columns=non_numeric_cols)
test_X = test.drop(columns=non_numeric_cols)

# Ensure alignment with test features
test_X = test_X[X.columns]

# Retrain model with clean features
oof_preds = np.zeros(len(train))
test_preds = np.zeros(len(test))

for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]

    model = LGBMClassifier(random_state=42)
    model.fit(X_train, y_train)

    oof_preds[val_idx] = model.predict_proba(X_val)[:, 1]
    test_preds += model.predict_proba(test_X)[:, 1] / cv.n_splits

    fold_loss = log_loss(y_val, oof_preds[val_idx])
    print(f'Fold {fold+1} Log Loss: {fold_loss:.5f}')

total_loss = log_loss(y, oof_preds)
print(f'\nOverall CV Log Loss: {total_loss:.5f}')


[LightGBM] [Info] Number of positive: 3502, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000567 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1906
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 25
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503595 -> initscore=0.014380
[LightGBM] [Info] Start training from score 0.014380
Fold 1 Log Loss: 0.44683
[LightGBM] [Info] Number of positive: 3502, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000442 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1906
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 25
[

| Fold    | Log Loss    |
| ------- | ----------- |
| 1       | 0.44683     |
| 2       | 0.44748     |
| 3       | 0.43954     |
| 4       | 0.43405     |
| 5       | 0.45521     |
| **Avg** | **0.44462** |

**Interpretation**
- Log loss of ~0.44 means the model is predicting well-calibrated probabilities, not just labels.

- LightGBM is already separating the signal from noise with just the numeric features.

- There's room for improvement—especially from:

    - Categorical encoding

    - Feature interactions (e.g., combining AgeGroup + VIP)

    - Hyperparameter tuning



In [5]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y, oof_preds > 0.5)
print(f'Accuracy: {accuracy:.4f}')

Accuracy: 0.7952
