# Architecture: Shinkansen Passenger Satisfaction
**Final Score: 0.9589 | Rank: 1**

This notebook details the core inference pipeline and architectural decisions that yielded the Rank 1 submission. The approach deliberately avoids One-Hot Encoding and complex ensemble topologies, relying instead on a deeply grown, highly regularized CatBoost Classifier utilizing Ordered Target Statistics for native categorical processing.

In [None]:
import pandas as pd
import numpy as np
import os
import zipfile
import logging
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import StratifiedKFold

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

ZIP_PATH = os.path.join('data', 'Olympus', 'archive.zip')

## 1. Data Ingestion & Behavioral Imputation

In subjective survey datasets, non-response is a distinct behavioral signal rather than random data loss. Standard mean, median, or mode imputation actively destroys this signal. 

Missing values in categorical features are mapped to an explicit `Missing_Data` class. This allows the tree splits to isolate and map passenger cohorts that decline to answer specific survey dimensions.

In [None]:
def load_and_preprocess(zip_path: str) -> tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series, list[int]]:
    """
    Extracts and preprocesses datasets.
    Categorical NaNs cast to 'Missing_Data' to preserve behavioral non-response signals.
    Numeric NaNs leverage CatBoost's native missing value handling.
    """
    if not os.path.exists(zip_path):
        raise FileNotFoundError(f"Dataset not found at: {zip_path}")
        
    files = {
        'train_travel': 'Traveldata_train_(1).csv',
        'train_survey': 'Surveydata_train_(1).csv',
        'test_travel': 'Traveldata_test_(1).csv',
        'test_survey': 'Surveydata_test_(1).csv'
    }
    
    with zipfile.ZipFile(zip_path, 'r') as z:
        data = {k: pd.read_csv(z.open(v)) for k, v in files.items()}
        
    train = pd.merge(data['train_travel'], data['train_survey'], on='ID')
    test = pd.merge(data['test_travel'], data['test_survey'], on='ID')
    
    target = train['Overall_Experience']
    test_ids = test['ID']
    
    train.drop(['ID', 'Overall_Experience'], axis=1, inplace=True)
    test.drop(['ID'], axis=1, inplace=True)
    
    df = pd.concat([train, test], axis=0).reset_index(drop=True)
    
    df['Total_Delay'] = df['Departure_Delay_in_Mins'].fillna(0) + df['Arrival_Delay_in_Mins'].fillna(0)
    df['Delay_Ratio'] = df['Arrival_Delay_in_Mins'] / (df['Departure_Delay_in_Mins'] + 1)
    
    cat_cols = []
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col] = df[col].fillna("Missing_Data").astype(str)
            cat_cols.append(col)
            
    X = df.iloc[:len(train)].copy()
    X_test = df.iloc[len(train):].copy()
    
    cat_feature_indices = [X.columns.get_loc(c) for c in cat_cols]
    
    return X, target, X_test, test_ids, cat_feature_indices

X, y, X_test, test_ids, cat_indices = load_and_preprocess(ZIP_PATH)

## 2. Model Topology & Execution

The inference engine is a singular CatBoost Classifier. 

A depth of 8 is utilized to capture multi-layered feature interactions within the survey responses. To counter the high variance introduced by deep trees on a constrained dataset, aggressive L2 Leaf Regularization (1.96) is applied. Categorical features are passed natively using Ordered Target Statistics, bypassing the spatial matrix bloat associated with One-Hot Encoding.

In [None]:
PARAMS = {
    'iterations': 3500,
    'depth': 8,
    'learning_rate': 0.05,
    'l2_leaf_reg': 1.96,
    'border_count': 152,
    'loss_function': 'Logloss',
    'verbose': 500,
    'random_seed': 42
}

logger.info("Initializing Stratified 5-Fold Cross Validation")
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

oof_preds = np.zeros(len(X))
test_preds = np.zeros(len(X_test))

test_pool = Pool(X_test, cat_features=cat_indices)

os.makedirs('models', exist_ok=True)

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    logger.info(f"Training Fold {fold + 1}")
    
    X_tr, y_tr = X.iloc[train_idx], y.iloc[train_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
    
    train_pool = Pool(X_tr, label=y_tr, cat_features=cat_indices)
    val_pool = Pool(X_val, label=y_val, cat_features=cat_indices)
    
    model = CatBoostClassifier(**PARAMS)
    model.fit(
        train_pool, 
        eval_set=val_pool, 
        early_stopping_rounds=150, 
        use_best_model=True
    )
    
    model_path = os.path.join('models', f'catboost_fold_{fold}.cbm')
    model.save_model(model_path)
    
    oof_preds[val_idx] = model.predict_proba(val_pool)[:, 1]
    test_preds += model.predict_proba(test_pool)[:, 1] / skf.n_splits

final_predictions = (test_preds > 0.5).astype(int)
sub = pd.DataFrame({'ID': test_ids, 'Overall_Experience': final_predictions})
sub.to_csv('Submission_Native_Cat_CV.csv', index=False)
logger.info("Inference complete. Artifacts serialized.")

## 3. Iteration Tracking & Performance

The following visualization reconstructs the experimental variance across 20 distinct submission architectures. 

The severe degradations (e.g., Submission 13) represent deprecated feature transformations and unstable multi-model ensembles. The variance highlights the necessity of the final, constrained CatBoost topology, which stabilizes at the peak evaluation metric (0.9589).

In [None]:
import matplotlib.pyplot as plt

submissions = list(range(1, 21))
# Historical accuracy metrics reflecting the exploration phase
accuracy = [
    0.955, 0.950, 0.957, 0.955, 0.958, 
    0.951, 0.951, 0.952, 0.955, 0.943, 
    0.935, 0.958, 0.909, 0.953, 0.957, 
    0.956, 0.9589, 0.934, 0.957, 0.957
]

plt.figure(figsize=(12, 5))
plt.plot(submissions, accuracy, marker='o', linestyle='-', color='#3b82f6', markersize=5)

plt.title('Accuracy vs Submission Trends', pad=15)
plt.xlabel('Submissions', labelpad=10)
plt.ylabel('Accuracy', labelpad=10)
plt.xticks(submissions)
plt.grid(True, linestyle='-', alpha=0.2)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

plt.tight_layout()
plt.show()