# Yandex Hackathon Indonesia - Final Submission Notebook

**Participant:** mikaelradityas@gmail.com  
**Model:** SGDRegressor with Huber Loss  
**Description:** This notebook contains a complete pipeline for the competition task, including preprocessing, feature engineering, model training, and submission generation.

### 1. Import Library
All required Python libraries for data handling, preprocessing, modeling, and exporting submission.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer

### 2. Load Dataset
Load the training and test data provided in the competition. No additional external data is used.

In [2]:
# Load train and test datasets
train = pd.read_csv("train_sample.csv")
test = pd.read_csv("test_sample.csv")

### 3. Data Cleaning
Convert data types if necessary and handle missing values.

In [3]:
# Convert columns to numeric
for col in ['vehicle_density', 'population_density']:
    train[col] = pd.to_numeric(train[col], errors='coerce')
    test[col] = pd.to_numeric(test[col], errors='coerce')


### 4. Transform Target & Fitur
Apply log transformation to skewed features (`event_count`) and to the target (`travel_time`) for better model performance.

In [4]:
# Log transform event_count
train['event_count'] = np.log1p(train['event_count'])
test['event_count'] = np.log1p(test['event_count'])

# Log transform target
y_train = np.log1p(train['travel_time'])
X_train = train.drop(columns='travel_time')
X_test = test.copy()

### 5. Feature Engineering
Add meaningful features:  
- `route`: combining start and end point  
- `same_area`: binary indicator whether start and end are the same

In [5]:
# Create 'route' and 'same_area' features
for df in [X_train, X_test]:
    df['route'] = df['start_point'] + '->' + df['end_point']
    df['same_area'] = (df['start_point'] == df['end_point']).astype(int)


### 6. Preprocessing Pipelines
Build preprocessing pipelines using `ColumnTransformer` to scale and encode the features appropriately.

In [6]:
# Define feature groups
cat_cols = ['time_of_day', 'day_of_week', 'weather', 'route']
num_std_cols = ['historical_delay_factor', 'public_transport_availability', 'is_holiday', 'same_area']
num_mm_cols = ['traffic_condition', 'vehicle_density', 'population_density', 'event_count']

# Pipelines
cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])
num_std_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
num_mm_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler())
])

# Column transformer
preprocessor = ColumnTransformer(transformers=[
    ('cat', cat_transformer, cat_cols),
    ('num_std', num_std_transformer, num_std_cols),
    ('num_mm', num_mm_transformer, num_mm_cols)
])


### 7. Model Definition & Pipeline
Use `SGDRegressor` with Huber loss and carefully tuned hyperparameters. Model is wrapped in a full pipeline with preprocessing.

In [7]:
pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('model', SGDRegressor(
        loss='huber',
        alpha=5e-10,
        epsilon=1.2,
        learning_rate='constant',
        eta0=0.02,
        early_stopping=True,
        validation_fraction=0.15, 
        n_iter_no_change=4,
        tol=1e-6,
        random_state=42
    ))
])


### 8. Train Model & Predict
Fit the model on the training data and generate predictions on the test set.

In [8]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_pred = np.expm1(y_pred)


### 9. Save Submission
Save the predictions to `submission.csv` as required for competition submission.

In [9]:
pd.DataFrame(y_pred).to_csv("submission.csv", index=False, header=True, float_format='%.6f')
print("submission.csv berhasil dibuat.")


submission.csv berhasil dibuat.
