## ⚡ Final Mission: Mapping SkyNet's Energy Nexus

### 🌐 The Discovery
SkyNet is harvesting energy from Trondheim's buildings. Some structures provide significantly more power than others.

### 🎯 Your Mission
Predict the **Nexus Rating** of unknown buildings in Trondheim (test set).

### 🧠 The Challenge
1. **Target**: Transform the Nexus Rating to reveal true energy hierarchy
2. **Data Quality**: Handle missing values and categorical features
3. **Ensembling**: Use advanced models and ensemble learning

### 💡 Hint
You suspect that an insider has tampered with the columns in the testing data... 

Compare the training and test distributions and try to rectify the test dataset.

### 📊 Formal Requirements
1. **Performance**: Achieve RMSLE <= 0.294 on the test set
2. **Discussion**:

   a. Explain your threshold-breaking strategy

   b. Justify RMSLE usage. Why do we use this metric? Which loss function did you use?

   c. Plot and interpret feature importances

   d. Describe your ensembling techniques

   e. In real life, you do not have the test targets. How would you make sure your model will work good on the unseen data? 

---

In [22]:
import pandas as pd
import numpy as np

train = pd.read_csv('final_mission_train.csv')
test = pd.read_csv('final_mission_test.csv')

In [23]:
from sklearn.metrics import mean_squared_log_error

def rmsle(y_true, y_pred):
    """ Root Mean Squared Logarithmic Error """
    return np.sqrt(mean_squared_log_error(y_true, y_pred))

In [24]:
# Shfit all colummns in the test set right by 1, except ownership type
original_grid = test['grid_connections'].copy()
copy = test.copy()
test.iloc[:, 1:] = copy.iloc[:, 1:].shift(1, axis=1)
test['nexus_rating'] = original_grid

In [25]:
# Data preprocessing - All resulting in worse results
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

# Identify continous & categorical columns
# continous = train.nunique()[train.nunique() > 10].index.tolist()
# categorical = train.nunique()[train.nunique() <= 10].index.tolist() # Should really use set theory

# preprocess = ColumnTransformer([
#     ("num", SimpleImputer(strategy="mean"), continous),
#     ("cat", OneHotEncoder(handle_unknown="ignore"), categorical),
# ])

# # Do fill strategy for continous
# Maybe create a function?
# test[continous] = test[continous].fillna(test[continous].mean())
# train[continous] = train[continous].fillna(train[continous].mean())


# #do fill strategy for categorical
# train = train.apply(lambda x: x.fillna(x.value_counts().index[0]))
# test = test.apply(lambda x: x.fillna(x.value_counts().index[0]))

# # One hot encode categorical values
# train = pd.get_dummies(train, columns=categorical, drop_first=True)
# test = pd.get_dummies(test, columns=categorical, drop_first=True)

# train.isnull().sum()
# missing_percentage = test.isnull().sum() / len(test)
# missing_percentage.keys
threshold = 0.4
missing_frac = train.isna().mean()   # fraction of NaNs per column
keep_cols = missing_frac[missing_frac <= threshold].index

train = train[keep_cols]
test = test[keep_cols]
train.isna().mean()


ownership_type               0.379214
nexus_rating                 0.000000
energy_footprint             0.000000
core_reactor_size            0.202749
harvesting_space             0.166717
vertical_alignment           0.000000
power_chambers               0.000000
shared_conversion_units      0.166287
isolated_conversion_units    0.166287
internal_collectors          0.346661
external_collectors          0.346661
grid_connections             0.003436
dtype: float64

In [26]:
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import make_scorer, mean_squared_log_error

# -----------------------------
# Config
# -----------------------------
TARGET = "nexus_rating"
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# def rmsle(y_true, y_pred):
#     y_pred = np.clip(np.asarray(y_pred), 0, None)
#     return mean_squared_log_error(np.asarray(y_true), y_pred) ** 0.5

rmsle_scorer = make_scorer(rmsle, greater_is_better=False)

# -----------------------------
# Split data
# -----------------------------
x_train = train.drop(columns=[TARGET]).copy()
y_train = train[TARGET].astype(float).copy()
x_test  = test.drop(columns=[TARGET], errors="ignore").copy()

# -----------------------------
# Column inference (minimal)
# - Only treat object/category as categorical
# - Everything numeric stays numeric (no cardinality heuristics)
# -----------------------------
cat_cols = x_train.select_dtypes(include=["object", "category"]).columns.tolist()
num_cols = x_train.select_dtypes(include=[np.number]).columns.tolist()

preprocess = ColumnTransformer(
    transformers=[
        ("num", SimpleImputer(strategy="median"), num_cols),                 # robust, minimal
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),           # only for true string cats
    ],
    remainder="drop",
)

# -----------------------------
# Base RandomForest (solid defaults for tabular)
# -----------------------------
rf = RandomForestRegressor(
    n_estimators=1000,
    max_depth=None,
    min_samples_leaf=3,        # smooths and helps tails
    max_features=0.5,          # reduce tree correlation
    n_jobs=-1,
    random_state=42,
)

# -----------------------------
# Variant A: Train on RAW target (clip only at scoring)
# -----------------------------
pipe_raw = Pipeline([
    ("prep", preprocess),
    ("model", rf),
])

# -----------------------------
# Variant B: Train on LOG1P(target), invert with EXPM1 + clip≥0
# -----------------------------
def inv_expm1_clip(yhat_log):
    return np.clip(np.expm1(yhat_log), 0, None)

pipe_log = Pipeline([
    ("prep", preprocess),
    ("model", TransformedTargetRegressor(
        regressor=rf,
        func=np.log1p,
        inverse_func=inv_expm1_clip,
        check_inverse=False,
    )),
])

# -----------------------------
# Cross-validated comparison (RMSLE)
# -----------------------------
scores_raw = -cross_val_score(pipe_raw, x_train, y_train, cv=cv, scoring=rmsle_scorer, n_jobs=-1)
scores_log = -cross_val_score(pipe_log, x_train, y_train, cv=cv, scoring=rmsle_scorer, n_jobs=-1)

print(f"RF RAW    CV RMSLE: {scores_raw.mean():.6f} ± {scores_raw.std():.6f}")
print(f"RF LOG1P  CV RMSLE: {scores_log.mean():.6f} ± {scores_log.std():.6f}")

# -----------------------------
# Fit the better one and predict
# -----------------------------
best_pipe = pipe_log if scores_log.mean() <= scores_raw.mean() else pipe_raw
best_pipe.fit(x_train, y_train)

y_pred = best_pipe.predict(x_test)  # already non-negative in LOG1P case; RAW may need clipping if you compute RMSLE
y_pred_safe = np.clip(y_pred, 0, None)

rf_fit = rf.fit(x_train, y_train)
y_pred_ez = rf.predict(x_test)


RF RAW    CV RMSLE: 0.322180 ± 0.006347
RF LOG1P  CV RMSLE: 0.311342 ± 0.006670


In [27]:
# Most important column by far is energy_consumption
# What does that mean, and how to fix
# print(len(x_train.columns.tolist()), len(rf.feature_importances_))
# test = zip(x_train.columns.tolist(), rf.feature_importances_)
# tuple(test)

In [28]:
# Convert back the nexus_rating for a fair comparison

print('Required RMSLE: ', 0.294)
print('RMSLE: ', rmsle(test['nexus_rating'], y_pred_safe))

print('RMSLE: ', rmsle(test['nexus_rating'], y_pred_ez))

Required RMSLE:  0.294
RMSLE:  0.32589298204940365
RMSLE:  0.3345966358547679
