## ⚡ Final Mission: Mapping SkyNet's Energy Nexus

### 🌐 The Discovery
SkyNet is harvesting energy from Trondheim's buildings. Some structures provide significantly more power than others.

### 🎯 Your Mission
Predict the **Nexus Rating** of unknown buildings in Trondheim (test set).

### 🧠 The Challenge
1. **Target**: Transform the Nexus Rating to reveal true energy hierarchy
2. **Data Quality**: Handle missing values and categorical features
3. **Ensembling**: Use advanced models and ensemble learning

### 💡 Hint
You suspect that an insider has tampered with the columns in the testing data... 

Compare the training and test distributions and try to rectify the test dataset.

### 📊 Formal Requirements
1. **Performance**: Achieve RMSLE <= 0.294 on the test set
2. **Discussion**:

   a. Explain your threshold-breaking strategy

   b. Justify RMSLE usage. Why do we use this metric? Which loss function did you use?

   c. Plot and interpret feature importances

   d. Describe your ensembling techniques

   e. In real life, you do not have the test targets. How would you make sure your model will work good on the unseen data? 

---

In [259]:
import pandas as pd
import numpy as np

train = pd.read_csv('final_mission_train.csv')
test = pd.read_csv('final_mission_test.csv')

In [260]:
from sklearn.metrics import mean_squared_log_error

def rmsle(y_true, y_pred):
    """ Root Mean Squared Logarithmic Error """
    return np.sqrt(mean_squared_log_error(y_true, y_pred))

In [261]:
# Shfit all colummns in the test set right by 1, except ownership type
original_grid = test['grid_connections'].copy()
copy = test.copy()
test.iloc[:, 1:] = copy.iloc[:, 1:].shift(1, axis=1)
test['nexus_rating'] = original_grid

In [267]:
from sklearn.impute import SimpleImputer
# Data preprocessing
categorical = ['ownership_type']
encode_cols = ['ownership_type']
numeric_imputer = SimpleImputer(strategy="median")  # or mean


continous = test.nunique()[test.nunique() > 10].index.tolist()
categorical = test.nunique()[test.nunique() <= 10].index.tolist() # Should really use set theory
print(categorical)
print(continous)


## Do fill strategy for continous
test[continous] = test[continous].fillna(test[continous].mean())
train[continous] = train[continous].fillna(train[continous].mean())



# do fill strategy for categorical


## One hot encode categorical values


['ownership_type', 'power_chambers', 'energy_flow_design', 'shared_conversion_units', 'isolated_conversion_units', 'internal_collectors', 'external_collectors', 'ambient_harvesters', 'shielded_harvesters', 'efficiency_grade', 'grid_connections']
['nexus_rating', 'energy_footprint', 'core_reactor_size', 'harvesting_space', 'vertical_alignment', 'upper_collector_height']


In [263]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(min_samples_leaf=10)

x_train = train.drop(columns=["nexus_rating"]).copy()
y_train = train['nexus_rating']

x_test = test.drop(columns=["nexus_rating"]).copy()

rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)

In [264]:
# Convert back the nexus_rating for a fair comparison

print('Required RMSLE: ', 0.294)
print('RMSLE: ', rmsle(test['nexus_rating'], y_pred))

Required RMSLE:  0.294
RMSLE:  0.3426635263095595
