## Example notebook

This is an example notebook to show how a model pipeline can be easily executed used our custom functions. All functions are adapted to be used in files within the models folder, but the path to the data files or modules can also be adjusted to be used in any folder.

**The first 3 cells and the last should not be changed. You can adapt the fourth one to any model you would want to use and apply gridsearch if necessary**

- I haven't had the time to incorporate cross-validation, but I think you can easily add it here

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd

import sys
sys.path.append('..')
from utils import get_train_data, create_preprocessor, submit_test, train_test_split_temporal


In [2]:
X, y = get_train_data() # X comes out prepared with added columns

# Train-test split
X_train, X_valid, y_train, y_valid = train_test_split_temporal(X, y)

X_train = X_train.drop(columns=["date"])
X_valid = X_valid.drop(columns=["date"])


In [3]:
X_train.columns


Index(['site_id', 'latitude', 'longitude', 'numer_sta', 'pmer', 'tend',
       'cod_tend', 'dd', 't', 'td', 'u', 'vv', 'ww', 'w1', 'w2', 'nbas',
       'hbas', 'cl', 'cm', 'ch', 'pres', 'tend24', 'raf10', 'rafper', 'per',
       'etat_sol', 'ht_neige', 'perssfrai', 'rr1', 'rr3', 'rr6', 'rr12',
       'rr24', 'nnuage1', 'ctype1', 'hnuage1', 'is_bank_holiday', 'year',
       'month', 'day', 'hour', 'is_weekend', 'season', 'hour_sin', 'hour_cos',
       'day_sin', 'day_cos', 'time_calm', 'time_morning', 'time_peak_hours',
       'time_working_hours', 'day_0', 'day_1', 'day_2', 'day_3', 'day_4',
       'day_5', 'day_6', 'arrondissement'],
      dtype='object')

In [4]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Ridge

# Define the pipeline
pipeline = Pipeline(steps=[
    #('preprocessor', preprocessor),# This preprocessor should always be used
    ('regressor', GradientBoostingRegressor(random_state=42))
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Evaluate the pipeline
y_pred = pipeline.predict(X_valid)
rmse = np.sqrt(mean_squared_error(y_valid, y_pred))

print(f"Mean Squared Error: {np.round(rmse, 2)}")

Mean Squared Error: 0.68


In [6]:
# Assuming X_train is the DataFrame used as input to the pipeline
feature_names = X_train.columns

gbr_model = pipeline.named_steps['regressor']

feature_importances = gbr_model.feature_importances_

# Compute the average value for each feature in X_train
column_averages = X_train.mean()

# Normalize feature importances by dividing by column averages
normalized_importances = feature_importances / column_averages.values

feature_weights = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances,
    'Normalized Importance': normalized_importances,
    'Absolute Normalized Importance': np.abs(normalized_importances)
}).sort_values(by='Absolute Normalized Importance', ascending=False)

# Display the sorted DataFrame
print(feature_weights[['Feature', 'Absolute Normalized Importance']])


               Feature  Absolute Normalized Importance
43            hour_sin                    1.925390e+03
44            hour_cos                    1.758393e+03
45             day_sin                    5.667942e+00
47           time_calm                    5.581745e-01
46             day_cos                    3.141649e-01
36     is_bank_holiday                    1.081119e-01
5                 tend                    2.490901e-02
2            longitude                    2.144225e-02
41          is_weekend                    2.036665e-02
57               day_6                    1.621579e-02
48        time_morning                    1.129914e-02
49     time_peak_hours                    1.038583e-02
50  time_working_hours                    9.553494e-03
42              season                    9.100285e-03
56               day_5                    7.259015e-03
58      arrondissement                    5.510631e-03
25            etat_sol                    5.302973e-03
40        

In [3]:
# Create the preprocessor
preprocessor = create_preprocessor(X_train)

In the next cell you can experiment with different models and try out gridsearch on different hyperparameters

In [None]:
# Don't forget to nclude the pipeline and a name that will be used for the file
submit_test(pipeline, "GradientBoostingRegressor_default")

