# Introduction

The goal of this experiment is to create an initial version of the ML Pipeline used to predict house prices for the melkor project. 

## Experiment Setup

### Pipeline Flow

This pipeline is intentionally kept simple. Therefore it is only going to be a single model scored on input data that is transformed in one preprocessing step.

### Model Types

Model types are limited to Random Forest Models, namely the [```sklearn.ensemble.RandomForestRegressor```](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). A sound introduction to Random Forest Models can be found [here](https://www.statlearning.com/)

#### Hyerparameters

* ```n_estimators```: $500$
* ```min_samples_split```: $2^{i}, i \in [1,~2,~3,~4,~5,~6]$
* ```max_features```: ```sqrt, None```
* ```max_samples```: $[0.7,~1]$

all other hyperparameters are kept as default

all of the hyperparameter grid is evaluated

### Evaluation

* The dataset is split into 60% for training, and 40% for validation.
* The training data is split into 5 folds for cross validation.
* The Average of the [RMSE](https://www.notion.so/prophecylabs/Project-Success-Metrics-68511ba4f8634756a440708cd0fd0829) is taken for all folds, and the model that has the lowest averyge RMSE is used to predict prices for the validation set and to be compared to the baseline model. 

# Experiment Run

In [1]:
import pandas as pd
import numpy as np
import yaml
import itertools
import os
import pickle

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as MSE
from datetime import datetime
from tqdm import tqdm
from joblib import Parallel, delayed

rndseed = 42
os.chdir('../..')

In [2]:
# set timestamp to uniquely identify the experiment
tstamp = datetime.now().strftime('%Y_%m_%d_%H_%M_%S')

In [3]:
with open("resources/paths.yaml") as f:
    dct_paths = yaml.safe_load(f)

In [4]:
def create_grid(dct_grid):
    """create a list of all combinations of elements of lists in a dictionary"""

    ls_grid = list(itertools.product(*dct_grid.values()))

    ls_grid = [dict(zip(tuple(dct_grid.keys()), i)) for i in ls_grid]

    return ls_grid

def test_eval(dct_param):
    """Evaluate model configuration on test set"""

    model = RandomForestRegressor(**dct_param, random_state = rndseed)

    model.fit(X_train, y_train)

    y_hat = model.predict(X_test)

    return MSE(y_test, y_hat, squared = False)

## Data Import and Preparation 

In [5]:
with open(dct_paths["log"] + "/master_dtypes.yaml") as f:
    dct_dtypes = yaml.safe_load(f)

X_train = pd.read_csv(dct_paths['data']+'/X_train.csv', dtype = dct_dtypes)
y_train = pd.read_csv(dct_paths['data']+'/y_train.csv', dtype = dct_dtypes )

X_test = pd.read_csv(dct_paths['data']+'/X_test.csv', dtype = dct_dtypes)
y_test = pd.read_csv(dct_paths['data']+'/y_test.csv', dtype = dct_dtypes)

X_val = pd.read_csv(dct_paths['data']+'/X_val.csv', dtype = dct_dtypes)
y_val = pd.read_csv(dct_paths['data']+'/y_val.csv', dtype = dct_dtypes)


In [6]:
pd.concat([y_train, y_test]).shape

(2341, 1)

## Grid Setup

In [7]:
n_estimators = [500]
min_samples_split = [2**i for i in range(1, 7)]
max_features = ['sqrt', None]
max_samples = [0.7, 1]

dct_grid = {
    'n_estimators':[500],
    'min_samples_split':[2**i for i in range(1, 7)],
    'max_features':['sqrt', None],
    'max_samples':[0.7, 1]
}

ls_grid = create_grid(dct_grid)

## Experiment Run

In [8]:
ls_out = Parallel(n_jobs=-1)(delayed(test_eval)(dct_param) for dct_param in tqdm(ls_grid))

for i in range(len(ls_grid)):
    ls_grid[i]['test_mse'] = ls_out[i]

100%|██████████| 24/24 [00:44<00:00,  1.85s/it]


## Training of final model

### Redefine training set to include test set

In [9]:
df_tmp = pd.DataFrame(ls_grid).sort_values('test_mse')
params_final = df_tmp.loc[0,[i for i in df_tmp.columns if i not in ['test_mse']]].to_dict()

X_train = pd.concat([X_train, X_test])
y_train = pd.concat([y_train, y_test]).values.ravel()

model_final = RandomForestRegressor(**params_final, random_state=rndseed)
model_final.fit(X_train, y_train)

## Saving of Results

In [10]:
path_results = dct_paths['experiment_results']+'/'+tstamp
os.makedirs(path_results)

In [11]:
dct_config = {
    'grid':dct_grid,
    'features':X_train.columns,
    'rndseed':rndseed
}

In [12]:
with open(path_results+'/model.pkl', 'wb') as f:
    pickle.dump(model_final, f)

with open(path_results+'/config.pkl', 'wb') as f:
    pickle.dump(dct_config, f)

In [13]:
print(path_results)

resources/results/2022_05_28_13_45_02
