<a href="https://colab.research.google.com/github/KotkaZ/journey-to-zero/blob/master/RandomForestRegressor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random Forest Regressor Notebook

This notebook contains data processing and Random Forest Regressor ensable model. The input data is a JourneyToZero Kaggle competition. 

https://www.kaggle.com/competitions/predict-electricity-consumption




In [1]:
# Here we use Numpy arrays and Pandas dataframes to pass and modify data.
import numpy as nb
import pandas as pd

# We use OneHotEncoder from preprocessing sklearn module.
from sklearn import preprocessing

# We use PCA from decomposition sklearn module.
from sklearn import decomposition

# Used to split for validation dataset.
from sklearn.model_selection import train_test_split

# Model used in notebook.
from sklearn.ensemble import RandomForestRegressor

# As Kaggle competition uses MEA as performance metric, we decided to use the same.
from sklearn.metrics import mean_absolute_error

import datetime

### Timestamp extraction

Because crazy things happened in the past year,  we validated that, some specific dates had significantly higher electricity prices. Therefore we do weekday, month, and time extraction from the timestamp. 

We only extract timestap features.



In [2]:
def extract_weekday(dataset):
    splits = dataset['date'].astype(str).str.split('-')
    dataset['weekday'] = [datetime.date(int(year), int(month), int(day)).weekday() for (year, month, day) in splits]

In [3]:
def extract_month(dataset):
    dataset['month'] = [month for (_, month, _) in dataset['date'].astype(str).str.split('-')]

In [4]:
def extract_datetime(dataset):
    dataset.loc[:,'time'] = pd.to_datetime(dataset.loc[:,'time'], format="%Y-%m-%d %H:%M:%S", utc=True)
    dataset['date'] = dataset['time'].dt.date
    dataset['hour'] = dataset['time'].dt.hour

In [5]:
def extract_features(dataset):
    extract_datetime(dataset)
    extract_month(dataset)
    extract_weekday(dataset)


### One Hot Encoder.

As regressor models prefer numerical inputs and we copied this notebook from Nerual Networks, we decided to delete this method.

In [6]:
def one_hot_encode(dataset, columns, encoder = None) -> preprocessing.OneHotEncoder:
    if encoder:
        transformed = encoder.transform(dataset[columns])
    else:
        encoder = preprocessing.OneHotEncoder(sparse= False)
        transformed = encoder.fit_transform(dataset[columns])

    new_columns = []
    for i, column in enumerate(encoder.feature_names_in_):
        new_columns.extend([column + str(cat) for cat in encoder.categories_[i]])

    encoder_df = pd.DataFrame(transformed, index=dataset.index)
    dataset[new_columns] = encoder_df
    dataset.drop(columns=columns, inplace=True)
    return encoder

### Feature dropping

In Estonia, there are approximately 500\-800 millimeters of rain on average. Our dataset consisted of only about 140mm of rain, which is definitely not correct. Also, the amount of snow was inappropriate for the same reason. We could integrate a new wether dataset or leave it as it is. The simpliest approch is to delete the whole columns, which we decided to do.

Some of the rows contained null values, which we also dropped. There was a case, when electricy prices hit market limit, which caused some outliners. It was wise to drop them.


In [7]:
def drop_features(dataset):
    dataset.drop(columns=['snow','prcp','time','date'], inplace=True)


In [8]:
def drop_rows(dataset):
    # Deal with NaN values
    initial_len = len(dataset)
    dataset.dropna(inplace=True)
    new_len = len(dataset)
    if (initial_len != new_len):
        print(f'Dropped {initial_len - new_len} row')

    # Deal with outliners
    dataset.drop(dataset[dataset['el_price'] > 1].index , inplace=True)

### Normalize

Here we scale numeric values between 0 to 1 with MinMaxScaler

In [9]:
def normalize(dataset, scaler = None) -> (pd.DataFrame, preprocessing.MinMaxScaler):
    if scaler:
        dataset_scaled = scaler.transform(dataset)
        return (dataset_scaled, scaler)
    scaler = preprocessing.MinMaxScaler()
    dataset_scaled = scaler.fit_transform(dataset)
    return (dataset_scaled, scaler)

### PCA

As OneHotEncoder introduced a lot of new features, we decided to do PCA with leaving 90% of importance value.

In [10]:
def reduce_dimensions(dataset, pca = None) -> (pd.DataFrame, decomposition.PCA):
    if pca:
        dataset_reduced = pca.transform(dataset)
        return (dataset_reduced, pca)
    pca = decomposition.PCA(n_components=0.9)
    dataset_reduced = pca.fit_transform(dataset)
    return (dataset_reduced, pca)

### Preprocess

Here we combine all the previous methods into one. 
As trained encoder must be used on test set, we return it from the method.

In [11]:
def preprocess(dataset, encoder=None) -> preprocessing.OneHotEncoder:
    extract_features(dataset)
    drop_features(dataset)
    encoder = one_hot_encode(dataset, ['coco', 'weekday'], encoder)
    drop_rows(dataset)
    return encoder


### Import dataset

Here we import dataset, do inital processing and split into train and validation. As we predict consumption, we extract this column to separate numpy array.

In [12]:
def read_dataset(file_name) -> pd.DataFrame:
    return pd.read_csv(file_name)

In [13]:
def extract_labels(dataset) -> (pd.DataFrame, pd.Series):
    X_train = dataset.loc[:, ~dataset.columns.isin(['consumption'])]
    y_train = dataset['consumption']
    return (X_train, y_train)

In [14]:
train_df = read_dataset('train.csv')
encoder = preprocess(train_df)


X_train, y_train = extract_labels(train_df)

X_train_norm, scaler = normalize(X_train)
X_train_reduced, pca = reduce_dimensions(X_train_norm)


Dropped 2 row


In [15]:
X_train_norm.shape

(8588, 41)

In [16]:
# Here we use all the previous encoders and scalers on test set.
X_test = read_dataset('test.csv')
preprocess(X_test, encoder)

X_test_norm, _ = normalize(X_test, scaler)
print(X_test_norm.shape)
X_test_reduced, _ = reduce_dimensions(X_test_norm, pca)

(168, 41)


In [17]:

X_train, X_val, y_train, y_val = train_test_split(X_train_reduced, y_train, test_size=0.2)

In [18]:
X_train.shape

(6870, 16)

### RandomForestRegress

We train a RandomForestRegressor model with train data and validate the output on validation set.

In [19]:
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)


RandomForestRegressor()

In [20]:
y_val_predicted = rfr.predict(X_val)
# We validate the model performance.
mean_absolute_error(y_val_predicted, y_val)

0.5509063562281723

### Prediction 

We predict on the test dataset and write the output to csv file.

In [21]:
prediction = rfr.predict(X_test_reduced)
display(prediction)

array([0.3966 , 0.63415, 0.28624, 1.15627, 0.58738, 0.48478, 0.7549 ,
       0.65639, 0.75048, 1.7137 , 0.80325, 0.82963, 0.93394, 0.64838,
       0.62494, 0.65115, 1.08123, 1.01876, 0.95316, 0.94253, 0.87391,
       1.24036, 0.71869, 0.63841, 0.77301, 0.65657, 0.66852, 0.70384,
       0.5559 , 0.62269, 0.38614, 0.77853, 0.76771, 0.72248, 0.58045,
       0.70981, 0.69656, 0.81392, 0.6975 , 0.86664, 0.73248, 0.79423,
       0.78125, 0.70698, 0.85319, 0.69499, 0.59707, 1.22386, 0.99267,
       0.37775, 0.58194, 0.62324, 0.90012, 0.8435 , 0.88093, 0.83026,
       0.90277, 0.57675, 1.24543, 2.09502, 0.69632, 0.70843, 0.67123,
       0.83993, 0.67182, 0.5946 , 0.74624, 0.65486, 0.56741, 0.58268,
       0.72901, 0.67045, 0.74364, 0.46168, 0.3747 , 0.76511, 0.56798,
       0.48638, 0.35104, 0.49472, 0.92808, 1.02863, 1.04069, 0.83119,
       0.6467 , 0.75924, 0.75504, 1.37861, 0.9918 , 0.75365, 0.50903,
       0.52133, 0.52613, 0.65696, 0.48495, 0.32465, 0.50681, 0.49564,
       0.74685, 0.70

In [22]:
# We must reload the test dataframe, as we droped the time column.
X_test = read_dataset('test.csv')

predictions_dict = {'time':X_test.time,'consumption':prediction}
pred_df = pd.DataFrame(predictions_dict)
pred_df.to_csv('submission_RFR_V1.csv',index=False)