# Random Forest Regressor

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.tree import export_graphviz, plot_tree

### Importing data

The data from the training and test set are loaded in and shaped into the right format.

In [None]:
df_train = pd.read_pickle(r"../input/train.pkl")
X_train = df_train.drop(["date", "count"], axis=1)
y_train = df_train["count"]
df_train.head()

In [None]:
df_test = pd.read_pickle(r"../input/test.pkl")
X_test = df_test.drop(["date", "count"], axis=1)
y_test = df_test["count"]
df_test.head()

### Hyperparameter tuning using Grid Search

In order to tune the hyperparameters _GridSearchCV_ is being used.
The hyperparameters which are tested are:
- _min_samples_leaf_
- _n\_estimators_

After the tuning of the hyperparameters the model is being trained with the help of the data in the training set

In [None]:
parameters = {"min_samples_leaf" : list(range(1, 11)),
             "n_estimators" : list(range(80, 241, 20))}
clf = GridSearchCV(RandomForestRegressor(), parameters, n_jobs=4, scoring="neg_mean_squared_error")
clf.fit(X_train, y_train)

In [None]:
clf.best_params_

With the help of the RMSE (Root Mean Square Error) the model is being tested on its performance.

In [None]:
clf.score(X_test, y_test)
y_pred = clf.predict(X_test)
mean_squared_error(y_test, y_pred)**0.5

### training regressor

Retraining the Random Forest Regressor with both the train and test data.

In [None]:
regressor = clf.best_estimator_
regressor.fit(X_test.append(X_train), y_test.append(y_train))

#### Testing regressor

Testing the regressor on the test set (it is also being trained on this set, so the RMSE can be much better than before).

In [None]:
regressor.score(X_test, y_test)

In [None]:
y_pred = regressor.predict(X_test)
mean_squared_error(y_test, y_pred)**0.5

### Plotting the predictions

To gain some more insights into the behaviour of the model one of the Regressor Decision Trees is being plotted.

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols=1, figsize=(4,4), dpi=800)
plot_tree(regressor.estimators_[0],
        feature_names=X_train.columns,
        filled=True,
        rounded=True,
        ax=axes)

fig.show()

Weekend is the most important feature in the tree, followed by mean temperature and days from epoch.

### Validating regressor
Predict the data for the dates in validation.pkl, enter data into kaggle competition.

In [None]:
df_validation = pd.read_pickle(r"../input/validation.pkl")
df_validation.head()

In [None]:
X_validate = df_validation.drop(["date", "Predicted"], axis=1)

In [None]:
y_validate = regressor.predict(X_validate)
df_validation["Predicted"] = y_validate
df_validation.head()

To give an overview over all the predictions the model has made, all the predictions are plotted into a graph.

In [None]:
df_test["count"].plot(figsize=(14,7), label="real value")
df_validation["Predicted"].plot()

plt.legend()
plt.show()

### Writing validation data to .csv file

All the predictions are written into a csv file and into the right format.

In [None]:
df_validation.rename(columns= {"date" : "id"}, inplace=True)
df_validation["id"] = df_validation["id"].dt.strftime("%Y%m%d")
df_validation[["id", "Predicted"]].to_csv("../output/RFRval.csv", index=False)