# RandomForest

    RandomForest is a tree based bootstrapping algorithm wherein a certain number of weak learners (decision trees) are combined to make a powerful prediction model. For every individual learner, a random sample of rows and a few randomly chosen variables are used to build a decision tree model. Final prediction can be a function of all the predictions made by the individual learners. In case of a regression problem, the final prediction can be mean of all the predictions.

In [21]:
#loading packages
import pandas as pd
import numpy as np #for mathematical calculations
import seaborn as sns
import math
import matplotlib.pyplot as plt #for plotting graphs
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from datetime import datetime #to access datetime
from pandas import Series #to work on series
from pathlib import Path #to create path to directories and files
%matplotlib inline
import warnings #to ignore the warnings
warnings.filterwarnings('ignore')

In [2]:
#https://pbpython.com/notebook-process.html
today = datetime.today()
train_original = Path.cwd() /'data'/'raw'/'Train_File.csv'
test_original = Path.cwd() /'data'/'raw'/'Test_File.csv'
summary_file_train = Path.cwd() /'data'/'processed'/f'summary_train{today:%b-%d-%Y}.pkl'
summary_file_test = Path.cwd() /'data'/'processed'/f'summary_test{today:%b-%d-%Y}.pkl'

In [3]:
#reading data
train = pd.read_pickle(summary_file_train)
test = pd.read_pickle(summary_file_test)

In [4]:
train = train.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1)
test = test.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1)

In [5]:
X = train.drop('Item_Outlet_Sales', axis=1)
y = train['Item_Outlet_Sales']

In [6]:
#importing cross validation
from sklearn.model_selection import train_test_split

In [8]:
x_train, x_cv, y_train, y_cv = train_test_split(X, y, test_size=0.3)

## Feature Engineering & Building Model

In [10]:
'''To summarize, we will scale our data, then create polynomial features, 
and then train a linear regression model.
(https://medium.com/coinmonks/regularization-of-linear-models-with-sklearn-f88633a93a2)'''

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [23]:
steps = [
    ('scalar', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2)),
    ('model', RandomForestRegressor(n_estimators=400, random_state=0))
]

In [24]:
random_pipe = Pipeline(steps)

In [25]:
random_pipe.fit(x_train, y_train)

print('Training Score: {}'.format(random_pipe.score(x_train, y_train)))
print('Test Score: {}'.format(random_pipe.score(x_cv, y_cv)))

Training Score: 0.9396580059037567
Test Score: 0.5875417102042929


In [26]:
#predicting on cv
pred_cv = random_pipe.predict(x_cv)

#calculating rmse
mse = mean_squared_error(y_cv, pred_cv)
rmse = math.sqrt(mse)

print('RMSE: {}'.format(rmse))

RMSE: 1083.857669069033
