<a href="https://colab.research.google.com/github/Pushkarp26/Machine-Learning-of-energy-use-of-appliances-in-alow-energy-house/blob/main/Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Importing Libraries**

In [None]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import math
from google.colab import files
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

#**Data**
From the previous notebooks I have saved the predictor and target feature datasets.
So, will use them directly here instead of doing all the preliminary steps.

In [None]:
X = pd.read_csv("predictor.csv")
y = pd.read_csv("target.csv")
X.drop("Unnamed: 0",axis=1,inplace=True)
y.drop("Unnamed: 0",axis=1,inplace=True)

print("Predictor features:\n {}\n\nTarget Features:\n {}".format(X.head(),y.head()))

Predictor features:
    Kitchen_Temp  Kitchen_Humidity  Living_room_Temp  ...        rv2  Weekday    NSM
0         19.89         47.596667              19.2  ...  13.275433        0  61200
1         19.89         46.693333              19.2  ...  18.606195        0  61800
2         19.89         46.300000              19.2  ...  28.642668        0  62400
3         19.89         46.066667              19.2  ...  45.410389        0  63000
4         19.89         46.333333              19.2  ...  10.084097        0  63600

[5 rows x 28 columns]

Target Features:
    Total_Energy_Consumption
0                        90
1                        90
2                        80
3                        90
4                       100


Splitting the data into Training and Testing sets.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

y_array = np.asarray(y)
y_array = y_array.ravel()

X_train,X_test,y_train,y_test = train_test_split(X,y_array,test_size=0.3)
print("X_train:",X_train.shape,"\t","y_train:",y_train.shape,"\n",'X_test:',
       X_test.shape,"\t","y_test:",y_test.shape)

X_train: (13814, 28) 	 y_train: (13814,) 
 X_test: (5921, 28) 	 y_test: (5921,)


Random Forest Regression modelling


In [None]:
rfc = RandomForestRegressor()
rfc.fit(X_train,y_train)                                                         #training the model

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

Checking the *Accuracy* score of our model

In [None]:
print("Accuracy Score for Train data: {}\n Accuracy Score for Test data: {}".
      format(rfc.score(X_train,y_train),rfc.score(X_test,y_test)))                                                      #

Accuracy Score for Train data: 0.9409394524045103
 Accuracy Score for Test data: 0.5867620131030638


Evaluating errors

In [None]:
from sklearn import metrics
pred = rfc.predict(X_test)
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_test, pred))   #mean absolute error
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_test, pred))     #mean squared error
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(
    y_test, pred)))                                                              #Root Mean Squared Error 
# mape = np.mean(np.abs((y_test - pred) / np.abs(y_test)))
# print('Mean Absolute Percentage Error (MAPE):', round(mape * 100, 2))
# print('Accuracy:', round(100*(1 - mape), 2))

Mean Absolute Error (MAE): 32.17968248606655
Mean Squared Error (MSE): 4377.442325620673
Root Mean Squared Error (RMSE): 66.16224244703827


Cross Validation of the model


In [None]:
cv = cross_validate(rfc, X_test, y_test, cv=10)
print(cv['test_score'])

[0.44304919 0.50878972 0.44661233 0.4136764  0.43701571 0.46374364
 0.39745992 0.43629888 0.29964489 0.40147436]


It is evident that our model is overfitting on test dataset.

Performing GridSearch for best parameters.


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = { 
    'n_estimators': [50,400],                                                    #setting no.of trees in the range of 50to 400
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
print(CV_rfc.best_params_)

{'max_features': 'sqrt', 'n_estimators': 400}


Now,we have evaluated the best parameters fro our model. Let's apply this on our model on training dataset.

In [None]:
rfc1 = RandomForestRegressor(n_estimators=400)
rfc1.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=400, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

Model prediction on test set.

In [None]:
pred2 = rfc1.predict(X_test)
rfc1.score(X_test,y_test)

0.586569152196529

Again cross validating our model for best accuracy.

In [None]:
cv1 = cross_validate(rfc1, X_test, y_test, cv=10)
print(cv1['test_score'])

[0.44708764 0.50172681 0.46149811 0.40847528 0.43001645 0.46062774
 0.3960033  0.44558116 0.30147128 0.4191446 ]
