<a href="https://colab.research.google.com/github/Rahulchauhan1612/Bike-Sharing-Demand-Prediction/blob/main/Copy_of_Bike_Sharing_Demand_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Seoul Bike Sharing Demand Prediction </u></b>

#Abstract:

Bike sharing as we know is a transport service primary focus to lend conventional or electrical bikes to an individual or a group of individuals in order to let them travel in city or outskirt in rent for an hour, a day or for a month depending on the needs.

In market share we can see that Bike Sharing system has a global market share which was valued around 3.39 billion Dollars in 2019 and is projected to grow to 6.98 Billion Dollars by 2027 with a compound annual growth rate of around 14% indicatively from 2020 to 2027.

Several factors such as low bike rent, increase in capital investments,introduction of e-bikes in the market, technological advancement and government schemes for development of several bike-sharing infrastructure has increased the overall market share and led to the introduction of several opportunities during the forecasted year. However, rise in bike theft and huge initial investment are some of the key factors in order to hinder expected market growth.

*# Keywords: Bike-Sharing, Data Mining, Predictive Analysis, Linear Regression, Machine Learning.*

# **Introduction:**

Bike sharing system demand nowadays is increasing in proportional manners globally. This system has gained a lot of attention with its cost effective system and easy to use nature. This system has already attracted a huge customer base globally like in South Korea, São Paulo ,China and Australia. Bike sharing system generally rents bikes on an hour, day and month basis and is generally based on static pricing inclusive of hour,days or month. Because of its affordability and easy renting system anyone can commute on arrival. According to our problem our main aim is to build a predictive model so as to find the number of bikes rented based on the given dataset.

## <b> Problem Description </b>

### Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.


## <b> Data Description </b>

### <b> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.</b>


### <b>Attribute Information: </b>

* ### Date : year-month-day
* ### Rented Bike count - Count of bikes rented at each hour
* ### Hour - Hour of he day
* ### Temperature-Temperature in Celsius
* ### Humidity - %
* ### Windspeed - m/s
* ### Visibility - 10m
* ### Dew point temperature - Celsius
* ### Solar radiation - MJ/m2
* ### Rainfall - mm
* ### Snowfall - cm
* ### Seasons - Winter, Spring, Summer, Autumn
* ### Holiday - Holiday/No holiday
* ### Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

In [1]:
# Importing required modules and loading dataset
import numpy as np
import pandas as pd
from datetime import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("whitegrid",{'grid.linestyle': '--'})


from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor


from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df=pd.read_csv("/content/drive/MyDrive/Supervised ML - Regression/SeoulBikeData.csv",encoding= 'unicode_escape')

df.head(10)

In [None]:
# Getting shape of the data
df.shape

In [None]:
# Getting details about all the features present in the dataset
df.info()

Here we can see there are ***6 float, 4 int , 4 object or str type data*** available in the dataset. 

Also there is ***8760 rows*** and ***14 columns*** or feature. Also there is **8760 not null values**. 

We can conclude all of the above after final feature engineering:(i.e feature creation,feature selection etc)

## **# Feature Description :-**

1. Date : Date feature which is str type is needed to convert it into Datetime format DD/MM/YYYY.
2.Hour: Hour feature which is in 24 hour format which tells us number bike rented per hour is int type.
3.Temperature(°C): Temperature feature which is in celsius scale(°C) is Float type.
4.Humidity(%): Feature humidity in air (%) which is int type.
5.Wind speed (m/s) : Wind Speed feature which is in (m/s) is float type.
6.Visibility (10m): Visibility feature which is in 10m, is int type.
7.Dew point temperature(°C): Dew point Temperature in (°C) which tells us temperature at the start of the day is Float type.
8.Solar Radiation (MJ/m2): Solar radiation or UV radiation is Float type.
9.Rainfall(mm): Rainfall feature in mm which indicates 1 mm of rainfall which is equal to 1 litre of water per metre square is Float type.
10.Snowfall (cm): Snowfall in cm is Float type. Seasons: Season, in this feature four seasons are present in data is str type.
11.Holiday: whether no holiday or holiday can be retrieved from this feature is str type.
12.Functioning Day: Whether the day is Functioning Day or not can be retrieved from this feature is str type.

# **#Processing the dataset :-**



In [None]:
# Checking for duplicated values
len(df[df.duplicated()])

In [None]:
# Checking for total null values
df.isna().sum()

 **No duplicate records are found in the dataset.Now we can proceed further**

**Breaking down Date column into 3 columns, namely Day, Month, Year.**

In [None]:
#Using Lambda function to strip date from string to Datetime format so to retrieve d,m,y
df['Date'] = df['Date'].apply(lambda x:dt.strptime(x, "%d/%m/%Y"))
df['Day'] = df['Date'].dt.day_name()
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year

In [None]:
df

In [None]:
# Lets add a new column named Weekend with binary values, indicating 1 for weekend and 0 for a weekday

df['Weekend']=df['Day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )

In [None]:
df.head()

In [None]:
#Dropping columns in vertical axis
df=df.drop(columns=['Date','Day','Year'],axis=1)

In [None]:
#counting Functioning Day using value value_count
df['Functioning Day'].value_counts()


### **#Exploratory Data Analysis :-**

In [None]:
# Visualisation of number of rented bikes vs Months:

fig,ax=plt.subplots(figsize=(18,8))
sns.barplot(data=df,x='Month',y='Rented Bike Count',color = 'brown' , ci=None)
ax.set_title('Average Rented bikes per day Vs Month ' , fontsize=18)
ax.set_xlabel('Months',fontsize=15)
ax.set_ylabel('Average Rented bikes per day',fontsize=15)

In [None]:
# Visualisation of average number of rented bikes vs Weekday or Weekend:

fig,ax=plt.subplots(figsize=(5,8))
sns.barplot(data=df,x='Weekend',y='Rented Bike Count',ax=ax,ci=None , color ='brown')
ax.set_title('Rented bikes Vs Weekend ' , fontsize=18)
ax.set_xlabel('Weekend',fontsize=15)
ax.set_ylabel('Average Rented Bikes per day',fontsize=15)

In [None]:
# Visualisation of Rented bikes vs Hour of the Day:

fig,ax=plt.subplots(figsize=(20,10))
sns.barplot(data=df,x='Hour',y='Rented Bike Count',ci= None ,  color ='brown')
ax.set_title('Avergae Rented Bikes per day Vs Hour ', fontsize=18)
ax.set_xlabel('Hour',fontsize=15)
ax.set_ylabel('Rented Bikes',fontsize=15)

In [None]:
# Visualisation of Average Rented bikes vs Hour of the Day by Weekend or Weeknday:

fig,ax=plt.subplots(figsize=(15,8))
sns.pointplot(data=df,x='Hour',y='Rented Bike Count',hue='Weekend',ci= None, color ='green' )
ax.set_title('Average Rented bikes per day Vs Hours according to Weekend or Weekday ' , fontsize=18)
ax.set_xlabel('Hour',fontsize=20)
ax.set_ylabel('Rented Bikes',fontsize=15)

From this visualizalisation we can conclude the following:

1. In ***Average Bike Rented vs Hour*** we can clearly see that at ***6:00 PM*** average number of bike rented by the people was ***1550***. While at ***00.00 or at midnight*** average bike rented was lowest with just around ***550 bikes*** which were ***on weekdays***.

2. In ***Average Bike Rented vs Hou***r we can also see that at 5:00 PM average number of bike rented by the people was around ***1150***. While at **00.00 or at midnight** average bike rented was lowest with just around ***650 bikes*** which were ***on weekend***.

In [None]:
# Visualisation of Average Bike rented vs Month:

avg_bike=df.groupby('Month')['Rented Bike Count'].mean()
plt.figure(figsize=(12,4))
s=avg_bike.plot(legend=True, marker='o',title=f'Average Bike rented vs Month',color='green')
s.set_xticks(range(len(avg_bike)))
s.set_xticklabels(avg_bike.index.tolist(),rotation = 85)
plt.show()



*  In ***Average Bike Rented vs Month*** we can clearly see that Average Bike rented in*** July*** was highest around ***1250*** and Average Bike Rented during month of ***February*** was the Lowest with just **200** average bike.




In [None]:
# Analysis of rented Bikes Vs Functioning Day:

fig,ax=plt.subplots(figsize=(6,8))
sns.barplot(data=df,x='Functioning Day',y='Rented Bike Count', ci=None,color = 'brown' )
ax.set_title('Rented bikes Vs Functioning Day ', fontsize=18)
ax.set_xlabel('Functioning Day',fontsize=15)
ax.set_ylabel('Average Rented Bikes',fontsize=15)



*   From this Bar Plot we can conclude that Bikes are rented only on **functioning day.**




In [None]:
# Analysis of Rented Bikes Vs Seasons:

fig,ax=plt.subplots(figsize=(10,7))
sns.barplot(data=df,x='Seasons',y='Rented Bike Count', color ='brown', ci= None , estimator= sum)
ax.set_title('Rented bikes Vs Seasons ' , fontsize=18)
ax.set_xlabel('Seasons',fontsize=15)
ax.set_ylabel('Rented Bikes',fontsize=15)



*   From this Bar Plot we can see that ***Highest number of bikes*** were rented during ***Summer seasons*** while **least number of bikes** were rented during ***Winter seasons***.




In [None]:
# Analysis of Rented Bikes Vs Holiday or not:

fig,ax=plt.subplots(figsize=(10,7))
sns.barplot(data=df,x='Holiday',y='Rented Bike Count', color ='brown', ci= None , estimator= sum)
ax.set_title('Rented bikes Vs Holiday ' , fontsize=18)
ax.set_xlabel('Holiday',fontsize=15);
ax.set_ylabel('Rented Bikes',fontsize=15)



*   Here we can assume that Bikes were rented more when there is no holiday and very less as on Holidays.




### **#Analyze numerical Variables :-** 


In [None]:
#Storing all the numeric features in a variable list

numeric_features =['Rented Bike Count', 'Temperature(°C)', 'Humidity(%)',
       'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)',
       'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']

In [None]:
#printing displots to analyze the distribution of all numerical features
for col in numeric_features:
  plt.figure(figsize=(10,6))
  sns.distplot(x=df[col] , color ='green')
  plt.xlabel(col)
  plt.axvline(df[col].mean(),color='magenta', linestyle='dashed',linewidth=2)
  plt.axvline(df[col].median(),color='cyan', linestyle='dashed',linewidth=2)
  plt.show()
plt.show()

1.In density plot for*** Rented Bike Count*** we can see the median and mean lies in range of 500 to 1000 mean is slightly greater than median which means its ***positively skewed***.

2.In density plot for ***Temperature*** we can see that median is greater than mean we can say to some extend that this is ***negatively skewed***.

3.In density plot for ***Humidity*** we can see that mean is greater than median we can say to some extend that this is ***positively skewed***.

4.In density plot for ***WindSpeed*** we can see that mean is greater than median we can say to some extend that this is ***positively skewed***.

5.In density plot for ***Visibility*** we can see that median is greater than mean we can say to some extend that this is ***negatively skewed***.

6.In density plot for ***Dew Point Temperature*** we can see that median is greater than mean we can say to some extend that this is ***negatively skewed***.

7.In density plot for ***Solar Radiation*** we can see that mean is greater than median we can say that this is*** positively skewed***.

## **#Feature Engineering :-**

### Regression plot:-

In [None]:
#printing the regression plot for all the numerical features:

for col in numeric_features[1:]:
  feature = df[col]
  label = df['Rented Bike Count']
  fig,ax=plt.subplots(figsize=(10,6))
  sns.regplot(x=df[col],y=df['Rented Bike Count'],scatter_kws={"color": 'pink'}, line_kws={"color": "black"})
  correlation = feature.corr(label)
  ax.set_title('Rented Bike Count vs ' + col + '- correlation: ' + str(correlation))

From the above regression plots we can conclude that the columns.

*   ***'Rainfall', 'Snowfall', 'Humidity'*** these features are ***negatively ***related with the dependent variaable.
*  ***'Temperature', 'Wind_speed','Visibility', 'Dew_point_temperature', 'Solar_Radiation'*** are **positively** correlated with the dependent variable.



In [None]:
# Check the correlation plot:

corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

Temperature and Dew point temperature has highest correlation. Since for Linear Regression model, it is assumed that there is no multi-collinearity between Independent variables, we have remove multi-collinearity from this dataset.

## **#Checking VIF:-**

In [None]:
# Definig function for VIF:

def calc_vif(X):
 
   # Calculating VIF:
   
   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
 
   return(vif)

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in ['Rented Bike Count','Day', 'Month', 'Year']]])

Since **Temperature has highest VIF** followed by Dew Point Temperature, we will check for VIF of features without Temperature.

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in ['Rented Bike Count','Day', 'Month', 'Year','Temperature(°C)']]])

After **dropping temperature**, VIF is in acceptable range, therefore we will drop temperature from our dataset.

In [None]:
df=df.drop(['Temperature(°C)'],axis=1)

In [None]:
# Checking correlation plot after dropping Temperature:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')


In [None]:
# Ordinal encoding:

df['Functioning Day']=df['Functioning Day'].map({'Yes':1,'No':0})
df['Holiday']=df['Holiday'].map({'No Holiday':0,'Holiday':1})

In [None]:
df

In [None]:
# One Hot Encoding:

df_seasons=pd.get_dummies( df['Seasons'] )
df_month=pd.get_dummies( df['Month'] , prefix = 'Month')
df_hour=pd.get_dummies( df['Hour'] ,prefix = 'Hour' )

In [None]:
# Join one hot encoded columns:

df=df.join([df_seasons,df_month,df_hour])

In [None]:
df=df.drop(columns = ['Hour', 'Seasons' ,'Month'])

In [None]:
df.columns

### **#Checking Distribution Rented Bike Count column data:-**

In [None]:
# Distribution plot of Rented Bike Count:

plt.figure(figsize=(10,6))
plt.xlabel('Rented_Bike_Count')
plt.ylabel('Density')
ax=sns.distplot(df['Rented Bike Count'],hist=True ,color="green")
ax.axvline(df['Rented Bike Count'].mean(), color='blue', linestyle='dashed', linewidth=2)
ax.axvline(df['Rented Bike Count'].median(), color='red', linestyle='dashed', linewidth=2)
plt.show()



*   In density plot for ***Rented Bike Count*** we can see the median and mean lies in range of 500 to 1000 mean is slightly **greater than median** which means its ***positively skewed***.




In [None]:
#Applying square root to Rented Bike Count to reduce skewness:

plt.figure(figsize=(10,8))
plt.xlabel('Rented Bike Count')
plt.ylabel('Density')

ax=sns.distplot(np.sqrt(df['Rented Bike Count']), color="green")
ax.axvline(np.sqrt(df['Rented Bike Count']).mean(), color='blue', linestyle='dashed', linewidth=2)
ax.axvline(np.sqrt(df['Rented Bike Count']).median(), color='red', linestyle='dashed', linewidth=2)

plt.show()

In [None]:
# Applying square root to Rented Bike Count column:

df['Rented Bike Count']=np.sqrt(df['Rented Bike Count'])

In [None]:
# Defining function for plotting y test  and y train values:

def get_linear_graph(pred_value , y_test ):
  plt.figure(figsize=(15,7))
  plt.plot(pred_value[:100])
  plt.plot(np.array(y_test[:100]))
  plt.legend(['Predicted','Actual'])
  plt.show()

In [None]:
# defining function for feature importance:

def get_feat_imp(model):
  feat_importances = pd.Series(model.feature_importances_, index=X.columns)
  plt.figure(figsize=(15,8))
  plt.title('Feature Importance')
  feat_importances.nlargest(10).plot(kind='barh', color= 'red')
  plt.show()

## #Data preparation:-

In [None]:
# Creating copy of data:
# Creating copy of data:

data= df.copy()

In [None]:
# Create the data of dependent and independent variables:

y = data['Rented Bike Count']
X = data.drop(columns=['Rented Bike Count'], axis=1)

In [None]:
# Splitting the dataset into the Training set and Test set:

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=12)
print(X_train.shape)
print(X_test.shape)

In [None]:
# Transforming data:

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### **#Linear Regression:-**

In [None]:
# Fitting onto Linear regression Model:

reg= LinearRegression().fit(X_train, y_train)


In [None]:
# Getting the X_train and X-test value:

y_pred_train=reg.predict(X_train)
y_pred_test=reg.predict(X_test)

### **Evluation Matrix for Linear Regression.**

In [None]:
# Calculate MSE, MAE, R2 for training data:


MSEl = mean_squared_error((y_train), (y_pred_train))
MAEl= mean_absolute_error(y_train, y_pred_train)
r2l = r2_score(y_train, y_pred_train)

In [None]:
# Calculate MSE, MAE, R2 for testing data:


MSEtestl = mean_squared_error((y_test), (y_pred_test))
MAEtestl= mean_absolute_error(y_test, y_pred_test)
r2testl = r2_score(y_test, y_pred_test)

In [None]:
# Printing Errors:

print('Training Errors\nMSE:', MSEl , '\nMAE:' , MAEl , '\nR2:',round((r2l),3))
print('\n\nTesting Errors\nMSE:', MSEtestl , '\nMAE:' , MAEtestl , '\nR2:',round((r2testl),3))

In [None]:
get_linear_graph(y_pred_test , y_test )

In [None]:
# storing the train set metrics value in a dataframe for later comparison:

dict1={'Model':'Linear regression ',
       'MAE':round((MAEl),3),
       'MSE':round((MSEl),3),
       'R2_score':round((r2l),3),
       }
training_df=pd.DataFrame(dict1,index=[1])

In [None]:
# storing the test set metrics value in a dataframe for later comparison:
dict2={'Model':'Linear regression ',
       'MAE':round((MAEtestl),3),
       'MSE':round((MSEtestl),3),
       'R2_score':round((r2testl),3)
       }
test_df=pd.DataFrame(dict2,index=[1])

### **#Polynomial Regression:-**

In [None]:
# Fitting training data onto Polynomial regression Model :

poly_reg = PolynomialFeatures(degree = 2)

In [None]:
X_poly = poly_reg.fit_transform(X_train)
X_poly_test = poly_reg.fit_transform(X_test)

In [None]:
# Fitting training data onto Polynomial regression Model:

poly = LinearRegression().fit(X_poly, y_train)

In [None]:
# Getting the y_train and y-test value:

y_pred_poly_train = poly.predict(X_poly)
y_pred_poly_test= poly.predict(X_poly_test)

### Evaluation Matrix for Polynomial Regression:-

In [None]:
# Calculate MSE, MAE, R2 for training data:

MSEp = mean_squared_error((y_train), (y_pred_poly_train))
MAEp= mean_absolute_error(y_train, y_pred_poly_train)
r2p = r2_score(y_train, y_pred_poly_train)

In [None]:
# Calculate MSE, MAE, R2 for testing data:

MSEtestp = mean_squared_error((y_test), (y_pred_poly_test))
MAEtestp= mean_absolute_error(y_test, y_pred_poly_test)
r2testp = r2_score(y_test, y_pred_poly_test)

In [None]:
# Printing Errors:

print('Training Errors\nMSE:', MSEp , '\nMAE:' , MAEp , '\nR2:',round((r2p),2))
print('\n\nTesting Errors\nMSE:', MSEtestp , '\nMAE:' , MAEtestp , '\nR2:',round((r2testp),2))

In [None]:
get_linear_graph(y_pred_poly_test , y_test )

In [None]:
# storing the train set metrics value in a dataframe for later comparison:

dict1={'Model':'Polynomial regression ',
       'MAE':round((MAEp),3),
       'MSE':round((MSEp),3),
       'R2_score':round((r2p),3)
       }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
# storing the test set metrics value in a dataframe for later comparison:

dict2={'Model':'Polynomial regression ',
       'MAE':round((MAEtestp),3),
       'MSE':round((MSEtestp),3),
       'R2_score':round((r2testp),3)
       }
test_df=test_df.append(dict2,ignore_index=True)

### **#Decision Tree Regressor:-**

In [None]:
# Creating object wth Decision tree regressor with critera of mean squared error, maximum depth being 10, maximum leaf noodes being 120:

decision_regressor = DecisionTreeRegressor(criterion='squared_error', max_depth=10, max_leaf_nodes=120)
decision_regressor.fit(X_train, y_train)

In [None]:
# Getting the y_train and y-test value:

y_pred_train_d = decision_regressor.predict(X_train)
y_pred_test_d = decision_regressor.predict(X_test)

### **#Evaluation Matrix for Decision Tree Regressor:-**

In [None]:
# Calculate MSE, MAE, R2 for training data:


MSEdt = mean_squared_error((y_train), (y_pred_train_d))
MAEdt = mean_absolute_error(y_train, y_pred_train_d)
r2dt = r2_score(y_train, y_pred_train_d)

In [None]:
# Calculate MSE, MAE, R2 for testing data:


MSEtestdt = mean_squared_error((y_test), (y_pred_test_d))
MAEtestdt = mean_absolute_error(y_test, y_pred_test_d)
r2testdt = r2_score(y_test, y_pred_test_d)

In [None]:
# Printing Errors:

print('Training Errors\nMSE:', MSEdt , '\nMAE:' , MAEdt , '\nR2:',round((r2dt),3))
print('\n\nTesting Errors\nMSE:', MSEtestdt , '\nMAE:' , MAEtestdt , '\nR2:',round((r2testdt),3))

In [None]:
get_linear_graph(y_pred_test_d , y_test )

In [None]:
# storing the train set metrics value in a dataframe for later comparison:

dict1={'Model':'Decision Tree Regression ',
       'MAE':round((MAEdt),3),
       'MSE':round((MSEdt),3),
       'R2_score':round((r2dt),3),
}
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2 = {'Model':'Decision Tree Regression ',
       'MAE':round((MAEtestdt),3),
       'MSE':round((MSEtestdt),3),
       'R2_score':round((r2testdt),3),
}
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
get_feat_imp(decision_regressor)



*   List itemHere we can see ***Hour_20*** is showing least feature importance while **Winter season** is showing highest feature importance in model prediction.




### **#Random Forrest Regressor:-**

In [None]:
rfc = RandomForestRegressor(n_estimators = 180, random_state = 21 ,criterion= 'mse',max_depth=13 ,max_leaf_nodes= 80)
rfc.fit(X_train,y_train)

In [None]:
# Prediction on train dataset:

y_pred_trainrfc = rfc.predict(X_train)

In [None]:
#Prediction on test dataset:

y_pred_testrfc = rfc.predict(X_test)

### **#Evaluation Matrix for Random Forest:-**

In [None]:
# Calculate MSE, MAE, R2 for training data:


MSErfc = mean_squared_error(y_train, y_pred_trainrfc)
MAErfc = mean_absolute_error(y_train, y_pred_trainrfc)
r2rfc = r2_score(y_train, y_pred_trainrfc)

In [None]:
# Calculate MSE, MAE, R2 for testing data:


MSEtestrf = mean_squared_error((y_test), (y_pred_testrfc))
MAEtestrf = mean_absolute_error(y_test, y_pred_testrfc)
r2testrf = r2_score(y_test, y_pred_testrfc)


In [None]:
# Printing Errors:

print('Training Errors\nMSE:', MSErfc , '\nMAE:' , MAErfc , '\nR2:',round((r2rfc),3))
print('\n\nTesting Errors\nMSE:', MSEtestrf , '\nMAE:' , MAEtestrf , '\nR2:',round((r2testrf),3))

In [None]:
get_linear_graph(y_pred_testrfc , y_test )

In [None]:
# storing the train set metrics value in a dataframe for later comparison:

dict1={'Model':'Random Forrest ',
       'MAE':round((MAErfc),3),
       'MSE':round((MSErfc),3),
       'R2_score':round((r2rfc),3)}
       
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
# storing the train set metrics value in a dataframe for later comparison:

dict2={'Model':'Random Forrest ',
       'MAE':round((MAEtestrf),3),
       'MSE':round((MSEtestrf),3),
       'R2_score':round((r2testrf),3)}
       
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
get_feat_imp(rfc)

Here we can see ***Month_3*** is showing least feature importance while **Winter season** is showing highest feature importance in model prediction.
