<a href="https://colab.research.google.com/github/ShivaniThakur-19/Bike-Sharing-Demand-Prediction-Capstone-I/blob/main/Bike_Sharing_Demand_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Seoul Bike Sharing Demand Prediction </u></b>

## <b> Problem Description </b>

### Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.


## <b> Data Description </b>

### <b> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.</b>


### <b>Attribute Information: </b>

* ### Date : year-month-day
* ### Rented Bike count - Count of bikes rented at each hour
* ### Hour - Hour of he day
* ### Temperature-Temperature in Celsius
* ### Humidity - %
* ### Windspeed - m/s
* ### Visibility - 10m
* ### Dew point temperature - Celsius
* ### Solar radiation - MJ/m2
* ### Rainfall - mm
* ### Snowfall - cm
* ### Seasons - Winter, Spring, Summer, Autumn
* ### Holiday - Holiday/No holiday
* ### Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

# 1.Understanding the data:
- Import the Libraries
- Import the data and views it columns
- Check all the statistics and data types of the data
- Visualize the numerical and categorical data

In [None]:
#import the libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as pg
%matplotlib inline
# import warning 
import warnings
warnings.filterwarnings('ignore')
# import evaluation metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
#import datetime library t wrok with datetime values
from datetime import datetime
import datetime as dt

# import  gridsearchcv , and randomsearCV for hyperparameter tuning 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import PowerTransformer
#import other important libraries 
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# import ML models 
from sklearn.linear_model import LinearRegression,Lasso,Ridge,SGDRegressor #linear , ridge , lasso and SGD regressor
from sklearn.preprocessing import PolynomialFeatures # for polynomial regression
from sklearn.tree import DecisionTreeRegressor # for decision tree regressor
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor # ensemble models
from xgboost import XGBRegressor #for XG boost
#
from sklearn.datasets import make_regression

In [None]:
# Mount your drive 
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# load the dataset

data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/SeoulBikeData.csv', encoding = ('ISO-8859-1'))

In [None]:
#make copy of our datasets 
df=data.copy()

In [None]:
# column name of our dataframe
df.columns

In [None]:
#  replace columns name with the single column names

df=df.rename(columns={'Rented Bike Count':'Rented_Bike_Count',
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind_speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'Solar_Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                                'Functioning Day':'Functioning_Day'})

In [None]:
# look at our new column name 
df.columns

In [None]:
# dataframe 
df

In [None]:
# shape of dataset  
df.shape

In [None]:
# some basic info  
df.info()

In [None]:
# to know about descriptive summary 
df.describe(include="all")

In [None]:
df.describe().T

In [None]:
# checking the null value in each column of datasets
df.isna().sum()

In [None]:
# check no. of unique values in each columns 
df.nunique()

In [None]:
df.columns

### Correlation between columns

In [None]:
# ploting heat map to determine corelation b/w columns of your datasets
plt.figure(figsize = (14,8))
sns.heatmap(df.corr() , annot = True )

# **Initial visualisation**

To get a feeling for the data it is a good idea to do some form of simple visualisation.  **Display a set of histograms for the features** as they are right now, prior to any cleaning steps.

In [None]:
# Plot the data
df.hist(bins=50,figsize=(10,15))
plt.show()

In [None]:
# plot the scatterplot to show missing values 
missing_values = pd.DataFrame((df.isna().sum()) * 100 / df.shape[0]).reset_index( )
plt.figure(figsize = (10,5))
ax = sns.scatterplot(df.columns,0,hue=0, palette="YlOrBr")  
plt.xticks(rotation =45,fontsize =12,Weight='bold')
plt.yticks(fontsize =10,Weight='bold')
plt.title("Percentage of Missing values",Weight='bold')
plt.ylabel("PERCENTAGE",Weight='bold')
plt.show()

As we can see above there are no missing value presents thankfully

In [None]:
# to know about duplicate data in our datasets 

df[df.duplicated()].shape

In [None]:
# as date is object dtype, we need to convert it into date type of object
df['Date'] = pd.to_datetime(df['Date']) 

In [None]:
# now split our date column into weekday, month , and year for better understanding 

df['day_of_week'] = df['Date'].dt.day_name() # extraxt weekday column from date 
df['month'] = df['Date'].dt.month_name() # extracting month column from date 
df['year'] = df['Date'].dt.year

In [None]:
# now we need to covert year column into categorical column for better analytical purpose
df['year'] = df['year'].astype('object')

In [None]:
# now see unqiue values in year column
df['year'].unique( )

In [None]:
# convert hour column into categorical column as even though time is continous column here it's present like timestamp feature
df['Hour']=df['Hour'].astype('object')

In [None]:
# we can also segregate our day into weekdays and weekend category 

df['weekend_col'] = df['day_of_week'].apply(lambda x:'Weekend'  if x=='Saturday' or  x== 'Sunday' else 'Weekdays')

In [None]:
# now we can drop day column of date , day(as we have extracted weekend and weekday feature from it) and we can also drop year column as ....
# year column have date from dec 2017 and nov 2018 

df= df.drop(columns = ['Date' , 'day_of_week' , 'year'] , axis =1 )

In [None]:
df.info()

In [None]:
# divide cour dataset on the base of categorical and numerical features 
numeric_df = df.select_dtypes(exclude='object')
categorical_df = df.select_dtypes(include='object') 

In [None]:
numeric_df.columns

# 2. Exploratary Data Analysis


In [None]:
### Visualizing
cols = df.columns.tolist()
cols.remove('Rented_Bike_Count')

# scatter plot
df.plot(kind="line", x="Rented_Bike_Count", y=cols, subplots=True, sharex=True, ls="none", marker="o",figsize=(20,30),layout=(5, 3))

# box plot
df.plot(kind="box", x="Rented_Bike_Count", y=cols, subplots=True, sharex=True,figsize=(10,15),layout=(5, 3))
plt.show()

# show the non-numerical entries
print(np.sum(df.isna()))

# Categorical columns eda 

In [None]:
categorical_df.columns

In [None]:
# first by months

sns.catplot(x = 'month' , y = 'Rented_Bike_Count' ,kind = 'bar', height= 4.5, aspect = 2.5 , data = df)
plt.title("Count of Rented bikes acording to Month")

In [None]:
# now by weekday and weekend

sns.catplot( x= 'weekend_col' ,  y = 'Rented_Bike_Count' , data = df , kind = 'box' , height = 4.5, aspect = 2.0 )
plt.title('Count of Rented bikes acording to weekdays')
plt.show()


In [None]:
# now by hour coulmns 
res = sns.catplot(x= 'Hour' , y= 'Rented_Bike_Count' , data = df , kind= 'bar' , height = 10.0 , aspect = 7.0)
res.set_xticklabels(fontsize = 26)
res.set_yticklabels(fontsize = 30)



In [None]:
# now on the basis of functioning day 

sns.catplot(x = 'Functioning_Day',y = 'Rented_Bike_Count',data = df , kind = 'box' , height = 4.5, aspect = 2.5 )

In [None]:
# by season now 
sns.catplot(x = 'Seasons' , y = 'Rented_Bike_Count' , data = df , kind = 'bar' , height = 4.5 , aspect = 2.5 )
plt.title('count od rented bike acc to season')

In [None]:
# now by holidays 

sns.catplot(x= 'Holiday' , y = 'Rented_Bike_Count' , data = df , kind = 'box' , height = 5.0 , aspect = 1.5)
plt.title("count od rented bike acc to holidays ")

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=df,x='Hour',y='Rented_Bike_Count',hue='Seasons',ax=ax)
ax.set(title='Count of Rented bikes acording to seasons ')

In [None]:
# create point plots with Rented Bike Count during different categorical features with respect of Hour
for i in categorical_df.columns:
  if i == 'Hour':
    pass
  else:
    plt.figure(figsize=(20,10))
    sns.pointplot(x=df["Hour"],y=df['Rented_Bike_Count'],hue=df[i])
    plt.title(f"Rented Bike Count during different {i} with respect of Hour")
  plt.show()

###### now focus on  numerical columns 

In [None]:
# boxplot to check outliers in numerical columns 
n = 1
plt.figure(figsize=(10,15))

for col in numeric_df.columns:
  plt.subplot(3,3,n)
  n=n+1
  sns.boxplot(df[col])
  plt.title(col)
  plt.xlabel(col)
  plt.tight_layout()

# since if we remove outlier from rainfall and snowfall column it will remove all of ur data so avoid them now

In [None]:
numeric_df.columns

In [None]:
# Heatmap of all variables against each other to see ther co-relations
plt.figure(figsize=(10,6))
sns.heatmap(data.corr(),annot=True,cmap='YlGnBu')
plt.title("Numerical columns co-reltion heatmap")
plt.show()

In [None]:
#printing displots to analyze the distribution of all numerical features
numerical_columns=list(df.select_dtypes(['int64','float64']).columns)
numerical_features=pd.Index(numerical_columns)
for col in numerical_features:
  plt.figure(figsize=(10,6))
  sns.distplot(x=df[col])
  plt.xlabel(col)
plt.show()

Numerical vs.Rented_Bike_Count

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Temperature" 
df.groupby('Temperature').mean()['Rented_Bike_Count'].plot(color='deeppink')

From the above plot we see that people like to ride bikes when it is pretty hot around 25°C in average

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Dew_point_temperature" 
df.groupby('Dew_point_temperature').mean()['Rented_Bike_Count'].plot(color='hotpink')

From the above plot of "Dew_point_temperature' is almost same as the 'temperature' there is some similarity present we can check it in our next step.

In [None]:

#print the plot to analyze the relationship between "Rented_Bike_Count" and "Solar_Radiation" 
df.groupby('Solar_Radiation').mean()['Rented_Bike_Count'].plot(color="orchid")

from the above plot we see that, the amount of rented bikes is huge, when there is solar radiation, the counter of rents is around 1000

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Snowfall" 
df.groupby('Snowfall').mean()['Rented_Bike_Count'].plot(color="violet")

We can see from the plot that, on the y-axis, the amount of rented bike is very low When we have more than 4 cm of snow, the bike rents is much lower

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Rainfall" 
df.groupby('Rainfall').mean()['Rented_Bike_Count'].plot(color="lightpink")

We can see from the above plot that even if it rains a lot the demand of of rent bikes is not decreasing, here for example even if we have 20 mm of rain there is a big peak of rented bikes

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Wind_speed" 
df.groupby('Wind_speed').mean()['Rented_Bike_Count'].plot(color="hotpink")

We can see from the above plot that the demand of rented bike is uniformly distribute despite of wind speed but when the speed of wind was 7 m/s then the demand of bike also increase that clearly means peoples love to ride bikes when its little windy.

In [None]:
# take a look at vmaximum value of each column to get an idea about outlier

print(df['Wind_speed'].max())
print(df['Solar_Radiation'].max())


In [None]:
# according to upper bound and lower bound for iqr of each column 

df.loc[df['Solar_Radiation']>=2,'Solar_Radiation']= 2
df.loc[df['Wind_speed' ]>=4,'Wind_speed']= 4

In [None]:
# and checking outliers again
 
n = 1
plt.figure(figsize=(20,15))

for col in numeric_df.columns:
  plt.subplot(3,3,n)
  n=n+1
  sns.boxplot(df[col],color='magenta')
  plt.title(col)
  plt.xlabel(col)
  plt.tight_layout()

In [None]:
# now we need to treat this outlier and we can use caaping for it 
# according to upper bound and lower bound for iqr of each column 


df.loc[df['Solar_Radiation']>=2,'Solar_Radiation']= 2
df.loc[df['Wind_speed' ]>=4,'Wind_speed']= 4




Now we are moving ahead to find relation b/w our numerical independent column with rented bike column with help of regression plot


In [None]:
# to know relation with rented bike count with numerical columns
n=1
plt.figure(figsize=(15,15))
for i in numeric_df.columns :
  if i == 'Rented Bike Count':
    pass
  else:
    plt.subplot(3,3,n )
    n += 1
    sns.regplot(df[i], df['Rented_Bike_Count'] , scatter_kws={"color": "magenta"}, line_kws={"color": "red"})
    plt.title(f'Dependend variable and {i}')
    plt.tight_layout()

In [None]:
# now take a one look at these column max value to look at outlier problem

print(df['Wind_speed'].max())
print(df['Solar_Radiation'].max())

In [None]:
# and again checking outliers again by emans of box plot 
 
n = 1
plt.figure(figsize=(20,15))

for col in numeric_df.columns:
  plt.subplot(3,3,n)
  n=n+1
  sns.boxplot(df[col],color='darkorchid')
  plt.title(col)
  plt.xlabel(col)
  plt.tight_layout()

In [None]:
n = 1 
for col in numeric_df.columns :
  plt.figure(figsize = (50,20))
  plt.subplot(3,3 ,n ) 
  n += 1 
  sns.distplot(df[col])
  feature = df[i]
  plt.axvline(feature.mean(), color='black', linestyle = 'dashed' , linewidth=3)
  plt.axvline(feature.median(), color='red', linestyle='dashed', linewidth=3)
  plt.show()

# Regression plot
The regression plots in seaborn are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during exploratory data analyses. Regression plots as the name suggests creates a regression line between 2 parameters and helps to visualize their linear relationships.

In [None]:
#printing the regression plot for all the numerical features
for col in numerical_features:
  fig,ax=plt.subplots(figsize=(10,6))
  sns.regplot(x=df[col],y=df['Rented_Bike_Count'],scatter_kws={"color": 'coral'}, line_kws={"color": "black"})

# Distplot plots we observe that some of our columns is right skewed and some are left skewed we have to remember this things when we apply algorithms
#Right skewed columns are Rented Bike Count (Its also our Dependent variable), Wind_speed, Solar_Radiation, Rainfall(mm), Snowfall (cm) and 
# Left skewed columns ar# Visibility (10m), Dew point temperature(°C)
# From Histogram we are coming to know that the features which are skewed, their mean and the median are also skewed, which was understood by looking at the graph that this would happen .


* ***The above graph shows that Rented Bike Count has moderate right skewness. Since the assumption of linear regression is that 'the distribution of dependent variable has to be normal', so we should perform some operation to normalize it.***

In [None]:
plt.figure(figsize = (12 ,8))
plt.xlabel("Rented  Bike Count")
plt.ylabel("Density of our dataset") 

ax = sns.distplot(np.sqrt(df['Rented_Bike_Count']) , color = "blue")
ax.axvline(np.sqrt(df['Rented_Bike_Count'].mean()) , color = 'black' , linestyle = 'dashed' , linewidth = 2.5)
ax.axvline(np.sqrt(df['Rented_Bike_Count'].median()) , color = 'red' , linestyle = 'dashed' , linewidth = 2.5)




# Since we have generic rule of applying Square root for the skewed variable in order to make it normal .After applying Square root to the skewed Rented Bike Count, here we get almost normal distribution

In [None]:
#After applying sqrt on Rented Bike Count check wheater we still have outliers 
plt.figure(figsize=(10,6))

plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=np.sqrt(df['Rented_Bike_Count']))
plt.show()

After applying Square root to the Rented Bike Count column, we find that there is no outliers present.

# Checking of Correlation between variables

In [None]:
 ## now corelation b/w the  dependent varaiables with rented bike count

df.corr()['Rented_Bike_Count']

we observed smilar things in regression plot also where some feature are negatively correlated and some positively correalted with depend var 

In [None]:
## plot the Correlation matrix
plt.figure(figsize=(20,8))
correlation=df.corr()
mask = np.triu(np.ones_like(correlation, dtype=bool))
sns.heatmap((correlation),mask=mask, annot=True,cmap='coolwarm')

From the above correlation heatmap, We see that there is a positive correlation between columns 'Temperature' and 'Dew point temperature' i.e 0.91 so even if we drop this column then it dont affects the outcome of our analysis. And they have the same variations.. so we can drop the column 'Dew point temperature(°C)'.

In [None]:
# dropping dew point column
df.drop(['Dew_point_temperature'] , axis = 1 , inplace = True)

# one hot encoding

###### one hot encoding to convert categorical into numerical columns for better alaysis


In [None]:
df_enc = df.copy()

def one_hot_encoding(data , column ) :
  data = pd.concat([data , pd.get_dummies(data[column] , prefix = column , drop_first = True)] , axis = 1)
  data = data.drop([column], axis =1 )
  return data

for col in categorical_df :
  df_enc = one_hot_encoding(df_enc , col )
df_enc.head()


In [None]:
# now multicolinearity with te help of vif
# make vif calculate function


from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)


In [None]:
# now cal vif our feature columns 
calc_vif(df_enc[[i for i in df_enc.describe().columns if i not in ['Rented_Bike_Count',]]] )

In [None]:
# now we will seprate our datasets dependent and indepent and dependent columns 

X = df_enc.drop(columns = ['Rented_Bike_Count'] , axis  = 1)
Y = np.sqrt(df_enc['Rented_Bike_Count'])

In [None]:
X.shape

In [None]:
Y.shape

# Model Training

Train Test split for regression

In [None]:
# now test, train and split database 

from sklearn.model_selection import train_test_split 
X_train , X_test , Y_train , Y_test = train_test_split(X , Y , test_size = 0.2 , random_state = 1)
print(X_train.shape)
print(X_test.shape)


In [None]:
df_enc.columns

# now we will go toward model building

Linear Regression

In [None]:
# import linear  regression and make its object
from sklearn.linear_model import LinearRegression
reg= LinearRegression().fit(X_train, Y_train)

In [None]:
# accuracy score on training dataset
reg.score(X_train , Y_train )

In [None]:
#check the coefficeint
reg.coef_

In [None]:
#get the X_train and X-test value
Y_train_pred=reg.predict(X_train)
Y_test_pred=reg.predict(X_test)

In [None]:
#import the packages
#Score matrics on train data
print(f"Linear regression training set metrics:")
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

MSE_lr1 = ( mean_squared_error(Y_train, Y_train_pred))
print("MSE", MSE_lr1)

MAE_lr1 = (mean_absolute_error(Y_train, Y_train_pred))
print("MAE", MAE_lr1)

RMSE_lr1 = (np.sqrt(mean_squared_error(Y_train, Y_train_pred)))
print("RMSE", RMSE_lr1)

R2_score_lr1 = r2_score(Y_train, Y_train_pred)
print("R2_score", R2_score_lr1)

Adjusted_r2_lrl = (1-(1-r2_score(Y_train, Y_train_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted_r2" , Adjusted_r2_lrl)
print()



In [None]:
#Score matrics on test data
print(f"Linear regression testing set metrics:")

MSE_lr2 = ( mean_squared_error(Y_test, Y_test_pred))
print("MSE", MSE_lr2)

MAE_lr2 = (mean_absolute_error(Y_test, Y_test_pred))
print("MAE", MAE_lr2)

RMSE_lr2 = (np.sqrt(mean_squared_error(Y_test, Y_test_pred)))
print("RMSE", RMSE_lr2)

R2_score_lr2 = r2_score(Y_test, Y_test_pred)
print("R2_score", R2_score_lr2)

Adjusted_r2_lr2 = 1-((1-R2_score_lr2)* (X_test.shape[0]-1)/ (X_test.shape[0]-1 -(X_test.shape[1])))
print("Adjusted_r2" , Adjusted_r2_lr2)

In [None]:
reg.intercept_


The r2_score for the test set is 0.78. This means our linear model is performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter)

In [None]:
### Heteroscadacity
plt.scatter((Y_test_pred),(Y_test)-(Y_test_pred))

In [None]:
#visualization
plt.figure(figsize=(10,5))
plt.plot((Y_test_pred), color = 'green')
plt.plot(np.array(Y_test), color = 'yellow')
plt.legend(["Predicted","Actual"])
plt.xlabel('Test Data')
plt.title("Linear regression")
plt.show()

In [None]:
# creating a dict to concat linear training and test data score metrics
# storing the train set metrics value in a dict for later comparison
dict1={'Model':'Linear regression ',
       'MAE':round((MAE_lr1),2),
       'MSE':round((MSE_lr1),2),
       'RMSE':round((RMSE_lr1),2),
       'R2_score':round((R2_score_lr1),2),
       'Adjusted R2':round((Adjusted_r2_lrl),2)
       }
lr_dict1 = pd.DataFrame(dict1,index=[1])
training_df=pd.DataFrame(dict1,index=[1])

# storing the test set metrics value in a dict for later comparison
dict2={'Model':'Linear regression ',
       'MAE':round((MAE_lr2),2),
       'MSE':round((MSE_lr2),2),
       'RMSE':round((RMSE_lr2),2),
       'R2_score':round((R2_score_lr2),2),
       'Adjusted R2':round((Adjusted_r2_lr2 ),2)
       }
lr_dict2 = pd.DataFrame(dict2,index=[1])
test_df=pd.DataFrame(dict2,index=[1])

In [None]:
# linear regression score for train and test data
result=pd.concat([lr_dict1,lr_dict2],keys=['Training set','Test set'])
result

# LASSO REGRESSION

In [None]:
# Create an instance of Lasso Regression implementation
from sklearn.linear_model import Lasso
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,0.0014]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=5)
# Fit the Lasso model
lasso_regressor.fit(X_train, Y_train)


In [None]:
print("The best fit alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

In [None]:
lasso = Lasso(alpha=0.001, max_iter=1000)
lasso.fit(X_train, Y_train)
Y_pred_train_lasso = lasso.predict(X_train)                 
Y_pred_test_lasso = lasso.predict(X_test)                     
print(Y_pred_train_lasso)

In [None]:
#Score matrics on train data
print(f"Lasso training set metrics:")

MSE_lasso1 = (mean_squared_error(Y_train, Y_pred_train_lasso))
print("MSE", MSE_lasso1)

MAE_lasso1 = (mean_absolute_error(Y_train, Y_pred_train_lasso))
print("MAE", MAE_lasso1)

RMSE_lasso1 = (np.sqrt(mean_squared_error(Y_train, Y_pred_train_lasso)))
print("RMSE", RMSE_lasso1)

R2_lasso1 = r2_score(Y_train, Y_pred_train_lasso)
print('R2', R2_lasso1)

Adjusted_r2_lasso1 = 1-(1-(R2_lasso1)* (X_train.shape[0]-1)/ (X_train.shape[0]-1 -(X_train.shape[1])))
print("Adjusted_R2", Adjusted_r2_lasso1)
print()



In [None]:
#Score matrics on test data
print(f"Lasso test set metrics:")

MSE_lasso2 = (mean_squared_error(Y_test, Y_pred_test_lasso))
print("MSE", MSE_lasso2)

MAE_lasso2 = (mean_absolute_error(Y_test, Y_pred_test_lasso))
print("MAE", MAE_lasso2)

RMSE_lasso2 = (np.sqrt(mean_squared_error(Y_test, Y_pred_test_lasso)))
print("RMSE", RMSE_lasso2)

R2_lasso2 = r2_score(Y_test, Y_pred_test_lasso)
print('R2', R2_lasso2)

Adjusted_r2_lasso2 = 1-(1-(R2_lasso2)* (X_test.shape[0]-1)/ (X_test.shape[0]-1 -(X_test.shape[1])))
print("Adjusted_R2", Adjusted_r2_lasso2)

In [None]:
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_l= mean_squared_error((Y_train), (Y_pred_train_lasso))
print("MSE :",MSE_l)

#calculate RMSE
RMSE_l=np.sqrt(MSE_l)
print("RMSE :",RMSE_l)


#calculate MAE
MAE_l= mean_absolute_error(Y_train, Y_pred_train_lasso)
print("MAE :",MAE_l)


from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_l= r2_score(Y_train, Y_pred_train_lasso)
print("R2 :",r2_l)
Adjusted_R2_l = (1-(1-r2_score(Y_train, Y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(Y_train, Y_pred_train_lasso))**((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

In [None]:
# creating a dict to concat lasso training and test data score metrics
# storing the Train set metrics value in a dict3 for later comparison
dict3={'Model':'Lasso regression ',
       'MAE':round((MAE_lasso1),2),
       'MSE':round((MSE_lasso1),2),
       'RMSE':round((RMSE_lasso1),2),
       'R2_score':round((R2_lasso1),2),
       'Adjusted R2':round((Adjusted_r2_lasso1),2)
       }
lasso_dict3 =pd.DataFrame(dict3,index=[1])
training_df=training_df.append(dict3,ignore_index=True)

In [None]:
# storing the test set metrics value in a dict4 for later comparison
dict4={'Model':'Lasso regression ',
       'MAE':round((MAE_lasso2),2),
       'MSE':round((MSE_lasso2),2),
       'RMSE':round((RMSE_lasso2),2),
       'R2_score':round((R2_lasso2),2),
       'Adjusted R2':round((Adjusted_r2_lasso2 ),2)
       }
lasso_dict4 =pd.DataFrame(dict4,index=[1])
test_df=test_df.append(dict4,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(10,4))
plt.plot(np.array(Y_pred_test_lasso))
plt.plot(np.array((Y_test)))
plt.legend(["Predicted","Actual"])
plt.xlabel('Test Data')
plt.title('Regularization: Lasso')
plt.show()


In [None]:
### Heteroscadacity
plt.scatter((Y_pred_test_lasso),(Y_test-Y_pred_test_lasso))

In [None]:
result=pd.concat([lasso_dict3,lasso_dict4],keys=['Training set','Test set'])
result

RIDGE REGRESSION

In [None]:
#import the packages
from sklearn.linear_model import Ridge

ridge= Ridge(alpha=0.1)

In [None]:
#FIT THE MODEL
ridge.fit(X_train,Y_train)

In [None]:
#check the score
ridge.score(X_train, Y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_ridge=ridge.predict(X_train)
y_pred_test_ridge=ridge.predict(X_test)

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_r= mean_squared_error((Y_train), (y_pred_train_ridge))
print("MSE :",MSE_r)

#calculate RMSE
RMSE_r=np.sqrt(MSE_r)
print("RMSE :",RMSE_r)


#calculate MAE
MAE_r= mean_absolute_error(Y_train, y_pred_train_ridge)
print("MAE :",MAE_r)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_r= r2_score(Y_train, y_pred_train_ridge)
print("R2 :",r2_r)
Adjusted_R2_r=(1-(1-r2_score(Y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(Y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

In [None]:
#Score matrics on test data
print(f"Ridge Regression: evaluation metrics on the testing set:")
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score      
MSE_ridge2 = (mean_squared_error(Y_test, y_pred_test_ridge))
print('MSE', MSE_ridge2)
MAE_ridge2 =  (mean_absolute_error(Y_train, y_pred_train_ridge))
print('MAE', MAE_ridge2)
RMSE_ridge2 =  np.sqrt(mean_squared_error(Y_test, y_pred_test_ridge))
print("RMSE", RMSE_ridge2)
R2_score_ridge2 = r2_score(Y_test,y_pred_test_ridge)
print("R2_score", R2_score_ridge2)
Adjusted_r2_ridge2 = 1-(1-(R2_score_ridge2)* (X_test.shape[0]-1)/ (X_test.shape[0]-1 -(X_test.shape[1])))
print("Adjusted_R2", Adjusted_r2_ridge2)

In [None]:
# creating a dict to concat ridge training and test data score metrics
# storing the Train set metrics value in a dict5 for later comparison
dict5={'Model':'Ridge regression ',
       'MAE':round((MAE_r),2),
       'MSE':round((MSE_r),2),
       'RMSE':round((RMSE_r),2),
       'R2_score':round((r2_r),2),
       'Adjusted R2':round((Adjusted_R2_r),2)
       }
ridge_dict5 =pd.DataFrame(dict5,index=[1])
training_df=training_df.append(dict5,ignore_index=True)

# storing the test set metrics value in a dict6 for later comparison
dict6={'Model':'Ridge regression ',
       'MAE':round((MAE_ridge2),2),
       'MSE':round((MSE_ridge2),2),
       'RMSE':round((RMSE_ridge2),2),
       'R2_score':round((R2_score_ridge2),2),
       'Adjusted R2':round((Adjusted_r2_ridge2 ),2)
       }
ridge_dict6 =pd.DataFrame(dict6,index=[1])
test_df=test_df.append(dict6,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(10,4))
plt.plot((y_pred_test_ridge))
plt.plot((np.array(Y_test)))
plt.legend(["Predicted","Actual"])
plt.xlabel('Test Data')
plt.title('Regularization: Ridge')
plt.show()

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_ridge),(Y_test)-(y_pred_test_ridge))

In [None]:
result=pd.concat([ridge_dict5,ridge_dict6],keys=['Training set','Test set'])
result

# ELASTIC NET REGRESSION

In [None]:
#import the packages
from sklearn.linear_model import ElasticNet
#a * L1 + b * L2
#alpha = a + b and l1_ratio = a / (a + b)
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)

In [None]:
#FIT THE MODEL
elasticnet.fit(X_train,Y_train)

In [None]:
#check the score
elasticnet.score(X_train, Y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_en=elasticnet.predict(X_train)
y_pred_test_en=elasticnet.predict(X_test)

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_e= mean_squared_error((Y_train), (y_pred_train_en))
print("MSE :",MSE_e)

#calculate RMSE
RMSE_e=np.sqrt(MSE_e)
print("RMSE :",RMSE_e)


#calculate MAE
MAE_e= mean_absolute_error(Y_train, y_pred_train_en)
print("MAE :",MAE_e)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_e= r2_score(Y_train, y_pred_train_en)
print("R2 :",r2_e)

Adjusted_R2_e=(1-(1-r2_score(Y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(Y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

In [None]:
#Score matrics on test data
print(f"Ridge Regression: evaluation metrics on the testing set:")
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score      
MSE_e = (mean_squared_error(Y_test, y_pred_test_en))
print('MSE:', MSE_e)

MAE_ridge2 =  (mean_absolute_error(Y_train, y_pred_train_en))
print('MAE', MAE_e)

RMSE_ridge2 =  np.sqrt(mean_squared_error(Y_test, y_pred_test_en))
print("RMSE", RMSE_e)

R2_e = r2_score(Y_test,y_pred_test_en)
print("R2_score", R2_e)

Adjusted_r2_e = 1-(1-(r2_e)* (X_test.shape[0]-1)/ (X_test.shape[0]-1 -(X_test.shape[1])))
print("Adjusted_R2:", Adjusted_r2_e)

In [None]:
 #storing the test set metrics value in a dataframe for later comparison
dict7={'Model':'Elastic net regression Test',
       'MAE':round((MAE_e),3),
       'MSE':round((MSE_e),3),
       'RMSE':round((RMSE_e),3),
       'R2_score':round((r2_e),3),
       'Adjusted R2':round((Adjusted_R2_e ),2)}

e_dict7 =pd.DataFrame(dict7,index=[1])
training_df=training_df.append(dict7,ignore_index=True)

 #storing the test set metrics value in a dict6 for later comparison
dict8={'Model':'Elastic net regression Test',
       'MAE':round((MAE_e),3),
       'MSE':round((MSE_e),3),
       'RMSE':round((RMSE_e),3),
       'R2_score':round((r2_e),3),
       'Adjusted R2':round((Adjusted_R2_e ),2)
       }
e_dict8 =pd.DataFrame(dict8,index=[1])
test_df=test_df.append(dict8,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(10,4))
plt.plot(np.array(y_pred_test_en))
plt.plot((np.array(Y_test)))
plt.legend(["Predicted","Actual"])
plt.xlabel('Test Data')
plt.title('Regularization: Elastic net regression ')
plt.show()

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_en),(Y_test)-(y_pred_test_en))

In [None]:
result=pd.concat([e_dict7,e_dict8],keys=['Training set','Test set'])
result

## **Decision Tree**

In [None]:
# importing deciion tree regressor
from sklearn.tree import DecisionTreeRegressor

In [None]:
# storing object for decision tree regresssor with max depth 15
dt_model = DecisionTreeRegressor(max_depth = 12)
# calling dt_model to train,fit and evalution of decision tree model
dt_model.fit(X_train,Y_train)

In [None]:
#Y_pred for traning and testing dataset
Y_pred1_dt = dt_model.predict(X_train)
Y_pred2_dt = dt_model.predict(X_test)

In [None]:
# decision tree score
dt_model.score(X_train , Y_train )

In [None]:
#Score matrics on train data
print(f"Decision Tree: evaluation metrics on the training set:")
#importing matrics for training data
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score   
MSE_dt1 = mean_squared_error(Y_train, Y_pred1_dt)
print("MSE", MSE_dt1)
MAE_dt1 = mean_absolute_error(Y_train, Y_pred1_dt)
print('MAE', MAE_dt1)
RMSE_dt1 =  np.sqrt(mean_squared_error(Y_train, Y_pred1_dt))
print("RMSE", RMSE_dt1)
R2_score_dt1 = r2_score(Y_train, Y_pred1_dt)
print("R2_score", R2_score_dt1)
Adjusted_r2_dt1 = 1-(1-(R2_score_dt1)* (X_train.shape[0]-1)/ (X_train.shape[0]-1 -(X_train.shape[1])))
print("Adjusted_R2", Adjusted_r2_dt1)
print()

In [None]:
#Score matrics on test data
print(f"Decision Tree: evaluation metrics on the testing set:")
#importing matrics for training data
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score   
MSE_dt2 = mean_squared_error(Y_test, Y_pred2_dt)
print("MSE", MSE_dt2)
MAE_dt2 = mean_absolute_error(Y_test, Y_pred2_dt)
print('MAE', MAE_dt2)
RMSE_dt2 =  np.sqrt(mean_squared_error(Y_test, Y_pred2_dt))
print("RMSE", RMSE_dt2)
R2_score_dt2 = r2_score(Y_test, Y_pred2_dt)
print("R2_score", R2_score_dt2)
Adjusted_r2_dt2 = 1-(1-(R2_score_dt2)* (X_train.shape[0]-1)/ (X_train.shape[0]-1 -(X_train.shape[1])))
print("Adjusted_R2", Adjusted_r2_dt2)

In [None]:
plt.figure(figsize=(10,5))
plt.plot((Y_pred2_dt), color = 'k')
plt.plot(np.array(Y_test), color = 'c')
plt.legend(["Predicted","Actual"])
plt.xlabel('Test Data')
plt.title('Decision Tree')
plt.show()

In [None]:
# creating a dict to concat ridge training and test data score metrics
# storing the Train set metrics value in a dict7 for later comparison
dict9={'Model':'Decision Tree ',
       'MAE':round((MAE_dt1),2),
       'MSE':round((MSE_dt1),2),
       'RMSE':round((RMSE_dt1),2),
       'R2_score':round((R2_score_dt1),2),
       'Adjusted R2':round((Adjusted_r2_dt1),2)
       }
dt_dict9 =pd.DataFrame(dict9,index=[1])
training_df=training_df.append(dict9,ignore_index=True)

# storing the test set metrics value in a dict8 for later comparison
dict10={'Model':'Decision Tree',
       'MAE':round((MAE_dt2),2),
       'MSE':round((MSE_dt2),2),
       'RMSE':round((RMSE_ridge2),2),
       'R2_score':round((R2_score_ridge2),2),
       'Adjusted R2':round((Adjusted_r2_ridge2 ),2)
       }
dt_dict10 =pd.DataFrame(dict10,index=[1])
test_df=test_df.append(dict10,ignore_index=True)

In [None]:
result=pd.concat([dt_dict9,dt_dict10],keys=['Training set','Test set'])
result

# **Random Forest**

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()

In [None]:
# creating param dict to check random forest with diffirent value of parameter through gridsearch
n_estimators=[60,80,100]
max_depth=[15,20]
max_leaf_nodes=[60,80]
max_features = [0.2, 0.5, 0.8 ]
params = {'n_estimators':n_estimators,'max_depth':max_depth ,'max_leaf_nodes':max_leaf_nodes, "max_features" : max_features }

In [None]:
# creating rf_grid model to run rf model with gridsearch
rf_grid= GridSearchCV(rf,param_grid=params,verbose=0, n_jobs = -1)

In [None]:
rf_grid.fit(X_train, Y_train)

In [None]:
# to see best prameter
rf_grid.best_params_

In [None]:
rf_grid.best_score_

In [None]:
# random forest with best parameter
rf = RandomForestRegressor(max_depth = 20, max_leaf_nodes = 80, n_estimators =  80)

In [None]:
# fitting x-train and y-train
rf.fit(X_train, Y_train)

In [None]:
# predictions and score
rf_y_pred1 = rf.predict(X_train)
rf_y_pred2 = rf.predict(X_test)
rf.score(X_train, Y_train)

In [None]:
#Score matrics on train data
print(f"Random Forest: evaluation metrics on the training set:")
#importing matrics for training data
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
MSE_rf1 = mean_squared_error(Y_train, rf_y_pred1)       
print("MSE", MSE_rf1)
MAE_rf1 = mean_absolute_error(Y_train, rf_y_pred1)
print('MAE', MAE_rf1)
RSME_rf1 = np.sqrt(mean_squared_error(Y_train, rf_y_pred1))
print("RMSE", RSME_rf1)
R2_score_rf1 = r2_score(Y_train, rf_y_pred1)
print("R2score", R2_score_rf1)
Adjusted_r2_rf1 = 1-(1-(R2_score_rf1)* (X_train.shape[0]-1)/ (X_train.shape[0]-1 -(X_train.shape[1])))
print("Adjusted_R2", Adjusted_r2_rf1)
print()


#Score matrics on train data
print(f"Random Forest: evaluation metrics on the testing set:")
MSE_rf2 = mean_squared_error(Y_test, rf_y_pred2)       
print("MSE", MSE_rf2)
MAE_rf2 = mean_absolute_error(Y_test, rf_y_pred2)
print('MAE', MAE_rf2)
RSME_rf2 = np.sqrt(mean_squared_error(Y_test, rf_y_pred2))
print("RMSE", RSME_rf2)
R2_score_rf2 = r2_score(Y_test, rf_y_pred2)
print("R2score", R2_score_rf2)
Adjusted_r2_rf2 = 1-(1-(R2_score_rf2)* (X_test.shape[0]-1)/ (X_test.shape[0]-1 -(X_test.shape[1])))
print("Adjusted_R2", Adjusted_r2_rf2)


In [None]:
# creating a dict to concat ridge training and test data score metrics
# storing the Train set metrics value in a dict9 for later comparison
dict9={'Model':'Random Forest ',
       'MAE':round((MAE_rf1),2),
       'MSE':round((MSE_rf1),2),
       'RMSE':round((RSME_rf1),2),
       'R2_score':round((R2_score_rf1),2),
       'Adjusted R2':round((Adjusted_r2_rf1),2)
       }
training_df=training_df.append(dict9,ignore_index=True)
rf_dict9 =pd.DataFrame(dict9,index=[1])

# storing the test set metrics value in a dict10 for later comparison
dict10={'Model':'Random Forest ',
       'MAE':round((MAE_rf2),2),
       'MSE':round((MSE_rf2),2),
       'RMSE':round((RSME_rf2),2),
       'R2_score':round((R2_score_rf2),2),
       'Adjusted R2':round((Adjusted_r2_rf2 ),2)
       }
test_df=test_df.append(dict10,ignore_index=True)
rf_dict10 =pd.DataFrame(dict10,index=[1])

In [None]:
plt.figure(figsize=(10,5))
plt.plot((rf_y_pred2), color = 'k')
plt.plot(np.array(Y_test), color = 'c')
plt.legend(["Predicted","Actual"])
plt.xlabel('Test Data')
plt.title('Random Forest')
plt.show()

In [None]:
result=pd.concat([rf_dict9,rf_dict10],keys=['Training set','Test set'])
result

In [None]:
rf.feature_importances_

In [None]:
importances = rf.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)

In [None]:
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)

In [None]:
importance_df.sort_values(by=['Feature Importance'],ascending=False)

In [None]:
#FIT THE MODEL
rf.fit(X_train,Y_train)

In [None]:
features = X_train.columns
importances = rf.feature_importances_
indices = np.argsort(importances)

In [None]:
#Plot the figure
plt.figure(figsize=(10,15))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')

plt.show()

# GRADIENT BOOSTING

In [None]:
#import the packages
from sklearn.ensemble import GradientBoostingRegressor
# Create an instance of the GradientBoostingRegressor
gb =GradientBoostingRegressor()


In [None]:
# creating param dict to check diffirent value of parameter
n_estimators=[100,150, 180]
max_depth=[5, 8, 10, 12]

params = {'n_estimators':n_estimators,'max_depth':max_depth}

In [None]:
#grid search for gradient bossting
gb_grid= GridSearchCV(gb,param_grid=params,verbose=0)

In [None]:
# fitting x-train and y-train
gb_grid.fit(X_train, Y_train)

In [None]:
gb_grid.best_params_

In [None]:
gb_grid.best_score_

In [None]:
#creating model of Gradient Boosting
gb =GradientBoostingRegressor(max_depth = 8, n_estimators = 180)
gb.fit(X_train, Y_train)

In [None]:
# predictions and score
gb_y_pred1 = gb.predict(X_train)
gb_y_pred2 = gb.predict(X_test)
gb.score(X_train, Y_train)

In [None]:
#Score matrics on train data
print(f"Gradient Boosting: evaluation metrics on the training set:")
#importing matrics for training data
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
MSE_gb1 = mean_squared_error(Y_train, gb_y_pred1)       
print("MSE", MSE_gb1)
MAE_gb1 = mean_absolute_error(Y_train, gb_y_pred1)
print('MAE', MAE_gb1)
RSME_gb1 = np.sqrt(mean_squared_error(Y_train, gb_y_pred1))
print("RMSE", RSME_gb1)
R2_score_gb1 = r2_score(Y_train, gb_y_pred1)
print("R2score", R2_score_gb1)
Adjusted_r2_gb1 = 1-(1-(R2_score_gb1)* (X_train.shape[0]-1)/ (X_train.shape[0]-1 -(X_train.shape[1])))
print("Adjusted_R2", Adjusted_r2_gb1)
print()


#Score matrics on train data
print(f"Gradient Boosting: evaluation metrics on the testing set:")
MSE_gb2 = mean_squared_error(Y_test, gb_y_pred2)       
print("MSE", MSE_gb2)
MAE_gb2 = mean_absolute_error(Y_test, gb_y_pred2)
print('MAE', MAE_gb2)
RSME_gb2 = np.sqrt(mean_squared_error(Y_test, gb_y_pred2))
print("RMSE", RSME_gb2)
R2_score_gb2 = r2_score(Y_test, gb_y_pred2)
print("R2score", R2_score_gb2)
Adjusted_r2_gb2 = 1-(1-(R2_score_gb2)* (X_test.shape[0]-1)/ (X_test.shape[0]-1 -(X_test.shape[1])))
print("Adjusted_R2", Adjusted_r2_gb2)

In [None]:
# creating a dict to concat ridge training and test data score metrics
# storing the Train set metrics value in a dict11 for later comparison
dict11={'Model':'Gradient Boosting',
       'MAE':round((MAE_gb1),2),
       'MSE':round((MSE_gb1),2),
       'RMSE':round((RSME_gb1),2),
       'R2_score':round((R2_score_gb1),2),
       'Adjusted R2':round((Adjusted_r2_gb1),2)
       }
training_df=training_df.append(dict11,ignore_index=True)
gb_dict11 =pd.DataFrame(dict11,index=[1])

# storing the test set metrics value in a dict12 for later comparison
dict12={'Model':'Gradient Boosting',
       'MAE':round((MAE_gb2),2),
       'MSE':round((MSE_gb2),2),
       'RMSE':round((RSME_gb2),2),
       'R2_score':round((R2_score_gb2),2),
       'Adjusted R2':round((Adjusted_r2_gb2 ),2)
       }
test_df=test_df.append(dict12,ignore_index=True)
gb_dict12 =pd.DataFrame(dict12,index=[1])

In [None]:
plt.figure(figsize=(10,5))
plt.plot((gb_y_pred2), color = 'r')
plt.plot(np.array(Y_test), color = 'y')
plt.legend(["Predicted","Actual"])
plt.xlabel('Test Data')
plt.title('Gradient Boosting')
plt.show()

In [None]:
result=pd.concat([gb_dict11,gb_dict12],keys=['Training set','Test set'])
result

In [None]:
gb.feature_importances_

In [None]:
importances = gb.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)

In [None]:
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)

In [None]:
importance_df.head()

In [None]:
importance_df.sort_values(by=['Feature Importance'],ascending=False)

In [None]:
gb.fit(X_train,Y_train)

In [None]:
features = X_train.columns
importances = gb.feature_importances_
indices = np.argsort(importances)

In [None]:
#Plot the figure
plt.figure(figsize=(10,20))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')

plt.show()

In [None]:
# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# HYperparameter Grid
param_dict = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}

In [None]:
param_dict

In [None]:
from sklearn.model_selection import GridSearchCV
# Create an instance of the GradientBoostingRegressor
gb_model = GradientBoostingRegressor()

# Grid search
gb_grid = GridSearchCV(estimator=gb_model,
                       param_grid = param_dict,
                       cv = 5, verbose=2)

gb_grid.fit(X_train,Y_train)

In [None]:
gb_optimal_model = gb_grid.best_estimator_

In [None]:
gb_grid.best_params_

In [None]:
# Making predictions on train and test data
y_pred_train_g_g = gb_optimal_model.predict(X_train)
y_pred_g_g= gb_optimal_model.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error

print("Model Score:",gb_optimal_model.score(X_train,Y_train))
print(f"Grid search CV: evaluation metrics on the traning set:")
MSE_gbh= mean_squared_error(Y_train, y_pred_train_g_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)


MAE_gbh= mean_absolute_error(Y_train, y_pred_train_g_g)
print("MAE :",MAE_gbh)


from sklearn.metrics import r2_score
r2_gbh= r2_score(Y_train, y_pred_train_g_g)
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(Y_train, y_pred_train_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(Y_train, y_pred_train_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

from sklearn.metrics import mean_squared_error
print(f"Grid search CV: evaluation metrics on the testing set:")
MSE_gbh= mean_squared_error(Y_test, y_pred_g_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)


MAE_gbh= mean_absolute_error(Y_test, y_pred_g_g)
print("MAE :",MAE_gbh)


from sklearn.metrics import r2_score
r2_gbh= r2_score((Y_test), (y_pred_g_g))
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(Y_test, y_pred_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((Y_test), (y_pred_g_g)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict13={'Model':'Gradient Boosting gridsearchcv ',
       'MAE':round((MAE_gbh),3),
       'MSE':round((MSE_gbh),3),
       'RMSE':round((RMSE_gbh),3),
       'R2_score':round((r2_gbh),3),
       'Adjusted R2':round((Adjusted_R2_gbh ),2)
      }
training_df=training_df.append(dict13,ignore_index=True)
cv_dict13 =pd.DataFrame(dict13,index=[1])



# storing the test set metrics value in a dataframe for later comparison
dict14={'Model':'Gradient Boosting gridsearchcv ',
       'MAE':round((MAE_gbh),3),
       'MSE':round((MSE_gbh),3),
       'RMSE':round((RMSE_gbh),3),
       'R2_score':round((r2_gbh),3),
       'Adjusted R2':round((Adjusted_R2_gbh ),2)
      }
test_df=test_df.append(dict14,ignore_index=True)
cv_dict14 =pd.DataFrame(dict14,index=[1])


In [None]:
result=pd.concat([cv_dict13,cv_dict14],keys=['Training set','Test set'])
result

In [None]:
### Heteroscadacity
plt.scatter((y_pred_g_g),(Y_test)-(y_pred_g_g))

In [None]:
importances = gb_optimal_model.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)

In [None]:
gb_optimal_model.feature_importances_

In [None]:
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)

In [None]:
importance_df.head()

In [None]:
importance_df.sort_values(by=['Feature Importance'],ascending=False)

In [None]:
gb_model.fit(X_train,Y_train)

In [None]:
features = X_train.columns
importances = gb_model.feature_importances_
indices = np.argsort(importances)

In [None]:
#Plot the figure
plt.figure(figsize=(10,20))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')

plt.show()

# Evaluation Matrix

In [None]:
result=pd.concat([training_df,test_df],keys=['Training set','Test set'])
result

# **Observation**

During the time of our analysis, we initially did EDA on all the features of our datset. We first analysed our dependent variable, 'Rented Bike Count' and also transformed it. Next we analysed categorical variable and dropped the variable who had majority of one class, we also analysed numerical variable, found out the correlation, distribution and their relationship with the dependent variable. We also removed some numerical features who had mostly 0 values and hot encoded the categorical variables.

Next we implemented 7 machine learning algorithms Linear Regression,lasso,ridge,elasticnet,decission tree, Random Forest and XGBoost. 

There's a high correlation between the dependent variables specifically temperature.

Temperature, Wind Speed, Solar Radiation, Visibility are positively correlated with the target variable.

In general people used rented bikes during their commuting hours i.e. from 7am to 9am in morning and 5pm to 7pm in the evening.

Weekdays are the ones where the demand of the bikes is comparatively high as compared with the weekends.

Summer season was the most preferred season throughout the year where the count was very high. 


After performing the various models. Random Forest and Gradient boosting found to be the best model that can be used for the Bike Sharing Demand Prediction since the performance metrics (mse,rmse) shows lower and (r2,adjusted_r2) shows a higher value for both models!


# Final Conclusion

We can use either Random Forest or Gradient boosting model for the bike rental stations. Since both the Regressor with Grid Search CV gave us the best results. Therefore, we can deploy it for our predictions. Also, As this data is time dependent, the values for 
variables will not always be consistent. Therefore, we need constantly keep checking for the models.