<a href="https://colab.research.google.com/github/Aakansha-Bansal/Capstone-Project--2--Regression/blob/main/Bike_sharing_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Seoul Bike Sharing Demand Prediction**



##### **Project Type**    - **Regression**
##### **Contribution**    - **Individual**


# **Project Summary -**

### In this dataset there are total of 8760 rows and 14 columns.

### Count of Rented Bike : Dependent variable

### Normally distributed attributes: - Temperature - Humidity.

### Positively correlated variables to the bike rent are : - Temperature - Dew Point Temperature - Solar Radiation - Hour

### Negatively correlated variables are: - Humidity - Rainfall - Weekdays or Weekends

### The number of bikes rented is on average higher during the rush hours.i.e. at 6 p.m. to 8 p.m.

### The rented bike counts is higher during the summer and lowest during the winter.

### The rented bike count is higher on working days than on non-working days.

### On a non-functioning day, no bikes are rented in all the instances of the data.

### The number of bikes rented on average remains constant throughout Monday - Saturday, it dips on Sunday, and on average, the rented bike counts is lower on weekends than on weekdays.

### On regular days, the demand for the bikes is higher during rush hours. On holidays or weekends, the demand is comparatively lower in the mornings, and is higher in the afternoons.

### After doing EDA I have implemented 7 machine learning algorithms i.e., Linear Regression,lasso,ridge,elasticnet,decission tree, Random Forest and XGBoost. I have done hyperparameter tuning to improve the model performance.




# **GitHub Link -**

https://github.com/Aakansha-Bansal/Capstone-Project--2--Regression

# **Problem Statement**


###Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from datetime import datetime
import datetime as dt

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MultiLabelBinarizer

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import log_loss

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
bike_df=pd.read_csv('/content/drive/MyDrive/SeoulBikeData.csv',encoding ='latin')

### Dataset First View

In [None]:
# Dataset First Look
bike_df

In [None]:
bike_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
bike_df.shape

### Dataset Information

In [None]:
# Dataset Info
bike_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(bike_df[bike_df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bike_df.isnull().sum()

### What did you know about your dataset?

### The dataset given is a dataset from Bike rental company, and we have to analysis the data of customers and the insights behind it.

### The above dataset has 8760 rows and 14 columns. Thankfully, in this dataset there are neither has duplicate values nor has null values.

### In a day we have 24 hours and we have 365 days in a year so 365*24 = 8760, which represents the number of line in the dataset which means we have the data of whole year.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
bike_df.columns

### Changing the column names
### Some of the columns name in the dataset are too large and clumsy so we change that names into some simple names, and it don't affect our end results.

In [None]:
bike_df=bike_df.rename(columns={'Rented Bike Count':'Rented_Bike_Count',
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind_speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'Solar_Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                                'Functioning Day':'Functioning_Day'})

In [None]:
# Dataset Describe
bike_df.describe().T

### Variables Description 

### **Date** : The date of the day, during 365 days from 01/12/2017 to 30/11/2018, formating in DD/MM/YYYY, type : str, we need to convert into datetime format.

### **Rented_Bike_Count** : Number of rented bikes per hour which our dependent variable and we need to predict that, type : int

### **Hour**: The hour of the day, starting from 0-23 it's in a digital time format, type : int, we need to convert it into category data type.

### **Temperature**: Temperature in Celsius, type : Float

### **Humidity** : Humidity in the air in %, type : int

### **Wind speed** : Speed of the wind in m/s, type : Float

### **Visibility** : Visibility in m, type : int

### **Dew point temperature** : Temperature at the beggining of the day, type : Float

### **Solar Radiation** : Sun contribution, type : Float

### **Rainfall** : Amount of raining in mm, type : Float

### **Snowfall** : Amount of snowing in cm, type : Float

### **Seasons** : Season of the year, type : str, there are only 4 season's in data .

### **Holiday** : If the day is holiday period or not, type: str

### **Functioning Day** : If the day is a Functioning Day or not, type : str

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
bike_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

### Python read "Date" column as a object type basically it reads as a string, as the date column is very important to analyze the users behaviour so we need to convert it into datetime format then we split it into 3 column i.e 'year', 'month', 'day'as a category data type.*

In [None]:
# Write your code to make your dataset analysis ready.
bike_df['Date'] = bike_df['Date'].apply(lambda x:  dt.datetime.strptime(x,"%d/%m/%Y"))

In [None]:
bike_df['year'] = bike_df['Date'].dt.year
bike_df['month'] = bike_df['Date'].dt.month
bike_df['day'] = bike_df['Date'].dt.day_name()

In [None]:
#creating a new column of "weekdays_weekend" and drop the column "Date","day","year"
bike_df['weekdays_weekend']=bike_df['day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )
bike_df=bike_df.drop(columns=['Date','day','year'],axis=1)

### What all manipulations have you done and insights you found?

### So we convert the "date" column into 3 different column i.e "year","month","day".
### The "year" column in our data set is basically contain the 2 unique number contains the details of from 2017 december to 2018 november so if i consider this is a one year then we don't need the "year" column so we drop it.
### The other column "day", it contains the details about the each day of the month, for our relevence we don't need each day of each month data but we need the data about, if a day is a weekday or a weekend so we convert it into this format and drop the "day" column.

In [None]:
bike_df

In [None]:
bike_df.head()

In [None]:
bike_df['weekdays_weekend'].value_counts()

### As "Hour","month","weekdays_weekend" column are show as a integer data type but actually it is a category data type. so we need to change this data type if we not then, while doing the further anlysis and correleted with this then the values are not actually true so we can mislead by this.

In [None]:
#Change the int64 column into catagory column
cols=['Hour','month','weekdays_weekend']
for col in cols:
  bike_df[col]=bike_df[col].astype('category')

In [None]:
#let's check the result of data type
bike_df.info()

In [None]:
bike_df.columns

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

### 1.Which month experiences more bikes on rent?

In [None]:
# Chart - 1 visualization code
fig,ax=plt.subplots(figsize=(20,8))
sns.barplot(data=bike_df,x='month',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count_of_Rented_bikes_acording_to_month ')

##### 1. Why did you pick the specific chart?

### Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable.

##### 2. What is/are the insight(s) found from the chart?

### We can clearly conclude from graph that summer season is the busiest season for rented bikes and june is most demanded season.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

### This plot provides positive insight that summer months are more likely by customers.

#### Chart - 2

### On which days more bikes has been rented?

In [None]:
# Chart - 2 visualization code
fig,ax=plt.subplots(figsize=(10,8))
sns.barplot(data=bike_df,x='weekdays_weekend',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to weekdays_weekenday ')

##### 1. Why did you pick the specific chart?

### As this chart gives proper visulization about on which days more demand of bikes have seen.

##### 2. What is/are the insight(s) found from the chart?

### From the above bar plot we can say that in the week days which represent in blue colour shows that the demand of the bikes are higher in weekdays.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

### The insight is positive that weekdays are more likely the busy days for bike demand obviously because of office(work) days.

#### Chart - 3

### On Which time the demand is high?

In [None]:
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=bike_df,x='Hour',y='Rented_Bike_Count',hue='weekdays_weekend',ax=ax)
ax.set(title='Count of Rented bikes acording to weekdays_weekend ')

##### 1. Why did you pick the specific chart?

### As line chart gives the proper comparison between weekdays and weekend according to time.

##### 2. What is/are the insight(s) found from the chart?

### Peak Time are 7 am to 9 am and 5 pm to 7 pm.The orange colur represents the weekend days, and it show that the demand of rented bikes are very low specially in the morning hour but when the evening start from 4 pm to 8 pm the demand slightly increases.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

### The gained insights are positive as it gives us the idea that on which time we have to advertise so that more customers will attract.

#### Chart - 4

### Demand of bikes is more on Functioning Day or on Non-Functioning Day?

In [None]:
# Chart - 4 visualization code
fig,ax=plt.subplots(figsize=(10,8))
sns.barplot(data=bike_df,x='Functioning_Day',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Functioning Day ')

##### 1. Why did you pick the specific chart?

### This chart gives the proper count of customers.

##### 2. What is/are the insight(s) found from the chart?

### The above bar plot shows the use of rented bike in functioning day or not, and we can clearly see that People don't use reneted bikes in no functioning day or we can say that people use rented bikes on functioning days only.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

### The insight is that functioning day are busy days for rented bikes.

#### Chart - 5

### Which season is most busy season for bike demand?

In [None]:
# Chart - 5 visualization code
fig,ax=plt.subplots(figsize=(15,8))
sns.barplot(data=bike_df,x='Seasons',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Seasons ')

##### 1. Why did you pick the specific chart?



### As this chart provides clear comparison among seasons.

##### 2. What is/are the insight(s) found from the chart?

### The bar plot shows the use of rented bike in four different seasons, and it clearly seen that in summer season the use of rented bike is high and peak time is 7am-9am and 7pm-5pm and in winter season the use of rented bike is very low.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

### We got the insight that the most demanding season is Summer followed by Autumn, Spring and Winter.

#### Chart - 6

### Seasons

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=bike_df,x='Hour',y='Rented_Bike_Count',hue='Seasons',ax=ax)
ax.set(title='Count of Rented bikes acording to seasons ')

1. Why did you pick the specific chart?

###  The point plot  shows the use of rented bike count in four different seasons.

2. What is/are the insight(s) found from the chart?

### In summer season the use of rented bike is high and peak time is 7am-9am and 5pm-7pm


3. Will the gained insights help creating a positive business impact?.

Are there any insights that lead to negative growth? Justify with specific reason.

### Summer season is most demanding season for rented bikes and winter season   is very low demanded season because of snowfall.


### Chart - 7

### Whether the demand of bike is more on holidays or on normal days?

In [None]:
# Chart - 6 visualization code
fig,ax=plt.subplots(figsize=(8,8))
sns.barplot(data=bike_df,x='Holiday',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Holiday ')

##### 1. Why did you pick the specific chart?

### As this chart gives the proper comparison of customers on Holiday and No Holiday.

##### 2. What is/are the insight(s) found from the chart?

### The above bar plot shows the use of rented bike in a holiday, and it is clearly shown in chart that on normal days more people uses the rented bike than on holiday.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

### Yes, this chart provides the  positive insight that the demand of rented bikes are more on No Holiday days.

#### Chart - 8

### Bikes count per hour

In [None]:
# Chart - 7 visualization code
fig,ax=plt.subplots(figsize=(20,8))
sns.barplot(data=bike_df,x='Hour',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Hour ')

##### 1. Why did you pick the specific chart?

### As this chart gives the proper counts of bikes per hour.

##### 2. What is/are the insight(s) found from the chart?

### The above plot shows the use of rented bike according the hours and the data are from all over the year generally people use rented bikes during their working hour from 7am to 9am and 5pm to 7pm. 6 p.m. is the most demanding time for rented bikes.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

### The insight is that working hours are most busy hours for rented bike.

#### Chart - 9

### On Which day rented bikes demand is more on Functioning Day or on Non-Functioning Day?

In [None]:
# Chart - 8 visualization code
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=bike_df,x='Hour',y='Rented_Bike_Count',hue='Functioning_Day',ax=ax)
ax.set(title='Count of Rented bikes acording to Functioning Day ')

##### 1. Why did you pick the specific chart?

### As this chart provides clear result about the situation.

##### 2. What is/are the insight(s) found from the chart?

### This plot shows the use of rented bikes on functioning day or not and it clearly shown that People don't use reneted bikes in no functioning day.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

### The insight is that people use rented bikes on functioning days only.

### **Analyze Numerical variables**

### **What is numerical data?**
 ### Numerical data is a data type expressed in numbers rather than in language description. Numerical data ia always collected in number form.

### Analyze of Numerical variables distplots

In [None]:
# Select your features wisely to avoid overfitting
numerical_columns=list(bike_df.select_dtypes(['int64','float64']).columns)
numerical_features=pd.Index(numerical_columns)
numerical_features

In [None]:
for col in numerical_features:
  plt.figure(figsize=(10,6))
  sns.distplot(x=bike_df[col])
  plt.xlabel(col)
plt.show()

### Numerical vs.Rented_Bike_Count

In [None]:
#relationship between "Rented_Bike_Count" and "Temperature" 
bike_df.groupby('Temperature').mean()['Rented_Bike_Count'].plot()

### By this graph we can see that when Temperature increases the demand of rented bikes is also increasing. From the above plot we see that people like to ride bikes when it is pretty hot around 25°C average.

In [None]:
#relationship between "Rented_Bike_Count" and "Dew_point_temperature" 
bike_df.groupby('Dew_point_temperature').mean()['Rented_Bike_Count'].plot()

### By this plot we can see dew point is almost same as temperature.

In [None]:
#relationship between "Rented_Bike_count" and "Solar_Radiation"
bike_df.groupby('Solar_Radiation').mean()['Rented_Bike_Count'].plot()

### From the above plot we see that, the amount of rented bikes is huge, when there is solar radiation, the counter of rents is around 1000.


In [None]:
#relationship between "Rented_Bike_Count" and "Snowfall" 
bike_df.groupby('Snowfall').mean()['Rented_Bike_Count'].plot()

### This graph clearly represents that when snowfall increases demand of rented bikes decreases.

In [None]:
#relationship between "Rented_Bike_Count" and "Rainfall" 
bike_df.groupby('Rainfall').mean()['Rented_Bike_Count'].plot()

### We can see from the above plot that even if it rains a lot the demand of of rent bikes is not decreasing, here for example even if we have 20 mm of rain there is a big peak of rented bikes.


In [None]:
#relationship between "Rented_Bike_Count" and "Wind_speed" 
bike_df.groupby('Wind_speed').mean()['Rented_Bike_Count'].plot()

### By this plot we can see demand of rented bikes is uniformly distributed but when speed of wind was 7 m/s the count of rented bikes increased. This means that people like to ride bike when it is little windy.

### **Regression plot**

### The regression plots in seaborn are primarily used to add a visual guide that helps to emphasize patterns in a dataset during EDA. Regression plots means as its name suggests that creates a regression line between 2 parameters and helps to visualize their linear relationships.

In [None]:
#printing the regression plot for all the numerical features
for col in numerical_features:
  fig,ax=plt.subplots(figsize=(10,6))
  sns.regplot(x=bike_df[col],y=bike_df['Rented_Bike_Count'],scatter_kws={"color": 'orange'}, line_kws={"color": "black"})

### From the above regression plot of all numerical features we see that the columns Temperature, Wind_speed, Solar_Radiation Snowfall are positively relation to the target variable.which means the rented bike count increases with increase of these features.
### Rainfall, Snowfall, Humidity these features are negatively related with the target variable which means the rented bike count decreases when these features increase.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

### Thankfully, there is no missing values in this dataset.

### 2. Handling Outliers

### Normalise Rented_Bike_Count column

### The data normalization (also referred to as data pre-processing) is a basic element of data mining. It means transforming the data, namely converting the source data in to another format that allows processing data effectively. The main purpose of data normalization is to minimize or even exclude duplicated data.

In [None]:
# Encode your categorical columns
#Distribution plot of Rented Bike Count
plt.figure(figsize=(10,6))
plt.xlabel('Rented_Bike_Count')
plt.ylabel('Density')
ax=sns.distplot(bike_df['Rented_Bike_Count'],hist=True ,color="y")
ax.axvline(bike_df['Rented_Bike_Count'].mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(bike_df['Rented_Bike_Count'].median(), color='black', linestyle='dashed', linewidth=2)
plt.show()

### The above graph shows that Rented Bike Count has moderate right skewness. Since the assumption of linear regression is that 'the distribution of dependent variable has to be normal', so we should perform some operation to make it normal.


In [None]:
# Handling Outliers & Outlier treatments
#Boxplot of Rented Bike Count to check outliers
plt.figure(figsize=(10,6))
plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=bike_df['Rented_Bike_Count'])
plt.show()

### The above boxplot shows that we have to detect outliers in Rented Bike Count column.


In [None]:
#Applying square root to Rented Bike Count to improve skewness
plt.figure(figsize=(10,8))
plt.xlabel('Rented Bike Count')
plt.ylabel('Density')

ax=sns.distplot(np.sqrt(bike_df['Rented_Bike_Count']), color="y")
ax.axvline(np.sqrt(bike_df['Rented_Bike_Count']).mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(np.sqrt(bike_df['Rented_Bike_Count']).median(), color='black', linestyle='dashed', linewidth=2)

plt.show()

### Since we have generic rule of applying Square root for the skewed variable in order to make it normal .After applying Square root to the skewed Rented Bike Count, here we get almost normal distribution.


In [None]:
#After applying sqrt on Rented Bike Count check wheater we still have outliers 
plt.figure(figsize=(10,6))

plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=np.sqrt(bike_df['Rented_Bike_Count']))
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

### After applying Square root to the Rented Bike Count column, we find that there is no outliers present and I have used the square root transformation for normalising the skewness.

### **Feature selection**
**Checking OLS model**

### Ordinary least squares (OLS) regression is a statistical method of analysis that estimates the relationship between one or more independent variables and a dependent variable.

In [None]:
#import the module
#assign the 'x','y' value
import statsmodels.api as sm
x = bike_df[[ 'Temperature','Humidity',
       'Wind_speed', 'Visibility','Dew_point_temperature',
       'Solar_Radiation', 'Rainfall', 'Snowfall']]
y = bike_df['Rented_Bike_Count']

In [None]:
#add a constant column
x = sm.add_constant(x)   # here we are adding b0

#fit a OLS model 
model= sm.OLS(y,x).fit()
model.summary()

### R sqauare and Adj Square are near to each other. 40% of variance in the Rented Bike count is explained by the model.

### For F statistic , P value is less than 0.05 for 5% levelof significance.

### P value of dew point temp and visibility are very high and they are not significant.

### Omnibus tests the skewness and kurtosis of the residuals. Here the value of Omnibus is high., it shows we have skewness in our data.

### The condition number is large, 3.11e+04. This might indicate that there are strong multicollinearity or other numerical problems

### Durbin-Watson tests for autocorrelation of the residuals. Here value is less than 0.5. We can say that there exists a positive auto correlation among the variables.

### **Correlation Map**

In [None]:
# correlation map
plt.figure(figsize=(16,15))
corr = bike_df.corr()
sns.heatmap(corr,annot=True,fmt='.2f',annot_kws={'size':15})
plt.title('correlation between features');

### We can observe on the heatmap that on the target variable line the most positively correlated variables to the bike rent are :
### the temperature
### the dew point temperature
### the solar radiation

### And most negatively correlated variables are:
### Humidity
### Rainfall
### weekdays or weekends

### From above correlation map, we can see that, there is high correlation between 'Dew Point Temperature' and 'Temperature'. Here we are featuring the best suitable model,so we have to drop either one of the feature i.e. either Temperature or Dew Point Temperature.


In [None]:
# dropping dew point temperature column
bike_df = bike_df.drop(['Dew_point_temperature','Visibility'],axis=1)

In [None]:
bike_df.info()

###  **Data preprocessing**


### **Create the dummy variables**
### A dataset may contain various type of values, sometimes it consists of categorical values. So, in-order to use those categorical value for programming efficiently we create dummy variables.


In [None]:
categorical_features = bike_df.select_dtypes(['object','category']).columns
categorical_features    

### **One hot encoding**
### A one hot encoding allows the representation of categorical data to be more expressive. Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. This is required for both input and output variables that are categorical.

In [None]:
#creat a copy
bike_df_copy = bike_df

def one_hot_encoding(data, column):
    data = pd.concat([data, pd.get_dummies(data[column], prefix=column, drop_first=True)], axis=1)
    data = data.drop([column], axis=1)
    return data

for col in categorical_features:
    bike_df_copy = one_hot_encoding(bike_df_copy, col)
bike_df_copy.head()       

# Model Training


### Train Test split for regression

In [None]:
#Assign the value in X and Y
X = bike_df_copy.drop(columns=['Rented_Bike_Count'], axis=1)
y = np.sqrt(bike_df_copy['Rented_Bike_Count'])

In [None]:
X.head()

In [None]:
y.head()

### As of now, we have assign the independent variables = x and dependent variable = y.

### 8. Data Splitting

### Split your data to train and test. Choose Splitting ratio wisely.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)
print(X_train.shape)
print(X_test.shape)

##### What data splitting ratio have you used and why? 

### I have used 75% data for training and 25% data for testing because to avoid overfitting.

In [None]:
bike_df_copy.describe().columns

## ***7. ML Model Implementation***

### The mean squared error (MSE) tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them.
It’s called the mean squared error as you’re finding the average of a set of errors. The lower the MSE, the better the forecast.

MSE formula = (1/n) * Σ(actual – forecast)2
Where:

n = number of items,

Σ = summation notation,

Actual = original or observed y-value,

Forecast = y-value from regression.

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).

Mean Absolute Error (MAE) are metrics used to evaluate a Regression Model. ... Here, errors are the differences between the predicted values (values predicted by our regression model) and the actual values of a variable.

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.

Formula for R-Squared

      R^2 = 1- Unexplained Variation/Total Variation


R
2 =1− Total Variation Unexplained Variation​

Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.
​

 


### **LINEAR REGRESSION**

Regression models describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line

Linear regression uses a linear approach to model the relationship between independent and dependent variables. In simple words its a best fit line drawn over the values of independent variables and dependent variable. In case of single variable, the formula is same as straight line equation having an intercept and slope.

     y_pred = b0+b1x

where b0 and b1 are intercept and slope respectively.

In case of multiple features the formula translates into:
 y_pred = b0 + b1x1 + b2x2 + b3x3 + ....
where x_1,x_2,x_3 are the features values and b0,b1,b2...are weights assigned to each of the features. These become the parameters which the algorithm tries to learn using Gradient descent.

Gradient descent is the process by which the algorithm tries to update the parameters using a loss function . Loss function is nothing but the diffence between the actual values and predicted values(aka error or residuals). There are different types of loss function but this is the simplest one. Loss function summed over all observation gives the cost functions. The role of gradient descent is to update the parameters till the cost function is minimized i.e., a global minima is reached. It uses a hyperparameter 'alpha' that gives a weightage to the cost function and decides on how big the steps to take. Alpha is called as the learning rate. It is always necesarry to keep an optimal value of alpha as high and low values of alpha might make the gradient descent overshoot or get stuck at a local minima. There are also some basic assumptions that must be fulfilled before implementing this algorithm. They are:

1.No multicollinearity in the dataset.

2.Independent variables should show linear relationship with dv.

3.Residual mean should be 0 or close to 0.

4.There should be no heteroscedasticity i.e., variance should be constant along the line of best fit.

Let us now implement our first model. We will be using LinearRegression from scikit library.



### ML Model - 1

### **Linear Regression**

In [None]:
# ML Model - 1 Implementation
LinearRegression
# Fit the Algorithm
reg= LinearRegression().fit(X_train, y_train)
# Predict on the model
reg.score(X_train, y_train)

In [None]:
#check the coefficeint
reg.coef_

In [None]:
#get the X_train and X-test value
y_pred_train=reg.predict(X_train)
y_pred_test=reg.predict(X_test)

In [None]:
#calculate MSE
MSE_lr= mean_squared_error((y_train), (y_pred_train))
print("MSE :",MSE_lr)

#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)


#calculate MAE
MAE_lr= mean_absolute_error(y_train, y_pred_train)
print("MAE :",MAE_lr)



#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr= r2_score(y_train, y_pred_train)
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### Looks like our r2 score value is 0.77 that means our model is able to capture most of the data variance. Lets save it in a dataframe for later comparisons.

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Linear regression ',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
training_df=pd.DataFrame(dict1,index=[1])

In [None]:
#calculate MSE
MSE_lr= mean_squared_error(y_test, y_pred_test)
print("MSE :",MSE_lr)

#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)


#calculate MAE
MAE_lr= mean_absolute_error(y_test, y_pred_test)
print("MAE :",MAE_lr)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr= r2_score((y_test), (y_pred_test))
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score((y_test), (y_pred_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",Adjusted_R2_lr )

### The r2_score for the test set is 0.78. This means our linear model is performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Linear regression ',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
test_df=pd.DataFrame(dict2,index=[1])

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test),(y_test)-(y_pred_test))

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot(y_pred_test)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

### ML Model - 2

### **LASSO REGRESSION**

In [None]:
# Create an instance of Lasso Regression implementation
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=1.0, max_iter=3000)
# Fit the Lasso model
lasso.fit(X_train, y_train)
# Create the model score
print(lasso.score(X_test, y_test), lasso.score(X_train, y_train))

In [None]:
#get the X_train and X-test value
y_pred_train_lasso=lasso.predict(X_train)
y_pred_test_lasso=lasso.predict(X_test)

In [None]:
#calculate MSE
MSE_l= mean_squared_error((y_train), (y_pred_train_lasso))
print("MSE :",MSE_l)

#calculate RMSE
RMSE_l=np.sqrt(MSE_l)
print("RMSE :",RMSE_l)


#calculate MAE
MAE_l= mean_absolute_error(y_train, y_pred_train_lasso)
print("MAE :",MAE_l)


from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_l= r2_score(y_train, y_pred_train_lasso)
print("R2 :",r2_l)
Adjusted_R2_l = (1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### **Looks like our r2 score value is 0.40 that means our model is not able to capture most of the data variance. Lets save it in a dataframe for later comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Lasso regression ',
       'MAE':round((MAE_l),3),
       'MSE':round((MSE_l),3),
       'RMSE':round((RMSE_l),3),
       'R2_score':round((r2_l),3),
       'Adjusted R2':round((Adjusted_R2_l ),2)
       }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#calculate MSE
MSE_l= mean_squared_error(y_test, y_pred_test_lasso)
print("MSE :",MSE_l)

#calculate RMSE
RMSE_l=np.sqrt(MSE_l)
print("RMSE :",RMSE_l)


#calculate MAE
MAE_l= mean_absolute_error(y_test, y_pred_test_lasso)
print("MAE :",MAE_l)


from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_l= r2_score((y_test), (y_pred_test_lasso))
print("R2 :",r2_l)
Adjusted_R2_l=(1-(1-r2_score((y_test), (y_pred_test_lasso)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_lasso)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### **The r2_score for the test set is 0.38. This means our linear model is not performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Lasso regression ',
       'MAE':round((MAE_l),3),
       'MSE':round((MSE_l),3),
       'RMSE':round((RMSE_l),3),
       'R2_score':round((r2_l),3),
       'Adjusted R2':round((Adjusted_R2_l ),2),
       }
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot(np.array(y_pred_test_lasso))
plt.plot(np.array((y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_lasso),(y_test-y_pred_test_lasso))

### ML Model - 3

### **RIDGE REGRESSION**

In [None]:
# ML Model - 3 Implementation
ridge= Ridge(alpha=0.1)
# Fit the Algorithm
ridge.fit(X_train,y_train)
# Predict on the model
ridge.score(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_ridge=ridge.predict(X_train)
y_pred_test_ridge=ridge.predict(X_test)

In [None]:
#calculate MSE
MSE_r= mean_squared_error((y_train), (y_pred_train_ridge))
print("MSE :",MSE_r)

#calculate RMSE
RMSE_r=np.sqrt(MSE_r)
print("RMSE :",RMSE_r)


#calculate MAE
MAE_r= mean_absolute_error(y_train, y_pred_train_ridge)
print("MAE :",MAE_r)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_r= r2_score(y_train, y_pred_train_ridge)
print("R2 :",r2_r)
Adjusted_R2_r=(1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### **Looks like our r2 score value is 0.77 that means our model is able to capture most of the data variance. Lets save it in a dataframe for later comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Ridge regression ',
       'MAE':round((MAE_r),3),
       'MSE':round((MSE_r),3),
       'RMSE':round((RMSE_r),3),
       'R2_score':round((r2_r),3),
       'Adjusted R2':round((Adjusted_R2_r ),2)}
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#calculate MSE
MSE_r= mean_squared_error(y_test, y_pred_test_ridge)
print("MSE :",MSE_r)

#calculate RMSE
RMSE_r=np.sqrt(MSE_r)
print("RMSE :",RMSE_r)


#calculate MAE
MAE_r= mean_absolute_error(y_test, y_pred_test_ridge)
print("MAE :",MAE_r)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_r= r2_score((y_test), (y_pred_test_ridge))
print("R2 :",r2_r)
Adjusted_R2_r=(1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

## **The r2_score for the test set is 0.78. This means our linear model is performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Ridge regression ',
       'MAE':round((MAE_r),3),
       'MSE':round((MSE_r),3),
       'RMSE':round((RMSE_r),3),
       'R2_score':round((r2_r),3),
       'Adjusted R2':round((Adjusted_R2_r ),2)}
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot((y_pred_test_ridge))
plt.plot((np.array(y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_ridge),(y_test)-(y_pred_test_ridge))

### ML Model - 4

### **ELASTIC NET REGRESSION**

In [None]:
#a * L1 + b * L2
#alpha = a + b and l1_ratio = a / (a + b)
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)
#FIT THE MODEL
elasticnet.fit(X_train,y_train)
#check the score
elasticnet.score(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_en=elasticnet.predict(X_train)
y_pred_test_en=elasticnet.predict(X_test)

In [None]:
#calculate MSE
MSE_e= mean_squared_error((y_train), (y_pred_train_en))
print("MSE :",MSE_e)

#calculate RMSE
RMSE_e=np.sqrt(MSE_e)
print("RMSE :",RMSE_e)


#calculate MAE
MAE_e= mean_absolute_error(y_train, y_pred_train_en)
print("MAE :",MAE_e)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_e= r2_score(y_train, y_pred_train_en)
print("R2 :",r2_e)
Adjusted_R2_e=(1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### **Looks like our r2 score value is 0.62 that means our model is able to capture most of the data variance. Lets save it in a dataframe for later comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Elastic net regression ',
       'MAE':round((MAE_e),3),
       'MSE':round((MSE_e),3),
       'RMSE':round((RMSE_e),3),
       'R2_score':round((r2_e),3),
       'Adjusted R2':round((Adjusted_R2_e ),2)}
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#calculate MSE
MSE_e= mean_squared_error(y_test, y_pred_test_en)
print("MSE :",MSE_e)

#calculate RMSE
RMSE_e=np.sqrt(MSE_e)
print("RMSE :",RMSE_e)


#calculate MAE
MAE_e= mean_absolute_error(y_test, y_pred_test_en)
print("MAE :",MAE_e)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_e= r2_score((y_test), (y_pred_test_en))
print("R2 :",r2_e)
Adjusted_R2_e=(1-(1-r2_score((y_test), (y_pred_test_en)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_en)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### **The r2_score for the test set is 0.62. This means our linear model is performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Elastic net regression Test',
       'MAE':round((MAE_e),3),
       'MSE':round((MSE_e),3),
       'RMSE':round((RMSE_e),3),
       'R2_score':round((r2_e),3),
       'Adjusted R2':round((Adjusted_R2_e ),2)}
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot(np.array(y_pred_test_en))
plt.plot((np.array(y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_en),(y_test)-(y_pred_test_en))

### ML Model - 5

### **DECISION TREE**

In [None]:
# ML Model - 5 Implementation
decision_regressor = DecisionTreeRegressor(criterion='squared_error', max_depth=8,max_features=9, max_leaf_nodes=100,)
decision_regressor.fit(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_d = decision_regressor.predict(X_train)
y_pred_test_d = decision_regressor.predict(X_test)

In [None]:
print("Model Score:",decision_regressor.score(X_train,y_train))

#calculate MSE
MSE_d= mean_squared_error(y_train, y_pred_train_d)
print("MSE :",MSE_d)

#calculate RMSE
RMSE_d=np.sqrt(MSE_d)
print("RMSE :",RMSE_d)


#calculate MAE
MAE_d= mean_absolute_error(y_train, y_pred_train_d)
print("MAE :",MAE_d)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_d= r2_score(y_train, y_pred_train_d)
print("R2 :",r2_d)
Adjusted_R2_d=(1-(1-r2_score(y_train, y_pred_train_d))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_d))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### **Looks like our r2 score value is 0.71 that means our model is able to capture most of the data variance. Lets save it in a dataframe for later comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Dicision tree regression ',
       'MAE':round((MAE_d),3),
       'MSE':round((MSE_d),3),
       'RMSE':round((RMSE_d),3),
       'R2_score':round((r2_d),3),
       'Adjusted R2':round((Adjusted_R2_d),2)
      }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#calculate MSE
MSE_d= mean_squared_error(y_test, y_pred_test_d)
print("MSE :",MSE_d)

#calculate RMSE
RMSE_d=np.sqrt(MSE_d)
print("RMSE :",RMSE_d)


#calculate MAE
MAE_d= mean_absolute_error(y_test, y_pred_test_d)
print("MAE :",MAE_d)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_d= r2_score((y_test), (y_pred_test_d))
print("R2 :",r2_d)
Adjusted_R2_d=(1-(1-r2_score((y_test), (y_pred_test_d)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_d)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### **The r2_score for the test set is 0.68. This means our linear model is performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter)**.

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Dicision tree regression ',
       'MAE':round((MAE_d),3),
       'MSE':round((MSE_d),3),
       'RMSE':round((RMSE_d),3),
       'R2_score':round((r2_d),3),
       'Adjusted R2':round((Adjusted_R2_d),2)
      }
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot((np.array(y_pred_test_d)))
plt.plot(np.array((y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_d),(y_test)-(y_pred_test_d))

### ML Model - 6

### **RANDOM FOREST**

In [None]:
# Create an instance of the RandomForestRegressor
rf_model = RandomForestRegressor()

rf_model.fit(X_train,y_train)

In [None]:
# Making predictions on train and test data
y_pred_train_r = rf_model.predict(X_train)
y_pred_test_r = rf_model.predict(X_test)

In [None]:
print("Model Score:",rf_model.score(X_train,y_train))

#calculate MSE
MSE_rf= mean_squared_error(y_train, y_pred_train_r)
print("MSE :",MSE_rf)

#calculate RMSE
RMSE_rf=np.sqrt(MSE_rf)
print("RMSE :",RMSE_rf)


#calculate MAE
MAE_rf= mean_absolute_error(y_train, y_pred_train_r)
print("MAE :",MAE_rf)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_rf= r2_score(y_train, y_pred_train_r)
print("R2 :",r2_rf)
Adjusted_R2_rf=(1-(1-r2_score(y_train, y_pred_train_r))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_r))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### **Looks like our r2 score value is 0.98 that means our model is able to capture most of the data variance. Lets save it in a dataframe for later comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Random forest regression ',
       'MAE':round((MAE_rf),3),
       'MSE':round((MSE_rf),3),
       'RMSE':round((RMSE_rf),3),
       'R2_score':round((r2_rf),3),
       'Adjusted R2':round((Adjusted_R2_rf ),2)}
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#calculate MSE
MSE_rf= mean_squared_error(y_test, y_pred_test_r)
print("MSE :",MSE_rf)

#calculate RMSE
RMSE_rf=np.sqrt(MSE_rf)
print("RMSE :",RMSE_rf)


#calculate MAE
MAE_rf= mean_absolute_error(y_test, y_pred_test_r)
print("MAE :",MAE_rf)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_rf= r2_score((y_test), (y_pred_test_r))
print("R2 :",r2_rf)
Adjusted_R2_rf=(1-(1-r2_score((y_test), (y_pred_test_r)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_r)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### **The r2_score for the test set is 0.91. This means our linear model is performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Random forest regression ',
       'MAE':round((MAE_rf),3),
       'MSE':round((MSE_rf),3),
       'RMSE':round((RMSE_rf),3),
       'R2_score':round((r2_rf),3),
       'Adjusted R2':round((Adjusted_R2_rf ),2)}
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_r),(y_test)-(y_pred_test_r))

In [None]:
rf_model.feature_importances_

In [None]:
importances = rf_model.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)

In [None]:
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)

In [None]:
importance_df.sort_values(by=['Feature Importance'],ascending=False)

In [None]:
#FIT THE MODEL
rf_model.fit(X_train,y_train)

In [None]:
features = X_train.columns
importances = rf_model.feature_importances_
indices = np.argsort(importances)

In [None]:
#Plot the figure
plt.figure(figsize=(10,20))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')

plt.show()

### Here, we can see that, Temperature, Functioning Day and Humidity has much higher relevance on the counting of bike Renting. means they are most important features which affects on rented bike count in the Random Forest model.



### ML Model - 7

### **GRADIENT BOOSTING**

In [None]:
# Create an instance of the GradientBoostingRegressor
gb_model = GradientBoostingRegressor()
#Fit the model
gb_model.fit(X_train,y_train)

In [None]:
# Making predictions on train and test data
y_pred_train_g = gb_model.predict(X_train)
y_pred_test_g = gb_model.predict(X_test)

In [None]:
print("Model Score:",gb_model.score(X_train,y_train))
#calculate MSE
MSE_gb= mean_squared_error(y_train, y_pred_train_g)
print("MSE :",MSE_gb)

#calculate RMSE
RMSE_gb=np.sqrt(MSE_gb)
print("RMSE :",RMSE_gb)


#calculate MAE
MAE_gb= mean_absolute_error(y_train, y_pred_train_g)
print("MAE :",MAE_gb)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_gb= r2_score(y_train, y_pred_train_g)
print("R2 :",r2_gb)
Adjusted_R2_gb = (1-(1-r2_score(y_train, y_pred_train_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### **Looks like our r2 score value is 0.87 that means our model is able to capture most of the data variance. Lets save it in a dataframe for later comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Gradient boosting regression ',
       'MAE':round((MAE_gb),3),
       'MSE':round((MSE_gb),3),
       'RMSE':round((RMSE_gb),3),
       'R2_score':round((r2_gb),3),
       'Adjusted R2':round((Adjusted_R2_gb ),2),
       }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#calculate MSE
MSE_gb= mean_squared_error(y_test, y_pred_test_g)
print("MSE :",MSE_gb)

#calculate RMSE
RMSE_gb=np.sqrt(MSE_gb)
print("RMSE :",RMSE_gb)


#calculate MAE
MAE_gb= mean_absolute_error(y_test, y_pred_test_g)
print("MAE :",MAE_gb)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_gb= r2_score((y_test), (y_pred_test_g))
print("R2 :",r2_gb)
Adjusted_R2_gb = (1-(1-r2_score((y_test), (y_pred_test_g)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_g)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### **The r2_score for the test set is 0.86. This means our linear model is performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Gradient boosting regression ',
       'MAE':round((MAE_gb),3),
       'MSE':round((MSE_gb),3),
       'RMSE':round((RMSE_gb),3),
       'R2_score':round((r2_gb),3),
       'Adjusted R2':round((Adjusted_R2_gb ),2),
       }
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_g),(y_test)-(y_pred_test_g))

In [None]:
gb_model.feature_importances_

In [None]:
importances = gb_model.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)

In [None]:
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)

In [None]:
importance_df.head()

In [None]:
importance_df.sort_values(by=['Feature Importance'],ascending=False)

In [None]:
gb_model.fit(X_train,y_train)

In [None]:
features = X_train.columns
importances = gb_model.feature_importances_
indices = np.argsort(importances)

In [None]:
#Plot the figure
plt.figure(figsize=(10,20))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')

plt.show()

#### **Hyperparameter Tuning**

### Before proceding to try next models, let us try to tune some hyperparameters and see if the performance of our model improves.

### Hyperparameter tuning is the process of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a model argument whose value is set before the learning process begins. The key to machine learning algorithms is hyperparameter tuning.

### **Using GridSearchCV**

### GridSearchCV helps to loop through predefined hyperparameters and fit the model on the training set. So, in the end, we can select the best parameters from the listed hyperparameters.


### **Gradient Boosting Regressor with GridSearchCV**

### **Provide the range of values for chosen hyperparameters**

In [None]:
# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# HYperparameter Grid
param_dict = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}

In [None]:
param_dict

### **Importing Gradient Boosting Regressor**

In [None]:
# Create an instance of the GradientBoostingRegressor
gb_model = GradientBoostingRegressor()

# Grid search
gb_grid = GridSearchCV(estimator=gb_model,
                       param_grid = param_dict,
                       cv = 5, verbose=2)

gb_grid.fit(X_train,y_train)

In [None]:
gb_grid.best_estimator_

In [None]:
gb_optimal_model = gb_grid.best_estimator_

In [None]:
gb_grid.best_params_

In [None]:
# Making predictions on train and test data

y_pred_train_g_g = gb_optimal_model.predict(X_train)
y_pred_g_g= gb_optimal_model.predict(X_test)

In [None]:
print("Model Score:",gb_optimal_model.score(X_train,y_train))
MSE_gbh= mean_squared_error(y_train, y_pred_train_g_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)


MAE_gbh= mean_absolute_error(y_train, y_pred_train_g_g)
print("MAE :",MAE_gbh)


from sklearn.metrics import r2_score
r2_gbh= r2_score(y_train, y_pred_train_g_g)
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(y_train, y_pred_train_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Gradient Boosting gridsearchcv ',
       'MAE':round((MAE_gbh),3),
       'MSE':round((MSE_gbh),3),
       'RMSE':round((RMSE_gbh),3),
       'R2_score':round((r2_gbh),3),
       'Adjusted R2':round((Adjusted_R2_gbh ),2)
      }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
MSE_gbh= mean_squared_error(y_test, y_pred_g_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)


MAE_gbh= mean_absolute_error(y_test, y_pred_g_g)
print("MAE :",MAE_gbh)


from sklearn.metrics import r2_score
r2_gbh= r2_score((y_test), (y_pred_g_g))
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(y_test, y_pred_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_g_g)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Gradient Boosting gridsearchcv ',
       'MAE':round((MAE_gbh),3),
       'MSE':round((MSE_gbh),3),
       'RMSE':round((RMSE_gbh),3),
       'R2_score':round((r2_gbh),3),
       'Adjusted R2':round((Adjusted_R2_gbh ),2)
      }
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
### Heteroscadacity
plt.scatter((y_pred_g_g),(y_test)-(y_pred_g_g))

In [None]:
gb_optimal_model.feature_importances_

In [None]:
importances = gb_optimal_model.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)
importance_df.head()

In [None]:
importance_df.sort_values(by=['Feature Importance'],ascending=False)

In [None]:
gb_model.fit(X_train,y_train)

In [None]:
features = X_train.columns
importances = gb_model.feature_importances_
indices = np.argsort(importances)

In [None]:
#Plot the figure
plt.figure(figsize=(10,20))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')

plt.show()

### **XgBoost with RandomizedSearchCV**

In [None]:
#Import Library
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
import xgboost

In [None]:
#hyperparameter optimization
params = {'learning_rate' : [0.05,0.1,0.15,0.2,0.25,0.3],
          'max_depth' : [3,4,5,6,8,10,12,15],
          'min_child_weight' : [1,3,5,7],
          'gamma' : [0,0.1,0.2,0.3,0.4],
          'colsample_bytree' : [0.3,0.4,0.5,0.7]
} 

In [None]:
xgb_regressor = xgboost.XGBRegressor()
random_search = RandomizedSearchCV(xgb_regressor, param_distributions=params, n_iter=5, scoring='neg_mean_squared_error',n_jobs=-1,cv=5,verbose=3)

In [None]:
random_search.fit(X_train,y_train)

In [None]:
xgb_model = random_search.best_estimator_
xgb_model

In [None]:
random_search.best_params_

In [None]:
#get the y predicted value for train and test dataset
y_pred_train = xgb_model.predict(X_train)
y_pred_test = xgb_model.predict(X_test)

In [None]:
#Calculating metric value for testing dataset
print('Metric value of training dataset')
#Calculate MSE
MSE_xgb = mean_squared_error(y_train, y_pred_train)    
print("MSE :",MSE_xgb)
#calculate RMSE 
RMSE_xgb = np.sqrt(MSE_xgb)   
print("RMSE :",RMSE_xgb)
#calculate MAE 
MAE_xgb = mean_absolute_error(y_train, y_pred_train)       
print("MAE :",MAE_xgb)
#calculate r2 and adjusted r2
r2_xgb = r2_score(y_train, y_pred_train)         
print("R2 :",r2_xgb)
Adjusted_r2_xgb = 1-(((1-r2_xgb)*(X_test.shape[0]-1))/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 :",Adjusted_r2_xgb)
#creating table of metric values
dict1={'Model':'XgBoost Regressor with RandomisedSearchCV',
       'MAE':round((MAE_xgb),3),
       'MSE':round((MSE_xgb),3),
       'RMSE':round((RMSE_xgb),3),
       'R2_score':round((r2_xgb),3),
       'Adjusted R2':round((Adjusted_r2_xgb),3)
        }
training_df = training_df.append(dict1,ignore_index=True)

#Calculating metric value for testing dataset
#calculate MSE
print('Metric value of testing dataset')
MSE_xgb = mean_squared_error(y_test, y_pred_test)    
print("MSE :",MSE_xgb)
#calculate RMSE 
RMSE_xgb = np.sqrt(MSE_xgb)   
print("RMSE :",RMSE_xgb)
#calculate MAE 
MAE_xgb = mean_absolute_error(y_test, y_pred_test)       
print("MAE :",MAE_xgb)
#calculate r2 and adjusted r2
r2_xgb = r2_score(y_test, y_pred_test)         
print("R2 :",r2_xgb)
Adjusted_r2_xgb = 1-(((1-r2_xgb)*(X_test.shape[0]-1))/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 :",Adjusted_r2_xgb)
#creating table of metric values
dict2={'Model':'XgBoost Regressor with RandomisedSearchCV',
       'MAE':round((MAE_xgb),3),
       'MSE':round((MSE_xgb),3),
       'RMSE':round((RMSE_xgb),3),
       'R2_score':round((r2_xgb),3),
       'Adjusted R2':round((Adjusted_r2_xgb),3)
        }
test_df = test_df.append(dict2,ignore_index=True)

### XgBoost with RandomisedSearchCV Model
### R2 score of Train data : 0.96
### adjusted R2 score of Train data : 0.96
### R2 score of Test data : 0.92
### adjusted R2 score of Test data : 0.92

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

### RMSE is the most popular evaluation metric used in regression problems. It follows an assumption that error are unbiased and follow a normal distribution. Here are the key points to consider on RMSE:

### The power of ‘square root’ empowers this metric to show large number deviations. The ‘squared’ nature of this metric helps to deliver more robust results which prevents cancelling the positive and negative error values. In other words, this metric aptly displays the plausible magnitude of error term. It avoids the use of absolute error values which is highly undesirable in mathematical calculations. When we have more samples, reconstructing the error distribution using RMSE is considered to be more reliable. RMSE is highly affected by outlier values. Hence, make sure you’ve removed outliers from your data set prior to using this metric. As compared to mean absolute error, RMSE gives higher weightage and punishes large errors

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

### **We choose Random forest as our final prediction as it gives the best accuracy among all.**

### Model Performance: The Goodness of fit determines how the line of regression fits the set of observations. The process of finding the best model out of various models is called optimization. It can be achieved by below method:

### R-squared method:
### R-squared is a statistical method that determines the goodness of fit. It measures the strength of the relationship between the dependent and independent variables on a scale of 0-100%. The high value of R-square determines the less difference between the predicted values and actual values and hence represents a good model.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

### LIME ( Local Interpretable Model-agnostic Explanations )is a novel explanation technique that explains the prediction of any classifier in an interpretable and faithful manner by learning an interpretable model locally around the prediction.

### A consistent model agnostic explainer .
### A method to select a representative set with explanations to make sure the model behaves consistently while replicating human logic. This representative set would provide an intuitive global understanding of the model.
### LIME explains a prediction so that even the non-experts could compare and improve on an untrustworthy model through feature engineering.

###  **Model Explainability by LIME**


In [None]:
# Extract features
float_columns=[]
cat_columns=[]
int_columns=[]

In [None]:
# Putting features into respective float, cat , int list.
for i in x.columns:
    if x[i].dtype == 'float' : 
        float_columns.append(i)
    elif x[i].dtype == 'int64':
        int_columns.append(i)
    elif x[i].dtype == 'object':
        cat_columns.append(i)

In [None]:
train_cat_features = x[cat_columns]
train_float_features = x[float_columns]
train_int_features = x[int_columns]

In [None]:
!pip install lime

In [None]:
import lime
import lime.lime_tabular
from __future__ import print_function

In [None]:
 #Create the LIME Explainer
explainer = lime.lime_tabular.LimeTabularExplainer(feature_names = X.columns,
                                                  training_data = np.array(X_train),
                                                  mode='regression')                                   

In [None]:
# Get the explanation for RandomForest
exp = explainer.explain_instance(data_row = X_test.iloc[24], predict_fn = rf_model.predict)
exp.show_in_notebook(show_table=True)

### From LIME black box model, we can see that, Hour 4,Hour 5,Hour 3,Hour 2 and Rainfall has positive impact and Hour 18,Hour 20, Hour 8, Hour 19,Hour 21 has negative impact on rented bike count for Random Forest Model.



In [None]:
# Get the explanation for Gradient Boosting with GridSearchCV
exp = explainer.explain_instance(data_row = X_test.iloc[24], predict_fn = gb_grid.predict)
exp.show_in_notebook(show_table=True)

### From LIME black box model, we can see that, Hour 4,Hour 5, Rainfall,Hour 3,Hour 2 has positive impact and Hour 18,Hour 19, Hour 20, Hour 21,Hour 22 has negative impact on rented bike count for Gradient Boosting with GridSearchCV Model.

In [None]:
# Get the explanation for XGBoost with GridSearchCV
exp = explainer.explain_instance(data_row = X_test.iloc[24], predict_fn = xgb_model.predict)
exp.show_in_notebook(show_table=True)

### From LIME black box model, we can see that, Hour 4,Hour 3,Hour 5, Rainfall,Hour 2 has positive impact and Hour 18,Hour 19, Hour 8,Hour 20  and Hour 21 has negative impact on rented bike count for XgBoost with GridSearchCV Model.



# **Conclusion**


### The number of bikes rented is on average higher during the rush hours.i.e. at 6 p.m. to 8 p.m.

### The rented bike counts is higher during the summer and lowest during the winter.

### The rented bike count is higher on working days than on non-working days.

### On a non-functioning day, no bikes are rented in all the instances of the data.

### The number of bikes rented on average remains constant throughout Monday - Saturday, it dips on Sunday, and on average, the rented bike counts is lower on weekends than on weekdays.

### On regular days, the demand for the bikes is higher during rush hours. On holidays or weekends, the demand is comparatively lower in the mornings, and is higher in the afternoons.

###  **Results**

### I implemented 7 machine learning algorithms Linear Regression,lasso,ridge,elasticnet,decission tree, Random Forest and XGBoost. I have done  hyperparameter tuning to improve the model performance. The results of our evaluation are:



In [None]:
# displaying the results of evaluation metric values for all models
result=pd.concat([training_df,test_df],keys=['Training set','Test set'])
result

### No overfitting is seen.
### Random forest Regressor, Gradient Boosting gridsearchcv and XgBoost Regressor with RandomisedSearchCV gives the highest R2 score of 98%, 95% and 96% respectively for Train Set and almost 92% for Test set.
###  Feature Importance value for Random Forest, Gradient Boosting, XgBoost are different.
### We can deploy Random Forest, Gradient Boosting with GridSearchCV, XgBoost with RandomizedSearchCV model.
 




### However, this is not the ultimate end. As this data is time dependent, the values for variables like temperature, windspeed, solar radiation etc., will not always be consistent. Therefore, there will be scenarios where the model might not perform well. As Machine learning is an exponentially evolving field, we will have to be prepared for all contingencies and also keep checking our model from time to time. Therefore, having a quality knowledge and keeping pace with the ever evolving ML field would surely help one to stay a step ahead in future.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***