# **Project Name**    - **Bike Sharing Demand Prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**

Bike Seoul is a bike sharing service in the city of Seoul, South Korea. It is part of the city's efforts to promote sustainable transportation and reduce traffic congestion. The service allows residents and visitors to rent bicycles at various stations across the city and return them to any other station, providing a convenient and eco-friendly mode of transportation. In recent years, the demand for bike rentals in Seoul has increased, leading to the need for a more efficient and effective way to manage the bike sharing operations. Accurately predicting bike demand is crucial for optimizing fleet management, ensuring the availability of bikes at high-demand locations, and reducing waste and costs.

The main objective of this project is to develop a machine learning model that can accurately predict the demand for bike rentals in Seoul, South Korea, based on historical data and various relevant factors such as weather conditions, time of day, and public holidays. In this project we have used regression analysis techniques to model the bike demand data. The model trained on a large dataset of past bike rental information, along with relevant weather and time data. The model then be tested and evaluated using metrics such as mean squared error and r-squared values. The actual data is from the Seoul city government's open data portal, and this dataset is also available on Kaggle.

 We have performed lots of regression algorithms like linear regression, random forest, decision tree, gradient boosting , Xtreme gradient boosting, also we tried to do hyperparameter tuning and cross validation to improve the accuracy of the model. And finally we have decided to select Xtreme gradient boosting algorithm because it gave us high accuracy around 93% and 90% on train and test data resp.

This project not only provided valuable insights into bike demand patterns in Seoul but also demonstrated the practical applications of machine learning in addressing real-world problems. The findings could potentially be extended to other cities with similar bike sharing systems, leading to improved services for bike users and more sustainable transportation systems.

# **GitHub Link -**

https://github.com/AshwiniSuryakar09/EDA-Bike-Sharing-Demand-Prediction/tree/main

# **Problem Statement**


Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
from datetime import datetime as dt


# Import Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Import warnings
import warnings
warnings.filterwarnings('ignore')

# Import preporcessing libraries
from sklearn.preprocessing import MinMaxScaler,StandardScaler

# Import model selection libraries
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV

# Import Outlier influence library
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Import Model
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from xgboost import XGBRegressor
import xgboost as xgb

# Import evaluation metric libraries
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error

# Import tree for visualization
from sklearn.tree import export_graphviz
from sklearn import tree
from IPython.display import SVG,display
from graphviz import Source

### Dataset Loading

In [None]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
data = pd.read_csv('SeoulBikeData.csv',encoding ='latin-1')

### Dataset First View

In [None]:
# Dataset First Look
data.head()

In [None]:
data.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Checking number of rows and columns of the dataset using shape
print("Number of rows are: ",data.shape[0])
print("Number of columns are: ",data.shape[1])


### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(data.isnull(),cbar= False)

### What did you know about your dataset?

* There are a total of 14 feature columns where Rented Bike Count is the dependent variable column. The total number of observations(rows) are 8760.

* There are no duplicate rows in the dataset.

* Also there are no missing values or Null values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe().T

### Variables Description

* Date : year-month-day
* Rented Bike Count : Count of bikes rented per hour
* Hour : Hour of the day
* Temperature(°C)' : Temperature in Celsius
* Humidity  : %
* Wind speed  : m/s
* Visibility  : 10m
* Dew point temperature: °C
* Solar Radiation  : MJ/m2
* Rainfall : mm
* Snowfall :  cm
* Seasons : Winter,Summer,Autmn,Spring
* Holiday : Holiday/Not Holiday
* Functioning Day : NoFunc(Non Functional Hours), Fun(Functional hours)


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in data.columns.tolist():
  print("No.of Unique values in",i,"is",data[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Before doing any data wrangling lets create copy of the dataset
data_2 = data.copy()

In [None]:
data.columns

In [None]:
data_2.rename(columns ={'Date':'date', 'Rented Bike Count':'rented_bike_count', 'Hour':'hour', 'Temperature(°C)':'temperature',
                       'Humidity(%)':'humidity','Wind speed (m/s)':'wind_speed', 'Visibility (10m)':'visibility',
                        'Dew point temperature(°C)':'dew_point_temperature',
                       'Solar Radiation (MJ/m2)':'solar_radiation', 'Rainfall(mm)':'rainfall', 'Snowfall (cm)':'snowfall', 'Seasons':'seasons',
                       'Holiday':'holiday', 'Functioning Day':'functioning_day'},inplace = True)

In [None]:
data_2.columns

In [None]:
data_2['date'] = pd.to_datetime(data_2['date'], format='%d/%m/%Y').dt.strftime('%d/%m/%Y') # Added format argument to specify the correct format of the date column


In [None]:
# Creating new columns for day and month
data_2['month'] = pd.to_datetime(data_2['date'],format= '%d/%m/%Y').dt.month
data_2['day_of_week'] = pd.to_datetime(data_2['date'],format = '%d/%m/%Y').dt.dayofweek

In [None]:
# engineering new feature 'weekend' from day_of_week
data_2['weekend'] = data_2['day_of_week'].apply(lambda x: 1 if x>5 else 0)

In [None]:
data_2.describe(include='all').round(2)

In [None]:
# defining continuous independent variables separately
cont_var = ['temperature', 'humidity', 'wind_speed', 'visibility', 'dew_point_temperature','solar_radiation', 'rainfall', 'snowfall']

In [None]:
# defining dependent variable
dependent_variable = ['rented_bike_count']

In [None]:
# defining categorical independent variables separately
cat_var = ['hour','seasons', 'holiday', 'functioning_day', 'month', 'day_of_week', 'weekend']


### What all manipulations have you done and insights you found?

* From the Date column, 'month' and 'day of the week' columns are created.

* From the day of the week column, weekend column is created where 6 and 7 are the weekends (Saturday and Sunday).

* We have also defined the continuous variables, dependent variable and categorical variables for ease of plotting graphs.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Dependent variable Distribution

In [None]:
# Chart - 1 visualization code for distribution of target variable
plt.figure(figsize=(10,8))
sns.distplot(data_2['rented_bike_count'])
plt.show()

##### 1. Why did you pick the specific chart?

A distplot, also known as a histogram-kernel density estimate (KDE) plot. It is useful because it provides a quick and easy way to check the distribution of the data, identify patterns or outliers, and compare the distribution of multiple variables. It also allows to check if the data is following normal distribution or not.

Thus, I used the histogram plot to analyse the variable distributions over the whole dataset whether it's symmetric or not.

##### 2. What is/are the insight(s) found from the chart?

From above distribution plot of dependent variable rented bike, we can clearly see that the distribution is positively skewed (Right skewed).

It means that distribution is not symmetric around the the mean.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, definately from this insight we got to know that we our data is not normally distributed so, before doing or implementing any model on this data we need to normalise this data.

#### Chart - 2 :  Distribution/ Box plot

In [None]:
# Chart - 2 visualization code
# Visualizing code of histogram plot & boxplot for each columns to know the data distribution
for col in data_2.describe().columns:
  fig,axes = plt.subplots(nrows=1,ncols=2,figsize=(20,8))
  sns.histplot(data_2[col],ax=axes[0],kde=True)
  sns.boxplot(data_2[col],ax= axes[1],orient ='h',showmeans= True,color='Blue')
  fig.suptitle("Distribution plot of " + col,fontsize = 15)
  plt.show()


##### 1. Why did you pick the specific chart?

A histplot is a type of chart that displays the distribution of a dataset. It is a graphical representation of the data that shows how often each value or group of values occurs. Histplots are useful for understanding the distribution of a dataset and identifying patterns or trends in the data. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data.

Thus, we used the histogram plot to analysis the variable distributions over the whole dataset whether it's symmetric or not.

A boxplot is used to summarize the key statistical characteristics of a dataset, including the median, quartiles, and range, in a single plot. Boxplots are useful for identifying the presence of outliers in a dataset, comparing the distribution of multiple datasets, and understanding the dispersion of the data. They are often used in statistical analysis and data visualization.

Thus, for each numerical varibale in the given dataset, we used box plot to analyse the outliers and interquartile range including mean, median, maximum and minimum value.

##### 2. What is/are the insight(s) found from the chart?

From above univariate analysis of all continuous feature variables. We got to know that only tempture and humidity columns are looks normally distributed others shows the different distributions.

Also we can see that there are outlier values in snowfall, rainfall, wind speed & solar radiation columns

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Histogram and Box plot cannot give us whole information regarding data. It's done just to see the distribution of the column data over the dataset.

#### Chart - 3 : : Dependent variable with continuous variables (Bivariate)

In [None]:
# Chart - 3 visualization code
# Analyzing the relationship between the dependent variable and the continuous variable
for i in cont_var:
  plt.figure(figsize=(11,8))
  sns.regplot(x=i, y= dependent_variable[0],data=data_2)
  plt.xlabel(i)
  plt.ylabel(dependent_variable[0])
  plt.title(i+ 'vs' + dependent_variable[0])
  plt.show()

##### 1. Why did you pick the specific chart?

Regplot is used to create a scatter plot with linear regression line. The purpose of this function is to visualize the relationship between two continuous variables. It can help to identify patterns and trends in the data, and can also be used to test for linearity and independence of the variables.

To check the patterns between independent variable with our rented bike dependent variable we used this regplot.

##### 2. What is/are the insight(s) found from the chart?

From above regression plot we can see that there is some linearity between temperature, solar radiation & dew point temperature with dependent variable rented bike

Other variables are not showing any patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, it helped a little bit from this we got to know that there are few variables which are showing some patterns with dependent variable this variable are maybe important feature while predicting for rented bike count so business needs focus on these variables.

#### Chart - 4  : Categorical variables with dependent variable (bivariate)

In [None]:
# Chart - 4 visualization code
# Analyzing the relationship between the dependent variable and the categorical variables
for i in cat_var:
  plt.figure(figsize=(11,8))
  sns.barplot(x=i,y=dependent_variable[0],data=data_2 ,palette = 'rainbow')
  plt.xlabel(i)
  plt.ylabel(dependent_variable[0])
  plt.title(i+ 'vs'+ dependent_variable[0])
  plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are used to compare the size or frequency of different categories or groups of data. Bar charts are useful for comparing data across different categories, and they can be used to display a large amount of data in a small space.

To show the distribution of the rented bike count with other categorical variables we used bar charts.

##### 2. What is/are the insight(s) found from the chart?

From above bar charts we got insights:

* In hour vs rented bike chart there is high demand in the morning 8'o clock and evening 18'o clock
* From season vs rented bike chart there is more demand in summer and less demand in winter.
* There is high demand on working days.
* From month chart we know that there is high demand in month of june.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insights are going to provide some positive business impact, beacause analysing the demand on the basis of categorical varible we got know that when demand for bike is more so we can focus more on that portion.

#### Chart - 5 : Rented Bike vs Hour

In [None]:
# Chart - 5 visualization code
#ploting line graph
avg_rent_hrs= data_2.groupby('hour')['rented_bike_count'].mean()
plt.figure(figsize=(12,6))
sns.lineplot(data=avg_rent_hrs,marker='o')
plt.title('Average bike rented per hour')
plt.show()

##### 1. Why did you pick the specific chart?

A line plot, also known as a line chart or line graph, is a way to visualize the trend of a single variable over time. It uses a series of data points connected by a line to show how the value of the variable changes over time.

Line plots are useful because they can quickly and easily show trends and patterns in the data. They are particularly useful for showing how a variable changes over a period of time. They are also useful for comparing the trends of multiple variables.

To see how rented bike demand is distributed over 24 hours time we used line plot.

##### 2. What is/are the insight(s) found from the chart?

From above line plot we can clearly see that there is high demand in the morning and in the evening.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, from above insight we know that there is high demand in morning and evening so business needs to focus more on that time slot. as well as try to meet the demand on that time slot.

#### Chart - 6  : Bike demand throughout the day (Multivariate)

In [None]:
# Chart - 6 visualization code
for i in cat_var:
  if i =='hour':
    continue
  else:
    fig ,ax = plt.subplots(figsize = (12,8))
    sns.pointplot(data=data_2,x='hour',y='rented_bike_count',hue =i,ax=ax)
    plt.title("hourly bike demand broken down based on the attribute: "+i)
    plt.legend(bbox_to_anchor = (1.05,1),loc='upper left' ,title = i)
    plt.show()

##### 1. Why did you pick the specific chart?

A line plot, also known as a line chart or line graph, is a way to visualize the trend of a single variable over time. It uses a series of data points connected by a line to show how the value of the variable changes over time.

Line plots are useful because they can quickly and easily show trends and patterns in the data. They are particularly useful for showing how a variable changes over a period of time. They are also useful for comparing the trends of multiple variables.

To show the demand of rented bike throughout the day on the basis of other categorical variable we used line plot drawing multiple lines on charts.

##### 2. What is/are the insight(s) found from the chart?

From above line plots we see that :

* In winter season there is no significant demand even in the morning or in the evening.
* On the functional day (i.e No Holiday) there is spike in morning and in evening, but that is not there on Holidays.
* Around 3 months in winter season (i.e December, January & February) there is low demand.
* On weekend almost throught the day there is demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, from this analysis we figure out some key factors such as high demand in morning and evening slot in all the seasons.

#### Chart - 7 :  Categorical plot for seasons

In [None]:
# Chart - 7 visualization code
#plot for rented bike count seasonly
sns.catplot(x='seasons',y='rented_bike_count',data=data_2)

##### 1. Why did you pick the specific chart?

Catplot is used to create a categorical plot. Categorical plots are plots that are used to visualize the distribution of a categorical variable. They can be used to show how a variable is related to a categorical variable and can also be used to compare the distribution of multiple categorical variables.

To see the distribution of the rented bike on basis of season column we used catplot.

##### 2. What is/are the insight(s) found from the chart?

From above catplot we got know that:

1.There is low demand in winter
2.Also in all seasons upto the 2500 bike counts distribution is seen dense.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, from this catplot we know that there is high bike count upto the 2500 so, above that there maybe outliers present. business needs to evaluate that.

#### Chart - 8

In [None]:
 # Chart 8 : Visualization code for pie chart
Winter=data_2[data_2["seasons"]=='Winter'].sum()
Spring=data_2[data_2["seasons"]=='Spring'].sum()
Summer=data_2[data_2["seasons"]=='Summer'].sum()
Autumn=data_2[data_2["seasons"]=='Autumn'].sum()

BikeSeasons={"Winter":Winter["rented_bike_count"],"Spring":Spring["rented_bike_count"],"Summer":Summer["rented_bike_count"],"Autumn":Autumn["rented_bike_count"]}
plt.gcf().set_size_inches(10,10)
plt.pie(BikeSeasons.values(),labels=BikeSeasons.keys(), autopct='%1d%%');
plt.title("Repartition of bikes rental by season", fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts are generally used to show the proportions of a whole, and are especially useful for displaying data that has already been calculated as a percentage of the whole.

So, we used pie chart to see percentage distribution of rented bike on the basis of sseasons

##### 2. What is/are the insight(s) found from the chart?

From above pie chart:

In year data season summer contributes around 36% then autumn around 29%
Lowest demand in winter, it contributes around only 7%

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insights only tell about percentage contribution of year data of season varible, which clearly gave indication about demand.

### **Chart - 14 - Correlation Heatmap**

In [None]:
# Chart - 9 visualization code
# Correlation Heatmap visualization code
corr = data_2.corr(numeric_only=True)
mask = np.zeros_like(corr)

mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(18, 9))
    ax = sns.heatmap(corr , mask=mask, vmin = -1,vmax=1, annot = True, cmap="YlGnBu")


##### 1. Why did you pick the specific chart?

The correlation coefficient is a measure of the strength and direction of a linear relationship between two variables. A correlation matrix is used to summarize the relationships among a set of variables and is an important tool for data exploration and for selecting which variables to include in a model. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, we have used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

From above correlation map we can clearly see that:

1.There is high multicolinearity between independent variable (i.e temperature & dew point temp, humidity & dew point temp, weekend & day of week).

2.There is correlation of temperature, hour, dew point temp & solar radiation with dependent variable rented bike.
Other than that we didnt see any correlation.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(data_2)
plt.show()

##### 1. Why did you pick the specific chart?

A pairplot, also known as a scatterplot matrix, is a visualization that allows you to visualize the relationships between all pairs of variables in a dataset. It is a useful tool for data exploration because it allows you to quickly see how all of the variables in a dataset are related to one another.

Thus, we used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

From above pair plot we got to know that, there is not clear linear relationship between variables. other than dew point temp, temperature & solar radiation there is not any reationship.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Based on above chart experiments we have noticed that our dependent variable does not seems to normally distributed so we have made hypothetical assumption that our data is normally distributed and for that we have decided to do statistical analysis.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

### Normality test

for normality test we decided

* Null hypothesis : Data is normally distributed
* Alternate hypothesis : Data is not normally distributed

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import shapiro
test_data = data_2['rented_bike_count']
stats,p = shapiro(test_data)
print('stats = %.2f,p=%.3f' % (stats,p))
if p <= 0.05:
  print("Reject null hypothesis i.e Data is not normally distributed")
else:
  print('Accept null hypothesis i.e Data is noramlly distributed')


##### Which statistical test have you done to obtain P-Value?

use Shapiro-wilk statistical test to obtain the p-value and we got very less p-value which is less than 0.05.

##### Why did you choose the specific statistical test?

The Shapiro-Wilk test is used to test the normality of a sample. The test checks whether the sample data fits a normal distribution, which is often assumed for statistical analysis. The test results can help determine if the data should be transformed or if non-parametric statistical methods should be used instead of traditional parametric methods.

## ***6. Feature Engineering & Data Pre-processing***

### 2. Handling Outliers

In [None]:
''' # Handling Outliers & Outlier treatments
# Removing outliers by Using IQR method:
q1, q3, median = data_2.rented_bike_count.quantile([0.25,0.75,0.5])
lower_limit = q1 = 1.5 * (q3-q1)
upper_limit = q3 + 1.5 * (q3-q1)
data_2['rented_bike_count'] = np.where(data_2['rented_bike_count'] > upper_limit,median,np.where(data_2['rented_bike_count'] < lower_limit, median, data_2['rented_bike_count']))
# removing outl;iers by capping :
for col in ['wind_speed','solar_radiation','rainfall','snowfall']:
  upper_limiot= data_2[col].quantile(0.99)
  data_2[col]= np.where(data_2[col] > upper_limit ,upper_limit ,data_2[col]) '''

##### What all outlier treatment techniques have you used and why did you use those techniques?

Here we use IQR method and Capping method, Based on IQR method we set Upper limit and Lower limit of rented bike count and convert those outliers into median values.

Also we capp outliers upto 99th percentile and above that we convert those outliers into upper limit value.

Note :-

* We have tried to remove the outliers but it is seen that there is drop in performance after removing the outliers around 10% drop in model performance
* So, we have decided that we will perform the model without removing the outliers.

### 3. Categorical Encoding

In [None]:
# Converting snowfall and rainfall to categorical attributes
data_2['snowfall'] = data_2['snowfall'].apply(lambda x :1 if x>0 else 0)
data_2['ranifall'] = data_2['rainfall'].apply(lambda x :1 if x>0 else 0)

In [None]:
# encoding the visibility column
data_2['visibility'] = data_2['visibility'].apply(lambda x: 0 if 0<=x<=399 else (1 if 400<=x<=999 else 2))

In [None]:
# encoding
data_2['functioning_day'] = np.where(data_2['functioning_day'] == 'Yes',1,0)
data_2['holiday'] = np.where(data_2['holiday'] == 'Holiday', 1,0)

In [None]:
# one hot encoding
data_2 = pd.get_dummies(data_2 ,columns = ['hour' ,'visibility' ,'month', 'day_of_week'])

In [None]:
# Encode your categorical columns
data_2.columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Since there are very few day on which there was snowfall / rainfall, it is in our interest that we convert these columns to binary categorical columns indicating whether there was rainfall / snowfall at that particular hour

For visibility

When

Visibility >= 20 Km ---> Clear (high visibility)

4 Km <= Visibility < 10 Km ---> Haze (medium visibility)

Visibility < 4 Km ---> Fog (low visibility)

Converting visibility based on the above mentioned threshold values. Since they are ordinal, we can encode them as 0 (low visibility), 1 (medium visibility), 2 (high visibility)

For func day and holiday There are two categories whether its holiday or func day so we use 0 and 1 for that.

For Hour, Visisbility, month & day of the week we use here one hot encoding

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# We see that the temperature and dew temperature are highly correlated
# Scatter plot to visualize the relationship between
# temperature and dew point temperature

plt.figure(figsize =(12,6))
sns.scatterplot(x = 'temperature',y = 'dew_point_temperature' ,data= data_2)
plt.xlabel('temperature')
plt.ylabel('dew_point_temperature')
plt.title('Temperature VS Dew Point Temperature')
plt.show()

In [None]:
# correlation
data_2[['temperature','dew_point_temperature']].corr()

In [None]:
# Creating new temperature column with 50% of both temp
data_2['temp'] = 0.5 * data_2['temperature'] + 0.5 * data_2['dew_point_temperature']

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
features = [i for i in data_2.columns if i not in ['rented_bike_count','temperature','dew_point_temperature']]
features

In [None]:
#remove multicollinearity by using VIF technique
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):
  # calculating VIF
  vif = pd.DataFrame()
  vif["variables"] = X.columns
  vif["VIF"] = [variance_inflation_factor(X.values ,i)for i in range(X.shape[1])]
  return(vif)

In [None]:
continuous_variables = ['temperature', 'humidity', 'wind_speed', 'dew_point_temperature', 'solar_radiation', 'temp']

In [None]:
continuous_feature_df = pd.DataFrame(data_2[continuous_variables])

continuous_feature_df

In [None]:
calc_vif(data_2[[i for i in continuous_feature_df]])

In [None]:
# Removing Temperature and dew point temperature
calc_vif(data_2[[i for i in continuous_feature_df if i  not in ['dew_point_temperature','temperature']]])

In [None]:
# dropping data, weekend, temperature and dew_point_temperature
data_2.drop(['date','weekend', 'dew_point_temperature', 'temperature','seasons'],axis=1, inplace=True)

In [None]:
data_2.head()

##### What all feature selection methods have you used  and why?

We have used pearson correlation coefficient to check correlation between variables and also with dependent variable

And also we check the multicolinearity using VIF and remove those who are having high VIF value.

##### Which all features you found important and why?

From above methods we have found that there is high correlation between temperature and dew point temperature So, we take 50 % of the both and create new variable temp by adding both of them.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# visualizing the distribution of the dependent variable - rental bike count
plt.figure(figsize=(10,5))
sns.distplot(data_2[dependent_variable])
plt.xlabel(dependent_variable[0])
plt.title(dependent_variable[0]+ 'distribution')
plt.axvline(data_2[dependent_variable[0]].mean(),color = 'blue' ,linestyle='dashed', linewidth= 3 )
plt.axvline(data_2[dependent_variable[0]].median(),color ='pink' ,linestyle='dashed', linewidth = 3)
plt.show()

In [None]:
# skew of the dependent variable

data_2[dependent_variable].skew()

In [None]:
# Defining dependent and independent variables
X = data_2.drop('rented_bike_count',axis=1)
y = np.sqrt(data_2[dependent_variable])

In [None]:
X

In [None]:
features

We plot distribution plot and also we did normality test using shapiro wilk and we have found that our data is  normally distributed it does not need any  transformation.
Still, I have checked to get the skewness value and we have found that rented bike attribute is zero skewness value .


### 6. Data Scaling

In [None]:
# Scaling your data
features = [ i for i in data_2.columns if i not in ['rented_bike_count']]

In [None]:
# Scaling your data
scaler = StandardScaler()
numerical_features = data_2[features].select_dtypes(include=['number'])
X = scaler.fit_transform(numerical_features)

##### Which method have you used to scale you data and why?

In this we have different independent features of different scale so we have used standard scalar method to scale our independent features into one scale.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train ,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2 , random_state = 0)


##### What data splitting ratio have you used and why?

To train the model we have split the data into train and test using train_test_split method

We have split 80% of our data into train and 20% into test.

## ***7. ML Model Implementation***

In [None]:
# Calculate Evaluation Matrix
# Defining a function to print evaluation matrix
def evaluate_model(model, y_test, y_pred):
  # Squring the y test and and pred as we have used sqrt transformation
  y_t = np.square(y_test)
  y_p = np.square(y_pred)
  y_train2 = np.square(y_train)
  y_train_pred = np.square(model.predict(X_train))

  # Calculate Evaluation Matrix - these calculations need to be inside the function
  mse = mean_squared_error(y_t, y_p)
  mae = mean_absolute_error(y_t, y_p)
  rmse = np.sqrt(mse)
  r2_train = r2_score(y_train2, y_train_pred)
  r2 = r2_score(y_t, y_p)
  r2_adjusted = 1 - (1 - r2) * ((len(X_test) - 1) / (len(X_test) - X_test.shape[1] - 1))

  # Print Evaluation Matrix
  print("MSE:", mse)
  print("RMSE:", rmse)
  print("MAE:", mae)
  print("Train R2:", r2_train)
  print("Test R2:", r2)
  print("Adjusted R2:", r2_adjusted)

  # Plot actual and predicted values
  plt.figure(figsize=(18, 6))
  plt.plot((y_p)[:100])
  plt.plot((np.array(y_t))[:100])
  plt.legend(["Predicted", "Actual"])
  plt.title('Actual and Predicted Bike Count', fontsize=18)

  try:
      importance = model.feature_importances_
  except:
      importance = model.coef_
  importance = np.absolute(importance)
  if len(importance) == len(features):
      pass
  else:
      importance = importance[0]

  # Feature importances
  feat = pd.Series(importance, index=features[:9]) #Fixed indentation
  plt.figure(figsize=(12, 8))
  plt.title('Feature Importances (top 9) for ' + str(model), fontsize=18)
  plt.xlabel('Relative Importance')
  feat.nlargest(20).plot(kind='barh')

  model_score = {
      'MSE': mse,
      'MAE': mae,
      'RMSE': rmse,
      'Train R2': r2_train,
      'Test R2': r2,
      'Adjusted R2': r2_adjusted
  }

  return model_score

In [None]:
# Create a score dataframe
score = pd.DataFrame(index = ['MSE', 'RMSE','MAE', 'Train R2', 'Test R2', 'Adjusted R2'])

### ML Model - 1 : Linear Regression

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# ML Model - 1 Implementation
reg =LinearRegression()
reg.fit(X_train,y_train)
y_pred_li = reg.predict(X_test)
linear_score = evaluate_model(reg, y_test,y_pred_li)
# Evaluation Metric Score chart
score['Linear regression'] = linear_score


In [None]:
score

So, using linear regression model we have got accuracy(i.e R2 score) around 45% on train data and same 44% on test data. Which seems to low to predict on unseen data.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Define the model
reg = LinearRegression()

# Define the parameters to be optimized
param_grid = {'fit_intercept': [True, False]}

# Perform grid search
grid_search = GridSearchCV(reg, param_grid, cv=5, scoring='r2', return_train_score=True)
grid_search.fit(X_train, y_train)

In [None]:
# Print the best parameters and the corresponding score
print("Best parameters: ", grid_search.best_params_)
print("Best R2 score: ", grid_search.best_score_)

In [None]:
# use the best parameter to train the model
best_reg = grid_search.best_estimator_
best_reg.fit(X_train, y_train)

In [None]:
# predict on test data
y_pred_li2 = best_reg.predict(X_test)

In [None]:
linear_score2 = evaluate_model(best_reg, y_test,y_pred_li2)

In [None]:
score['Linear regression tuned'] = linear_score2

In [None]:
score

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is used to find the best hyperparameters for a machine learning model by searching over a specified parameter grid. It helps to ensure that a model is not overfitting or underfitting by evaluating the model's performance using cross-validation techniques. GridSearchCV can save time and resources compared to manually tuning the parameters of a model.

To reduce time and effort we have used GridSearchCV.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After using GridSearchCV it has seen that there is no improvment in the model. There is no change in train R2 score.

So, we have decided to move ahead with next regression model.

### ML Model - 2 :  Lasso Regression





In [None]:
# Import the Lasso Regression class
lasso = Lasso()
# Initialize an instance of the class
lasso.fit(X_train,y_train)
# Fit the lasso regression model to your training data
y_pred_lasso1 = lasso.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
lasso_score = evaluate_model(lasso,y_test,y_pred_lasso1)
score['Lasso regression']= lasso_score

In [None]:
score

It is seen that using Lasso regression analysis the performance of the model has drop down. so we will try to tuned the model.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
#import Lasso regressor and grid search cv
lasso = Lasso()

parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=5)
#fitting model
lasso_regressor.fit(X_train,y_train)

In [None]:
#getting optimum parameters
print("The optimum alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

In [None]:
# Import the Lasso Regression class with best alpha
lasso = Lasso(alpha = lasso_regressor.best_params_['alpha'])

# Initialize an instance of the class
lasso.fit(X_train, y_train)

# Fit the lasso regression model to your training data
y_pred_lassocv = lasso.predict(X_test)

In [None]:
#Evaluation matrices for Lasso regression

lasso2 = evaluate_model(lasso_regressor.best_estimator_, y_test,y_pred_lassocv) # Use lasso_regressor.best_estimator_
name = 'Lasso with alpha = ' + str(lasso_regressor.best_params_['alpha'])
score[name] = lasso2

##### Which hyperparameter optimization technique have you used and why?


GridSearchCV is used to find the best hyperparameters for a machine learning model by searching over a specified parameter grid. It helps to ensure that a model is not overfitting or underfitting by evaluating the model's performance using cross-validation techniques. GridSearchCV can save time and resources compared to manually tuning the parameters of a model.

To reduce time and effort we have used GridSearchCV.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
score

 ### ML Model - 3 : Decision Tree

In [None]:
# ML Model - 3 Implementation
dt = DecisionTreeRegressor(random_state=1)
# Fit the Algorithm
dt.fit(X_train,y_train)
# Predict on the model
y_pred_dt1 = dt.predict(X_test)

In [None]:
# Evaluation Metric Score chart
result = evaluate_model(dt, y_test,y_pred_dt1)
score['Decision tree'] = result


In [None]:
score

From the decision tree algorithm we got train R2 score is 1 and test R2 score is 37% .
So, we have decided to tune our data using GridSearchCV.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Regressor
dt_model = DecisionTreeRegressor(random_state=1)
# HYperparameter Grid
grid = {'max_depth' : [8,10,12,14,16],
        'min_samples_split' : np.arange(35,50),
        'min_samples_leaf' : np.arange(22,31,2)}
# fitting model with hypertuned paramaters using grid search
dt_gridsearch = GridSearchCV(dt_model,
                             grid,
                             cv=6,
                             scoring= 'neg_root_mean_squared_error')
dt_gridsearch.fit(X_train,y_train)
dt_best_params = dt_gridsearch.best_params_

In [None]:
dt_best_params


In [None]:
# building DT model with best parameters
dt_model = DecisionTreeRegressor(max_depth=dt_best_params['max_depth'],
                                 min_samples_leaf=dt_best_params['min_samples_leaf'],
                                 min_samples_split=dt_best_params['min_samples_split'],
                                 random_state=1)

In [None]:
# fitting model
dt_model.fit(X_train,y_train)

In [None]:
# dt test predictions
y_pred_dt = dt_model.predict(X_test)


In [None]:
#Evaluation matrices for DecisionTree
result = evaluate_model(dt_model, y_test,y_pred_dt)
score['Decision tree tuned'] = result

In [None]:
# Convert X_train to a DataFrame if it's a NumPy array
if isinstance(X_train, np.ndarray):
    X_train = pd.DataFrame(X_train)  # Convert to DataFrame

# Get the correct feature names from X_train
feature_names = X_train.columns

graph = Source(tree.export_graphviz(dt_model,
                                    out_file=None,
                                    feature_names=feature_names, # Use the correct feature names
                                    filled= True))
display(SVG(graph.pipe(format='svg')))

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is used to find the best hyperparameters for a machine learning model by searching over a specified parameter grid. It helps to ensure that a model is not overfitting or underfitting by evaluating the model's performance using cross-validation techniques. GridSearchCV can save time and resources compared to manually tuning the parameters of a model.

To reduce time and effort we have used GridSearchCV.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
score

From the above tuned Decision tree regression. It Has seen that earlier that we have faced underfitting condition. After tuning it there is no such condition and it looks like model performs with good accuracy around 68% on train and 60% on test data.

## **ML Model - 4 : Ridge Regression**

In [None]:
# Ridge regressor class
ridge = Ridge()
#prediction for Ridge regression
ridge.fit(X_train, y_train)
# Predict on the model
y_pred_ridge1 = ridge.predict(X_test)

**1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

In [None]:
# Evaluation Metric Score chart
result = evaluate_model(ridge, y_test,y_pred_ridge1)
score['Ridge'] = result

In [None]:
score

We have used Ridge regression technique to check the performance of the model and we have found that there is no significant difference in between linear regression and Ridge

**2. Cross- Validation & Hyperparameter Tuning**

In [None]:
# Import Ridge regressor Class
ridge = Ridge()
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=5)
#fitting model
ridge_regressor.fit(X_train,y_train)

In [None]:
#getting optimum parameters
print("The best fit alpha value is found out to be :" ,ridge_regressor.best_params_)
print("\nUsing ",ridge_regressor.best_params_, " the negative mean squared error is: ", ridge_regressor.best_score_)

In [None]:
# Initiate ridge with best alpha
ridge = Ridge(alpha = ridge_regressor.best_params_['alpha'])
#prediction for Ridge regression
ridge.fit(X_train, y_train)
# Predict on model
y_pred_ridge = ridge.predict(X_test)

In [None]:
#Evaluation matrices for Ridge regression
result = evaluate_model(ridge, y_test,y_pred_ridge)
namer = 'Ridge with alpha = ' + str(ridge_regressor.best_params_['alpha'])
score[namer] = result

**Which hyperparameter optimization technique have you used and why?**

GridSearchCV is used to find the best hyperparameters for a machine learning model by searching over a specified parameter grid. It helps to ensure that a model is not overfitting or underfitting by evaluating the model's performance using cross-validation techniques. GridSearchCV can save time and resources compared to manually tuning the parameters of a model.

To reduce time and effort we have used GridSearchCV.

**Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

In [None]:
score


We have tuned Ridge regression using hyperparametric technique & check the performance of the model and we have found that there is no significant difference in between tuned and without tuned.

It performs well only when there is multicolinearity or overfitting situation is present in our case we have already handle the multicolinearity situation, thats why it is not showing any difference in performance.

## **ML Model - 5 : Random Forest**

In [None]:
# ML Model - 5 Implementation
rf = RandomForestRegressor(random_state=0)
# Fit the Algorithm
rf.fit(X_train,y_train)
# Predict on the model
y_pred_rf1 = rf.predict(X_test)

**1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

In [None]:
# Visualizing evaluation Metric Score chart
result = evaluate_model(rf, y_test,y_pred_rf1)
score['Random forest'] = result

In [None]:
score

Here we have seen that using random forest regression accuracy of the train and test model increases 94% and 64% respectively. which seems to be a good model for prediction.

**2. Cross- Validation & Hyperparameter Tuning**

In [None]:
# random forest model
rf_model = RandomForestRegressor(random_state=0)
rf_params = {'n_estimators':[300],                    # limited due to computational power availability
             'min_samples_leaf':np.arange(20,25)}

In [None]:
# fitting a rf model with best parameters obtained from gridsearch
rf_gridsearch = GridSearchCV(rf_model,rf_params,cv=6,scoring='neg_root_mean_squared_error')
rf_gridsearch.fit(X_train,y_train)
rf_best_params = rf_gridsearch.best_params_

In [None]:
# best parameters for random forests
rf_best_params

In [None]:
# Fitting RF model with best parameters
rf_model = RandomForestRegressor(n_estimators=rf_best_params['n_estimators'],
                                 min_samples_leaf=rf_best_params['min_samples_leaf'],
                                 random_state=0)

In [None]:
# fit
rf_model.fit(X_train,y_train)


In [None]:
# rf predictions on test data
y_pred_rf = rf_model.predict(X_test)

In [None]:
#Evaluation matrices for RandomForest
result = evaluate_model(rf_model, y_test,y_pred_rf)
score['Random forest tuned'] = result

**Which hyperparameter optimization technique have you used and why?**

GridSearchCV is used to find the best hyperparameters for a machine learning model by searching over a specified parameter grid. It helps to ensure that a model is not overfitting or underfitting by evaluating the model's performance using cross-validation techniques. GridSearchCV can save time and resources compared to manually tuning the parameters of a model.

To reduce time and effort we have used GridSearchCV.

**Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

In [None]:
score

After using the Randomised search CV our model performs very well it gave accuracy 70% on train data & 64% on test data which is very good model accuracy.

## **ML Model  - 6 : Gradient Boosting Regressor**

In [None]:
# ML Model - 3 Implementation
gb = GradientBoostingRegressor(random_state=0)
# Fit the Algorithm
gb.fit(X_train,y_train)
# Predict on the model
y_pred_gb1 = gb.predict(X_test)

**1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

In [None]:
# Evaluation Metric Score chart
result = evaluate_model(gb, y_test,y_pred_gb1)
score['Gradient Boosting Regressor'] = result

In [None]:
score

**Which hyperparameter optimization technique have you used and why?**

Randomized search cross-validation (CV) is used to efficiently explore the hyperparameter space of a machine learning model. It works by randomly sampling from the search space of hyperparameters, rather than exhaustively trying every possible combination. This allows for a more efficient search while still providing a good chance of finding good hyperparameter values. Additionally, by using cross-validation to evaluate the performance of each set of hyperparameters, one can ensure that the model is not overfitting to the training data.

Because of its randomly sampling technique and to save the time we have decided to use Randomised search CV.

**Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

In [None]:
score

**2. Cross- Validation & Hyperparameter Tuning**

In [None]:
# GBM model
gb_model = GradientBoostingRegressor(random_state=0)
gb_params = {'n_estimators':[300],
             'min_samples_leaf':np.arange(20,24),
             'max_depth':np.arange(14,17)
             }

In [None]:
# Perform the randomized search
random_search = RandomizedSearchCV(gb_model, param_distributions=gb_params, cv=6, n_iter=20, scoring='neg_root_mean_squared_error', n_jobs=-1)
random_search.fit(X_train, y_train)

gb_best_params = random_search.best_params_

In [None]:
# GBM best parameters
gb_best_params

In [None]:
# Building GBM model with best parameters
gb_model = GradientBoostingRegressor(n_estimators=gb_best_params['n_estimators'],
                                     min_samples_leaf=gb_best_params['min_samples_leaf'],
                                     max_depth = gb_best_params['max_depth'],
                                     random_state=0)

In [None]:
# fit
gb_model.fit(X_train,y_train)

In [None]:
# gradient boosting test predictions
y_pred_gb = gb_model.predict(X_test)

In [None]:
#Evaluation matrices for GradientBoosting
result = evaluate_model(gb_model, y_test,y_pred_gb)
score['Gradient Boosting Regressor Tuned'] = result


**Which hyperparameter optimization technique have you used and why?**

Randomized search cross-validation (CV) is used to efficiently explore the hyperparameter space of a machine learning model. It works by randomly sampling from the search space of hyperparameters, rather than exhaustively trying every possible combination. This allows for a more efficient search while still providing a good chance of finding good hyperparameter values. Additionally, by using cross-validation to evaluate the performance of each set of hyperparameters, one can ensure that the model is not overfitting to the training data.

Because of its randomly sampling technique and to save the time we have decided to use Randomised search CV.

**Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

In [None]:
score

After using the Randomised search CV our model performs very well it gave accuracy 64% on train data & 64% on test data which is good model accuracy.


## **ML Model - 7 : Xtreme Gradient Boosting Regressor**

In [None]:
# ML Model - 7 Implementation
xgb_model = xgb.XGBRegressor(random_state=0,
                             objective='reg:squarederror')
# Fit the Algorithm
xgb_model.fit(X_train,y_train)
# Predict on the model
y_pred_xgb1 = xgb_model.predict(X_test)

### **1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

In [None]:
# Evaluation Metric Score chart
result = evaluate_model(xgb_model, y_test,y_pred_xgb1)
score['Xtreme Gradient Boosting Regressor'] = result

In [None]:
score

**2. Cross- Validation & Hyperparameter Tuning**

In [None]:
# xg boost
xgb_model = xgb.XGBRegressor(random_state=0,
                             objective='reg:squarederror')
xgb_params = {'n_estimators':[500],
             'min_samples_leaf':np.arange(20,22)}

In [None]:
# finding best parameters
xgb_search = RandomizedSearchCV(xgb_model,xgb_params,cv=6,scoring='neg_root_mean_squared_error',n_iter=100, n_jobs=-1)
xgb_search.fit(X_train,y_train)
xgb_best_params = xgb_search.best_params_

In [None]:
# xg boost best parameters
xgb_best_params

In [None]:
# Building a XG boost model with best parameters
xgb_model = xgb.XGBRegressor(n_estimators=xgb_best_params['n_estimators'],
                             min_samples_leaf=xgb_best_params['min_samples_leaf'],
                             random_state=0)

In [None]:
# fit
xgb_model.fit(X_train,y_train)

In [None]:
# xtreme gradient boosting test predictions
y_pred_xgb = xgb_model.predict(X_test)


In [None]:
#Evaluation matrices for XGBRegressor
result = evaluate_model(xgb_model, y_test,y_pred_xgb)
score['Xtreme Gradient Boosting Regressor Tuned'] = result

**Which hyperparameter optimization technique have you used and why?**

Randomized search cross-validation (CV) is used to efficiently explore the hyperparameter space of a machine learning model. It works by randomly sampling from the search space of hyperparameters, rather than exhaustively trying every possible combination. This allows for a more efficient search while still providing a good chance of finding good hyperparameter values. Additionally, by using cross-validation to evaluate the performance of each set of hyperparameters, one can ensure that the model is not overfitting to the training data.

Because of its randomly sampling technique and to save the time we have decided to use Randomised search CV.

**Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

In [None]:
score

After tuning the model we have got accuracy on train data arround 97 % and 59% on test data which is good but need more accuracy.

**Plot R2 scores for each model**

In [None]:
score.columns

In [None]:
# R2 Scores plot
models = list(score.columns)
train = score.iloc[-3,:]
test = score.iloc[-2,:]
X_axis = np.arange(len(models))
plt.figure(figsize=(25,10))
plt.bar(X_axis - 0.2, train, 0.4, label = 'Train R2 Score')
plt.bar(X_axis + 0.2, test, 0.4, label = 'Test R2 Score')
plt.xticks(X_axis,models, rotation=30)
plt.ylabel("R2 Score")
plt.title("R2 score for each model")
plt.legend()
plt.show()

***Plot of adjusted R2 score***

In [None]:
# Removing the overfitted models which have more than 5% gap between train and test values
score_t = score.transpose()            #taking transpose of the score dataframe to create new difference column
score_t['diff']=score_t['Train R2']-score_t['Test R2']                   #creating new column diff of train R2 and test R2 score
remove_models = list(score_t[score_t['diff']>=.05].index)                #creating a list of models which have difference more than .05 that is 5%
remove_models

adj = score_t['Adjusted R2'].drop(remove_models)                     #creating a new dataframe with required models and adjusted r2 score

plt.figure(figsize=(14,8))
plots = sns.barplot(x=list(adj.index), y=adj)
for bar in plots.patches:
  plots.annotate(format(bar.get_height(),'.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=12, xytext=(0, 8),
                   textcoords='offset points')
plt.xticks(rotation=30)

plt.title(" Adjusted R2 score", fontsize = 20)
plt.xlabel('Models', fontsize = 15)
plt.ylabel('Score', fontsize = 15)
# Setting limit of the y axis from 0 to 30
plt.ylim(0,1)
plt.show()


**1. Which Evaluation metrics did you consider for a positive business impact and why?**

On the basis of all the model we have decided to select R2 score Evaluation matrics which shows the accuracy of the model which is very good indicator to check the feasibility of the model.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

We have ran a several models like linear regression, decision tree, random forest, gradient boosting, and xtreame gradient boosting but amongst them we have selected xtreame gradient boosting regressor (tuned) as we achieved 63% training accuracy and 63% testing accuracy.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

## **SHAP(Shapley additive Explanations)**

In [None]:
pip install shap

In [None]:
# importing shap
import shap

In [None]:
#Shap explainer for xgb (tree based)
explainer = shap.TreeExplainer(xgb_model, X_train, feature_names=features)

In [None]:
#Shap explainer for xgb (tree based)
explainer = shap.TreeExplainer(xgb_model, X_train, feature_names=features)
# Calculate SHAP values for the first instance in the test set
shap_values = explainer(X_test)
# Create the SHAP force plot for the first instance
# Use array indexing to select the first row of X_test
shap.plots.force(shap_values[0])

The force plot shows the shap values for a particular instance.

Here we have considered the 50th index row values for the plot. We can see that the prediction is 21.33 (sqrt value). The different contribution of the columns is shown for getting the prediction.

In [None]:
# get shap values of test data
shap_values = explainer(X_test)

In [None]:
shap.summary_plot(shap_values, X_test)

In the summary plot we can see the top 9 columns and their impact on the prediction. The red color indicates that the value of the columns is high and blue color shows that the value of the column is low.

For categorical columns, we have zeros and ones where zero is blue color and one is red color.

Shap values are also displayed and the impact on the prediction is also shown. towards the right hand side, the impact is positive (increases the predicted value) and towards the left hand side, the impace is negative (decreases the predicted value).

In [None]:
# Obtain a Bar Summary Plot
shap.summary_plot(shap_values, X_test, plot_type="bar")

This bar plot shows the top 9 important features and the mean shap values. It shows the average impact on the model output magnitude.

It does not show the positive or negative impact on the prediction.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
# Import pickle
import pickle

# Save the best model (XGB)
pickle.dump(xgb_model, open('xgbmodel.pkl', 'wb'))

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
pickled_model = pickle.load(open('xgbmodel.pkl', 'rb'))

# create a list for the x test value for the 50 index row
predict_new = [list(X_test[50])] # Use X_test[50] to get the 50th row and enclose in a list

# Testing on one instance which we used for shap X_test[50,:]
pickled_model.predict(predict_new)

# **Conclusion**

The project successfully demonstrated the feasibility of using machine learning techniques to predict bike demand in Seoul.

Some of the key points are:-

* High demand in the morning and evening.
* Less Demand in the winter season.
* Highest demand in june.
* Found multicollinearity between temperature and dew point temperature.
*Perform linear regression, decision tree, random forest, gradient boosting, Xtreme gradient boosting. & got highest accuracy i.e 93% on train and 90% on test on Xtreme gradient boosting.
*There is no use of removing outliers it affects negatively on model performance.

Overall, the project highlights the potential of machine learning in solving real-world problems and provides a roadmap for future research in this area. The findings of this project can be extended to other cities with similar bike sharing systems, leading to more effective and efficient bike sharing operations, and better outcomes for all stakeholders.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***