<a href="https://colab.research.google.com/github/AryamanChaudhary/Bike-Sharing-Demand-Prediction-/blob/main/ML_REGRESSION_PROJECT_DONE_BY_ARYAMAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Done By**         - Aryaman Chaudhary

# **Project Summary -**

The goal of this project is to develop a predictive model that can accurately forecast demand for bike rentals in Seoul, South Korea, based on historical usage patterns, weather conditions, and other relevant factors. The project involves data collection, data preparation, feature engineering, model selection, hyperparameter tuning, model training, model evaluation, and model deployment.
The project aims to help the bike-sharing program in Seoul optimize its resources, improve user satisfaction, and reduce operating costs. The project will leverage machine learning techniques and statistical analysis to build a robust and accurate predictive model that can help the bike-sharing program make data-driven decisions.
The success of the project will be evaluated based on the model's ability to accurately predict bike rental demand, as well as its ability to provide valuable insights and recommendations for improving the bike-sharing program's operations. The project has the potential to not only benefit the bike-sharing program in Seoul but also serve as a model for other cities and organizations looking to optimize their resource allocation and improve their service offerings.

# **GitHub Link -**

https://github.com/AryamanChaudhary/Bike-Sharing-Demand-Prediction-/tree/main

# **Problem Statement**


The bike-sharing program in Seoul, South Korea, is experiencing low utilization rates and inefficient allocation of resources. The goal of this project is to develop a predictive model that can accurately forecast demand for bike rentals based on historical usage patterns, weather conditions, and other relevant factors. By doing so, we aim to help the bike-sharing program optimize its resources, improve user satisfaction, and reduce operating costs.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import Ridge, Lasso
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from statsmodels.stats.outliers_influence import variance_inflation_factor
import datetime as dt
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount = ('/content/SeoulBikeData.csv')

In [None]:
data = pd.read_csv('/content/SeoulBikeData.csv' , encoding = 'latin')

### Dataset First View

In [None]:
# Dataset First Look
data

In [None]:
# Viewing  top five rows in the given dataset by using head()
data.head()

In [None]:
# Viewing the last five rows of the dataset bt using tail()
data.tail()

### Dataset Rows & Columns count

In [None]:
# Counting the rows and columns in the dataset
print(f'Number of rows : {len(data.axes[0])}')
print(f'Number of columns : {len(data.axes[1])}')

In [None]:
# We can also use shape to get the numbers of the rows and columns
data.shape

### Dataset Information

In [None]:
# Dataset Info
# Viewing the info of the given dataset by using info()
data.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Checking if are there any duplicate values or not
print(f' Total number of duplicate values  : {data.duplicated().sum()}')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(data.isnull(), cbar=False);

### What did you know about your dataset?

* There are 8760 observation and 14 features.

* In a day we have 24 hours and we have 365 days a year so 365 multiplied by 24 = 8760, which represents the number of line in the dataset

* There are no null values.
* Dataset has all unique values i.e., there is no duplicate, which means data is free from bias as duplicates which can cause problems in downstream analysis.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(f' Columns : {data.columns}')

In [None]:
# Dataset Describe
data.describe()


### Variables Description

* Date : The date of the day, during 365 days from 01/12/2017 to 30/11/2018, formating in DD/MM/YYYY, type : str, we need to convert into datetime format.

* Rented Bike Count : Number of rented bikes per hour which our dependent variable and we need to predict that, type : int

* Hour: The hour of the day, starting from 0-23 it's in a digital time format, type : int, we need to convert it into category data type.

* Temperature(°C): Temperature in Celsius, type : Float

* Humidity(%): Humidity in the air in %, type : int

* Wind speed (m/s) : Speed of the wind in m/s, type : Float

* Visibility (10m): Visibility in m, type : int

* Dew point temperature(°C): Temperature at the beggining of the day, type : Float

* Solar Radiation (MJ/m2): Sun contribution, type : Float

* Rainfall(mm): Amount of raining in mm, type : Float

* Snowfall (cm): Amount of snowing in cm, type : Float

* Seasons: Season of the year, type : str, there are only 4 season's in data .

* Holiday: If the day is holiday period or not, type: str

* Functioning Day: If the day is a Functioning Day or not, type : str

### Check Unique Values for each variable.

In [None]:

# Using for loop to check the unique values.
for dataset in data.columns:
  print(f"No. of unique values in {dataset} is {data[dataset].nunique()}.")

## 3. ***Data Wrangling***

#PROCESSING THE DATA

#BREAKING DATE COLUMN

In [None]:
data['Date'] = data['Date'].str.replace('-', '/')
data['Date'] = data['Date'].apply(lambda x: dt.datetime.strptime(x, "%d/%m/%Y"))


In [None]:
# We can see that the Date variable is in object datatype, we need to change it into datetime datatype
data['Date'] = data['Date'].astype(np.datetime64)

data['month'] = data['Date'].dt.month

data['day'] = data['Date'].dt.day_name()



In [None]:
# Dropping Date variable from dataset
data.drop(['Date'],axis = 1, inplace = True)


In [None]:
# Defining separate data as numerical and categorical data.

# Numerical data

numerical_data = list(set(data.describe().columns.tolist()) - {'Hour','month'})

# Categorical data

categorical_data = list(set(data.columns)-set(numerical_data))

In [None]:
#let's check the result of data type
data.info()

In [None]:
# Here are the final columns
data.columns

* Changed Date column datatype from object to Datetime data type.

* Created new columns Day and Month from date column and dropped Date.

* Defining separate data as numerical and categorical data.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#EDA OF THE DATASET

# Analysis of Dependent Variable:
### What is a dependent variable in data analysis?

* we analyse our dependent variable,A dependent variable is a variable whose value will change depending on the value of another variable.*

#Analysation of categorical variables
* Our dependent variable is "Rented Bike Count" so we need to analysis this column with the other columns by using some visualisation plot.first we analyze the category data type then we proceed with the numerical data type*

##Visualizing data distribution of our dependent variable (Rented Bike Count)

In [None]:
# Chart - 1 visualization code

plt.figure(figsize=(15,5))
plt.title('Rented Bike Count')
sns.distplot(data['Rented Bike Count'] )
plt.show()

##1. Why did you pick the specific chart?
Distplot is one of the best charts to show the data distribution.

##2. What is/are the insight(s) found from the chart?
Data is positively skewed may need to transform it further.

##3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As of now we know that the data is positively skewed that means the bike demand for smaller bulks is more.

#Visualizing data distribution of categorical data with respect to Rented Bike Count.

In [None]:
# Chart 2

# Creating a for loop for visualizing all the categorical data with respect to Rented Bike Count.

for i in categorical_data:
    plt.figure(figsize=(15,6))
    plt.title(i)
    sns.barplot(x = data[i], y = data['Rented Bike Count'])
    plt.show()


##1. Why did you pick the specific chart?
* To visualise and spread categorical data with respect to Rented Bike Count.

##2. What is/are the insight(s) found from the chart?
The peak hours of rented bikes is 5:00PM - 7:00PM and the least bikes are rented between 3:00AM to 5:00AM.

* June is the peak and January is the bottom months for number of rented bikes.

* Highest no. of bikes are booked on Thursday and the least on Sunday.

* People prefer renting bikes most in the Summer season and the least in winter season.

* People rented more bikes on a non-holiday compared to a holiday.

##3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* The above gained insights can definitely help creating a positive business impact, we got to know how time, month, day, season and holiday impact the number of rented bikes, we can make strategies accordingly.

#Bike Rent Count trend with respect to Hours on Months

In [None]:

# Chart - 3 visualization code

# Using pointplot for multivariate analysis.

plt.figure(figsize = (18,6))
sns.pointplot(x = data['Hour'], y = data['Rented Bike Count'], hue = data['month'])
plt.show()


##1. Why did you pick the specific chart?
* To do a multivariate analysis among Hour, Rented Bike Count and Months

## 2. What is/are the insight(s) found from the chart?
* June is the peak and January is the bottom months for number of rented bikes.


## 3. Are there any insights that lead to negative growth? Justify with specific reason.

* The above gained insights can definitely help creating a positive business impact, we got to know how month impact the number of rented bikes, we can make strategies accordingly.

# Bike Rent Count trend with respect Hours on Seasons.

In [None]:

# Chart - 4 visualization code

# Using pointplot for multivariate analysis.

plt.figure(figsize = (18,8))
sns.pointplot(x = data['Hour'], y = data['Rented Bike Count'], hue = data['Seasons'])
plt.show()



##1. Why did you pick the specific chart?
To do a multivariate analysis among Hour, Rented Bike Count and Months

##2. What is/are the insight(s) found from the chart?
People prefer renting bikes most in the Summer season and the least in winter season.

##3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The above gained insights can definitely help creating a positive business impact, we got to know how Seasons impact the number of rented bikes, we can make strategies accordingly.

# Bike Rent Count trend with respect Hours on Days.

In [None]:
# Chart - 5 visualization code

# Using pointplot for multivariate analysis.

plt.figure(figsize = (18,8))
sns.pointplot(x = data['Hour'], y = data['Rented Bike Count'], hue = data['day'])
plt.show()

##### 1. Why did you pick the specific chart?

To do a multivariate analysis among Hour, Rented Bike Count and Day.

##### 2. What is/are the insight(s) found from the chart?

Highest no. of bikes are booked on Thursday and the least on Sunday.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The above gained insights can definitely help creating a positive business impact, we got to know how Days impact the number of rented bikes, we can make strategies accordingly.

#### Chart - 6

#Bike Rent Count trend with respect Hours on Holidays.

In [None]:
# Chart - 6 visualization code

# Using pointplot for multivariate analysis.

plt.figure(figsize = (18,8))
sns.pointplot(x = data['Hour'], y = data['Rented Bike Count'], hue = data['Holiday'])

plt.show()

##### 1. Why did you pick the specific chart?

To do a multivariate analysis among Hour, Rented Bike Count and Holiday.

##### 2. What is/are the insight(s) found from the chart?

People rented more bikes on a non-holiday compared to a holiday.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The above gained insights can definitely help creating a positive business impact, we got to know how Holidays impact the number of rented bikes, we can make strategies accordingly.

#### Chart - 7

#Visualizing outliers using box plot of numeric columns.

In [None]:
# Chart - 7 visualization code

# Writing code for creating a for loop for numerical variables and creting box plots.

for i in numerical_data:
  plt.figure(figsize = (18,6))
  sns.boxplot(x = data[i])
  plt.title(i)
  plt.show()

##### 1. Why did you pick the specific chart?

Using Boxplots as we need to identify outliers.

##### 2. What is/are the insight(s) found from the chart?

Rainfall, Solar Radiation, Snowfall and Windspeed has high numbers of outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In some cases, outliers may represent opportunities for business growth. For example, if there are a small number of customers who are making a significantly higher number of bike rentals than the average customer, this may represent a market segment that can be targeted for special promotions or marketing campaigns.

#### Chart - 8

#Lets check the linear relationship between the dependent variable-"Rented Bike Count' and remaining columns(independent variables).

In [None]:
# Chart - 8 visualization code using regplot.
for i in numerical_data:
  if i not in ['Rented Bike Count']:
    fig = plt.figure(figsize = (18,7))
    fig = plt.gca()

    sns.regplot(
        data=data, x=i, y="Rented Bike Count",
        truncate=False, order=2, color=".2",scatter_kws={'color':'green'}
    )
    plt.show()


##### 1. Why did you pick the specific chart?

I used regplot as it allows us to quickly visualize the relationship between two variables and determine whether there is a linear or non-linear relationship between them.

##### 2. What is/are the insight(s) found from the chart?

##Hour:
* There is sudden peak between 6/7AM to 10 AM. Office time,College and going time could be the reason for this sudden peak.

* Again there is peak between 10 AM to 7 PM. may be its office leaving time for the above people.

* We can say that,from morning 7 AM to Evening 7 PM we have good Bike Rent Count. and from 7 PM to 7 AM Bike Rent count starts declining.


##Rainfall And snowfall:
* Its very obivious that people usually do not like ride bikes in rain and snowfall.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The above gained insights can definitely help creating a positive business impact, we got to know how the variables impact the number of rented bikes, we can make strategies accordingly.

#### Chart - 9

#Visualizing the data distribution on the dependent variable before normalization.

In [None]:
# Chart - 9 visualization code
# Visualizing the data distribution on the dependent variable using distplot and boxplot charts.

f, axes = plt.subplots(1, 2,figsize=(18,7))
sns.distplot(x=(data['Rented Bike Count']),color='r',ax=axes[0])
sns.boxplot(x=(data['Rented Bike Count']),color='r',ax=axes[1])
plt.title("Outliers on 'Rented Bike Count' variable before normalization.")
plt.show()

#### Chart - 10

#Visualizing the data distribution on the dependent variable after normalization.

In [None]:
# Chart - 10 visualization code
# Normalizing  our target variable by squre root method

f, axes = plt.subplots(1, 2,figsize=(18,8))
sns.distplot(x=np.sqrt(data['Rented Bike Count']),color='g',ax=axes[0])
sns.boxplot(x=np.sqrt(data['Rented Bike Count']),color='g',ax=axes[1])
plt.show()


##### 1. Why did you pick the specific chart?

Used distplot to visualize the distribution of data and using boxplot to detect outliers.

##### 2. What is/are the insight(s) found from the chart?

* As we can see from both the charts the distribution of the data is less skewed and is moving towards normally distributed data

* The outliers are also gone after normalization.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Improved accuracy, comparability and better visualization are the result of normalization which can improve model performance.

#### Chart - 11

#Visualizing data distribution of numerical data.

In [None]:
# Chart - 11 visualization code
# Creating a for loop for visualizing all the numerical variables using distplot.

for i in numerical_data:
  if i not in ['Rented Bike Count']:
    plt.figure(figsize=(15,6))
    plt.title(i)
    sns.distplot(data[i] )
    plt.show()

##### 1. Why did you pick the specific chart?

Distplot is one of the best charts to show the data distribution.And most of the people understand it easily.

##### 2. What is/are the insight(s) found from the chart?

* Solar radiation, Snowfall, Rainfall and visibility are highly skewed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can see that weather influences the output in a big way as the data for Solar radiation, Snowfall, Rainfall and visibility are highly skewed we can say that people choose riding a bike on specific weather conditions.

#### Chart - 12 - Correlation Heatmap

#Checking Corelation between dependent and independent variable using Correlation Heatmap visualization.

In [None]:
# Correlation Heatmap visualization code
corr = data.corr()

plt.figure(figsize = (18,8))
sns.heatmap(corr, annot = True , cmap = 'coolwarm')
plt.show()

##### 1. Why did you pick the specific chart?

I used correlation heatmap to visualize the correlation among variables.

##### 2. What is/are the insight(s) found from the chart?

Temperature and Dew point Temperature are highly correlated. As per our regression assumption, there should not be colinearity between independent variables. We can see from the heatmap that "Temperature" and "Dew Point Temperature" are highly corelated. We can drop one of them.As the corelation between temperature and our dependent variable "Bike Rented Count" is high. So we will Keep the Temperature column and drop the "Dew Point Temperature" column.

#### Chart - 13 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(data = data)

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* null_hypothesis = 'There is no relationship between temperature and bike demand in Seoul.'

* alt_hypothesis = 'There is a relationship between temperature and bike demand in Seoul.'

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import statsmodels.api as sm


# Define null and alternative hypotheses
null_hypothesis = 'There is no relationship between temperature and Rented Bike Count.'
alt_hypothesis = 'There is a relationship between temperature and Rented Bike Count.'

# Perform linear regression
X = sm.add_constant(data['Temperature(°C)'])
y = data['Rented Bike Count']
model = sm.OLS(y, X).fit()

# Print summary statistics
print(model.summary())

# Extract p-value for temperature coefficient
p_value = model.pvalues[1]
print('p-value:', p_value)

##### Which statistical test have you done to obtain P-Value?

* We use the OLS (ordinary least squares) function from the statsmodels package to perform a linear regression of bike demand on temperature.

* The p-value associated with the temperature coefficient is shown under the column "P>|t|", and is equal to 0.000 in this example. Since this p-value is less than the significance level of 0.05, we can reject the null hypothesis and conclude that there is evidence of a significant relationship between temperature and bike demand in Seoul.

##### Why did you choose the specific statistical test?

* I chose linear regression as the statistical test to perform hypothesis testing for Seoul bike sharing demand prediction because it is a commonly used method for analyzing the relationship between a continuous predictor variable and a continuous response variable. In this case, we are interested in determining whether there is a significant relationship between a predictor variable, such as temperature or time of day, and bike demand.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* null_hypothesis = 'There is no relationship between month and bike demand in Seoul.'

* alt_hypothesis = 'There is a relationship between month and bike demand in Seoul.'

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
null_hypothesis = 'There is no relationship between month and Rented Bike Count.'
alt_hypothesis = 'There is a relationship between month and Rented Bike Count.'

# Perform linear regression
X = sm.add_constant(data['month'])
y = data['Rented Bike Count']
model = sm.OLS(y, X).fit()

# Print summary statistics
print(model.summary())

# Extract p-value for temperature coefficient
p_value = model.pvalues[1]
print('p-value:', p_value)

##### Which statistical test have you done to obtain P-Value?

We use the OLS (ordinary least squares) function from the statsmodels package to perform a linear regression of bike demand on month.

The p-value associated with the month coefficient is shown under the column "P>|t|", and is equal to 3.144647620349008e-11 in this example. Since this p-value is less than the significance level of 0.05, we can reject the null hypothesis and conclude that there is evidence of a significant relationship between month and bike demand in Seoul.

##### Why did you choose the specific statistical test?

I chose linear regression as the statistical test to perform hypothesis testing for Seoul bike sharing demand prediction because it is a commonly used method for analyzing the relationship between a continuous predictor variable and a continuous response variable. In this case, we are interested in determining whether there is a significant relationship between a predictor variable, such as temperature or time of day, and bike demand.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* null_hypothesis = 'There is no relationship between Hour and Bike demand in Seoul.'
* alt_hypothesis = 'There is a relationship between Hour and Bike demand in Seoul.'

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
null_hypothesis = 'There is no relationship between Hour and bike Rented Bike Count.'
alt_hypothesis = 'There is a relationship between Hour and bike Rented Bike Count.'

# Perform linear regression
X = sm.add_constant(data['Hour'])
y = data['Rented Bike Count']
model = sm.OLS(y, X).fit()

# Print summary statistics
print(model.summary())

# Extract p-value for temperature coefficient
p_value = model.pvalues[1]
print('p-value:', p_value)

##### Which statistical test have you done to obtain P-Value?

We use the OLS (ordinary least squares) function from the statsmodels package to perform a linear regression of bike demand on Hour.

The p-value associated with the Hour coefficient is shown under the column "P>|t|", and is equal to 0.0 in this example. Since this p-value is less than the significance level of 0.05, we can reject the null hypothesis and conclude that there is evidence of a significant relationship between Hour and Rented Bike Count.

##### Why did you choose the specific statistical test?

I chose linear regression as the statistical test to perform hypothesis testing for Seoul bike sharing demand prediction because it is a commonly used method for analyzing the relationship between a continuous predictor variable and a continuous response variable. In this case, we are interested in determining whether there is a significant relationship between a predictor variable, such as temperature or time of day, and bike demand.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Checking if there is any null value in the dataset.

data.isnull().sum()


#### What all missing value imputation techniques have you used and why did you use those techniques?

* No missing/null values in the dataset.

### 2. Categorical Encoding

In [None]:
# Encoding the categorical columns:

# First encoding Holiday as '1' and '0' replacing 'Holiday' and 'No-Holiday':

data['Holiday'] = data['Holiday'].apply(lambda x: 1 if x == 'Holiday' else 0)


# Encoding 'Functioning Day' as '1' and '0' replacing 'Yes' and 'No':

data['Functioning Day'] = data['Functioning Day'].apply(lambda x: 1 if x == 'Yes' else 0)


# Creating new feature 'weekend' from 'Day' as '1' and '0' replacing ['Sunday','Saturday'] and weekdays:

data['weekend'] = data['day'].apply(lambda x: 1 if x in ['Sunday','Saturday'] else 0 )


# Using one hot encoding on 'Seasons' and 'Hours' features to create dummy variables:

data = pd.get_dummies(data, columns = ['Seasons','Hour'])


In [None]:

features = list(set(data.describe().columns) - set(['Dew point temperature(°C)','Humidity(%)','Visibility (10m)']))
features

#### What all categorical encoding techniques have you used & why did you use those techniques?

I used manual categorical encoding and One hot encoding which is a technique used in data preprocessing and feature engineering to represent categorical variables as numerical features that can be used as inputs to machine learning models.

### 3. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Defining a function for getting the variance inflation factor.

def Calculate_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)




In [None]:
# Calculating VIF for variables.

Calculate_vif(data[[i for i in data.describe().columns if i not in ['Rented Bike Count']]])

# Not including 'Rented Bike Count' is the dependent variable.

In [None]:
Calculate_vif(data[[i for i in data.describe().columns if i not in ['Rented Bike Count','Dew point temperature(°C)']]])


In [None]:
Calculate_vif(data[[i for i in data.describe().columns if i not in ['Rented_Bike_Count','Dew point temperature(°C)','Humidity(%)']]])


In [None]:
#Not including 'Visibility (10m)' as it has high VIF.

Calculate_vif(data[[i for i in data.describe().columns if i not in ['Rented_Bike_Count','Dew point temperature(°C)','Humidity(%)','Visibility (10m)']]])


## Now we have VIF values in the range of 1 to 5. we will drop 'Humidity', 'Dew point temperature(°C)', 'Visibility' because these columns from our dataset shown colinearity in VIF test.

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Selecting the features which are not colinear among themselves and whose VIF is less than 5.

features = list(set(data.describe().columns) - set(['Rented_Bike_Count','Dew point temperature(°C)','Humidity(%)','Visibility (10m)']))
features

##### What all feature selection methods have you used  and why?

Used VIF(Variance Inflation Factor). It is a measure of multicollinearity in regression analysis that quantifies how much the variance of the estimated regression coefficients are increased due to the linear dependence between the predictor variables.

And along with that we did manual feature elimination using VIF.

##### Which all features you found important and why?

I have selected these features as they show least amount of multicolinearity -

['Snowfall (cm)', 'Wind speed (m/s)', 'Solar Radiation (MJ/m2)', 'Rented Bike Count', 'Temperature(°C)', 'Hour', 'Rainfall(mm)', 'month']

### 4. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

# Splitting the data to train and test.

X = data[list(set(features)- {'Rented Bike Count'})]

# Using Square root on dependent variable to normalize the variable.

Y = np.sqrt(data['Rented Bike Count'])

# Splitting data, 70% for training and 30% for testing using train_test_split.

x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size = 0.30, random_state= 0)

##### What data splitting ratio have you used and why?

I used 70-30 ratio to split train and test data.Because it is convienient.

### 5. Data Scaling

In [None]:
# Scaling your data
# Using StandardScaler() standardization technique to scale data:

scaler  = StandardScaler()

# Fitting x_train and x_test using StandardScaler():

x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

##### Which method have you used to scale you data and why?

* Used standardization method because it is effective when the data is skewed and high number of outliers are there.

## ***7. ML Model Implementation***

In [None]:
# Defining function to calculate different model accuracy scores.

def calc_scores(true, pred):
  MAE= mean_absolute_error(true, pred)
  print(f"The Mean Absolute Error (MAE) is {MAE}.")

  #Calculate  Mean Squared Error
  MSE=mean_squared_error(true, pred)
  print(f"The Mean Squred Error(MSE) is {MSE}.")

  #Calculate Root Mean Squared Error
  RMSE=np.sqrt(MSE)
  print(f"The Root Mean Squared Error(RMSE) is {RMSE}.")

  #Calculate R2 Score
  R2=r2_score(true, pred)
  print(f"The R2 Score is {R2}.")


In [None]:
# Defining a function which appends all the model scores in a dataframe modelwise:

# Making an empty dataframe:

score_df = pd.DataFrame({'Model':[],'Mean Absolute Error (MAE)':[], 'Mean Squred Error(MSE)':[], 'The Root Mean Squared Error(RMSE)':[], 'R2 Score':[]})

# Code for the store_score function:

def store_scores(Model, MAE, MSE, RMSE, R2):
  scores = {'Model':Model,'Mean Absolute Error (MAE)':MAE, 'Mean Squred Error(MSE)':MSE, 'The Root Mean Squared Error(RMSE)':RMSE, 'R2 Score':R2}
  global score_df
  score_df = score_df.append(scores, ignore_index=True)
  return score_df


In [None]:
# Defining function which visualize linearity of real and predicted data:

def reg_scatter(true, pred):
  plt.figure(figsize=(10,8))

  sns.regplot(x= true, y = pred, scatter_kws={'color':'magenta'},line_kws={'color':'black'})

  plt.legend(["Actual","Predicted",])
  plt.show()


In [None]:
# defining a function for visualizing feature importance bar graph:

def feature_imp(model):
  model1 = [model]

  best_rf = model.predict

  feature_importances = best_rf.feature_importances_

  # Create a dataframe of feature importances with their corresponding feature names
  feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})

  # Sort the features by importance in descending order
  feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)

  # Plot the feature importances in a horizontal bar plot
  sns.set_style("whitegrid")
  plt.figure(figsize=(15,8))
  sns.barplot(x='Importance', y='Feature', data=feature_importance_df, color='blue', order=feature_importance_df['Feature'])
  if model1 == [model]:
    plt.title("Feature Importances for Random Forest Regression Model with Grid Search CV")
  else:
    plt.title("Feature Importances for Gradiend Boosting Regression Model with Grid Search CV")
  plt.xlabel("Importance")
  plt.ylabel("Feature")
  plt.show()


### ML Model - 1




---


---

## Linear Regression

---



---





In [None]:
# ML Model - 1 Implementation
lin_reg = LinearRegression()

# Fit the Algorithm
reg = lin_reg.fit(x_train, y_train)

# Predict on the model
y_train_pred = reg.predict(x_train)
y_test_pred = reg.predict(x_test)

# Finding the coefficients and intercept from the model-

print(f'The coefficients of the model is {reg.coef_}')
print(f'The intercept of the model is {reg.intercept_}')

In [None]:

# Checking model train and test score:


print(f'Linear Regression model train score is :{reg.score(x_train, y_train)}')

print(f'Linear Regression model test score is :{reg.score(x_test, y_test)}')

In [None]:
# Calculating model performance scores for train data.

calc_scores(y_train, y_train_pred)

In [None]:
# Calculating model performance scores for test data.

calc_scores(y_test, y_test_pred)

In [None]:
#visualizing linearity between real and predicted data:

reg_scatter(y_test, y_test_pred)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

store_scores('Linear regression',5.025080868159481, 43.41177115815907, 6.588760972911301, 0.7173677593343155)

##The model used is Linear Regression and the performance of the model was evaluated using various evaluation metrics such as mean squared error (MSE), mean absolute error (MAE), and R-squared value. These metrics were calculated on both the training and testing sets to determine the overall performance of the model.
##The linear regression model achieved a mean squared error of 43.41177115815907 and a mean absolute error of 5.025080868159481 on the testing set. The R-squared value was 0.7173677593343155, indicating that the model explained 71.73% of the variance in the data.

### ML Model - 2



---



---





##Lasso Regression

---



---



In [None]:
# ML Model - 2 Implementation

lasso = Lasso()


# Fitting the Algorithm
lasso.fit(x_train, y_train)


# Predicting on the model
y_pred_trlasso = lasso.predict(x_train)
y_pred_telasso = lasso.predict(x_test)

# Finding the coefficients and intercept from the model-

print(f'The coefficients of the model is {np.array(lasso.coef_)}')
print(f'The intercept of the model is {lasso.intercept_}')


In [None]:
# Checking model train and test score:


print(f'Lasso Regression model train score is :{lasso.score(x_train, y_train)}\n')

print(f'Lasso Regression model test score is :{lasso.score(x_test, y_test)}')

In [None]:
# Calculating model performance scores for train data.

calc_scores(y_train, y_pred_trlasso)

In [None]:
# Calculating model performance scores for test data.

calc_scores(y_test, y_pred_telasso)


In [None]:
#visualizing linearity between real and predicted data:

reg_scatter(y_test, y_pred_telasso)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
store_scores('Lasso regression',6.084010902487589, 61.1274611692987, 7.818405283003606, 0.6020297984723794)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Implementation with hyperparameter optimization techniques using GridSearch CV:
# Create the model
model = Lasso()

# Define the hyperparameters to tune
params = {'alpha': [1e-15, 1e-13, 1e-10, 1e-08, 1e-05, 0.0001,
                                   0.001, 0.01, 0.1, 1, 5, 10, 20, 30, 40, 45,
                                   50, 55, 60, 100],}

# Create the grid search object
grid = GridSearchCV(model, params, cv=5)

# Fit the model
grid.fit(x_train, y_train)

# Print the best parameters
print("Best hyperparameters: {}".format(grid.best_params_))
print("Best mean cross-validation score: {:.2f}".format(grid.best_score_))

In [None]:
# Predict on the model using tuned hyperparameter

lasso = Lasso(alpha =  0.01)

# Fitting the Algorithm
lasso.fit(x_train, y_train)


# Predicting on the model
y_pred_trlasso = lasso.predict(x_train)
y_pred_telasso = lasso.predict(x_test)

# Finding the coefficients and intercept from the model-

print(f'The coefficients of the model is {np.array(lasso.coef_)}')
print(f'The intercept of the model is {lasso.intercept_}')


In [None]:
# Calculating model performance scores for train data.

calc_scores(y_train, y_pred_trlasso)

In [None]:
# Calculating model performance scores for test data.

calc_scores(y_test, y_pred_telasso)

In [None]:
#visualizing linearity between real and predicted data:

reg_scatter(y_test, y_pred_telasso)


##### Which hyperparameter optimization technique have you used and why?

* Here I used Grid search cross validation hyperparameter optimization technique. Grid search CV is used to tune hyperparameters of a machine learning model. Hyperparameters are parameters that are not learned by the model during training, but instead are set by the practitioner before training. Grid search CV performs an exhaustive search over a specified parameter grid, trying every combination of hyperparameters to find the best combination that produces the highest cross-validation accuracy score. By using grid search CV, we can avoid manually trying different combinations of hyperparameters, which can be time-consuming and error-prone.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
#Yes, I got significant improvement in Lasso regression model as we can see the results in below table:

# Visualizing evaluation Metric Score chart

store_scores('Lasso regression(Grid Search CV)',4.867785094548076, 41.05151829341606, 6.407145877332282, 0.7327341804201906)


### ML Model - 3

---



---


#Ridge Regression

---



---



In [None]:
# ML Model - 3 Implementation
ridge = Ridge()

# Fit the Algorithm

ridge.fit(x_train,y_train)

# Predict on the model

y_pred_trridge = ridge.predict(x_train)
y_pred_teridge = ridge.predict(x_test)

# Finding the coefficients and intercept from the model-

print(f'The coefficients of the model is {np.array(ridge.coef_)}')
print(f'The intercept of the model is {ridge.intercept_}')


In [None]:
# Calculating model performance scores for train data.

calc_scores(y_train, y_pred_trridge)


In [None]:
# Calculating model performance scores for test data.

calc_scores(y_test, y_pred_teridge)


In [None]:
#visualizing linearity between real and predicted data:

reg_scatter(y_test, y_pred_teridge)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
store_scores('Ridge regression',4.873230763166252, 41.07227225075999, 6.408765267253902, 0.7325990618265394)


#### The model used is Ridge Regression and the performance of the model was evaluated using various evaluation metrics such as mean squared error (MSE), mean absolute error (MAE), and R-squared value. These metrics were calculated on both the training and testing sets to determine the overall performance of the model. The ridge regression model achieved a mean squared error of 41.07227225075999 and a mean absolute error of 4.873230763166252 on the testing set. The R-squared value was 0.7325990618265394, indicating that the model explained 73.25% of the variance in the data.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Implementation with hyperparameter optimization techniques using GridSearch CV:

# Create the model

model = Ridge()

# Define the hyperparameters to tune

params = {'alpha': [1e-15, 1e-13, 1e-10, 1e-08, 1e-05, 0.0001,
                                   0.001, 0.01, 0.1, 1, 5, 10, 20, 30, 40, 45,
                                   50, 55, 60, 100],}

# Create the grid search object

grid = GridSearchCV(model, params, cv=5)

# Fit the model

grid.fit(x_train, y_train)

# Print the best parameters
print("Best hyperparameters: {}".format(grid.best_params_))
print("Best mean cross-validation score: {:.2f}".format(grid.best_score_))

In [None]:
# Predict on the model using tuned hyperparameter.

#Creating instance

ridge = Ridge(alpha = 30)
ridge.fit(x_train,y_train)

# Fit the Algorithm

ridge.fit(x_train,y_train)

# Predict on the model
y_pred_trridge = ridge.predict(x_train)
y_pred_teridge = ridge.predict(x_test)

In [None]:
# Calculating model performance scores for train data.

calc_scores(y_train, y_pred_trridge)

In [None]:
# Calculating model performance scores for test data.

calc_scores(y_test, y_pred_teridge)

In [None]:
#visualizing linearity between real and predicted data:

reg_scatter(y_test, y_pred_teridge)

##### Which hyperparameter optimization technique have you used and why?

* Grid search CV is used to tune hyperparameters of a machine learning model. Hyperparameters are parameters that are not learned by the model during training, but instead are set by the practitioner before training. Grid search CV performs an exhaustive search over a specified parameter grid, trying every combination of hyperparameters to find the best combination that produces the highest cross-validation accuracy score. By using grid search CV, we can avoid manually trying different combinations of hyperparameters, which can be time-consuming and error-prone.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
# I did not find any significant difference after hyperparameter tuning:

# Visualizing evaluation Metric Score chart:

store_scores('Ridge regression(Grid Search CV)',4.873230763166252, 41.07227225075999, 6.408765267253902, 0.7325990618265394)



##4.ML Model 4

---



---


#Random Forest Regression

---



---

In [None]:
# ML Model - 4 Implementation
rf = RandomForestRegressor()


# Fitting the Algorithm

rf.fit(x_train, y_train)


# Predicting on the model

y_train_predrf = rf.predict(x_train)
y_test_predrf = rf.predict(x_test)


In [None]:
# Calculating model performance scores for train data.

calc_scores(y_train, y_train_predrf)

In [None]:
# Calculating model performance scores for test data.

calc_scores(y_test, y_test_predrf)

In [None]:
#visualizing linearity between real and predicted data:

reg_scatter(y_test, y_test_predrf)

###1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

* The model used is Random Forest regression and the performance of the model was evaluated using various evaluation metrics such as mean squared error (MSE), mean absolute error (MAE), and R-squared value. These metrics were calculated on both the training and testing sets to determine the overall performance of the model.

* The ridge regression model achieved a mean squared error of 15.688780794063215 and a mean absolute error of 2.63099 on the testing set. The R-squared value was 0.89785822713, indicating that the model explained 89.78% of the variance in the data.

In [None]:
# Visualizing evaluation Metric Score chart

store_scores('Random Forest regression',2.6309909474063495, 15.688780794063215, 3.9609065621475112, 0.8978582271388054)


###2. Cross- Validation & Hyperparameter Tuning

In [None]:

rf = RandomForestRegressor(max_depth= 30, min_samples_split = 5, n_estimators= 200)


# Fitting the Algorithm

rf.fit(x_train, y_train)


# Predicting on the model

y_train_predrf = rf.predict(x_train)
y_test_predrf = rf.predict(x_test)

In [None]:
# Calculating model performance scores for train data.

calc_scores(y_train, y_train_predrf)

In [None]:
# Calculating model performance scores for test data.

calc_scores(y_test, y_test_predrf)

In [None]:
#visualizing linearity between real and predicted data:

reg_scatter(y_test, y_test_predrf)

In [None]:

# Visualizing evaluation Metric Score chart

store_scores('Random Forest regression(Grid Search CV)',2.6128959118593276, 15.488846781090674, 3.9355872218883263, 0.8991598970906219)


##5.ML Model - 5

---



---


##Gradient Boosting Regression

---



---



In [None]:
# ML Model - 1 Implementation
gbr = GradientBoostingRegressor()


# Fitting the Algorithm

gbr.fit(x_train, y_train)


# Predicting on the model

y_train_predgbr = gbr.predict(x_train)
y_test_predgbr = gbr.predict(x_test)


In [None]:
# Calculating model performance scores for train data.

calc_scores(y_train, y_train_predgbr)

In [None]:
# Calculating model performance scores for test data.

calc_scores(y_test, y_test_predgbr)

In [None]:
#visualizing linearity between real and predicted data:

reg_scatter(y_test, y_test_predgbr)

###1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

* The model used is Gradient Boosting regression and the performance of the model was evaluated using various evaluation metrics such as mean squared error (MSE), mean absolute error (MAE), and R-squared value. These metrics were calculated on both the training and testing sets to determine the overall performance of the model.

* The ridge regression model achieved a mean squared error of 25.337965384259732 and a mean absolute error of 3.7124676409203308 on the testing set. The R-squared value was 0.8350372320822267, indicating that the model explained 83.50% of the variance in the data.

In [None]:
# Visualizing evaluation Metric Score chart

store_scores('Gradient Boosting regression',3.7124676409203308, 25.337965384259732, 5.033683083415138, 0.8350372320822267)


###2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Again fitting the model after hyperparameter tuning:

gbr = GradientBoostingRegressor(max_depth= 10, min_samples_split = 10, n_estimators= 100)


# Fitting the Algorithm

gbr.fit(x_train, y_train)


# Predicting on the model

y_train_predgbr = gbr.predict(x_train)
y_test_predgbr = gbr.predict(x_test)


In [None]:
# Calculating model performance scores for train data.

calc_scores(y_train, y_train_predgbr)

In [None]:
# Calculating model performance scores for test data.

calc_scores(y_test, y_test_predgbr)

In [None]:
#visualizing linearity between real and predicted data:

reg_scatter(y_test, y_test_predgbr)

In [None]:
# Visualizing evaluation Metric Score chart

store_scores('Gradient Boosting Regression(Grid Search CV)',2.5703976949659966, 15.125168528069088, 3.889108963254834, 0.9015276235572154)


###1. Which Evaluation metrics did you consider for a positive business impact and why?

* For the Seoul bike sharing prediction project, the evaluation metrics that would have a positive business impact are:

* Root Mean Squared Error (RMSE): This metric measures the average difference between the predicted and actual values. A lower RMSE value indicates better performance of the model, which is desirable from a business perspective as it means the model is making accurate predictions and reducing errors.

* Mean Absolute Error (MAE): This metric measures the absolute difference between the predicted and actual values. A lower MAE value indicates better performance of the model, which is desirable from a business perspective as it means the model is making accurate predictions and reducing errors.

* R-squared (R2) score: This metric measures how well the model fits the data. A higher R2 score indicates a better fit of the model to the data, which is desirable from a business perspective as it means the model is able to explain more of the variance in the target variable.

* Overall, these evaluation metrics are important for a positive business impact as they indicate how well the model is able to predict bike rental demand in Seoul, and a model that performs well on these metrics is likely to result in more accurate forecasting, improved resource allocation, and better decision making for bike-sharing companies and city planners.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:
# Setting Model as index in r2 score dataframe:

score_df = score_df.set_index('Model')

In [None]:
score_df

In [None]:
# Visualising comparison of R2 score of different models using barplot.

plt.figure(figsize = (15,8))
sns.barplot(data =score_df, x = score_df['R2 Score'], y= score_df.index, order=score_df.index, hue = score_df['R2 Score'] )
plt.title('comparison of R2 scores of different models')
plt.show()

####Based on the provided evaluation metrics, the ML models I chose from the above created models as my final prediction model:
* 1) **Gradient Boosting Regression(Grid Search CV)**: This model has the lowest MAE, MSE, and RMSE values, indicating that it can make more accurate predictions compared to other models. Additionally, the R2 score of 0.90 indicates that this model can explain a significant proportion of the variance in the target variable, which is desirable in bike sharing prediction. The business impact of this model can be significant as accurate predictions can help bike-sharing companies optimize their operations, improve customer satisfaction, and reduce costs.
* 2) **Random Forest regression(Grid Search CV):** This model also has low MAE, MSE, and RMSE values, indicating good performance in prediction. The R2 score of 0.90 is also high, indicating good performance in explaining the variance in the target variable. Like the Random Forest regression model, this model can help bike-sharing companies optimize their operations, improve customer satisfaction, and reduce costs.
* It is important to note that the business impact of these models also depends on other factors such as the cost of implementing the model, availability of data, and market conditions.

###In terms of features, Temperature ,Functioning day and Rainfall plays very important role in the above 2 models. Other features importances are different in each model.

# **Conclusion**

* As we have calculated MAE,MSE,RMSE and R2 score for each model. Based on r2 score will decide our model performance.
* Our assumption: if the differnece of R2 score between Train data and Test is more than 5 % we will consider it as overfitting.
##Linear,Lasso and Ridge.

1.    From The above data frame, we can see that linear,Lasso and Ridge regression models have almost similar R2 scores(73%) on both training and test data.(Even after using GridserachCV we have got similar results as of base models).

##Random Forest:


*   On Random Forest regressor model, without hyperparameter tuning we got r2 score as 98% on training data and 90% on test data. Thus our model memorised the data.So it was a overfitted model, as per our assumption
* After hyperparameter tuning we got r2 score as 97% on training data and 90% on test data which is very good for us.

##Gradient Boosting Regression(Gradient Boosting Machine):
* On Random Forest regressor model, without hyperparameter tuning we got r2 score as 86% on training data and 83% on test data.Our model performed well without hyperparameter tuning.
* After hyperparameter tuning we got r2 score as 98% on training data and 90% on test data,thus we improved the model performance by hyperparameter tuning.
* Thus Gradient Boosting Regression(GridSearchCV) and Random forest(gridSearchCv) gives good r2 scores. We can deploy these models.