<a href="https://colab.research.google.com/github/Pruthviraj3196/capstone---2---Bike-Sharing-Demand-Prediction/blob/main/Copy_of_Sample_ML_Submission_Template_Bike_Sharing_Demand_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Bike Sharing Demand Prediction 



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Name              -Pruthviraj Gopinath Barbole**


# **Project Summary -**

Bike sharing has become a popular mode of transportation in urban areas around the world. Bike sharing systems allow users to rent bikes from a network of stations located throughout the city, and to return the bikes to any station in the network. These systems are often used for short, one-way trips, and can be a convenient and affordable alternative to driving or public transportation.

In this project, we explore the problem of bike sharing demand prediction using a publicly available dataset from the UCI Machine Learning Repository. The dataset contains hourly bike rental counts for a bike sharing system in Washington D.C. over a period of two years, along with a variety of weather and seasonal features.

Our goal is to build a machine learning model that can accurately predict the number of bikes that will be rented at each hour of the day, based on the available features. To do this, we first explore and preprocess the data, including handling missing values, encoding categorical variables, and scaling the numeric features.

We then train several different machine learning models, including linear regression, decision trees, random forests, and gradient boosting, and evaluate their performance using a variety of metrics, including mean absolute error, mean squared error, and R-squared. We also use feature importance analysis to identify which features are most important for predicting bike demand.

Finally, we discuss the implications of our findings for bike sharing companies and city planners. Our results suggest that machine learning can be a powerful tool for predicting bike demand and optimizing bike allocation, which could help to improve the efficiency and sustainability of bike sharing systems. However, there are also important ethical and privacy considerations to be addressed, such as ensuring that the data used in these models is collected and used in a responsible and transparent manner.

Overall, this project highlights the potential of machine learning to address real-world challenges in transportation and urban planning, and underscores the importance of careful data analysis and model evaluation in developing effective solutions.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes..**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
# Importing required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import StackingRegressor

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
df_bike  = pd.read_csv('/content/drive/MyDrive/data/bike data/SeoulBikeData.csv',encoding='unicode_escape')

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
df_bike.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
df_bike.shape

### Dataset Information

In [None]:
# Dataset Info

In [None]:
df_bike.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
df_bike.duplicated().sum()

In the above data after count the missing and duplicate value we came to know that there are no missing and duplicate value present

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
df_bike.isnull().sum()

There are no missing values to handle in the given dataset.

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
plt.figure(figsize=(14, 5))
sns.heatmap(df_bike.isnull(), cbar=True, yticklabels=False)
plt.xlabel("column_name", size=14, weight="bold")
plt.title("missing values in column",fontweight="bold",size=17)
plt.show()

### What did you know about your dataset?

1 This Dataset contains 8760 lines and 14 columns.

2 Three categorical features ‘Seasons’, ‘Holiday’, & ‘Functioning Day’. 

3 One Datetime features ‘Date’.

4 We have some numerical type variables such as temperature, humidity, wind, visibility, dew point temp, solar radiation, rainfall, snowfall which tells the environment conditions at that particular hour of the day.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
df_bike.columns

In [None]:
# Dataset Describe

In [None]:
df_bike.describe()

### Variables Description 

Date - year-month-day

Rented Bike count - Count of bikes rented at each hour

Hour - Hour of the day

Temperature-Temperature in Celsius

Humidity - Humidity in the air in %, type : int

Windspeed -peed of the wind in m/s, type : Float 

*Visibility(10m)* : Visibility in m, type : int

*Dew point temperature* : Temperature at the beginning of the day in celsius, type : int

*Solar Radiation(MJ/m2)* : Sun contribution, type : Float

*Rainfall(mm)* : Amount of raining in mm, type : float

*Snowfall(cm)* : Amount of snowing in cm, type : float

*Seasons* : Season of the year, type : str (Four types of season's present in data)

*Holiday* : If the day is Holiday or not, type : str

*Functional Day* : If the day is a Functioning Day or not, type : str

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Check Unique Values for each variable.
for i in df_bike.columns.tolist():
  print("No. of unique values in ",i,"is",df_bike[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

 Some of the columns name in the dataset are too large and clumsy so we change the the into some simple name, and it don't affect our end results.



In [None]:
# Renaming Columns
df_bike.rename(columns={'Date': 'date', 'Rented Bike Count': 'bike_count', 'Hour': 'hour',
                   'Temperature(°C)': 'temp', 'Humidity(%)': 'humidity', 'Wind speed (m/s)': 'wind',
                   'Visibility (10m)': 'visibility', 'Dew point temperature(°C)': 'dew_temp',
                   'Solar Radiation (MJ/m2)': 'sunlight', 'Rainfall(mm)': 'rain', 'Snowfall (cm)': 'snow',
                   'Seasons': 'season', 'Holiday': 'holiday', 'Functioning Day': 'functioning_day'}, inplace=True)

In [None]:
df_bike.head()

In [None]:
df_bike.info()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
# finding numerical_features from the dataset
numerical_features = df_bike.describe().columns
numerical_features

In [None]:
for col in numerical_features:
  plt.figure(figsize=(10,7))
  sns.distplot(x=df_bike[col])
  plt.xlabel(col)
plt.show()

1. Why did you pick the specific chart?\
 subplot is use to check the column skewness which meansthe column is right skewed or left skewed.

2. What is/are the insight(s) found from the chart?\
In this plots we observe that some of our columns is right skewed and some are left skewed we have to remember this things when we apply algorithms\
Right skewed columns are :- Rented Bike Count (Its also our Dependent variable), Wind speed (m/s), Solar Radiation (MJ/m2), Rainfall(mm), Snowfall (cm),\
Left skewed columns are :- Visibility (10m), Dew point temperature(°C)

3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.\
right skewed and left skewed columns contains negative impact on our analysis.

In [None]:
# plotting the box plot for numerical_features
for feature in numerical_features[1:]:
    plt.figure(figsize=(10,6))
    sns.boxplot(x = df_bike[feature])
plt.show()

1. Why did you pick the specific chart?

   Boxplot is use to detect the outlier of the columns thats why we use box plot.

2. What is/are the insight(s) found from the chart?

Here we can see that the columns that contain outliers are Rainfall, Snowfall, Windspeed and Solar Radiation. we can remove outlier in the future when we do feature engineering.

3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We will use a regression plot to find this correlation. This also finds if the independent variable has a linear relationship with the dependent variable, which is an assumption that has to be satisfied for models like linear regression.

## Handling Outliers

In [None]:
# Removing outliers by Using IQR method:
q1, q3, median = df_bike.bike_count.quantile([0.25,0.75,0.5])
lower_limit = q1 - 1.5*(q3-q1)
upper_limit = q3 + 1.5*(q3-q1)
df_bike['bike_count'] = np.where(df_bike['bike_count'] > upper_limit, median,np.where(
                            df_bike['bike_count'] < lower_limit,median,df_bike['bike_count']))

# Removing outliers by Capping:
for col in ['wind','sunlight','rain','snow']:
  upper_limit = df_bike[col].quantile(0.99)
  df_bike[col] = np.where(df_bike[col] > upper_limit, upper_limit, df_bike[col])

# Cleaning and Manipulating dataset

In [None]:
#convert the "date" column into 3 different columns i.e "year","month","day"
import datetime as dt
df_bike['date'] = df_bike['date'].apply(lambda x: dt.datetime.strptime(x,"%d/%m/%Y"))


In [None]:
df_bike['year'] = df_bike['date'].dt.year
df_bike['month'] = df_bike['date'].dt.month

df_bike['day'] = df_bike['date'].dt.day_name()

In [None]:
#creating a new column of "weekdays_weekend" and drop the column "Date","day","year"
df_bike['week']=df_bike['day'].apply(lambda x : "weekend" if x=='Saturday' or x=='Sunday' else "weekday" )

In [None]:
# checking no of years
df_bike['week'].value_counts()

In [None]:
df_bike.head()

In [None]:
df_bike.columns

In [None]:
df_bike=df_bike.drop(columns=['date','day','year'],axis=1)

In [None]:
df_bike.head()

In [None]:
df_bike['timeshift'] = df_bike['hour'].apply(lambda x: 'night' if 0<=x<=6 else ('day' if 7<=x<=16 else 'evening'))

In [None]:
df_bike.head()

What all manipulations have you done and insights you found?

So we convert the "date" column into 3 different column i.e "year","month","day". The "year" column in our data set is basically contain the 2 unique number contains the details of from 2017 december to 2018 november. The other column "day", it contains the details about the each day of the month, for our relevence we don't need each day of each month data but we need the data about, if a day is a weekday or a weekend so we convert it into this format and drop the "day" column. 

In [None]:
#checking AS PER  categorical feature
cat_fea =df_bike.describe(include=['object']).columns
     

In [None]:
cat_fea

In [None]:
# plot a bar plot for each categorical feature count  

for col in cat_fea:
    counts =df_bike[col].value_counts().sort_index()
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    counts.plot.bar(ax = ax, color='red')
    ax.set_title(col + ' counts')
    ax.set_xlabel(col) 
    ax.set_ylabel("Frequency")
plt.show()

1. Why did you pick the specific chart?

 A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent.

2. What is/are the insight(s) found from the chart?

1 - seasons- From the above bar plot it is clear that in summer season Rented bike count is very high and in Winter season Rented bike count is very low.\
2 - Holiday - As we can see that the majority of the bikes rented are on days which are considered as No Holiday. \
3-function Day - it is clear from above plot rented bikes are rented on functioning day.\
4- Timeshift - as compair to 3 stages of days during day time bike requirement as seen to be more and during evening and night its are same moderate 


3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1 - seasons- winter is coldest season of the year thats why customers prefer taxi or cars over the rented bike thats why winter season is negative growth.\
2 - Holiday -Positive growth is clearly seen in the no holiday days because on no holiday days office or work is going on where on holiday days negative growth is occur because on holiday days office or work is shut down.\
3-function Day - rented bikes have huge loss on no functioning day.\
4 - Timeshift - As night time there less number of people  as compair to day time 

# Regression plot

In [None]:
continuous_variable =  ['temp','dew_temp', 'humidity', 'wind', 'visibility', 'sunlight', 'rain', 'snow']

In [None]:
#regression plot between Rented_Bike_Count and other variables
for i in continuous_variable:
  fig,ax=plt.subplots(figsize=(10,6))
  sns.regplot(x=df_bike[i],y=df_bike['bike_count'],scatter_kws={"color": 'green'}, line_kws={"color": "red"})

1. Why did you pick the specific chart?\
 We will use a regression plot to find this correlation. This also finds if the independent variable has a linear relationship with the dependent variable, which is an assumption that has to be satisfied for models like linear regression.

2. What is/are the insight(s) found from the chart?\
This regression plots shows that some of our features are positive linear and some are negative linear in relation to our target variable.

###  Categorical Encoding

In [None]:
# Lets take care of the categorical features
categorical_features = [i for i in df_bike.columns if i not in df_bike.describe().columns]
categorical_features

In [None]:
# Checking unique value with their counts in categorical features
for col in categorical_features:
  print(df_bike[col].value_counts(),'\n')

In [None]:
# Defining a label encoder based on above data
encoder = {'holiday':{'Holiday':1, 'No Holiday':0},'week':{'weekday':1, 'weekend':0},'functioning_day':{'Yes':1, 'No': 0},
          'timeshift': {'night':0, 'day':1, 'evening':2}}

In [None]:
df_bike.head()

In [None]:
# Label Encoding
df_bike = df_bike.replace(encoder)

# One Hot Encoding
df_bike = pd.get_dummies(df_bike, columns=["season"], prefix='', prefix_sep='')

In [None]:
df_bike.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

I have used Label Encoding technique for bike data, holiday , week , function day  & time shift column . As we have seen during analysis, that categorical columns are very very important Machine learning models can only work with numerical values and therefore important categorical columns have to converted/encoded into numerical variables. This process is known as Feature Encoding 

In [None]:
independent_variables = [i for i in df_bike.columns if i not in ['bike_count']]

# Checking Linearity
plt.figure(figsize=(18,18))
for n,column in enumerate(independent_variables):
  plt.subplot(5, 4, n+1)
  sns.regplot(data = df_bike, x = column, y ='bike_count',line_kws={"color": "red"})
  plt.title(f'Bike_Count v/s {column.title()}',weight='bold')
  plt.tight_layout()

Hour:
1)There is sudden peak between 6/7AM to 10 AM. Office time,College and going time could be the reason for this sudden peak.
2) Again there is peak between 10 AM to 7 PM. may be its office leaving time for the above people.
3) We can say that,from morning 7 AM to Evening 7 PM we have good Bike Rent Count. and from 7 PM to 7 AM Bike Rent count starts declining.
Temperature:
1) For decrease in temperature below 0 deg celicus the bike rent count is significantly decreased because may be people dont want to ride bike in such cold temperature.
2) But for normal temperature the Bike rent count is very high.
humidity
1) Here its seems like humidty is inversely proportional to bike rent count. As humdity percentage is increasing there is decrease in bike rent count.
Wind Speed:
1) upto wind speed 4 m/s there is good bike rent count.
Visibility
1) It's very obivious that as visibilty increases the bike rent count also increases. Nobody would prefere to ride in low visibilty.
Dew Point Temperature
1)It's again the same case as of temperature. As dew temperature goes below 0 deg celcius there is less bike rent count. It looks like Dew Point temperature and Temperature columns have strong colinarity.
Solar radiation
1)Here the amount of rented bikes is huge, when there is solar radiation.
Rainfall And snowfall
1) Its very obivious that people usually do not like ride bikes in rain and snowfall.

In [None]:
#checking skewness of the dependend variable
print(f'Skewness of original data : {df_bike.bike_count.skew()}')
print(f'Skewness after log transformation : {np.log(df_bike.bike_count).skew()}')
print(f'Skewness after transformation : {np.sqrt(df_bike.bike_count).skew()}')

### 5. Data Transformation

In [None]:
#Since Sqrt Transformation gives skewness between -0.5 and 0.5 indicates that the distribution is fairly symmetrical we will use it
plt.figure(figsize=(9,4))
plot = plt.subplot(1,2,1)
sns.distplot(df_bike['bike_count']).set_title('Bike_Count Before Transformation',weight='bold')
plot = plt.subplot(1,2,2)
sns.distplot(np.sqrt(df_bike['bike_count'])).set_title('Bike_Count After Transformation',weight='bold')
plt.tight_layout()


The above graph shows that Rented Bike Count has moderate right skewness. 

Since the assumption of linear regression is that 'the distribution of dependent variable has to be normal', so we should perform Square root operation to make it normal.

After applying Square root to the skewed Rented Bike Count, here we get almost normal distribution. 

# Removing Multicolinearity

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Defining a function to calculate Variance Inflation factor
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif.sort_values(by='VIF',ascending=False).reset_index(drop=True))

VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable. VIF score of an independent variable represents how well the variable is explained by other independent variables.

In [None]:
# Checking corelations
plt.figure(figsize=(18,9))
plot=sns.heatmap(abs(df_bike.corr()), annot=True, cmap='coolwarm')
plot.set_xticklabels(plot.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()

We can see some highly correlated features. Lets treat them by excluding them from dataset and checking the variance inflation factors.

In [None]:
# Checking VIF for each variable
independent_variables = [i for i in df_bike.columns if i not in ['bike_count']]
calc_vif(df_bike[independent_variables])

Since Summer and Winter can also be classified on the basis of temperature and we already have that feature present. Even if we drop these features the useful information will not be lost. So lets drop them.

In [None]:
# Summer and Winter are highly correlated with temperature. Hence removing them
independent_variables = [i for i in df_bike.columns if i not in ['bike_count','Winter','Summer','dew_temp','hour','humidity']]
calc_vif(df_bike[independent_variables])

In [None]:
# Updating the dataset
dataset = df_bike[independent_variables + ['bike_count']]

#checking corelations
plt.figure(figsize=(18,9))
plot=sns.heatmap(abs(dataset.corr()), annot=True, cmap='coolwarm')
plot.set_xticklabels(plot.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()

1. Why did you pick the specific chart?\
A correlation heatmap is a good chart choice to visualize the relationships between multiple variables in a dataset. It shows the correlation coefficients between each pair of variables as a color-coded matrix, where the intensity of the color represents the strength of the correlation. By using a correlation heatmap, we can easily identify the variables that have a strong positive or negative correlation with each other, which can help in feature selection and modeling. Therefore, it is a good choice for exploring the relationships between different variables in the Bike dataset.

In [None]:
# Checking Linearity of the new dataset

In [None]:
#regression plot between Rented_Bike_Count and other variables
for i in dataset:
  fig,ax=plt.subplots(figsize=(10,6))
  sns.regplot(x=df_bike[i],y=df_bike['bike_count'],scatter_kws={"color": 'green'}, line_kws={"color": "red"})

In [None]:
# Checking top 5 rows of the cleaned dataset
dataset.head()

What all feature selection methods have you used and why?

In this project, I have used the Variance Inflation Factor (VIF) method to check for feature selection. The VIF score indicates how much the variance of a regression coefficient is increased due to multicollinearity in the data. High VIF values indicate high multicollinearity among the independent variables, which may lead to inaccurate and unstable estimates of the regression coefficients. Therefore, it is essential to remove highly correlated variables or combine them to reduce their VIF score to an acceptable level. Hence, I have used the VIF method to identify highly correlated variables in the data set and then remove or combine them to reduce multicollinearity.

Which all features you found important and why?

Based on the VIF score, it seems like all the features have VIF values lower than the threshold value of 10, indicating that there is no multicollinearity issue between the features.

But we can see that temperature and dew point temperature has VIF value 33.659374 and 17.264367 respectively indicating multicollinearity. This means that these two variables are highly correlated with other independent variables in the dataset. In such cases, it is generally recommended to remove one of the highly correlated variables to reduce the impact of multicollinearity on the regression model. However, before making a final decision, it is important to consider the domain knowledge and the impact of each variable on the dependent variable. In this case, temperature and dew point temperature are both important factors that can influence bike rental counts. Therefore, it may be worthwhile to keep both variables in the model and explore other methods to handle multicollinearity, such as regularization techniques.

Hence, all features can be considered as important for modeling.

## Defining Independent and Dependent & Data Splitting

In [None]:
# Data for independent and dependent set
X=dataset.drop(['bike_count'],axis=1)
y=dataset.loc[:,'bike_count']

In [None]:
X

In [None]:
# Shape of independent and dependent dataset
print(X.shape)
print(y.shape)    

In [None]:
# Spiliting the data using the Train Test Split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)    

In [None]:
# Checking the size of training and testing data
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)     

1 . What you have done and what insight you have got?

Date and Dew point temperature has been dropped from our dataset fro modelling. because no need to keep date and due to multicolinearity dew point has been dropped.\
Then I have splitted the data set into 80:20 ratio for train test analysis.\
The training dataset contains 7008 records with 13 features and the corresponding target variable, and the testing dataset contains 1752 records with 13 features and the corresponding target variable.

2 - What data splitting ratio have you used and why?\
Data splitting is when data is divided into two or more subsets. Typically, with a two-part split, one part is used to evaluate or test the data and the other to train the model. Data splitting is an important aspect of data science, particularly for creating models based on data.

## Data Scaling

In [None]:
# Now we rescaling our data
scalar=MinMaxScaler()
X_train=scalar.fit_transform(X_train)
X_test=scalar.transform(X_test)   

In [None]:
X_train

Which method have you used to scale you data and why?

I have used MinMaxScaler to scale the data.

MinMaxScaler scales the data between 0 and 1. This is particularly useful for neural networks, SVMs, and algorithms that require input data to be normalized to a uniform range. It preserves the shape of the original distribution and does not distort the relative distances between data points.

In addition, MinMaxScaler is less prone to the influence of outliers compared to other scaling methods.

## ***7. ML Model Implementation***

Machine learning models can be described as programs that are trained to find patterns or trends within data and predict the result for new data.

In this project we are dealing with a regression problem, therefore we will be using regression models. Some popular examples are Linear Regression and polynomial regression.

In this project we will be include the following models:

1.Linear regression.

2.Lasso regression (Linear regression with L1 regularization).

3.Decision Tree

4.Random forest regression.

### ML Model - 1

**Linear Regression**

In [None]:
# Fitting Multiple Linear Regression to the Training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
#Predicting on Training set
y_pred_train = regressor.predict(X_train)

In [None]:
# Predicting the Test set results
y_pred = regressor.predict(X_test)

In [None]:
#Train Performance
#Mean Squared error
MSE  = mean_squared_error((y_train), (y_pred_train))
print("MSE :" , MSE)

#Root Mean Squared Error
RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

#R2 score
r2 = r2_score((y_train), (y_pred_train))
print("R2 :" ,r2)

#Adjusted R2 score
print("Adjusted R2 : ",1-(1-r2_score((y_train), (y_pred_train)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))

In [None]:
#Train Performance
#Mean Squared Error
MSE  = mean_squared_error((y_test), (y_pred))
print("MSE :" , MSE)

#Root Mean Squared Eror
RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

#R2 score
r2 = r2_score((y_test), (y_pred))
print("R2 :" ,r2)

#Adjusted R2 score
print("Adjusted R2 : ",1-(1-r2_score((y_test), (y_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
#Dataframe of actual vs predicted values
predictions = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

In [None]:
#Random sampling from predictions dataframe and plotting
predictions.sample(50).plot(kind='bar',figsize=(14,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

**Observation** \
From the above output, we can see that the R-squared value is 0.5573 which means that approximately 54% of the variance in the target variable is explained by the independent variables in our model. The adjusted R-squared value is 0.5540 which is similar to the R-squared value.

The Root Mean Squared Error (RMSE) is 390.6229 which means that the average difference between the predicted bike count and the actual bike count is 880.83. The Mean Squared Error (MSE) is 152586.2585 which is the average of the squared differences between the predicted bike count and the actual bike count.

we have to make our model more complex for better discretion or move to tree and ensembling algorithm for better results

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The ML model used in this case is Linear Regression. Linear Regression is a supervised learning algorithm used for predicting a continuous target variable based on one or more input features.

Now, let's analyze the performance of the Linear Regression model using the provided evaluation metric score chart:

 the provided evaluation metric score chart indicates that the Linear Regression model has an R-squared value of 0.5573, suggesting a moderate level of prediction accuracy. The adjusted R-squared value of 0.5540 indicates a similar level of accuracy, considering the model's complexity. The RMSE value of 390.6229 indicates the average prediction error, and the MSE value of 152586.2585 provides a measure of the squared error.

### Lasso Regression

In [None]:
from sklearn.linear_model import Lasso
lasso_1  = Lasso(alpha=0.001 , max_iter= 3000)
lasso_1.fit(X_train, y_train)

In [None]:
# Accuracy of lasso regression model
lasso_1.score(X_train, y_train)

In [None]:
lasso_1.coef_

In [None]:
# prediction of test data
y_pred_lasso_1 = lasso_1.predict(X_test)

In [None]:
# Finding the Evaluation Metrics
MSE  = mean_squared_error(y_test, y_pred_lasso_1)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(y_test,y_pred_lasso_1)
print("R2 :" ,r2)
print("Adjusted R2 : ",1-(1-r2_score(y_test, y_pred_lasso_1))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
# Actual price vs predicted  for lasso regression ploting
plt.figure(figsize=(15,5))
plt.plot((y_pred_lasso_1)**2)
plt.plot(np.array((y_test)**2))
plt.legend(["Predicted","Actual"])
plt.show()

**Observation** \
From the above output, we can see that the R-squared value is 0.5573 which means that approximately 54% of the variance in the target variable is explained by the independent variables in our model. The adjusted R-squared value is 0.5540 which is similar to the R-squared value.

The Root Mean Squared Error (RMSE) is 390.6229 which means that the average difference between the predicted bike count and the actual bike count is 880.83. The Mean Squared Error (MSE) is 152586.2103 which is the average of the squared differences between the predicted bike count and the actual bike count.

we have to make our model more complex for better discretion or move to tree and ensembling algorithm for better results

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The ML model used in this case is Lasso Regression. Lasso Regression is a linear regression technique that includes a regularization term to shrink the coefficients of the less important features to zero, effectively performing feature selection.

Now, let's analyze the performance of the Lasso Regression model using the provided evaluation metric score chart:

The provided evaluation metric score chart indicates that the Lasso Regression model has an R-squared value of 0.5573, suggesting a moderate level of prediction accuracy. The adjusted R-squared value of 0.5540 indicates a similar level of accuracy, considering the model's complexity. The RMSE value of 390.6229 indicates the average prediction error, and the MSE value of 152586.2103 provides a measure of the squared error. It's worth noting that the performance metrics for Lasso Regression are identical to those of Linear Regression in this case, which might indicate that the regularization term didn't have a significant impact on feature selection.

# Decision Tree

In [None]:
# For decision tree we use the standard scalar
from sklearn.preprocessing import StandardScaler
scalar=StandardScaler()
X_train=scalar.fit_transform(X_train)
X_test=scalar.transform(X_test)

In [None]:
# Fitting the Decision Tree Algorithm
from sklearn.tree import DecisionTreeRegressor
regressor=DecisionTreeRegressor(criterion='friedman_mse', max_leaf_nodes=9, max_depth=5)
regressor.fit(X_train,y_train)

In [None]:
#checking the traning
regressor.score(X_train,y_train)

In [None]:
# Predecting the result on the test data
y_pred=regressor.predict(X_test)

In [None]:
# Calcuting the R_squared for test data
r2_score(y_test,y_pred)

In [None]:
# Applying Grid Search for Decision Tree
param = {'max_depth' : [1,4,5,6,7,10,15,20,8], 'max_leaf_nodes':[5,10,20,25,30,40,45]}

decision_tree=DecisionTreeRegressor()

gridSearch_decisionTree=GridSearchCV(decision_tree,param,scoring='r2',cv=5)
gridSearch_decisionTree.fit(X_train,y_train)

In [None]:
# Best params for decision tree
gridSearch_decisionTree.best_params_

In [None]:
# Score and optimal paramters
print('The best hyperparameter for Decision Tree :',gridSearch_decisionTree.best_params_)
print('The best score:',gridSearch_decisionTree.best_score_)

In [None]:
# Optimal Model for decision tree
optimal_DecisionTree=gridSearch_decisionTree.best_estimator_

In [None]:
# Predicting the Output value
y_pred_dt=optimal_DecisionTree.predict(X_test)

In [None]:
# Traning score
optimal_DecisionTree.score(X_train,y_train)

In [None]:
# Finding the Evaluation Metrics
MSE  = mean_squared_error(y_test, y_pred_dt)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(y_test,y_pred_dt)
print("R2 :" ,r2)
print("Adjusted R2 : ",1-(1-r2_score(y_test, y_pred_dt))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
# Plot figure for Actual and Predicted test value
plt.figure(figsize=(12,6))
plt.plot((y_pred_dt**2)[:100])
plt.plot((np.array(y_test**2)[:100]))
plt.legend(["Predicted","Actual"])
plt.xlabel('Number of Test Data', fontsize=16)
plt.title('Decision Tree',fontsize= 20, fontweight='bold')
plt.show()

**Observation** \
From the above output, we can see that the R-squared value is 0.6969 which means that approximately 69% of the variance in the target variable is explained by the independent variables in our model. The adjusted R-squared value is 0.6946 which is similar to the R-squared value.

The Root Mean Squared Error (RMSE) is 323.2012 which means that the average difference between the predicted bike count and the actual bike count is 880.83. The Mean Squared Error (MSE) is 104459.0230 which is the average of the squared differences between the predicted bike count and the actual bike count.

we have to make our model more complex for better discretion or move to tree and ensembling algorithm for better results

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The ML model used in this case is Decision Tree. Decision Tree is a non-parametric supervised learning algorithm that creates a tree-like model of decisions and their possible consequences.

Now, let's analyze the performance of the Decision Tree model using the provided evaluation metric score chart:

The provided evaluation metric score chart indicates that the Decision Tree model has an R-squared value of 0.6969, suggesting a relatively good level of prediction accuracy. The adjusted R-squared value of 0.6946 indicates a similar level of accuracy, considering the complexity of the model and the number of input features. The RMSE value of 323.2012 indicates the average prediction error, and the MSE value of 104459.0230 provides a measure of the squared error. Overall, the Decision Tree model seems to perform well based on the provided evaluation metrics.

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Providing the range of values for the hyperparameter, so it can be used for gridsearch
n_estimators=[75,100,125]

# Max Depth of the tree
max_depth=[4,6,8]

# Minimum number of samples requires for the spilting of a node
min_samples_split=[50, 70, 90, 110]

# Minimum number of samples in the leaf node
min_samples_leaf=[40,50]   # To avoid the overfitting of data

# Hyperparameter Grip
grid_dict={'n_estimators':n_estimators,
            'max_depth':max_depth,
            'min_samples_split':min_samples_split,
            'min_samples_leaf':min_samples_leaf}
     

In [None]:
# crearing an instance of the random forest
rf_model=RandomForestRegressor()

# Perform the gridsearch
rf_grid=GridSearchCV(estimator=rf_model, param_grid=grid_dict, scoring='r2',verbose=0, cv=5)

rf_grid.fit(X_train,y_train)

In [None]:
# Find the best parameters for the RandomForestRegressor
rf_grid.best_params_

In [None]:
# Traning Score 
print('The best score:',rf_grid.best_score_)

In [None]:
# Optimal Model
rf_optimal_model=rf_grid.best_estimator_

In [None]:
# Making predictions on the test data
y_pred_rf=rf_optimal_model.predict(X_test)

In [None]:
# Training Score
rf_optimal_model.score(X_train,y_train)

In [None]:
# Finding the Evaluation Metrics
MSE  = mean_squared_error(y_test, y_pred_rf)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(y_test,y_pred_rf)
print("R2 :" ,r2)
print("Adjusted R2 : ",1-(1-r2_score(y_test, y_pred_rf))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
# Plotting actual and predicted values and the feature importances:
plt.figure(figsize=(18,6))
plt.plot((y_pred_rf)[:500])
plt.plot((np.array(y_test)[:500]))
plt.legend(["Predicted","Actual"])
plt.title('Actual and Predicted Bike Counts')

**Observation** \
From the above output, we can see that the R-squared value is 0.6948 which means that approximately 54% of the variance in the target variable is explained by the independent variables in our model. The adjusted R-squared value is 0.6925 which is similar to the R-squared value.

The Root Mean Squared Error (RMSE) is 324.3009 which means that the average difference between the predicted bike count and the actual bike count is 880.83. The Mean Squared Error (MSE) is 105171.1026 which is the average of the squared differences between the predicted bike count and the actual bike count.

we have to make our model more complex for better discretion or move to tree and ensembling algorithm for better results

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The ML model used in this case is Random Forest. Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. It improves upon the individual decision tree's performance by reducing overfitting and increasing prediction accuracy.

Now, let's analyze the performance of the Random Forest model using the provided evaluation metric score chart:

The provided evaluation metric score chart indicates that the Random Forest model has an R-squared value of 0.6948, suggesting a reasonably good level of prediction accuracy. The adjusted R-squared value of 0.6925 indicates a similar level of accuracy, considering the complexity of the model and the number of input features. The RMSE value of 324.3009 indicates the average prediction error, and the MSE value of 105171.1026 provides a measure of the squared error. Overall, the Random Forest model seems to perform well based on the provided evaluation metrics.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

- The temperature, hours, and solar radiation features were found to be more relevant for the bike count required at each hour for the stable supply of rental bikes.

- Other factors such as rainfall and snowfall also have an impact on the requirement of bikes for rent. Because in heavy rainfall and snowfall bike riding sometime becomes dangerous.

- As we have analyzed that the rental bike demands are high in the day time. So bikes should be available at that time to fulfill the bike demands.

- In the Functioning Day column, If there is no Functioning Day then there is no demand

- As we have analyzed the various features, we have seen that people prefer to take bikes on rent when temperature is near about 25 degrees Celcius.

- The Bike demand increases with an increase in visibility and decreases with an increase with humidity.

Results from ML models:

- Decision Tree Regression is the best performing model with an r2 score of 0.6969

- Actual vs Prediction visualisation is done for all the 4 models.

In conclusion, the tree-based models (Decision Tree and Random Forest) outperform Linear Regression and Lasso Regression in terms of explaining the variance in the target variable (higher R-squared values) and making accurate predictions (lower RMSE and MSE values). Among the tree-based models, the Decision Tree and Random Forest show similar performance, with slightly higher R-squared and adjusted R-squared values for the Decision Tree, but slightly lower RMSE and MSE values for the Random Forest. Therefore, the Decision Tree and Random Forest models may be more suitable for this Model. 

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***