# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 -** Dipak Someshwar


# **Project Summary -**

The goal of this project is to develop a machine learning model to predict the demand for bike sharing. The dataset used for this project contains various features related to weather conditions, date and time, and other factors that may influence bike rental demand.

Overall, the bike sharing demand prediction project aimed to provide an accurate and reliable model to forecast the bike rental demand, which can be beneficial for bike sharing companies or city planners in optimizing bike availability and improving operational efficiency.

# **GitHub Link -**

https://github.com/Dipak9699-ds/Internship/tree/main/Almabetter%20Capstone%20Projects/Regression%20-%20Bike%20Sharing%20Demand%20Prediction

# **Problem Statement**


**Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.**

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from datetime import datetime
import datetime as dt

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MultiLabelBinarizer

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import log_loss

import warnings
warnings.filterwarnings('ignore')

### Mount the drive & Dataset Loading

In [None]:
# Let's mount the google drive for import the dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
bike_df = pd.read_csv('/content/drive/MyDrive/AlmaBetter/SeoulBikeData.csv',encoding ='latin')

### Dataset First View

In [None]:
# Dataset First Look
bike_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(bike_df.shape)

### Dataset Information

In [None]:
# Dataset Info
bike_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate or Non Duplicate Value Count
bike_df.duplicated().value_counts()

In [None]:
# Dataset Duplicate Value Count
len(bike_df[bike_df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(bike_df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(bike_df.isnull())

### What did you know about your dataset?

The above dataset has 8760 rows and 14 columns. There are no mising values and duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
bike_df.columns

In [None]:
# Dataset Describe
bike_df.describe().T.style.background_gradient()

### Variables Description

*   Date : Date (year-month-day)
*   Rented Bike count : Count of bikes rented at each hour
*   Hour : Hour of the day (0-23)
*   Temperature : Temperature of the day (in celsius)
*   Humidity : Humidity measure (in %)
*   Windspeed : Windspeed (m/s)
*   Visibility : Visibility measure (10m)
*   Dew point temperature : Dew point temperature measure (in celsius)
*   Solar radiation : Solar radiation (MJ/m2)
*   Rainfall : Rainfall measure (in mm)
*   Snowfall : Snowfall measure (in cm)
*   Seasons : Winter, Spring, Summer, Autumn
*   Holiday : Weather a holiday or not
*   Functional Day : Weather a functional day or not











### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in bike_df.columns:
  print("No. of unique values in ",i,"is",bike_df[i].nunique(),".")

In [None]:
# Check Unique Values for each variable.
for column in bike_df.columns:
  print(str(column) + ' : ' + str(bike_df[column].unique()))
  print('____________________________________________')

### Changing column name

In [None]:
# Rename the complex columns name
bike_df = bike_df.rename(columns={'Rented Bike Count':'Rented_Bike_Count',
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind_speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'Solar_Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                                'Functioning Day':'Functioning_Day'})

### Breaking date column

In [None]:
# Changing the "Date" column into three "year","month","day" column
bike_df['Date'] = bike_df['Date'].apply(lambda x: dt.datetime.strptime(x,"%d/%m/%Y"))

In [None]:
bike_df['year'] = bike_df['Date'].dt.year
bike_df['month'] = bike_df['Date'].dt.month
bike_df['day'] = bike_df['Date'].dt.day_name()

In [None]:
# Creating a new column of "weekdays_weekend" and drop the column "Date","day","year"
bike_df['weekdays_weekend']=bike_df['day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )
bike_df=bike_df.drop(columns=['Date','day','year'],axis=1)

In [None]:
bike_df.head()

In [None]:
bike_df.info()

In [None]:
bike_df['weekdays_weekend'].value_counts()

### Changing data type

In [None]:
# Change the int64 column into catagory column
cols=['Hour','month','weekdays_weekend']
for col in cols:
  bike_df[col]=bike_df[col].astype('category')

In [None]:
# Let's check the result of data type
bike_df.info()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to df
bike_df1 = bike_df.copy()

### What all manipulations have you done and insights you found?

* First of all change the column names and give proper names to all column.
* Python read "Date" column as a object type basically it reads as a string, as the date column is very important to analyze the users behaviour so we need to convert it into datetime format then we split it into 3 column i.e 'year', 'month', 'day'as a category data type.
* The "year" column in our data set is basically contain the 2 unique number contains the details of from 2017 december to 2018 november so if i consider this is a one year then we don't need the "year" column so we drop it.
* The other column "day", it contains the details about the each day of the month, for our relevence we don't need each day of each month data but we need the data about, if a day is a weekday or a weekend so we convert it into this format and drop the "day" column.
* As "Hour","month","weekdays_weekend" column are show as a integer data type but actually it is a category data tyepe. so we need to change this data tyepe if we not then, while doing the further anlysis and correleted with this then the values are not actually true so we can mislead by this.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(8,5))
sns.barplot(data=bike_df,x='month',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Month ')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

From the above bar plot we can clearly say that from the month 5 to 10 the demand of the rented bike is high as compare to other months.these months are comes inside the summer season.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing the count of rented bikes according to the month can indeed provide valuable insights that can help create a positive business impact. However, there are scenarios where certain insights could lead to negative growth. Let's explore both aspects:

Analyzing rental counts by month provides insights to optimize operations, marketing efforts, and resource allocation. While these insights generally lead to positive business impact, careful consideration and proactive strategies are necessary to mitigate any negative growth factors such as seasonal variations, external influences, or market saturation.


#### Chart - 2

In [None]:
# Chart - 2 visualization code
# anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(8,5))
sns.barplot(data=bike_df,x='weekdays_weekend',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to weekdays_weekenday ')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

From the above bar plot we can say that in the week days which represent in blue colur show that the demand of the bike higher because of the office.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing the count of rented bikes according to the weekdays_weekends can indeed provide valuable insights that can help create a positive business impact. However, there are scenarios where certain insights could lead to negative growth. Let's explore both aspects:

Analyzing rental counts between weekdays and weekends can provide valuable insights for businesses to optimize their operations, pricing, and marketing strategies. While these insights generally lead to positive business impact, negative growth factors can arise due to seasonal variations, competitive factors, or external influences. Adapting strategies to mitigate these challenges is crucial for maintaining positive growth and customer satisfaction.



#### Chart - 3

In [None]:
# Chart - 3 visualization code
# anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(10,5))
sns.pointplot(data=bike_df,x='Hour',y='Rented_Bike_Count',hue='weekdays_weekend',ax=ax)
ax.set(title='Count of Rented bikes acording to weekdays_weekend ')

##### 1. Why did you pick the specific chart?

A point plot is a categorical plot that displays the mean value (or another statistical estimate) and confidence intervals for different categories. It is commonly used to compare the relationship between a categorical variable and a numeric variable across different groups or levels.

##### 2. What is/are the insight(s) found from the chart?

From the above point plot we can say that in the week days which represent in blue colur show that the demand of the bike higher because of the office.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(10,5))
sns.barplot(data=bike_df,x='Hour',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Hour ')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

In the above plot which shows the use of rented bike according the hours and the data are from all over the year.

Generally people use rented bikes during their working hour from 7am to 9am and 5pm to 7pm.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing the count of rented bikes according to the hours can indeed provide valuable insights that can help create a positive business impact. However, there are scenarios where certain insights could lead to negative growth. Let's explore both aspects:

Analyzing rental counts by hours provides valuable insights that can positively impact business operations, pricing, promotions, and resource allocation. However, negative growth factors may arise due to low-demand hours, competitive pressures, or external influences. Businesses must proactively address these challenges through targeted strategies and by leveraging insights to maximize growth opportunities and customer satisfaction.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(8,5))
sns.barplot(data=bike_df,x='Functioning_Day',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Functioning Day ')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

In the above bar plot which shows the use of rented bike in functioning day or not, and it clearly shows that, People don't use rented bikes in no functioning day.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing the count of rented bikes according to the functioning day can indeed provide valuable insights that can help create a positive business impact. However, there are scenarios where certain insights could lead to negative growth. Let's explore both aspects:

Analyzing rental counts based on functioning days provides insights that can positively impact resource allocation, marketing strategies, and revenue optimization. However, negative growth factors can arise due to seasonal variations, external influences, or competition dynamics. Adapting strategies to address these challenges is essential for maintaining positive growth and customer satisfaction.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(10,5))
sns.pointplot(data=bike_df,x='Hour',y='Rented_Bike_Count',hue='Functioning_Day',ax=ax)
ax.set(title='Count of Rented bikes acording to Functioning Day ')

##### 1. Why did you pick the specific chart?

A point plot is a categorical plot that displays the mean value (or another statistical estimate) and confidence intervals for different categories. It is commonly used to compare the relationship between a categorical variable and a numeric variable across different groups or levels.

##### 2. What is/are the insight(s) found from the chart?

In the above point plot which shows the use of rented bike in functioning day or not, and it clearly shows that, People don't use rented bikes in no functioning day.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(8,5))
sns.barplot(data=bike_df,x='Seasons',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Seasons ')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

In the above bar plot which shows the use of rented bike in in four different seasons, and it clearly shows that, In summer season the use of rented bike is high.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing the count of rented bikes according to the seasons can indeed provide valuable insights that can help create a positive business impact. However, there are scenarios where certain insights could lead to negative growth. Let's explore both aspects:

Analyzing rental counts based on seasons provides valuable insights that can positively impact resource allocation, pricing strategies, and service diversification. However, challenges such as weather conditions, maintenance, operational hurdles, and market competition can lead to negative growth. By proactively addressing these challenges, businesses can leverage the gained insights to adapt their strategies, optimize operations, and maximize growth opportunities throughout the different seasons.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(10,5))
sns.pointplot(data=bike_df,x='Hour',y='Rented_Bike_Count',hue='Seasons',ax=ax)
ax.set(title='Count of Rented bikes acording to seasons ')

##### 1. Why did you pick the specific chart?

A point plot is a categorical plot that displays the mean value (or another statistical estimate) and confidence intervals for different categories. It is commonly used to compare the relationship between a categorical variable and a numeric variable across different groups or levels.

##### 2. What is/are the insight(s) found from the chart?

In the above point plot which shows the use of rented bike in in four different seasons, and it clearly shows that,

In summer season the use of rented bike is high and peak time is 7am-9am and 5pm-7pm.

In winter season the use of rented bike is very low because of snowfall.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(8,5))
sns.barplot(data=bike_df,x='Holiday',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Holiday ')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

In the above bar plot which shows the use of rented bike is more on Non-holiday compare to holiday.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing the count of rented bikes according to the holidays can indeed provide valuable insights that can help create a positive business impact. However, there are scenarios where certain insights could lead to negative growth. Let's explore both aspects:

Analyzing rental counts based on holidays provides valuable insights for businesses to optimize resource allocation, tailor marketing strategies, and enhance customer experiences. While these insights generally lead to positive business impact, challenges such as seasonal variations, competing events, and operational hurdles may result in negative growth. By adapting strategies, providing alternative services, and offering unique experiences, businesses can mitigate the negative impact and leverage the gained insights to drive positive growth during holidays.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(10,5))
sns.pointplot(data=bike_df,x='Hour',y='Rented_Bike_Count',hue='Holiday',ax=ax)
ax.set(title='Count of Rented bikes acording to Holiday ')

##### 1. Why did you pick the specific chart?

A point plot is a categorical plot that displays the mean value (or another statistical estimate) and confidence intervals for different categories. It is commonly used to compare the relationship between a categorical variable and a numeric variable across different groups or levels.

##### 2. What is/are the insight(s) found from the chart?

In the above point plot which shows the use of rented bike in a holiday, and it clearly shows that, in holiday people uses the rented bike from 2pm-8pm.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Analyze of Numerical variables distplots

# assign the numerical column to variable
numerical_columns = list(bike_df.select_dtypes(['int64','float64']).columns)
numerical_features = pd.Index(numerical_columns)
numerical_features

# let's see how data is distributed for every column
plt.figure(figsize=(12,10))
plotnumber = 1

for column in numerical_features:
    if plotnumber <= 9 :
        ax = plt.subplot(3,3,plotnumber)
        sns.distplot(bike_df[column])
        plt.xlabel(column,fontsize=20)
        #plt.ylabel('Salary',fontsize=20)
    plotnumber+=1
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Distplot is used basically for univariant set of observations and visualizes it through a histogram i.e. only one observation and hence we choose one particular column of the dataset.

##### 2. What is/are the insight(s) found from the chart?

In the above distplot we can see that there are right skew and left skew are present in most of the columns.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Numerical vs.Rented_Bike_Count

# print the plot to analyze the relationship between "Rented_Bike_Count" and "Temperature"
bike_df.groupby('Temperature').mean()['Rented_Bike_Count'].plot()

In [None]:
# print the plot to analyze the relationship between "Rented_Bike_Count" and "Dew_point_temperature"
bike_df.groupby('Dew_point_temperature').mean()['Rented_Bike_Count'].plot()

In [None]:
# print the plot to analyze the relationship between "Rented_Bike_Count" and "Solar_Radiation"
bike_df.groupby('Solar_Radiation').mean()['Rented_Bike_Count'].plot()

In [None]:
# print the plot to analyze the relationship between "Rented_Bike_Count" and "Snowfall"
bike_df.groupby('Snowfall').mean()['Rented_Bike_Count'].plot()

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Rainfall"
bike_df.groupby('Rainfall').mean()['Rented_Bike_Count'].plot()

In [None]:
# print the plot to analyze the relationship between "Rented_Bike_Count" and "Wind_speed"
bike_df.groupby('Wind_speed').mean()['Rented_Bike_Count'].plot()

##### 1. Why did you pick the specific chart?

The plot() is used to draw points (markers) in a diagram. By default,
 the plot() draws a line from point to point. The function takes parameters for specifying points in the diagram.

##### 2. What is/are the insight(s) found from the chart?

From the above plots we see that,
* People like to ride bikes when it is pretty hot around 25°C in average.
* 'Dew_point_temperature' is almost same as the 'temperature' there is some similarity present we can check it in our next step.
* The amount of rented bikes is huge, when there is solar radiation, the counter of rents is around 1000.
* On the y-axis, the amount of rented bike is very low when we have more than 4 cm of snow, the bike rents is much lower.
* Even if it rains a lot the demand of of rent bikes is not decreasing, here for example even if we have 20 mm of rain there is a big peak of rented bikes.
* Demand of rented bike is uniformly distribute despite of wind speed but when the speed of wind was 7 m/s then the demand of bike also increase that clearly means peoples love to ride bikes when its little windy.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In conclusion, the gained insights can help businesses create a positive impact by capitalizing on customer preferences for hot weather, leveraging sunny days, and addressing concerns related to rainy weather. However, the presence of significant snowfall may lead to negative growth, necessitating the adoption of alternative strategies to sustain business during winter months.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# printing the regression plot for all the numerical features

# let's see how data is distributed for every column
plotnumber = 1
plt.figure(figsize=(12,10))

for column in numerical_features:
    if plotnumber <= 9:
        ax = plt.subplot(3,3,plotnumber)
        sns.regplot(x=bike_df[column], y=bike_df['Rented_Bike_Count'], scatter_kws={"color": 'green'}, line_kws={"color": "black"})
        ax.set_xlabel(column, fontsize=12)
        ax.set_ylabel('Rented Bike Count', fontsize=12)
    plotnumber += 1

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Regplot used to create a scatter plot with a linear regression line fit to the data. It allows you to visualize the relationship between two variables and assess the strength and direction of their linear correlation. The regplot function can be used to perform simple linear regression analysis and visualize the resulting model.

##### 2. What is/are the insight(s) found from the chart?

From the above regression plot of all numerical features we see that the columns 'Temperature', 'Wind_speed','Visibility', 'Dew_point_temperature', 'Solar_Radiation' are positively relation to the target variable.

Which means the rented bike count increases with increase of these features.

'Rainfall','Snowfall','Humidity' these features are negatively related with the target variaable which means the rented bike count decreases when these features increase.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In conclusion, the gained insights regarding the relationship between the numerical features and the rented bike count can help businesses create a positive impact by capitalizing on favorable weather conditions. By aligning their services and marketing efforts with periods of higher temperatures, moderate wind speeds, good visibility, and ample solar radiation, businesses can attract more customers and increase rental counts. However, they should also be aware of the negative impact that unfavorable weather conditions, such as high rainfall, snowfall, or humidity, can have on bike rentals. By adapting their strategies and offering alternatives during such conditions, businesses can mitigate the potential negative growth and maintain a positive business impact.

#### Chart - 14

In [None]:
# Chart - 14 visualization code
# Visualize the outliers using boxplot
plt.figure(figsize=(15,12))
graph = 1

for column in numerical_features:
    if graph <= 9:
        plt.subplot(3,3,graph)
        ax=sns.boxplot(bike_df[column])
        plt.xlabel(column,fontsize=10)
    graph+=1
plt.show()

##### 1. Why did you pick the specific chart?

Boxplot is used to create box and whisker plots. A boxplot is a visual representation of the distribution of a dataset, showing the median, quartiles, and any outliers.

##### 2. What is/are the insight(s) found from the chart?

In the above boxplot we can see that outliers are present in most of the columns like wind speed, solar radiation etc.

#### Chart - 15 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Plot the Correlation matrix
plt.figure(figsize=(20,8))
correlation=bike_df.corr()
mask = np.triu(np.ones_like(correlation, dtype=bool))
sns.heatmap((correlation),mask=mask, annot=True, cmap='YlGnBu')

##### 1. Why did you pick the specific chart?

The correlation heatmap chart is a great way to visualize correlations between multiple variables. It provides a clear and concise view of the relationships between the variables, which allows for easy and quick analysis. Additionally, the color coding used in the heatmap helps to quickly and easily identify correlations that may otherwise not be as apparent.

##### 2. What is/are the insight(s) found from the chart?

We can observe on the heatmap that on the target variable line the most positively correlated variables to the rent are :

* Temperature
* Dew point temperature
* Solar radiation

And most negatively correlated variables are:

* Humidity
* Rainfall

From the above correlation heatmap, We see that there is a positive
correlation between columns 'Temperature' and 'Dew point temperature' i.e 0.91 so even if we drop this column then it dont affects the outcome of our analysis. And they have the same variations.. so we can drop the column 'Dew point temperature(°C)'.

#### Chart - 16 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(bike_df,hue='Rented_Bike_Count')

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know, there are less linear relationship between variables and data points aren't linearly separable.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Missing Values/Null Values Count
print(bike_df.isnull().sum())

# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(bike_df.isnull(), cbar=False)

#### What all missing value imputation techniques have you used and why did you use those techniques?

**There are no missing values to handle in the given dataset.**

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# Remove outliers using zscore.
from scipy.stats import zscore

z_score = zscore(bike_df[numerical_features])
abs_z_score = np.abs(z_score)   # Apply the formula and get the scaled data

filtering_entry = (abs_z_score < 3).all(axis=1)

bike_df = bike_df[filtering_entry]

##### What all outlier treatment techniques have you used and why did you use those techniques?

* I have used z_score technique to treat outliers.
* The z-score technique is used to treat outliers because it provides a standardized way to identify and handle data points that deviate significantly from the mean of a distribution.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#Assign all catagoriacla features to a variable
categorical_features=list(bike_df.select_dtypes(['object','category']).columns)
categorical_features=pd.Index(categorical_features)
categorical_features

#creat a copy
bike_df_copy = bike_df

def one_hot_encoding(data, column):
    data = pd.concat([data, pd.get_dummies(data[column], prefix=column, drop_first=True)], axis=1)
    data = data.drop([column], axis=1)
    return data

for col in categorical_features:
    bike_df_copy = one_hot_encoding(bike_df_copy, col)
bike_df_copy.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

* I have used One_hot_encoding technique for categorical data conversion.
* One hot encoding is used to represent categorical variables numerically in a format that is suitable for machine learning algorithms. It is a popular technique for handling categorical data because many machine learning algorithms are designed to work with numerical data rather than categorical data.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

**There are no text columns in the given dataset which I am working on. So, Skipping this part.**

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation and Selection

In [None]:
# Select your features wisely to avoid overfitting

# Split data into x and y
y = bike_df_copy['Rented_Bike_Count']
X = bike_df_copy.drop(columns='Rented_Bike_Count', axis=1)

# Feature Scalling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
X_scaled.shape[1]

In [None]:
#Finding variance inflation factor in each scaled column i.e X_scaled.shape[1] (1/(1-R2))
vif = pd.DataFrame()
vif["vif_score"] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]
vif["Features"] = X.columns

#let's check the values
vif

In [None]:
#drop the Dew point temperature column
bike_df_copy = bike_df_copy.drop(['Dew_point_temperature'],axis=1)

##### What all feature selection methods have you used  and why?

* I have used Correlation heatmat (corr) and VIF method for feature selection.
* A correlation heatmap (corr) is used to visualize the correlation between different variables in a dataset. It is a graphical representation of the correlation matrix, where each cell represents the correlation coefficient between two variables. The correlation coefficient indicates the strength and direction of the linear relationship between two variables.
* VIF (Variance Inflation Factor) is used to measure multicollinearity in regression analysis. Multicollinearity occurs when there is a high correlation between two or more predictor variables in a regression model, which can lead to issues in the interpretation of the model and unstable coefficient estimates.

##### Which all features you found important and why?

We can observe on the heatmap that on the target variable line the most positively correlated variables to the rent are :

Temperature
Dew point temperature
Solar radiation
And most negatively correlated variables are:

Humidity
Rainfall
From the above correlation heatmap, We see that there is a positive correlation between columns 'Temperature' and 'Dew point temperature' i.e 0.91 so even if we drop this column then it dont affects the outcome of our analysis. And they have the same variations.. so we can drop the column 'Dew point temperature(°C)'.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

# In Data Transformation I have rename all the column names and there are no missing values in the dataset don't need to tranformed data.

### 6. Data Scaling

In [None]:
# Scaling your data

# Above I have already done Data Scaling using standard scaler technique.

# Feature Scalling
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X)

##### Which method have you used to scale you data and why?

* I have used standard scaler technique for data scaling.
* Standard scaling technique is used to promote better analysis, enhance model performance, and ensure consistent and meaningful comparisons among variables in various statistical and machine learning tasks.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

No, dimensionality reduction technique is not needed here because in this dataset there is not much columns and columns are less in the dataset.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Dimensionality technique is not used.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

 # Split into 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 42)

# describes info about train and test set
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

##### What data splitting ratio have you used and why?

* I have split the data in 70-30 % ratio in train-test.
* The 70-30 ratio for train-test split is a commonly used practice in machine learning and data analysis, although it is not a hard rule and can vary depending on the specific problem and dataset. The 70% of the data is typically allocated to the training set, while the remaining 30% is allocated to the test set.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Generally, Data Imbalanced check in classification problem so, here there is dataset imbalanced because it's regression analysis.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

There are no imbalance in the dataset.

## ***7. ML Model Implementation***

### ML Model - Linear Regression

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm
reg= LinearRegression().fit(X_train, y_train)

# Check the score
reg.score(X_train, y_train)

In [None]:
# Check the coefficeint
reg.coef_

In [None]:
# Get the X_train and X-test value
y_pred_train=reg.predict(X_train)
y_pred_test=reg.predict(X_test)

In [None]:
print("\n================Train Result==========================")

# Calculate MSE
MSE_lr= mean_squared_error((y_train), (y_pred_train))
print("MSE :",MSE_lr)

# Calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)

# Calculate MAE
MAE_lr= mean_absolute_error(y_train, y_pred_train)
print("MAE :",MAE_lr)

# Calculate r2 and adjusted r2
r2_lr= r2_score(y_train, y_pred_train)
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the train set metrics value in a dataframe for later comparison
dict1={'Model':'Linear Regression',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
training_df=pd.DataFrame(dict1,index=[1])

In [None]:
print("\n=================Test Result==========================")

#calculate MSE
MSE_lr= mean_squared_error(y_test, y_pred_test)
print("MSE :",MSE_lr)

#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)

#calculate MAE
MAE_lr= mean_absolute_error(y_test, y_pred_test)
print("MAE :",MAE_lr)

#calculate r2 and adjusted r2
r2_lr= r2_score((y_test), (y_pred_test))
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score((y_test), (y_pred_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",Adjusted_R2_lr )

# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Linear Regression',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
test_df=pd.DataFrame(dict2,index=[1])

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print(training_df)
print(test_df)

#### 2. Cross- Validation & Hyperparameter Tuning

###LASSO REGRESSION

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Create an instance of Lasso Regression implementation
lasso = Lasso(alpha=1.0, max_iter=3000)

# Fit the Lasso model
lasso.fit(X_train, y_train)

# Create the model score
print(lasso.score(X_train, y_train), lasso.score(X_test, y_test))

In [None]:
# get the X_train and X-test value
y_pred_train_lasso=lasso.predict(X_train)
y_pred_test_lasso=lasso.predict(X_test)

In [None]:
print("\n================Train Result==========================")

#calculate MSE
MSE_l= mean_squared_error((y_train), (y_pred_train_lasso))
print("MSE :",MSE_l)

#calculate RMSE
RMSE_l=np.sqrt(MSE_l)
print("RMSE :",RMSE_l)

#calculate MAE
MAE_l= mean_absolute_error(y_train, y_pred_train_lasso)
print("MAE :",MAE_l)

#calculate r2 and adjusted r2
r2_l= r2_score(y_train, y_pred_train_lasso)
print("R2 :",r2_l)
Adjusted_R2_l = (1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the train set metrics value in a dataframe for later comparison
dict1={'Model':'Lasso Regression ',
       'MAE':round((MAE_l),3),
       'MSE':round((MSE_l),3),
       'RMSE':round((RMSE_l),3),
       'R2_score':round((r2_l),3),
       'Adjusted R2':round((Adjusted_R2_l ),2)
       }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
print("\n=================Test Result==========================")

#calculate MSE
MSE_l= mean_squared_error(y_test, y_pred_test_lasso)
print("MSE :",MSE_l)

#calculate RMSE
RMSE_l=np.sqrt(MSE_l)
print("RMSE :",RMSE_l)

#calculate MAE
MAE_l= mean_absolute_error(y_test, y_pred_test_lasso)
print("MAE :",MAE_l)

#calculate r2 and adjusted r2
r2_l= r2_score((y_test), (y_pred_test_lasso))
print("R2 :",r2_l)
Adjusted_R2_l=(1-(1-r2_score((y_test), (y_pred_test_lasso)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_lasso)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Lasso Regression',
       'MAE':round((MAE_l),3),
       'MSE':round((MSE_l),3),
       'RMSE':round((RMSE_l),3),
       'R2_score':round((r2_l),3),
       'Adjusted R2':round((Adjusted_R2_l ),2),
       }
test_df=test_df.append(dict2,ignore_index=True)

###RIDGE REGRESSION

In [None]:
# Create an instance of Ridge Regression implementation
ridge= Ridge(alpha=0.1)

# Fit the model
ridge.fit(X_train,y_train)

In [None]:
#check the score
ridge.score(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_ridge=ridge.predict(X_train)
y_pred_test_ridge=ridge.predict(X_test)

In [None]:
print("\n=================Train Result==========================")

#calculate MSE
MSE_r= mean_squared_error((y_train), (y_pred_train_ridge))
print("MSE :",MSE_r)

#calculate RMSE
RMSE_r=np.sqrt(MSE_r)
print("RMSE :",RMSE_r)

#calculate MAE
MAE_r= mean_absolute_error(y_train, y_pred_train_ridge)
print("MAE :",MAE_r)

#calculate r2 and adjusted r2
r2_r= r2_score(y_train, y_pred_train_ridge)
print("R2 :",r2_r)
Adjusted_R2_r=(1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the train set metrics value in a dataframe for later comparison
dict1={'Model':'Ridge Regression ',
       'MAE':round((MAE_r),3),
       'MSE':round((MSE_r),3),
       'RMSE':round((RMSE_r),3),
       'R2_score':round((r2_r),3),
       'Adjusted R2':round((Adjusted_R2_r ),2)}
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
print("\n=================Test Result==========================")

#calculate MSE
MSE_r= mean_squared_error(y_test, y_pred_test_ridge)
print("MSE :",MSE_r)

#calculate RMSE
RMSE_r=np.sqrt(MSE_r)
print("RMSE :",RMSE_r)

#calculate MAE
MAE_r= mean_absolute_error(y_test, y_pred_test_ridge)
print("MAE :",MAE_r)

#calculate r2 and adjusted r2
r2_r= r2_score((y_test), (y_pred_test_ridge))
print("R2 :",r2_r)
Adjusted_R2_r=(1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Ridge Regression',
       'MAE':round((MAE_r),3),
       'MSE':round((MSE_r),3),
       'RMSE':round((RMSE_r),3),
       'R2_score':round((r2_r),3),
       'Adjusted R2':round((Adjusted_R2_r ),2)}
test_df=test_df.append(dict2,ignore_index=True)

##### Which hyperparameter optimization technique have you used and why?

I have used Lasso (L1 regularization) and Ridge (L2 regularization) in machine learning and statistical modeling to address the problems of overfitting and high variance. They work by adding a penalty term to the loss function, which encourages the model to have smaller coefficient values.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
print(training_df,"\n\n")
print(test_df)

### ML Model - Decision Tree

In [None]:
# ML Model - 2 Implementation

# Fit the Algorithm
decision_regressor = DecisionTreeRegressor()
decision_regressor.fit(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_d = decision_regressor.predict(X_train)
y_pred_test_d = decision_regressor.predict(X_test)

In [None]:
print("\n=================Train Result==========================")

#import the packages
print("Model Score:",decision_regressor.score(X_train,y_train))

#calculate MSE
MSE_d= mean_squared_error(y_train, y_pred_train_d)
print("MSE :",MSE_d)

#calculate RMSE
RMSE_d=np.sqrt(MSE_d)
print("RMSE :",RMSE_d)

#calculate MAE
MAE_d= mean_absolute_error(y_train, y_pred_train_d)
print("MAE :",MAE_d)

#calculate r2 and adjusted r2
r2_d= r2_score(y_train, y_pred_train_d)
print("R2 :",r2_d)
Adjusted_R2_d=(1-(1-r2_score(y_train, y_pred_train_d))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_d))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the train set metrics value in a dataframe for later comparison
dict1={'Model':'DicisionTree Regressor',
       'MAE':round((MAE_d),3),
       'MSE':round((MSE_d),3),
       'RMSE':round((RMSE_d),3),
       'R2_score':round((r2_d),3),
       'Adjusted R2':round((Adjusted_R2_d),2)
      }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
print("\n=================Test Result==========================")

#calculate MSE
MSE_d= mean_squared_error(y_test, y_pred_test_d)
print("MSE :",MSE_d)

#calculate RMSE
RMSE_d=np.sqrt(MSE_d)
print("RMSE :",RMSE_d)

#calculate MAE
MAE_d= mean_absolute_error(y_test, y_pred_test_d)
print("MAE :",MAE_d)

#calculate r2 and adjusted r2
r2_d= r2_score((y_test), (y_pred_test_d))
print("R2 :",r2_d)
Adjusted_R2_d=(1-(1-r2_score((y_test), (y_pred_test_d)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_d)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'DicisionTree Regressor',
       'MAE':round((MAE_d),3),
       'MSE':round((MSE_d),3),
       'RMSE':round((RMSE_d),3),
       'R2_score':round((r2_d),3),
       'Adjusted R2':round((Adjusted_R2_d),2)
      }
test_df=test_df.append(dict2,ignore_index=True)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print(training_df)
print(test_df)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# We are tuning four Important hyperparameters right now, we are passing the different values for both parameters

grid_param = {
    'criterion': ['mse', 'friedman_mse', 'mae'],
    'max_depth' : [10, 11, 12, 13, 14],
    'min_samples_leaf' : range(2, 8),
    'min_samples_split': range(3, 8)
}

dt_grid = GridSearchCV(estimator=DecisionTreeRegressor(),
                           param_grid=grid_param,
                           cv=5,
                           n_jobs=-1)

# Fit the Algorithm
dt_grid.fit(X_train, y_train)

In [None]:
dt_grid.best_estimator_

In [None]:
dt_optimal_model = dt_grid.best_estimator_

In [None]:
dt_grid.best_params_

In [None]:
y_pred_train_d_g = dt_optimal_model.predict(X_train)
y_pred_test_d_g= dt_optimal_model.predict(X_test)

In [None]:
print("\n=================Train Result==========================")

print("Model Score:",dt_optimal_model.score(X_train,y_train))
MSE_gbh= mean_squared_error(y_train, y_pred_train_d_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)

MAE_gbh= mean_absolute_error(y_train, y_pred_train_d_g)
print("MAE :",MAE_gbh)

r2_gbh= r2_score(y_train, y_pred_train_d_g)
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(y_train, y_pred_train_d_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_d_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the train set metrics value in a dataframe for later comparison
dict1={'Model':'DecisionTree Regressor gridsearchcv',
       'MAE':round((MAE_gbh),3),
       'MSE':round((MSE_gbh),3),
       'RMSE':round((RMSE_gbh),3),
       'R2_score':round((r2_gbh),3),
       'Adjusted R2':round((Adjusted_R2_gbh ),2)
      }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
print("\n=================Test Result==========================")

MSE_gbh= mean_squared_error(y_test, y_pred_test_d_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)

MAE_gbh= mean_absolute_error(y_test, y_pred_test_d_g)
print("MAE :",MAE_gbh)

from sklearn.metrics import r2_score
r2_gbh= r2_score((y_test), (y_pred_test_d_g))
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(y_test, y_pred_test_d_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_d_g)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'DecisionTree Regressor gridsearchcv',
       'MAE':round((MAE_gbh),3),
       'MSE':round((MSE_gbh),3),
       'RMSE':round((RMSE_gbh),3),
       'R2_score':round((r2_gbh),3),
       'Adjusted R2':round((Adjusted_R2_gbh ),2)
      }
test_df=test_df.append(dict2,ignore_index=True)

##### Which hyperparameter optimization technique have you used and why?

* I have used GridsearchCV hyperparameter optimization technique to improve DecisionTree model's better performance.

* GridSearchCV is a technique used for hyperparameter tuning in machine learning. It is a systematic approach to find the optimal combination of hyperparameter values for a given model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
print(training_df)
print(test_df)

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Evaluation metrics in machine learning provide insights into the performance of a model and its impact on the business problem at hand. The choice of evaluation metric depends on the specific task and the goals of the business. Here are some commonly used evaluation metrics and their indications towards the business impact of the ML model:

* Accuracy
* Precision and Recall
* F1-score
* Mean Squared Error (MSE)
* R-squared (R²)


### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm
rf_model = RandomForestRegressor()
rf_model.fit(X_train,y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_r = decision_regressor.predict(X_train)
y_pred_test_r = decision_regressor.predict(X_test)

In [None]:
print("\n================Train Result==========================")

print("Model Score:",rf_model.score(X_train,y_train))

#calculate MSE
MSE_rf= mean_squared_error(y_train, y_pred_train_r)
print("MSE :",MSE_rf)

#calculate RMSE
RMSE_rf=np.sqrt(MSE_rf)
print("RMSE :",RMSE_rf)

#calculate MAE
MAE_rf= mean_absolute_error(y_train, y_pred_train_r)
print("MAE :",MAE_rf)

#calculate r2 and adjusted r2
r2_rf= r2_score(y_train, y_pred_train_r)
print("R2 :",r2_rf)
Adjusted_R2_rf=(1-(1-r2_score(y_train, y_pred_train_r))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_r))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the train set metrics value in a dataframe for later comparison
dict1={'Model':'Random forest regression ',
       'MAE':round((MAE_rf),3),
       'MSE':round((MSE_rf),3),
       'RMSE':round((RMSE_rf),3),

       'R2_score':round((r2_rf),3),
       'Adjusted R2':round((Adjusted_R2_rf ),2)}
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
print("\n================Test Result==========================")

#calculate MSE
MSE_rf= mean_squared_error(y_test, y_pred_test_r)
print("MSE :",MSE_rf)

#calculate RMSE
RMSE_rf=np.sqrt(MSE_rf)
print("RMSE :",RMSE_rf)

#calculate MAE
MAE_rf= mean_absolute_error(y_test, y_pred_test_r)
print("MAE :",MAE_rf)

#calculate r2 and adjusted r2
r2_rf= r2_score((y_test), (y_pred_test_r))
print("R2 :",r2_rf)
Adjusted_R2_rf=(1-(1-r2_score((y_test), (y_pred_test_r)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_r)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Random forest regression ',
       'MAE':round((MAE_rf),3),
       'MSE':round((MSE_rf),3),
       'RMSE':round((RMSE_rf),3),
       'R2_score':round((r2_rf),3),
       'Adjusted R2':round((Adjusted_R2_rf ),2)}
test_df=test_df.append(dict2,ignore_index=True)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print(training_df)
print(test_df)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# We are tuning four Important hyperparameters right now, we are passing the different values for both parameters

grid_param = {
    'n_estimators': [25, 50, 100, 150],
    'max_features': ['sqrt', 'log2', None],
    'max_depth': [3, 6, 9],
    'max_leaf_nodes': [3, 6, 9],
}

rf_grid = GridSearchCV(estimator=RandomForestRegressor(),
                           param_grid=grid_param,
                           cv=5,
                           n_jobs=-1)

# Fit the Algorithm
rf_grid.fit(X_train, y_train)

In [None]:
rf_grid.best_estimator_

In [None]:
rf_optimal_model = rf_grid.best_estimator_

In [None]:
rf_grid.best_params_

In [None]:
y_pred_train_r_g = rf_optimal_model.predict(X_train)
y_pred_test_r_g= rf_optimal_model.predict(X_test)

In [None]:
print("\n=================Train Result==========================")

print("Model Score:",rf_optimal_model.score(X_train,y_train))
MSE_gbh= mean_squared_error(y_train, y_pred_train_r_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)

MAE_gbh= mean_absolute_error(y_train, y_pred_train_r_g)
print("MAE :",MAE_gbh)

r2_gbh= r2_score(y_train, y_pred_train_r_g)
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(y_train, y_pred_train_r_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_r_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the train set metrics value in a dataframe for later comparison
dict1={'Model':'RandomForest Regressor gridsearchcv',
       'MAE':round((MAE_gbh),3),
       'MSE':round((MSE_gbh),3),
       'RMSE':round((RMSE_gbh),3),
       'R2_score':round((r2_gbh),3),
       'Adjusted R2':round((Adjusted_R2_gbh ),2)
      }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
print("\n=================Test Result==========================")

MSE_gbh= mean_squared_error(y_test, y_pred_test_r_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)

MAE_gbh= mean_absolute_error(y_test, y_pred_test_r_g)
print("MAE :",MAE_gbh)

from sklearn.metrics import r2_score
r2_gbh= r2_score((y_test), (y_pred_test_r_g))
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(y_test, y_pred_test_r_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_r_g)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'RandomForest Regressor gridsearchcv',
       'MAE':round((MAE_gbh),3),
       'MSE':round((MSE_gbh),3),
       'RMSE':round((RMSE_gbh),3),
       'R2_score':round((r2_gbh),3),
       'Adjusted R2':round((Adjusted_R2_gbh ),2)
      }
test_df=test_df.append(dict2,ignore_index=True)

##### Which hyperparameter optimization technique have you used and why?

I have used GridsearchCV hyperparameter optimization technique to improve RandomForestRegressor model's better performance.

GridSearchCV is a technique used for hyperparameter tuning in machine learning. It is a systematic approach to find the optimal combination of hyperparameter values for a given model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
print(training_df)
print(test_df)

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Evaluation metrics in machine learning provide insights into the performance of a model and its impact on the business problem at hand. The choice of evaluation metric depends on the specific task and the goals of the business. Here are some commonly used evaluation metrics and their indications towards the business impact of the ML model:

* Accuracy
* Precision and Recall
* F1-score
* Mean Squared Error (MSE)
* R-squared (R²)

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

* Choose RandomForestRegressor model for final prediction because In RandomForestRegressor Model Trainning R2 score is 1.0% and testing R2 score is 0.82%

* RandomForestRegressor is a popular machine learning model used for regression tasks. Here are some reasons why you might choose the RandomForestRegressor model for final prediction.

* Non-linearity
* Robustness to outliers
* Handling high-dimensional data
* Dealing with missing data
* Feature importance
* Ensemble learning and generalization
* Model interpretability

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Gini Importance: The Gini importance, also known as mean decrease impurity, measures the total reduction in the impurity (Gini index) achieved by splitting on a particular feature throughout all the trees in the Random Forest. Features that result in higher impurity reduction when used for splitting are considered more important.
The feature importance values can be normalized to represent the relative importance of features across the entire dataset. Higher values indicate greater importance, while lower values indicate lesser importance.

Model explainability tools such as SHAP (SHapley Additive exPlanations), ELI5 (Explain Like I'm 5), or scikit-learn's permutation_importance module can be utilized to provide more detailed feature importance insights and visualizations for Random Forest models. These tools can help generate feature importance plots, partial dependence plots, or SHAP value explanations.

Using these tools, you can explore the impact of individual features on the model's predictions, identify the most influential features, and gain a better understanding of the relationship between features and the target variable.

Keep in mind that the specific implementation and results may vary depending on the tool or library used, as well as the version and configuration. It's recommended to consult the documentation and examples of the chosen tool for more specific guidance on obtaining feature importance for a RandomForestRegressor model.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import pickle
filename = 'finalized_bsdp_model.pickle'
pickle.dump(rf_model, open(filename, 'wb'))

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
with open('finalized_bsdp_model.pickle', 'rb') as file:
    model = pickle.load(file)

unseen_data = X_scaled[5]
predictions = rf_model.predict([unseen_data])
print(predictions)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

* During the time of our analysis, I have initially did EDA on all the features of dataset.
* I have first analysed our dependent variable, 'Rented Bike Count' and also transformed it.
* Next I have analysed categorical variable and dropped the variable who had majority of one class, I have also analysed numerical variable, found out the correlation, distribution and their relationship with the dependent variable.
* I have also removed some numerical features who had mostly 0 values and hot encoded the categorical variables.
* Next I have implemented 5 machine learning algorithms Linear Regression, lasso, ridge, decission tree and Random Forest. I did hyperparameter tuning to improve our model performance.
* No overfitting is seen.
* Random forest Regressor gives the highest R2 score of 100% recpectively for Train Set and 82% for Test set.
* I can deploy this model.
* However, this is not the ultimate end. As this data is time dependent, the values for variables like temperature, windspeed, solar radiation etc., will not always be consistent. Therefore, there will be scenarios where the model might not perform well. As Machine learning is an exponentially evolving field, I will have to be prepared for all contingencies and also keep checking our model from time to time. Therefore, having a quality knowledge and keeping pace with the ever evolving ML field would surely help one to stay a step ahead in future.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***