# **Project Name**    -  BIKE SHARING DEMAND PREDICTION



##### **Project Type**    - Regression
##### **Contribution**    - Individual

# **Project Summary -**

<div style="font-family: 'Rosemary';color: skyblue;">
In modern urban cities, the introduction of rental bikes has significantly enhanced mobility and provided a convenient mode of transportation. To maximize the benefits of this service, it is essential to ensure that rental bikes are available and accessible to the public at the right time, thereby minimizing waiting times and improving user satisfaction.


A major challenge in this endeavor is maintaining a stable and sufficient supply of rental bikes across the city. This requires accurately predicting the number of bikes needed at each hour to meet demand. Effective prediction helps in ensuring that bikes are evenly distributed, reducing shortages and surpluses at various locations.

The key to achieving this lies in leveraging predictive analytics. By analyzing historical data and identifying patterns in bike usage, we can forecast the hourly demand for rental bikes. Factors such as weather conditions, time of day, day of the week, public events, and seasonal variations can be incorporated into the predictive model to enhance its accuracy.
</div>

# **GitHub Link -**

Github link :- https://github.com/251aditya/Bike_Sharing_Demand_Prediction

# **Problem Statement**


Rental bikes have become a cornerstone of enhanced mobility and convenience. Ensuring a stable supply of rental bikes at the right time and place is crucial to minimize waiting times and maximize user satisfaction. However, striking the right balance in bike supply is challenging. Excess bikes lead to wasted resources, including maintenance costs and parking space, while insufficient bikes result in revenue loss and potential long-term customer dissatisfaction.

To address this, our project aims to investigate key variables that influence the hourly demand for rental bikes and develop a predictive model to estimate the number of bikes required each hour. Our goals are to:

*   **Maximize** the availability of bikes to customers.
*   **Minimize** the waiting time for rental bikes.

Target Column: The number of bikes rented per hour.

Input Columns (13 variables):

    1.Date
    2.Hour
    3.Temperature (°C)
    4.Humidity (%)
    5.Wind speed (m/s)
    6.Visibility (10m)
    7.Dew point temperature (°C)
    8.Solar Radiation (MJ/m²)
    9.Rainfall (mm)
    10.Snowfall (cm)
    11.Seasons
    12.Holiday
    13.Functioning Day







**PROJECT STEPS : -**

Data Preprocessing:
    
  * Standardize and format the dataset to ensure consistency.

Data Cleaning:

  * Handle missing values and correct any inaccuracies.

Data Duplication:

  * Remove duplicate entries to ensure data integrity.

Handling Outliers:
  * Identify and treat outliers to prevent skewed analysis.
Feature Transformation:

  * Transform features to enhance their predictive power.
  
Exploratory Data Analysis (EDA):

  * **Univariate Analysis**: Examine each variable individually to understand its distribution and identify patterns.

  * **Bivariate Analysis:**  Investigate relationships between pairs of variables.
         
  * **Multivariate Analysis**:  Explore interactions among multiple variables to uncover complex relationships.

Encoding of Categorical Columns:

    Convert categorical variables into numerical format using techniques like one-hot encoding or label encoding.

Modeling with Algorithms:
* Linear Regression
* Ridge Regression:
* Lasso Regression:
* Decision Tree:
* Random Forest:

By following these steps, we aim to build a robust predictive model that accurately forecasts the hourly demand for rental bikes. This will enable efficient allocation of bikes, ensuring optimal availability and customer satisfaction, while minimizing operational costs and resource wastage.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import math
from scipy.stats import *
import math
from scipy.stats import ttest_1samp
from sklearn import svm,datasets
from sklearn.model_selection import GridSearchCV
from statsmodels.stats.outliers_influence import variance_inflation_factor
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from datetime import datetime
from sklearn.metrics import mean_squared_error, r2_score
import datetime as dt
from sklearn.linear_model import Ridge,Lasso
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import LabelEncoder

#for model building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import metrics
from xgboost import XGBRegressor


from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

#for model evaluation
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import log_loss

!pip install shap
import shap
import graphviz
sns.set_style('darkgrid')


#---- For handling warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')


In [None]:
data = pd.read_csv('/content/drive/MyDrive/CSV_File/SeoulBikeData.csv', encoding= 'unicode_escape', parse_dates=['Date'])

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('Rows :',data.shape[0])
print('Columns :',data.shape[1])

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f"There are [{data.duplicated().sum()}] duplicated values.")

# Missing Values/Null Values Count


In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(data.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show()

### What did you know about your dataset?

The Dataset related to Rental Bike Demand from South Korean City of Seoul for 2 years ( 2017 and 2018 ) comprising of climatic variables to make bike sharing demand prediction. On this data we are trying to build multiple machine learning algorithms which contributed toward demand prediction and goal is to predict the number of rental bikes that were needed to make the bike-sharing system consistently work..

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include = 'all').T

### Variables Description

**Date** : date-month-year

**Rented Bike count** - Count of bikes rented at each hour

**Hour** - Hour of the day

**Temperature**-Temperature in Celsius

**Humidity** - Humidity in percentage(%)

**Windspeed** - Windspeed in m/s

**Visibility** - Visibility in 10m

**Dew point temperature** - temperature in Celsius

**Solar radiation** - Solar radiation in MJ/m2

**Rainfall** - Rainfall in mm

**Snowfall** - Snowfall in cm

**Seasons** - [Winter, Spring, Summer, Autumn]


**Holiday** - whether the day is considered a holiday [Holiday/No holiday]

**Functional Day** -whether the day is neither a weekend nor holiday[No-(Non Functional Day), Yes-(Functional Day)]

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
column = data.columns.tolist()
for col in column:
  print(f'{col} : {data[col].unique()}')
  print('------------------------------------------------------------------------------------------------------------------------------')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df = data.copy()
print(df['Date'].dtype)
df['Date'] = pd.to_datetime(df['Date'])
print(df['Date'].dtype)

In [None]:
#Lets split the date column into [month, date , day ] category since the bike demand is more likely  dependent on these individual categories.
df['Week_day']=df['Date'].dt.strftime('%A')
df['month_year']=df['Date'].dt.strftime('%m-%Y')
df['Year']=df['Date'].dt.year
df['date']=df['Date'].dt.day
df['Month']=df['Date'].dt.month

In [None]:
def w(_day):
  if _day in ['Saturday','Sunday']:
    return 'weekend'
  else:
    return 'weekday'
df['Day']=df['Week_day'].apply(w)

In [None]:
def time(num):
  if num in range (6,12):
    return 'Morning'
  elif num in range (12,17):
    return 'Afternoon'
  elif num in range (17,21):
    return 'Evening'
  else:
    return 'Night'

df['Shift Time']=df['Hour'].apply(time)

In [None]:
df.head()

### What all manipulations have you done and insights you found?

  * I extracted month , year , day, date , week_day .

  * Also created two different columns , one by dividing day of week into weekday and weekend to check how bike demand was affected on the basis of weekend and weekday.

  * Also, on the basis of hour we divided entire dataset in different time period like- [morning, afternoon, evening and night ]to understand its effect.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1
### ***On the basis of day distribution of bike demand***

In [None]:
# Chart - 1 visualization code
df['Week_day'].value_counts().plot(kind = 'pie',legend = True, figsize = (12,12),explode = [0.10,0.02,0.02,0.02,0.02,0.02,0.02], autopct = '%1.2f%%', shadow = True)
plt.title('Bike sharing demand ratio on the basis of day')
plt.show()

##### 1. Why did you pick the specific chart?



*   We selected **pie charts** for display of this information because their major function is to show the composition of a whole dataset, where each segment represents a different category or subcategory of the data.

*   They can be useful for quickly and easily identifying which categories are most prominent or for comparing the relative sizes of different categories.



##### 2. What is/are the insight(s) found from the chart?

We focused on closely observing the distribution of count of rented bikes across different days of the week and found out that highest count is for friday and rest all the day approximately having same amount of demand bikes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insights helps us to tackle and understand how count of bikes being rented according to the day of the week and ultimately helps us to manange inventory and help business to serve customers better.

#### Chart - 2
### ***Bike Sharing Demand as per Seasons***


In [None]:
# Chart - 2 visualization code
df['Seasons'].value_counts().plot(kind = 'pie',legend = True, figsize = (8,8),explode = [0.02,0.02,0.02,0.02], autopct = '%1.2f%%')
plt.title('Bike sharing demand ratio on the basis of Seasons')
plt.show()

##### 1. Why did you pick the specific chart?

*   We selected **pie charts** for display of this information because their major function is to show the composition of a whole dataset, where each segment represents a different category or subcategory of the data.

*   They can be useful for quickly and easily identifying which categories are most prominent or for comparing the relative sizes of different categories.

##### 2. What is/are the insight(s) found from the chart?

The distribution across all seasons is nearly even, indicating that bike usage is fairly consistent throughout the year.

    * Spring and Summer: Both have the highest count at 25.21%.
    * Autumn: Slightly lower at 24.93%.
    * Winter: Slightly lower at 24.66%.
These minor differences suggest that while bike usage is steady across all seasons, there is a very slight preference for biking in Spring and Summer compared to Autumn and Winter.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By understanding the seasonal distribution of bike usage, businesses can optimize their operations, improve customer satisfaction, and enhance marketing strategies, all of which contribute to a positive business impact.

#### Chart - 3
### ***Bike Sharing Demand ratio on the basis of Month***

In [None]:
# Chart - 3 visualization code

df['Year'].value_counts().plot(kind = 'pie',legend = True, figsize = (10,10),autopct = '%1.2f%%', shadow = True, explode = [0.05,0.05])
plt.title('Bike sharing demand ratio on the basis of month')
plt.show()

##### 1. Why did you pick the specific chart?


* We selected pie charts for display of this information because their major function is to show the composition of a whole dataset, where each segment represents a different category or subcategory of the data.

* They can be useful for quickly and easily identifying which categories are most prominent or for comparing the relative sizes of different categories.

##### 2. What is/are the insight(s) found from the chart?

The year 2018 dominates the bike sharing demand, accounting for 91.51% of the total demand and the year 2017 represents a much smaller portion of the demand, with only 8.49%.

* There is a significant increase in bike sharing demand from 2017 to 2018, indicating strong growth in user adoption or expansion of services during this period.
* This growth trend can be leveraged to forecast future demand and plan for scalability.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By understanding and analyzing the reasons behind the significant growth in 2018, the business can replicate successful strategies, better prepare for future demand, and ensure sustainable growth.

#### Chart - 4
### **Distribution of  bike sharing demand on the basis of seasons**

In [None]:
# Chart - 4 visualization code
plt.scatter('Seasons', 'Rented Bike Count', s=26, c='purple', marker='^',data=df)
plt.title("Spread of amount of demand of bikes across different seasons")
plt.xlabel("Seasons")
plt.ylabel("Density")
plt.show()

##### 1. Why did you pick the specific chart?

A Scatter plot is a type of data visualization that displays the relationship between two numerical variables.

##### 2. What is/are the insight(s) found from the chart?

* **Spring and Summer** have the highest outliers, indicating that there are instances of exceptionally high bike demand during these seasons. This could be due to favorable weather conditions, holidays, or special events.
* **Autumn** also shows some high demand outliers, but fewer compared to Spring and Summer.
* **Winter** has the least number of high demand outliers, suggesting that bike demand is generally lower during this season, likely due to colder weather.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* We conclude the  density of bike demand varies across seasons. The main clusters in Spring, Summer, and Autumn show higher densities, meaning these seasons have more consistent bike usage.
* In Winter, the lower density and fewer outliers indicate overall reduced bike usage, aligning with typical seasonal trends where colder weather discourages biking.

#### Chart - 5
### **Monthly Average Bike Rentals: Holidays vs. Non-Holidays**

In [None]:
# Chart - 5 visualization code
group_month = df.groupby(['Month', 'Holiday'])['Rented Bike Count'].mean().reset_index()

# Create the plot
plt.figure(figsize=(15, 6))
sns.barplot(data=group_month, x='Month', y='Rented Bike Count', hue='Holiday')

# Set plot labels and title
plt.xlabel('Month')
plt.ylabel('Count')
plt.title('Average bike rentals per Month')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

We selected bar graph for display of this information because they are a popular and effective way to visually communicate data to a broad audience because they are easy to read and interpret.

They are commonly used to show the frequency or proportion of different categories or to compare the magnitude of different data points.

##### 2. What is/are the insight(s) found from the chart?

* Bike rentals show a clear seasonal pattern, with higher demand in the warmer months (May to September) and lower demand in the colder months (December to February).
* The trend varies across different months:
  * May: Non-holidays have higher bike rental counts than holidays.
  * June: Bike rentals are relatively equal on holidays and non-holidays.
  * July: Bike rentals on holidays are slightly higher than non-holidays.
  * August and September: Non-holidays have higher bike rental counts than holidays.
* June stands out with the highest average bike rentals, both on holidays and non-holidays. This indicates June might be the most favorable month for biking, possibly due to good weather conditions and the beginning of summer vacations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Allocate more bikes and maintenance resources during peak months (May to September) and holidays to meet higher demand, but also consider the high demand on non-holidays in certain months.
* Adjust staffing levels and bike availability to match the higher rental volumes during these times.

#### Chart - 6
### **Hourly Distribution of Average Bike Rentals Across Different Days of the Week**

In [None]:
# Chart - 6 visualization code
group_hour = df.groupby(['Week_day','Hour'])['Rented Bike Count'].mean().reset_index()

# Create the plot
plt.figure(figsize=(15, 6))
sns.pointplot(data=group_hour, x='Hour', y='Rented Bike Count', hue='Week_day')


# Set plot labels and title
plt.xlabel('Hour')
plt.ylabel('Count')
plt.title('Average bike rentals per Hour')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

we selected point plot because it envolves represtation of the mean, median, or another statistic of interest for each level of the categorical variable, and a point is plotted for each level at the corresponding value of the quantitative variable.The points are then connected by a line, making it easier to compare the values of the quantitative variable across the different levels of the categorical variable.

Point plots are useful for exploring the relationship between two variables and for visualizing patterns or trends in the data

##### 2. What is/are the insight(s) found from the chart?

* Morning Peak (7-8 AM):

    * There is a sharp peak in bike rentals around 7-8 AM on weekdays (Monday to Friday), which corresponds to the morning commute time. This peak is highest on Fridays.
    * Weekends (Saturday and Sunday) show a more gradual increase during the morning hours, without a sharp peak.
* Evening Peak (5-6 PM):

    * Another significant peak occurs around 5-6 PM on weekdays, which corresponds to the evening commute time. This peak is again highest on Fridays.
    * Weekends show an increase in rentals during the evening hours, but it is less pronounced compared to weekdays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Increase the availability of bikes during peak commuting hours (7-9 AM and 5-7 PM) on weekdays, especially on Fridays.
* Ensure a steady supply of bikes throughout the midday period on weekends to cater to recreational users.

#### Chart - 7
### **Average bike rentals as per different shifts of timmings**

In [None]:
# Chart - 7 visualization code
Time_shift = df.groupby(['Shift Time'])['Rented Bike Count'].mean().reset_index()

# Create the plot
plt.figure(figsize=(10, 5))
sns.pointplot(data=Time_shift, x='Shift Time', y='Rented Bike Count')


# Set plot labels and title
plt.xlabel('Shift')
plt.ylabel('Count')
plt.title('Average bike rentals as per different shifts of timmings')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

* we selected ponit plot because it envolves represtation of the mean, median, or another statistic of interest for each level of the categorical variable, and a point is plotted for each level at the corresponding value of the quantitative variable.

* The points are then connected by a line, making it easier to compare the values of the quantitative variable across the different levels of the categorical variable.

* Point plots are useful for exploring the relationship between two variables and for visualizing patterns or trends in the data

##### 2. What is/are the insight(s) found from the chart?

* The chart highlights that the evening shift is the most popular time for bike rentals, followed by the afternoon, morning, and finally the night shift.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Considering insights found , we can try to keep more arrangments of bikes during evening time because utlimately it helps to generate more profit when business is high.

#### Chart - 8
### **Comparison of Average Bike Rentals: Weekday vs. Weekend**

In [None]:
# Chart - 8 visualization code
colors = ['skyblue', 'purple']
sns.barplot(data = df, x = 'Day', y= 'Rented Bike Count', palette=colors,width=0.3)
plt.title('Comparison of Average Bike Rentals: Weekday vs. Weekend')
plt.show()

##### 1. Why did you pick the specific chart?

* A bar chart allows for an easy and direct comparison between the two categories (weekday and weekend). The differences in bike rentals are visually distinct, making it straightforward to interpret the data.

* The height of each bar represents the average count, which is easy to understand even for people without a statistical background.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows that the average number of bike rentals is higher on weekdays compared to weekends. This suggests that bike rentals are more popular among people commuting to work or running errands during the weekdays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Given the higher demand for bike rentals during weekdays, it is essential to ensure an adequate stock of bikes to meet this increased demand. By doing so, we can better satisfy customer needs and enhance business performance.

#### Chart - 9
### **Impact of Weather Conditions on Bike Rentals**

In [None]:
# Chart - 9 visualization code
fig = plt.figure(figsize=(18, 8))
axes = fig.add_subplot(1, 3, 1)
sns.regplot(data=df, x='Temperature(°C)', y='Rented Bike Count',ax=axes,color = 'maroon')
axes.set(title='Temperature vs Rented Bike Count')
axes = fig.add_subplot(1, 3, 2)
sns.regplot(data=df, x='Humidity(%)', y='Rented Bike Count',ax=axes)
axes.set(title='Humidity vs Rented Bike Count')
axes = fig.add_subplot(1, 3, 3)
sns.regplot(data=df, x='Wind speed (m/s)', y='Rented Bike Count',ax=axes, color='green')
axes.set(title='Windspeed vs Rented Bike Count')
plt.show()

##### 1. Why did you pick the specific chart?

* Scatter plots are ideal for visualizing the relationship between two numerical variables. In this case, we want to understand how temperature, humidity, and wind speed affect the number of rented bikes.

* Scatter plots effectively show the distribution of data points and potential correlations.

##### 2. What is/are the insight(s) found from the chart?

* ***Temperature***: There seems to be a positive correlation between temperature and bike rentals. As the temperature increases, the number of bike rentals tends to increase as well. This is evident from the upward slope of the regression line.

* ***Humidity***: The relationship between humidity and bike rentals is less clear. There appears to be a slight negative correlation, but the data points are more scattered, indicating a weaker relationship.

* ***Wind Speed***: The scatter plot for wind speed shows a negative correlation with bike rentals. As wind speed increases, the number of bike rentals tends to decrease.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Temperature: Bike rental businesses can leverage temperature forecasts to anticipate demand and adjust their operations accordingly. For example, they can increase bike availability during warmer periods.

* Wind Speed: Understanding the negative impact of wind speed can help in planning marketing campaigns or offering incentives during windy days to encourage rentals.

#### Chart - 10 - Correlation Heatmap

In [None]:
#Extracting numerical dataframe form existing dataset and creating new numerical dataframe
numerical_df = df.select_dtypes(include= [np.number])
print(numerical_df)

In [None]:
# Correlation Heatmap visualization code

plt.figure(figsize=(14, 5))
sns.heatmap(numerical_df.corr(), annot=True,annot_kws={'size':7})
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

* A Cor-relation matrix is a table that show the correlation between each variables.

* Is also used to summarize data as an input into a more advanced ananlysis.

##### 2. What is/are the insight(s) found from the chart?

* The ***Rented Bike Count*** shows a strong positive correlation (0.54) with ***Temperature(°C)***. This indicates that bike rentals tend to increase as the temperature rises.

* The ***Dew point temperature(°C)*** has a moderate positive correlation (0.38) with ***bike rentals***, indicating that higher dew point temperatures, which often accompany warmer weather, are associated with more rentals.

#### Chart - 11 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(data=df, vars=['Temperature(°C)','Visibility (10m)','Rented Bike Count'], hue="Seasons",height=8,aspect=2)
plt.show()

In [None]:
sns.pairplot(df)
plt.show()

##### 1. Why did you pick the specific chart?

* A pair plot is a type of data visualization that displays the pairwise relationships between multiple variables in a dataset.

* The pair plot consists of a grid of scatter plots, where each variable in the dataset is plotted against every other variable.

* Pair plots can be used for both continuous and categorical variables. For categorical variables, the plot may use a different type of plot, such as a stacked bar plot, to display the relationships between variables.

##### 2. What is/are the insight(s) found from the chart?

* While focussing on the distribution plot between temp-temp , we could intuitively say that during summer seasons temperature remains high as compared to others while visibility is high in autumn and overall rented bikes are high in winter season.

* However we have a lot of outliers in terms of temperature and visibility features specially in summer seasons.

* Temperature and rented bike counts kind of shows a linear trend across different seasons.

## ***5. Hypothesis Testing***


### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.


<div style="font-family: 'Rosemary';font-size :20px;">

1. Temperature Impact on Bike Rentals.


2. Seasonal Variation in Bike Rentals.


3. Hourly Variation in Bike Rentals.
</div>

### Hypothetical Statement - 1
<div style="font-family: 'Rosemary'; color: skyblue; font-size : 28px">
Temperature Impact on Bike Rentals :
</div>


#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis** : "There is no relationship between temperature and the number of rented Bike."


**Alternative_hypothesis** : "There is a relationship between temperature and the number of rented Bike."

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
Null_hypothesis = "There is no relationship between temperature and the number of rented Bike."
Alternative_hypothesis = "There is a relationship between temperature and the number of rented Bike."

#Calculate the correlation between temperature and the number of rented Bike
r,p= stats.pearsonr(df['Temperature(°C)'],df['Rented Bike Count'])

#Print the result of the test
print(f"Correlation coefficient: {r}")
print(f"P-Value: {p}")

if p< 0.05:
	print(f"{Alternative_hypothesis},\n hence we are rejecting null hypothesis.")
else :
	print(f"{Null_hypothesis},\n hence we fail to reject null hypothesis.")

##### Which statistical test have you done to obtain P-Value?

* We used Pearson correlation coefficient and the p-value to test the statistical significance of the relationship between the number of rented bikes and the temperature.

* Specifically, we use pearsonr() function from the scipy.stats library to calculate the Pearson correlation coefficient and the p-value between the Temperature and Rented_Bike_Count columns of the data.

##### Why did you choose the specific statistical test?

* The Pearson correlation coefficient is a measure of the strength and direction of the linear relationship between two variables, and it takes values between -1 and 1.

* A value of -1 indicates a strong negative linear relationship, a value of 0 indicates no linear relationship, and a value of 1 indicates a strong positive linear relationship.

* It is suitable for testing the statistical significance of a linear relationship between two continuous variables.

* In this case, the Temperature column is a continuous variable that can take on any value within a certain range, and the Rented_Bike_Count column is also a continuous variable that can take on any integer value within a certain range.

### Hypothetical Statement - 2
<div style="font-family: 'Rosemary'; color: skyblue; font-size : 28px">
Seasonal Variation in Bike Rentals :
</div>

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null_Hypothesis** : There is no difference in the mean of rented Bikes between different seasons.

**Alternative_Hypothesis** : There is a difference in the mean of rented Bikes between different seasons.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value and F-Statistics:
null_Hypothesis=  "There is no difference in the mean of rented Bikes between different seasons."
alternative_Hypothesis =  "There is a difference in the mean of rented Bikes between different seasons."


S1 = df.groupby('Seasons')['Rented Bike Count'].get_group('Autumn')
S2 = df.groupby('Seasons')['Rented Bike Count'].get_group('Summer')
S3 = df.groupby('Seasons')['Rented Bike Count'].get_group('Spring')
S4 = df.groupby('Seasons')['Rented Bike Count'].get_group('Winter')

F,P = stats.f_oneway(S1,S2,S3,S4)
print(f"F-Statistic: {F}")
print(f"P-Value: {P}")
if p < 0.05:
	print(f"{alternative_Hypothesis}\n Hence we are rejecting null hypothesis,")
else :
	print(f"{null_Hypothesis} \n Hence we fail to reject null hypothesis.")

##### Which statistical test have you done to obtain P-Value?

* A one-way ANOVA test is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.

* The null hypothesis states that all group means are equal, and the alternative hypothesis states that at least one group mean is different.

##### Why did you choose the specific statistical test?

* ANOVA works by comparing the variance within each group to the variance between the groups. If the variance within the groups is small relative to the variance between the groups, it suggests that the means of the groups are significantly different.

* On the other hand , if the variance within the groups is large relative to the variance between the groups it suggests that the means of tha groups are not significantly different.anext

* This test is suitable for these hypothesis because they both involve comparing the mean values of a variable ( the number of rented Bikes) with different group.

### Hypothetical Statement - 3
<div style="font-family: 'Rosemary'; color: skyblue; font-size : 28px">
Hourly Variation on Bike Rentals :
</div>

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null_Hypothesis** : "There is no difference in the mean number of rented bikes between different hours of the day."


**Alternative_Hypothesis** : "There is a difference in the mean number of rented bikes between different hours of the day."


#### 2. Perform an appropriate statistical test.

In [None]:
# Define the null and alternative hypotheses
Null_Hypothesis = "There is no difference in the mean number of rented bikes between different hours of the day."
Alternative_Hypothesis = "There is a difference in the mean number of rented bikes between different hours of the day."

# Perform Statistical Test to obtain P-Value and F- value
F, p = stats.f_oneway(*[data.groupby('Hour')['Rented Bike Count'].get_group(hour)
                        for hour in data.groupby('Hour').groups])
print(f"F-statistic: {F}")
print(f"p-value: {p}")
if p < 0.05:
   print(f"{Alternative_Hypothesis} \n Hence we are rejecting null hypothesis.")
else:
   print(f"{Null_Hypothesis}\n Hence we fail to reject null hypothesis.")

##### Which statistical test have you done to obtain P-Value?

* We choose analysis of variance (ANOVA) test for the hypotheses because it is used to determine whether there is a statistically significant difference in the means of two or more groups.

##### Why did you choose the specific statistical test?

A one-way ANOVA test is used to determine whether there are any statistically significant differences between the means of three or more independent  groups. The null hypothesis states that all group means are equal, and the alternative hypothesis states that at least one group mean is different.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
print(df.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

We found there is no missing Values in this dataset, hence we don't apply any imputation techniques on it.

### 2. Handling Outliers

In [None]:
# Seperating colunmns that required for Analysis:
column_to_exclude = ['Year','date', 'Month']
req_columns = numerical_df.drop(columns= column_to_exclude )
print(req_columns.head())

In [None]:
# Handling Outliers & Outlier treatments
symmetric_feature = []
skew_symmetric_feature =[]
for c in req_columns:
	if abs(df[c].mean()-df[c].median())< 0.2 :
		symmetric_feature.append(c)
	else:
		skew_symmetric_feature.append(c)

print(f"Symmetric Distributed Features : {symmetric_feature}")
print(f"Skew Symmetric Distributed Features : {skew_symmetric_feature}")

In [None]:
#For Symmetric features defining upper and lower boundry considering it as normally distributed by using mean and std.
def outlier_treatment(df,feature):
	upper_boundry = df[feature].mean()+ 3* df[feature].std()
	lower_boundry = df[feature].mean()- 3* df[feature].std()
	return upper_boundry,lower_boundry

In [None]:
#Capping the data to lower and upper boundry:
for feature in symmetric_feature:
  df.loc[df[feature]<= outlier_treatment(df=df,feature=feature)[1], feature]=outlier_treatment(df=df,feature=feature)[1]
  df.loc[df[feature]>= outlier_treatment(df=df,feature=feature)[0], feature]=outlier_treatment(df=df,feature=feature)[0]

In [None]:
#For Skew Symmetric features defining upper and lower boundry :
def outlier_treatment_skew(df,feature):
  IQR= df[feature].quantile(0.75)- df[feature].quantile(0.25)
  lower_bridge =df[feature].quantile(0.25)-1.5*IQR
  upper_bridge =df[feature].quantile(0.75)+1.5*IQR
  return upper_bridge,lower_bridge

In [None]:
#Capping the data to lower and upper boundry:
for feature in skew_symmetric_feature:
  df.loc[df[feature]<= outlier_treatment_skew(df=df,feature=feature)[1], feature]=outlier_treatment_skew(df=df,feature=feature)[1]
  df.loc[df[feature]>= outlier_treatment_skew(df=df,feature=feature)[0], feature]=outlier_treatment_skew(df=df,feature=feature)[0]

##### What all outlier treatment techniques have you used and why did you use those techniques?

* We first seperate out columns required for analysis i.e. outlier treatment.

* Then we categorised columns into skew symmetric and symmetric features and defined the upper and lower boundry.

* Afer this we used capping method to change outliers into upper and lower limit instead of removing the entire data.

* In normal distribution while it’s the symmetric curve and outlier are present so we set the boundary by taking standard deviation.

* For Non Symmetric or skew symmetric features, we use quantile method for capping.

* The box plot uses the median and the lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 — Q1) is called the interquartile range or IQ.

			lower : Q1–1.5*IQ
			upper : Q3 + 1.5*IQ

### 3. Categorical Encoding

In [None]:
df.columns

In [None]:
# Encode your categorical columns:


# Copy the DataFrame and drop the specified columns
categorical_data = df.copy()
categorical_data.drop(columns = ['Date',  'month_year', 'Year', 'Day','Week_day','Month','Shift Time','date'],axis = 1, inplace = True)
# Convert categorical variables to dummy/indicator variables
categorical_data = pd.get_dummies(categorical_data, drop_first=False)

# Convert boolean columns to integers
for column in categorical_data.select_dtypes(include='bool').columns:
    categorical_data[column] = categorical_data[column].astype(int)

categorical_data.head(3)

#### What all categorical encoding techniques have you used & why did you use those techniques?

* Since there were not much different unique categories in each categorical feature , we used one hot encoding using get_dummies function which simply convert each features into boolen types.

* then after we convert all the boolen type columns into binary (o/1).

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
def absolute_humidity(temp, dew_point):


    # Calculate the saturation vapor pressure at the dew point temperature
    E_dew = 6.11 * 10 ** ((7.5 * dew_point) / (237.7 + dew_point))

    # Calculate absolute humidity
    AH = (216.7 * E_dew) / (temp + 273.15)

    return AH

In [None]:
categorical_data['Absolute Humidity (g/m³)']= categorical_data.apply(lambda x : absolute_humidity(x['Temperature(°C)'], x['Dew point temperature(°C)']), axis = 1)

In [None]:
categorical_data = categorical_data.drop(categorical_data[categorical_data['Functioning Day_Yes']==0].index)

In [None]:
categorical_data.head()

>> Here we are using two features **Temperature(°C) and Dew point temperature(°C)** and combining them into absolute humidity, as a new feature as both of them were showing high correlation.

>> ***Absolute humidity is the mass of water vapor in a given volume of air.***

>> #### **Secondly we are dropping the values when there is no functioning day, because there are no bike rented.**

>> Also it can cause overfitting as it will act as extra feature and due to less variation will not help model to learn.


#### 2. Feature Selection

In [None]:
lower_bound = 0.00
upper_bound = 0.05
variance = categorical_data.var()
filtered_variance = variance[(variance >= lower_bound) & (variance <= upper_bound)]
print(filtered_variance)

### Feature Selection envolves removing columns from dataset that have low vairance.

In [None]:
# Selection your features wisely t avoid overfitting

from sklearn.feature_selection import VarianceThreshold
def drop_constant_columns(data, threshold=0.05):


    # Initialize the VarianceThreshold object with the specified threshold
    var_thres = VarianceThreshold(threshold=threshold)

    # Fit the VarianceThreshold object to the data
    var_thres.fit(data)

    # Get the columns to be dropped
    low_variance_columns = [column for column in data.columns
                            if column not in data.columns[var_thres.get_support()]]

    # Print the columns that are being dropped
    print(f'Columns dropped: {low_variance_columns}')

    # Drop the low variance columns from the DataFrame
    new_df = data.drop(columns=low_variance_columns, axis=1)

    return new_df

In [None]:
remove_var = drop_constant_columns(categorical_data)

In [None]:
remove_var.head()

### Variance Inflation Factor (VIF):
* It helops in removal of multicolinearity in dataset.

* It also indentifies the strength of correlation between independent variable.

* Range of VIF values:
	> Higher then 5 is more corrective measure required

	> less then 5 need no corrective required.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

def calculate_vif(data_):

    # Add constant for intercept
    X = add_constant(data_)

    # Calculate VIF for each feature
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return vif_data



# Calculate VIF
vif_data = calculate_vif(remove_var)
# Filter out columns with infinite VIF values
vif_data = vif_data[vif_data["VIF"] != float('inf')]
vif_data = vif_data.sort_values(by = 'VIF', ascending=False)
print(vif_data)


In [None]:
# Drop the feature with the highest VIF
highest_vif_feature = vif_data.sort_values(by="VIF", ascending=False).iloc[0]["Feature"]
df_dropped= remove_var.drop(columns=[highest_vif_feature])
print(f"Dropped feature: {highest_vif_feature}")
print(df_dropped)

In [None]:
df_dropped.head()

In [None]:
# Calculate VIF
vif_data = calculate_vif(df_dropped)

# Filter out columns with infinite VIF values
vif_data = vif_data[vif_data["VIF"] != float('inf')]

vif_data = vif_data.sort_values(by = 'VIF', ascending=False)
print(vif_data)

In [None]:
 # Drop the feature with the highest VIF
highest_vif_feature = vif_data.sort_values(by="VIF", ascending=False).iloc[0]["Feature"]
Final_df= df_dropped.drop(columns=[highest_vif_feature])
print(f"Dropped feature: {highest_vif_feature}")
print(Final_df)

In [None]:
# Calculate VIF
vif_data = calculate_vif(Final_df)
# Filter out columns with infinite VIF values
vif_data = vif_data[vif_data["VIF"] != float('inf')]
vif_data = vif_data.sort_values(by = 'VIF', ascending=False)
print(vif_data)

In [None]:
plt.figure(figsize=(14, 5))
sns.heatmap(Final_df.corr(), annot=True,annot_kws={'size':7})
plt.title('Correlation Heatmap')
plt.show()

##### What all feature selection methods have you used  and why?
**Dropping contant Freature:** By using this freatures we dropped columns that having constant value.

**Reduces Dimensionality:** By removing features with very low variance, we reduce the number of features, which simplifies the model and reduces the risk of overfitting.

**Improves Model Performance:** Low variance features can introduce noise and may negatively impact the model’s performance. Removing them can lead to better generalization.

##### Which all features you found important and why?

In [None]:
Final_df.head()


In [None]:

print(Final_df.shape)

>> Columns that are important to predict:
* 'Rented Bike Count', 'Hour', 'Humidity(%)', 'Wind speed (m/s)',
       'Visibility (10m)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)',
       'Snowfall (cm)', 'Seasons_Autumn', 'Seasons_Spring', 'Seasons_Summer',
       'Seasons_Winter','Absolute Humidity (g/m³)'
	   
>> By analyzing these features, we can better understand the factors that influence bike rentals and build a more accurate predictive model.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
def classify_features(df, threshold=0.1):
    symmetric_features = []
    non_symmetric_features = []

    for column in df.columns:
        # Calculate mean and median
        mean_val = df[column].mean()
        median_val = df[column].median()

        # Classify based on the mean-median difference
        if abs(mean_val - median_val) <= threshold:
            symmetric_features.append(column)
        else:
            non_symmetric_features.append(column)

    return symmetric_features, non_symmetric_features

symmetric_feature,non_symmetric_features = classify_features(Final_df)




In [None]:
print("Symmetric Distributed Features :-", symmetric_feature)
print('--------------------------------------------------------------------------------------------------')
print("Skew Distributed Features :-", non_symmetric_features)

In [None]:
# Function to plot histograms and boxplots for features
def plot_features(df, features, title):
    plt.figure(figsize=(15, 10))

    for i, feature in enumerate(features, 1):
        plt.subplot(len(features), 2, i*2-1)
        sns.histplot(df[feature], kde=True)
        plt.title(f'Histogram of {feature}')

        plt.subplot(len(features), 2, i*2)
        sns.boxplot(x=df[feature])
        plt.title(f'Boxplot of {feature}')

    plt.suptitle(title)
    plt.tight_layout()
    plt.show()

# Visualize symmetric features
plot_features(Final_df, symmetric_feature, 'Symmetric Features')

# Visualize non-symmetric features
plot_features(Final_df, non_symmetric_features, 'Non-Symmetric Features')

In [None]:
Final_df['Wind speed (m/s)']=np.cbrt(Final_df['Wind speed (m/s)'])
Final_df['Rented Bike Count']=np.sqrt(Final_df['Rented Bike Count'])

### 6. Data Scaling

In [None]:
# Scaling your data
for col in Final_df:
  fig=plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature= (Final_df[col])
  sns.distplot(Final_df[col])
  ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(col)
  plt.show()

In [None]:
# Scaling your data
for col in non_symmetric_features:
  if col == 'Rented Bike Count':
    pass
  elif col == 'Wind speed (m/s)':
    Final_df[col] = StandardScaler().fit_transform(Final_df[col].values.reshape(-1, 1))
  else:
    Final_df[col] = MinMaxScaler().fit_transform(Final_df[col].values.reshape(-1, 1))

##### Which method have you used to scale you data and why?

* Since we use Standardization when data follows Gaussian distribution and Normalization when data does not followed.

* In our dataset few of the features were having large difference in distribution, hence we have used Standardization using **StandardScaler** on **Winds Speed** as it showed normal distributed and Normalization using **MinMaxScaler** on rest of the features.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

### Not-Required

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X = Final_df.drop(columns=['Rented Bike Count'])
y = Final_df['Rented Bike Count']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


##### What data splitting ratio have you used and why?

>> ***We splitting data into 80:20 or (4:1) ratio.***
* This split provides enough data for the model to learn effectively while still retaining a substantial number of samples to evaluate model performance accurately.

* In summary, an 80:20 split is a balanced and widely accepted practice that helps ensure sufficient data for both training and testing, promoting robust model development and reliable evaluation.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

### Not-Required

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

Model_1 = LinearRegression()

# Fit the Algorithm
Model_1.fit(X_train,y_train)

In [None]:
# Score
Model_1.score(X_train,y_train)

In [None]:
# Predict on the model
print(f'The model coefficients are {Model_1.coef_}')
print(f'The model intercept is {Model_1.intercept_}')
y_pred_train = Model_1.predict(X_train)
y_pred = Model_1.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Metric Score for train set
r2_train = r2_score(y_train, y_pred_train)
adj_r2_train = 1-(1-r2_score(np.square(y_train), np.square(y_pred_train)))*(
    (X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
MSE_train = mean_squared_error(y_train, y_pred_train)
RMSE_train = np.sqrt(MSE_train)
MAE_train = mean_absolute_error(y_train, y_pred_train)


# Metric Score for test set
r2_test = r2_score(y_test, y_pred)
adj_r2_test = 1-(1-r2_score(np.square(y_test), np.square(y_pred)))*(
    (X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
MSE_test = mean_squared_error(y_test, y_pred)
RMSE_test = np.sqrt(MSE_test)
MAE_test = mean_absolute_error(y_test, y_pred)

#Converting into readable format
row_a=['r2_score','adj_r2_score','mean_squared_error','RMSE','mean_absolute_error']
row_b=[r2_train,adj_r2_train,MSE_train,RMSE_train,MAE_train]
row_c=[r2_test,adj_r2_test,MSE_test,RMSE_test,MAE_test]

#final dataframe of parameters
data_r=pd.DataFrame({'Evalution Parameters': row_a, 'Train': row_b, 'Test': row_c}).set_index('Evalution Parameters')
data_r

In [None]:
# Visualizing evaluation Metric Score chart
# Plotting actual and predicted values and the feature importances:
plt.figure(figsize=(18,6))
plt.plot((y_pred)[:200])
plt.plot((np.array(y_test)[:200]))
plt.legend(["Predicted","Actual"])
plt.title('Actual and Predicted Bike Counts')
plt.tight_layout()
plt.show()

* The R-squre values on both the training and the test dataset are relatively similar, which indicates that the model is doing really good job of explaining the variance inthe target variable.

* The MAE and  RMSE values also more or less low which is also indicating that the model is making relatively small and accurate predictions.

* The R-squred value is slightly higher in the test dataset than in the training dataset it could indiacate that the model is underfitting to the training data , meaning that the model is not capturing the underlying patterns in the data well enough.

****Overall Trends:****

>The predicted values generally follow the trends of the actual values, indicating that the model captures the overall pattern in the data.


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
parameter={
      'fit_intercept':[True,False],
      'copy_X':[True,False],
      'n_jobs':[1,2,3,4,5,6,7,8,9,10,11,12],
      'positive':[True,False]}


# Create the grid search object
gs_r=GridSearchCV(Model_1,param_grid=parameter,cv=5,scoring='r2')

# Fit the Algorithm
gs_r.fit(X_train,y_train)

# Predict on the model
y_pred_test_gs=gs_r.predict(X_test)
y_pred_train_gs=gs_r.predict(X_train)

In [None]:
# Get the best parameters, estimator, and score
best_parameters = gs_r.best_params_
best_model = gs_r.best_estimator_
best_score = gs_r.best_score_

print("Best parameters found: ", best_parameters)
print("Best estimator: ", best_model)
print("Best R-squared score: ", best_score)

##### Which hyperparameter optimization technique have you used and why?

* We used GridSearchCV hyperparameter optimization technique which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

* our goal should be to find the best hyperparameters values to get the perfect prediction results from our model.

* It uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters.

* This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

* In GridSearchCV,cross-validation is also performed which is used while training the model.Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**The parameters obtained similar after performing hyperparameter tunning , so it means the model is already optimised at those parameters.**

### ML Model - 2

The second model that i want to apply is **RandonForest_Model**

* Random Forest is known for its high accuracy. It combines the predictions of multiple decision trees, which helps to improve the overall predictive performance.

* It is less likely to overfit compared to a single decision tree. Overfitting is minimized because the model averages the results of multiple trees.

Random Forest is a powerful, versatile, and robust model that can handle various types of data, provide insights into feature importance, and deliver high predictive accuracy. It is a good choice for many regression and classification tasks, especially when the dataset is large and complex.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Creating Random Forest Model
Model_2 = RandomForestRegressor()
# Fit the Algorithm
Model_2.fit(X_train,y_train)

# Predict on the model

y_pred_train_rf = Model_2.predict(X_train)
y_pred_test_rf = Model_2.predict(X_test)

In [None]:
# Metric Score for train set
r2_train_rf = r2_score(y_train, y_pred_train_rf)
adj_r2_train_rf = 1-(1-r2_score(np.square(y_train), np.square(y_pred_train_rf)))*(
    (X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
MSE_train_rf = mean_squared_error(y_train,y_pred_train_rf)
RMSE_train_rf = np.sqrt(MSE_train_rf)
MAE_train_rf = mean_absolute_error(y_train, y_pred_train_rf)

# Metric Score for test set
r2_test_rf = r2_score(y_test, y_pred_test_rf)
adj_r2_test_rf = 1-(1-r2_score(np.square(y_test), np.square(y_pred_test_rf)))*(
    (X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
MSE_test_rf = mean_squared_error(y_test, y_pred_test_rf)
RMSE_test_rf = np.sqrt(MSE_test_rf)
MAE_test_rf = mean_absolute_error(y_test,y_pred_test_rf)

#Converting into readable format
row_a_rf=['r2_score','adj_r2_score','mean_squared_error','RMSE','mean_absolute_error']
row_b_rf=[r2_train_rf,adj_r2_train_rf,MSE_train_rf,RMSE_train_rf,MAE_train_rf]
row_c_rf=[r2_test_rf,adj_r2_test_rf,MSE_test_rf,RMSE_test_rf,MAE_test_rf]

#final dataframe of parameters
data_rf=pd.DataFrame({'Evalution Parameters': row_a_rf, 'Train': row_b_rf, 'Test': row_c_rf}).set_index('Evalution Parameters')
data_rf

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(18,6))
plt.plot((y_pred)[:200])
plt.plot((np.array(y_pred_test_rf)[:200]))
plt.legend(["Predicted","Actual"])
plt.title('Actual and Predicted Bike Counts')
plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# HYperparameter Grid
p= {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}

# Grid search
gs_rf= GridSearchCV(estimator= Model_2 ,
                       param_grid = p,
                       cv = 5, verbose=2, scoring='r2')


# Fit the Algorithm
gs_rf.fit(X_train,y_train)

# Predict on the model
y_pred_train_rf_gs= gs_rf.predict(X_train)
y_pred_test_rf_gs= gs_rf.predict(X_test)

In [None]:
#Metric Score chart for train
r2_train_rf_gs= r2_score(y_train, y_pred_train_rf_gs)
adj_r2_train_rf_gs = 1-(1-r2_score(np.square(y_train), np.square(y_pred_train_rf_gs)))*(
    (X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
MSE_train_rf_gs= mean_squared_error(y_train,y_pred_train_rf_gs)
RMSE_train_rf_gs = np.sqrt(MSE_train_rf)
MAE_train_rf_gs = mean_absolute_error(y_train, y_pred_train_rf_gs)

#Metric Score chart for test
r2_test_rf_gs= r2_score(y_test, y_pred_test_rf_gs)
adj_r2_test_rf_gs = 1-(1-r2_score(np.square(y_test), np.square(y_pred_test_rf_gs)))*(
    (X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
MSE_test_rf_gs= mean_squared_error(y_test,y_pred_test_rf_gs)
RMSE_test_rf_gs= np.sqrt(MSE_test_rf)
MAE_test_rf_gs= mean_absolute_error(y_test,y_pred_test_rf_gs)

#Converting into readable format
row_a_rf_gs=['r2_score','adj_r2_score','mean_squared_error','RMSE','mean_absolute_error']
row_b_rf_gs=[r2_train_rf_gs,adj_r2_train_rf_gs,MSE_train_rf_gs,RMSE_train_rf_gs,MAE_train_rf_gs]
row_c_rf_gs=[r2_test_rf_gs,adj_r2_test_rf_gs,MSE_test_rf_gs,RMSE_test_rf_gs,MAE_test_rf_gs]

#final dataframe of parameters
data_rf_gs=pd.DataFrame({'Evalution Parameters': row_a_rf_gs, 'Train': row_b_rf_gs, 'Test': row_c_rf_gs}).set_index('Evalution Parameters')
data_rf_gs

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(18,6))
plt.plot((y_pred)[:200])
plt.plot((np.array(y_pred_train_rf_gs)[:200]))
plt.legend(["Predicted","Actual"])
plt.title('Actual and Predicted Bike Counts')
plt.tight_layout()
plt.show()

##### Which hyperparameter optimization technique have you used and why?

We used GridSearchCV hyperparameter optimization technique which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

our goal should be to find the best hyperparameters values to get the perfect prediction results from our model.

It uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters.

This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

In GridSearchCV,cross-validation is also performed which is used while training the model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.


Before hyperparameter tuning model was overfitting as model has a very large difference in training and test score, which was reduced from **0.98** in training to **0.83** and test result from **0.87 to 0.82** which also has slight difference but overall model is efficient now compared to earlier.


#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

**R2 score:**

* A high R2 score suggests that the model is able to explain a large portion of the variance in the data. In a business context, a high R2 score can indicate that the model is able to make accurate predictions, which could have a positive impact on decision-making.

**Adjusted R2 score:**

* In a business context, a high adjusted R2 score can indicate that the model is able to make accurate predictions with a reasonable level of complexity, which could be more practical for deployment in a business setting.

**Mean absolute error (MAE):**

* The MAE is a measure of the average absolute error of the model's predictions.

* In a business context, a low MAE can indicate that the model is making relatively small errors, which could be important if the model is being used to make important decisions.

**Root mean squared error (RMSE):**

* The RMSE is a measure of the average squared error of the model's predictions.

* In a business context, a low RMSE can indicate that the model is making relatively small errors, which could be important if the model is being used to make important decisions.

### ML Model - 3

We are implementing **XGBoost model**.

It is a popular machine learning algorithm that uses an ensemble of decision trees to make predictions.

The XGBRegressor class allows us to train a regression model using the XGBoost algorithm which is then used to make predictions on new data.

The model is trained by fitting a sequence of decision trees to the training data, with each new tree trying to correct the errors of the previous trees.

The final model is a weighted sum of these individual trees.

In [None]:
# ML Model - 3 Implementation
model_3 = XGBRegressor(objective= 'reg:squarederror')

# Fit the Algorithm
model_3.fit(X_train,y_train)

# Predict on the model
y_pred_train_xg =model_3.predict(X_train)
y_pred_test_xg =model_3.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Metric Score for train set
r2_train_xg= r2_score(y_train, y_pred_train_xg)
adj_r2_train_xg = 1-(1-r2_score(np.square(y_train), np.square(y_pred_train_xg)))*(
    (X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
MSE_train_xg = mean_squared_error(y_train,y_pred_train_xg)
RMSE_train_xg = np.sqrt(MSE_train_xg)
MAE_train_xg = mean_absolute_error(y_train, y_pred_train_xg)

# Metric Score for test set
r2_test_xg = r2_score(y_test, y_pred_test_xg)
adj_r2_test_xg = 1-(1-r2_score(np.square(y_test), np.square(y_pred_test_xg)))*(
    (X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
MSE_test_xg = mean_squared_error(y_test,y_pred_test_xg)
RMSE_test_xg = np.sqrt(MSE_test_xg )
MAE_test_xg = mean_absolute_error(y_test,y_pred_test_xg)

#Converting into readable format
row_a_xg=['r2_score','adj_r2_score','mean_squared_error','RMSE','mean_absolute_error']
row_b_xg=[r2_train_xg,adj_r2_train_xg,MSE_train_xg,RMSE_train_xg,MAE_train_xg]
row_c_xg=[r2_test_xg,adj_r2_test_xg,MSE_test_xg,RMSE_test_xg,MAE_test_xg]

#final dataframe of parameters
data_xg=pd.DataFrame({'Evalution Parameters': row_a_xg, 'Train':row_b_xg, 'Test':row_c_xg}).set_index('Evalution Parameters')
data_xg

In [None]:
# Plot between actual target variable vs Predicted value
plt.figure(figsize=(18,6))
plt.plot((y_pred)[ :300])
plt.plot((np.array(y_pred_train_xg)[ :300]))
plt.legend(["Predicted","Actual"])
plt.title('Actual and Predicted Bike Counts')
plt.tight_layout()
plt.show()

The r-square for the test is as high as 0.87 and the values of RSME AND MAE is as low as 4.05 and 2.76 respectively.

### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# HYperparameter Grid
para= {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}

# Grid search
gs_xg= GridSearchCV(estimator=model_3,
                       param_grid = para,
                       cv = 5, verbose=2, scoring='r2')


# Fit the Algorithm
gs_xg.fit(X_train,y_train)

# Predict on the model
y_pred_train_xg_gs= gs_xg.predict(X_train)
y_pred_test_xg_gs= gs_xg.predict(X_test)

In [None]:
#best estimators
gs_xg.best_estimator_

In [None]:
# Metric Score for train set
r2_train_xg_gs= r2_score(y_train, y_pred_train_xg_gs)
adj_r2_train_xg_gs = 1-(1-r2_score(np.square(y_train), np.square(y_pred_train_xg_gs)))*(
    (X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
MSE_train_xg_gs = mean_squared_error(y_train,y_pred_train_xg_gs)
RMSE_train_xg_gs = np.sqrt(MSE_train_xg_gs)
MAE_train_xg_gs = mean_absolute_error(y_train, y_pred_train_xg_gs)

# Metric Score for test set
r2_test_xg_gs = r2_score(y_test, y_pred_test_xg_gs)
adj_r2_test_xg_gs = 1-(1-r2_score(np.square(y_test), np.square(y_pred_test_xg_gs)))*(
    (X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
MSE_test_xg_gs = mean_squared_error(y_test,y_pred_test_xg_gs)
RMSE_test_xg_gs = np.sqrt(MSE_test_xg_gs)
MAE_test_xg_gs = mean_absolute_error(y_test,y_pred_test_xg_gs)


#Converting into readable format
row_a_xg_gs=['r2_score','adj_r2_score','mean_squared_error','RMSE','mean_absolute_error']
row_b_xg_gs=[r2_train_xg_gs,adj_r2_train_xg_gs,MSE_train_xg_gs,RMSE_train_xg_gs,MAE_train_xg_gs]
row_c_xg_gs=[r2_test_xg_gs,adj_r2_test_xg_gs,MSE_test_xg_gs,RMSE_test_xg,MAE_test_xg_gs]

#final dataframe of parameters
data_xg_gs=pd.DataFrame({'Evalution Parameters':row_a_xg_gs, 'Train':row_b_xg_gs, 'Test':row_c_xg_gs}).set_index('Evalution Parameters')
data_xg_gs

### Which hyperparameter optimization technique have you used and why?

We used GridSearchCV hyperparameter optimization technique which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

our goal should be to find the best hyperparameters values to get the perfect prediction results from our model.

It uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters.

This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

In GridSearchCV,cross-validation is also performed which is used while training the model.



### 2. Which ML model did you choose from the above created models as your final prediction model and why?

### I choose the **XGBRegressor model** as your final prediction model because it offers better predictive performance, is more robust to overfitting, and better captures the complex, non-linear relationships in your data compared to the Linear Regression model.

  * R² and Adjusted R²: This model has significantly higher R² and Adjusted R² for both the train and test datasets, which indicates it explains more of the variance in the data.

  * Mean Squared Error (MSE): This model has lower MSE on the test set, meaning its predictions are closer to the actual values.

  * Root Mean Squared Error (RMSE): The RMSE is lower for this model on the test set, suggesting less error in the predictions.

  * Mean Absolute Error (MAE): This model has a lower MAE, which indicates more accurate predictions on average.


### 3. Explain the model which you have used and the feature importance using any model explainability tool?

**Here we will be using XGBoost model and for model explainability we'll use SHAP (SHapley Additive exPlanations) value.**

SHAP (SHapley Additive exPlanations) values are a technique in machine learning used for explaining the output of a model by quantifying the contribution of each input feature to the predicted outcome.

In [None]:
explainer = shap.TreeExplainer(model_3)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test)

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

### The main goal of the project was to stablize bike demand at every hour. Based on the objective, it was found that:

* The **XGBRegressor model** achieved an ***R-squared value of 0.88*** , indicating that the model generalizes well and captures a significant portion of the variance in the target variable.

		Bike rental count is high during working days than on weekend.

		Bike demand shows peek around 8-9 AM in the morning and 6 - 7pm in the evening.

		People prefer to rent bike more in summer than in winter.

		Bike demand is more on clear days than on snowy or rainy days.

		Temperature range from 22 to 25(°C) has more demand for bike.

		The important feautures which plays a crucial role in deciding the number of rented bikes are {'Hour', 'Temperature(°C)', 'Humidity', 'Wind_speed','Visibility ', 'Solar_Radiation', 'Rainfall', 'Snowfall', 'Seasons'}


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***