<a href="https://colab.research.google.com/github/1994shuklaanand/Play-Store-App-Reviews-Analysis/blob/main/bike_sharing_demond_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 



##### **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 1 -** Anand Kumar
##### **Team Member 2 -** Vinay Chaudhari


# **Project Summary -**

Bike sharing systems means that renting bicycles in which there is a process of obtaining membership, rental, and bike return is automatically. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis.

Currently, there are over 500 bike-sharing programs around the world. Several bike/scooter rides sharing facilities (e.g., Bird, Capital Bikeshare, Citi Bike) have started up lately especially in metropolitan cities like San Francisco, New York, Chicago and Los Angeles, and one of the most important problem from a business point of view is to predict the bike demand on any particular day.

While having excess bikes results in wastage of resource (both with respect to bike maintenance and the land/bike stand required for parking and security), having fewer bikes leads to revenue loss (ranging from a short term loss due to missing out on immediate customers to potential longer term loss due to loss in future customer base), Thus, having an estimate on the demands would enable efficient functioning of these companies.

The goal of this project is to combine the historical bike usage patterns with the weather data to forecast bike rental demand. The data set consists of hourly rental data spanning two years. The training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month
To build a machine learning model on this data, we first gathered and clean the data, and handled the null values, then we performed indepth EDA with visuals and we gathered many insights from our EDA. Then further on, we did data preprocessing.

Then we split it into training and testing sets. Next, we choose a machine learning algorithm and use the training data to train the model. Finally, you we evaluated the model's performance on the testing data to see how well it is able to predict sales.

There are many different machine learning algorithms that we used for this task, including Linear Regression, Decision trees, Random Forests, Light GBM and XGBOOST. It is also possible to use more advanced techniques, such as deep learning, to build a model on Bike Shairing Demond Prediction data.

Overall, while building a machine learning model on Bike Shairing Demond Prediction data we applied combination of data processing, machine learning techniques, and model evaluation skills. It was a challenging task, but with the right approach, we were able to create a model that can accurately predict sales for a retail store chain.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split


# ML eveluation library
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV


# ML Model implementation library
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor

from xgboost import XGBRegressor
from sklearn.tree import plot_tree

#figure plot library
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path = ('/content/drive/MyDrive/Bike Sharing Predection/SeoulBikeData.csv')
dataset = pd.read_csv(path, encoding = "ISO-8859-1")

### Dataset First View

In [None]:
# View top 5 dataset
dataset.head()

In [None]:
# View top 5 dataset
dataset.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('Number of (rows, columns) are',dataset.shape)

### Dataset Information

Date : year-month-day

Rented Bike Count : Count of bikes rented at each hour

Hour : Hour of the day

Temperature(°C) : Temperature in Celsius

Humidity(%) : Relative Humidity%

Wind speed (m/s) : Average Speed of the wind(m/s)

Visibility (10m) : 10meter

Dew point temperature(°C) : Celsius

Solar Radiation (MJ/m2) : Megajoules/meter*meter

Rainfall(mm) : millimetre

Snowfall (cm) : centimeter

Seasons : Winter, Spring, Summer, Autumn

Holiday : Holiday/No holiday

Functioning Day : NoFunc(Non Functional Hours), Fun(Functional hours)

In [None]:
# Dataset.info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value
dataset.duplicated().sum()

There is no Value Present into it.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
dataset.isnull().sum()

We can say that there is no value present in it.

In [None]:
# Visualizing the missing values
plt.figure(figsize=(14, 5))
sns.heatmap(dataset.isnull(), cbar=True, yticklabels=False)
plt.xlabel("column_name", size=14, weight="bold")
plt.title("missing values in column",fontweight="bold",size=17)
plt.show()

### What did you know about your dataset?

 Till now we get to know the following points about our datasets

1. 'SeoulBikeData' is having 8760 rows and 14 columns and does not have any null value.
2. There is no duplicate values present in both datasets.


3. There is no null value present in both datasets.
4.There are total 4 categorical features in 'SeoulBikeData' dataset namely : Date, Season, Holiday and Functioning Day.







## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe(include='all').T

### Variables Description 

Answer Here

We have handled all the null values in our dataset, and created new variables using date column.

In [None]:
dataset.head()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
dataset.nunique()

In [None]:
# Check Unique Values for each variable
for i in dataset.columns.tolist():
  print(f"The Unique Values of', {i}, 'are:", dataset[i].unique())
  print()
  print('--'*50)

## 3. ***Data Wrangling***

### Data Wrangling Code

Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze. Due to the rapid expansion of the amount of data and data sources available today, storing and organizing large quantities of data for analysis is becoming increasingly necessary.

In [None]:
# Write your code to make your dataset analysis ready.
dataset.head()

In [None]:
# Write your code to make your dataset analysis ready.
dataset['Date'] = pd.to_datetime(dataset['Date'])

dataset['year'] = dataset['Date'].dt.year
dataset['month'] = dataset['Date'].dt.month
dataset['day'] = dataset['Date'].dt.day_name()

# We dont want each day name so we converted it into binary class as Weekdays = 0 & Weekend 1.

dataset['weekdays_weekend']=dataset['day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )

# Droping unnecessary columns.
# Year basically contains details from 2017 december to 2018 november so we considers this is one year.
dataset=dataset.drop(columns=['Date','day','year'],axis=1)

In [None]:
dataset.columns

In [None]:
# Numeric Features

numeric_features= dataset.select_dtypes(exclude='object')
numeric_features

### What all manipulations have you done and insights you found?

Answer : Here we have deleted Data column which contain data/Month/Year from the dataset and add a new column that is Month to make easy to find the insights.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Mean Skew of Dataset

In [None]:
# Chart - 1 visualization code
#plotting histogram

for col in numeric_features[:]:
  sns.histplot(dataset[col])
  plt.axvline(dataset[col].mean(), color='magenta', linestyle='dashed', linewidth=2)
  plt.axvline(dataset[col].median(), color='cyan', linestyle='dashed', linewidth=2)   
  plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

We have pick this histogrtam chart to find the count the number of the data and mean of that data column to analyse.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The features which are skewed, their mean and the median are also skewed.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# ploting Regression plot of each columns of dataset v/s rented bike count columns

for col in numeric_features[:]:
  if col == 'Rented Bike Count':
    pass
  else:
    sns.regplot(x=dataset[col],y=dataset["Rented Bike Count"],line_kws={"color": "red"})
  
  plt.show()

##### 1. Why did you pick the specific chart?

Answer : We have pick up this chart to find the relationship between dependent and independent variable and find the best fit line.

##### 2. What is/are the insight(s) found from the chart?

Answer : From this chart we can say that the line drawn shows the relationship between dependent and independent variable in this column some of the independent variable are directly proportional, inversely proportional and some are neither directly nor inversely.

Directly Proportiooinal variable are :-

Hour

Temperature(°C)

Wind speed

Dew point temperature

Solar Radiation

Inversely Proportional variable are

Humidity

Rainfall

Snowfall

Neither Directly nor Inversely Proportional

Month

weekdays_weekend

Visibility

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The columns 'Hour','Temperature', 'Wind_speed','Visibility', and 'Solar_Radiation' are positively related to the dependent variable. Which means that the rented bike count increases with increase of these features.
Whereas, the colums 'Rainfall','Snowfall','Humidity' are those features which are negatively related with the dependent variable, which implies that the rented bike count decreases when these features increases.

#### Chart - 3 Season Vs Ranted Bike Count

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(20,15),dpi=200)
sns.catplot(x ='Seasons', y = 'Rented Bike Count' , data = dataset)
plt.show()

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8,8))
sns.barplot(x ='Seasons', y = 'Rented Bike Count' , data = dataset)
plt.show()

##### 1. Why did you pick the specific chart?

Answer : We have pick categorical chart and bar chart to analyse the number of bike which is ranted from which helps to analyse that the given data which is divided into four season we can easily see from the data.

##### 2. What is/are the insight(s) found from the chart?

Answer : We have found that average of the ranted bike diveded into four season like Winter, Spring, Summer, Autumn which show that the demond of bike is high in Summmer and lowest in Winter compare to the other.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : Yes we have gain the importent insights that help us to create positive business impect as we can see from the graph that demand of bike is high in Summer after that Autumn after that in Spring and very less in Winter so from the business prespective we can say that the business will give high profit in Summer, Autumn, Spring and less profit in Winter.

These shows that high growth in the season like- Summer, Autumn, Sring as compare to winter.

#### Chart - 4 Analysis Rented bike demond in different in every month.

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(20,15), dpi =200)
sns.catplot(x='month', y='Rented Bike Count', data=dataset)
plt.show()

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8,8))
sns.barplot(x='month', y='Rented Bike Count', data=dataset)
plt.show()

##### 1. Why did you pick the specific chart?

Answer : We have pick categorical chart and bar chart to analyse the number of bike which is ranted from which helps to analyse that the given data which is divided into 12 month which we can easily see from the data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Answer : We have found that average of the ranted bike diveded into 122 month of an year which show that the demond of bike is different in different month like the season wise.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : Yes we have gain the importent insights that help us to create positive business impect as we can see from the graph that demand of bike is on the basis of month ie demond of bike is changes in everey month so from the business prespective.

We can analyse that high demond in the month like - May, June and July and demond is moderate in the month of - March, April, August, September, October, november and low demond in the month - January, February, December.

So the company have to ready with the plan on the basis of the month.

#### Chart - 5 Weekdays Weekend Data Analysis

In [None]:
# Chart - 5 visualization code
# Dependant Column Value Counts
print(dataset.weekdays_weekend.value_counts())
print(" ")
# Dependant Variable Column Visualization
dataset['weekdays_weekend'].value_counts().plot(kind='pie',
                              figsize=(15,6),
                               autopct="%1.1f%%",
                               startangle=90,
                               shadow=True,
                               labels=['Not weekend Days(%)','weekend Days(%)'],
                               colors=['purple','blue'],
                               explode=[0,0]
                              )

##### 1. Why did you pick the specific chart?

Answer : We have pick up the pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable

##### 2. What is/are the insight(s) found from the chart?

Answer : From the above chart I got to know that, there are 6216 Bike has rented in which is not the weekend days which is 71% of the whole rated bike count data given in the dataset. In other hand, 2544 customers are Bike has rented in which is in weekend days which is 29% of the whole Ranted Bike count data given in the dataset.

These shows that demond of bike is high during non Weekend_day as compare to weekend days.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : We can analyse that there is high demond in non weekdays_weekend day but if we talk about weekdays_weekend there is no demond so this insight is very important for the business prespective there is high growth in business during non weekdays_weekend and there is less growth in weekdays_weekend.

#### Chart - 6 Demond in Functioning Day and Non Functional day

In [None]:
# Chart - 6 visualization code
# Dependant Column Value Counts
print(dataset['Functioning Day'].value_counts())
# Dependant Variable Column Visualization
dataset['Functioning Day'].value_counts().plot(kind='pie',
                              figsize=(15,6),
                               autopct="%1.1f%%",
                               startangle=90,
                               shadow=True,
                               labels=['Functional Day(%)','Not Functional Day(%)'],
                               colors=['purple','green'],
                               explode=[0,0]
                              )

##### 1. Why did you pick the specific chart?

Answer : We have pick up the pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.

##### 2. What is/are the insight(s) found from the chart?

Answer : From the above chart I got to know that, there are 6216 Bike has rented in which is not the weekend days which is 71% of the whole rated bike count data given in the dataset. In other hand, 2544 customers are Bike has rented in which is in weekend days which is 29% of the whole Ranted Bike count data given in the dataset.

These shows that demond of bike is high during non Weekend_day as compare to weekend days.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : We can analyse that there is high demond in non weekdays_weekend day but if we talk about weekdays_weekend there is no demond so this insight is very important for the business prespective there is high growth in business during non weekdays_weekend and there is less growth in weekdays_weekend.

#### Chart - 7 Demond during Non Holidays and Holiday

In [None]:
# Chart - 7 visualization code
# Dependant Column Value Counts
print(dataset['Holiday'].value_counts())
# Dependant Variable Column Visualization
dataset['Holiday'].value_counts().plot(kind='pie',
                              figsize=(15,6),
                               autopct="%1.1f%%",
                               startangle=90,
                               shadow=True,
                               labels=['No Holiday(%)','Holiday(%)'],
                               colors=['purple','green'],
                               explode=[0,0]
                              )

##### 1. Why did you pick the specific chart?

Answer : We have pick up the pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.

##### 2. What is/are the insight(s) found from the chart?

Answer : Answer : From the above chart I got to know that, there are 8328 is on No Holiday which is 95.1% of the whole rated bike count data given in the dataset. In other hand, 432 customers are in Holiday days which is 4.9% of the whole Ranted Bike count data given in the dataset.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Ansawer : We can analyse that there is high demond in Non Holiday but if we talk about holiday there is no demond so this insight is very important for the business prespective there is high growth in business during non holiday and there is highly negative growth in holidays.

#### Chart - 8 Average Bikes Rented per Hour

In [None]:
# Chart - 8 visualization code
avg_rent_hrs = dataset.groupby('Hour')['Rented Bike Count'].mean()

# plot average rent over time(hrs)
plt.figure(figsize=(20,4))
a=avg_rent_hrs.plot(legend=True,marker='o',title="Average Bikes Rented Per Hr")
a.set_xticks(range(len(avg_rent_hrs)));
a.set_xticklabels(avg_rent_hrs.index.tolist(), rotation=85);

In [None]:
# Chart  - 8 visualization code 
plt.figure(figsize=(20,15),dpi=200)
sns.catplot(x='Hour',y='Rented Bike Count',data=dataset)
plt.show()

We have pick up the pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.

##### 1. Why did you pick the specific chart?

Answer : We have pick up the line and cat plot expresses a whole relationship in your data. Line represents the demond continuously.

##### 2. What is/are the insight(s) found from the chart?

Answer : We can analyse that there is High rise of Rented Bikes from 8:00 a.m to 9:00 p.m means people prefer rented bike during rush hour for the office hour.

We can analyse that there is high demond in Non Holiday but if we talk about holiday there is no demond so this insight is very important for the business prespective there is high growth in business during non holiday and there is highly negative growth in holidays.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : We can analyse that there is high demond of Bike during ofice going and coming hours that means during office hours high positive business impact.

#### Chart - 9 Demond of bike during Rainfall

In [None]:
# Chart - 9 visualization code
dataset.columns


In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10,8))
sns.catplot(x='Rainfall(mm)', y='Rented Bike Count',data=dataset)
plt.show()

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10,8))
sns.barplot(x='Rainfall(mm)', y='Rented Bike Count',data=dataset)
plt.show()

##### 1. Why did you pick the specific chart?

Answer : We have pick categorical chart and bar chart to find relationship between ranted bike count and Rainfall by the helps that to analyse the given data.

##### 2. What is/are the insight(s) found from the chart?

Answer : we can see that if Rainfall increase demand of Rented Bike Decreases.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : From the given graph we can say that as Rainfall increase demand of Rented Bike Decreases which lead to the negative impect on business.

#### Chart - 10  Demond of bike during Snowfall

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(8,8))
sns.catplot(x='Snowfall (cm)',y='Rented Bike Count',data=dataset)
plt.show()

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(8,8))
sns.barplot(x='Snowfall (cm)',y='Rented Bike Count',data=dataset)
plt.show()

##### 1. Why did you pick the specific chart?

Answer : We have pick categorical chart and bar chart to find relationship between ranted bike count and Snowfall by the helps that to analyse the given data.

##### 2. What is/are the insight(s) found from the chart?

Answer : we can see that if if there is snowfall then there is decrease in the demond on rented bike.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : From the given graph we can say that if there is snowfall then there is decrease in demand of Rented Bike Decreases which lead to the negative impect on business.

#### Chart - 11 Effect in Rented Bike with respect to temperature

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(8,8))
sns.catplot(x='Temperature(°C)',y='Rented Bike Count',data=dataset)
plt.show()

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(8,8))
sns.lineplot(x='Temperature(°C)',y='Rented Bike Count',data=dataset)
plt.show()

##### 1. Why did you pick the specific chart?

Answer : We have pick categorical chart and line chart to find relationship between ranted bike count and Snowfall by the helps that to analyse the given data.

##### 2. What is/are the insight(s) found from the chart?

Answer : From the given graph we say that the rented bike count is directly proportional to temperature as the temperature is high the demond is high.

#### Chart - 12

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,10))
sns.heatmap(dataset.corr(), cmap ='PiYG', annot = True)

As we have seen from the heatmap that temperature and Due point temperature are highly corelated so we can drop one.

##### 1. Why did you pick the specific chart?

Answer. We have pick up this heatmap chart to find insights to analyse that how the given one variable are the corelation to another variable.

##### 2. What is/are the insight(s) found from the chart?

Answer : We have found that temperature and due point temperature are highly corelated to each orther that why we have to remove any one of them and we can say that temperature and hour is highly effect to dependent variable 'Rented Bike count'.

In [None]:
dataset.columns

In [None]:
# Droping highly correlated features for eleminating Multico-linearity
dataset=dataset.drop(['Dew point temperature(°C)'],axis=1)

#### Chart - 13 pair plot

In [None]:
dataset.head()

In [None]:
# pair plot visulization code
sns.pairplot(dataset)

##### 1. Why did you pick the specific chart?

Answer : Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

Answer : From the above chart I got to know, there are less linear relationship between variables and data points aren't linearly separable.

5. Hypothesis Testing

Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.



We have different statistical tests for different scenarios:

Single categorical feature -> One proportion test
Two categorical features -> Chi squared test
More than two category in categorical features -> ANOVA test
One numerical and one categorical(=2 categories) feature-> ANOVA test
One numerical feature -> T-test
Two numerical feature -> Corelation test
One numerical and one categorical(>2 categories) feature -> T-test
Let's just define three hypothetical statements and perform the needed tests for the same

#Statement 1

* Null Hypothesis: There is no relation between temperature and rented bike count.
* Alternate Hypothesis: There is a relationship between temperature and rented bike count.




 






#Statement 2:

* Null Hypothesis: There is no relationship between holyday and rented bike count.
* Alternate Hypothesis: There is a relationship between Holyday and rented bike count.
#Statement 3:

* Nll Hypothesis: There is no relation between wind speed and rented bike count.
* Alternate Hypothesis: There is a relationship between wind speed and rented bike count.











#Hypothetical Statement - 1

1. State Your research hypothesis as a null hypothesis and alternate hypothesis.






* Null Hypothesis : There is no relation between "Temperature" and "Ranted Bike Coount"
* Alternate Hypothesis : There is a relationship between "Temperature" and "Ranted Bike Count"





#2. Perform an appropriate statistical test

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr


first_sample1 = dataset["Temperature(°C)"].head(100)
second_sample1 = dataset["Rented Bike Count"].head(100)

stat, p = pearsonr(first_sample1, second_sample1)
print('stat=%.3f, p = %.5f'%(stat, p))
if p> 0.05:
  print('Accept Null Hypothesis')
else:
  print('Rejected Null Hypothesis')


The above statistical test states that 'Rented Bike count' depends on 'Temperature' that is temperature is correlated with Rented Bike count.








#Which statistical test have you done to obtain P-Value?

Answer : Pearson Correlation

##### Why did you choose the specific statistical test?

Answer : To find the relationship between the testing series.

#Hypothetical Statement - 2

#1. State Your research hypothesis as a null hypothesis and alternate hypothesis.




* Null Hypothesis : There is no relation between "Holiday" and "Ranted Bike Coount"
* Alternate Hypothesis : There is a relationship between "Holiday" and "Ranted Bike Coount"



#2. Perform an appropriate statistical test

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import spearmanr

first_sample2 = dataset['Holiday'].head(100)
second_sample2 = dataset["Rented Bike Count"].head(100)

stat, p = spearmanr(first_sample2, second_sample2)
print('stat=%.3f, p = %.2f'%(stat, p))
if p> 0.05:
  print('Accept Null Hypothesis')
else:
  print('Rejected Null Hypothesis')




The above statistical test states that 'Rented Bike count' depends on 'Holiday' that is Holiday is correlated with Rented Bike count.

Which statistical test have you done to obtain P-Value?

Answer : Spearmanr Correlation

Why did you choose the specific statistical test?

Answer : To find the relationship between the testing series.

#Hypothetical Statement - 3

##### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here

Null Hypothesis: This is no relation between "wind speed" and Rented "Bike Count"

Alternate Hypothesis: There is relationship between "wind speed" and "Rented Bike Count"







#2. Perform an appropriate statistical test

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import pearsonr
first_sample = dataset["Wind speed (m/s)"].head(100)
second_sample = dataset["Rented Bike Count"].head(100)

stat, p = pearsonr(first_sample, second_sample)
print('stat=%.3f, p = %.2f'%(stat, p))
if p> 0.05:
  print('Accept Null Hypothesis')
else:
  print('Rejected Null Hypothesis')

####The above statistical test states that 'Rented Bike count' does not depends on 'Wind Speed' that is Wind Speed is not is correlated with Rented Bike count.

### Which statistical test have you done to obtain P-Value?

Answer : Pearsonr Correlation

### Why did you choose the specific statistical test?

Answer : To find the relationship between the testing series.

#6. Feature Engineering & Data Pre-processing

## 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
dataset.isna().sum()


As we can see that there is no null value persent into it therefore there is not a reguirement to handle missing value and null value of the data.


## What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here. When we are working with large set of data then there is chance that missing value persent into it so we have to handle error and following are the techenique to handle missing value--


1. Deleting Rows with missing values.
2. Impute missing values for continuous variable.

3. Using Algorithms that support missing values.
4. Prediction of missing values.

5. Imputation using Deep Learning Library.













#2. Handling Outliers

## What all outlier treatment techniques have you used and why did you use those techniques?

In [None]:
dataset.info()

In [None]:
#Define variable
continuous_variable = ['Wind speed (m/s)','Solar Radiation (MJ/m2)','Rainfall(mm)','Temperature(°C)','Visibility (10m)','Humidity(%)','Hour','Snowfall (cm)','Rented Bike Count']
categorical_variable =['Sasons','Holiday','Functioning Day','Weekdays_weekend','Month']
object_data =['Seasons','Month','Holiday','Functional Day',]

In [None]:
# code to find outliers
plt.figure(figsize=(30,15))
for n,column in enumerate(dataset.describe().columns):
  plt.subplot(5, 4, n+1)
  sns.boxplot(dataset[column])
  plt.title(f'{column.title()}',weight='bold')
  plt.tight_layout()

In [None]:
# defining the code for outlier detection and percentage using IQR.
def detect_outliers(data):
    outliers = []
    data = sorted(data)
    q1 = np.percentile(data, 25)
    q2 = np.percentile(data, 50)
    q3 = np.percentile(data, 75)
    print(f"q1:{q1}, q2:{q2}, q3:{q3}")

    IQR = q3-q1
    lwr_bound = q1-(1.5*IQR)
    upr_bound = q3+(1.5*IQR)
    print(f"Lower bound: {lwr_bound}, Upper bound: {upr_bound}, IQR: {IQR}")

    for i in data: 
        if (i<lwr_bound or i>upr_bound):
            outliers.append(i)
    len_outliers= len(outliers)
    print(f"Total number of outliers are: {len_outliers}")

    print(f"Total percentage of outlier is: {round(len_outliers*100/len(data),2)} %")

In [None]:
# Determining IQR, Lower and Upper bound and number out outliers present in each of the continous numerical feature
for feature in continuous_variable:
  print(feature,":")
  detect_outliers(dataset[feature])
  print("\n")

Below mentioned continous features with the percentage of outliers:


1. "Wind Sped" - 1.84%
2. "Solar Radiation" - 7.32%

3. "Rainfall" - 6.03%
4. "Snowfall" - 5.06%

5. "Rented Bike Count" - 1.8%

Let's define a function for the outlier treatment using IQR technique and cap the outliers in 25-75 percentile.









In [None]:
# Defining the function that treats outliers with the IQR technique
def treat_outliers_iqr(data):
    # Calculate the first and third quartiles
    q1, q3 = np.percentile(data, [25, 75])
    
    # Calculate the interquartile range (IQR)
    iqr = q3 - q1
    
    # Identify the outliers
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    outliers = [x for x in data if x < lower_bound or x > upper_bound]
    
    # Treat the outliers (e.g., replace with the nearest quartile value)
    treated_data = [q1 if x < lower_bound else q3 if x > upper_bound else x for x in data]
    treated_data_int = [int(absolute) for absolute in treated_data]
    
    return treated_data_int

In [None]:
#Passing all the feature one by one from the list of continous_value_feature in our above defined function for outlier treatment
for feature in continuous_variable:
  dataset[feature]= treat_outliers_iqr(dataset[feature])

In [None]:
plt.figure(figsize=(30,15))
for n,column in enumerate(dataset.describe().columns):
  plt.subplot(5, 4, n+1)
  sns.boxplot(dataset[column])
  plt.title(f'{column.title()}',weight='bold')
  plt.tight_layout()

In [None]:
# Rechecking the total number of outliers and its percentage present in our dataset.
for feature in continuous_variable:
  print(feature,":")
  detect_outliers(dataset[feature])
  print("\n")

## Bivariate analysis of Outliers

In [None]:
categorical_variable =['Seasons','Holiday','Functioning Day']

In [None]:

# Checking the outliers present in each category
plt.figure(figsize=(18,14))
for i,j in enumerate(categorical_variable):
  plt.subplot(2,2,i+1)
  sns.boxplot(x=dataset[j], y=dataset["Rented Bike Count"])
  plt.title(f"Box plot for {j} feature")

In [None]:
# Determining IQR, Lower and Upper bound and number out outliers present in each of the category of object dtype features
for feature in categorical_variable:
  print(f"Feature: {feature}")
  for num,cat in enumerate(dataset[feature].unique().tolist()):
    print(f"{num+1}: Category: {cat}")
    detect_outliers(dataset[dataset[feature]==cat]["Rented Bike Count"])
    print("\n")

Although we have some categorical outliers in the dataset but we will not treat them because we are going to implement ML model and algorithm can easily handle these categorical outliers without information loss.

## What all outlier treatment techniques have you used and why did you use those techniques?

Since, the outliers present in some of the continous features i.e "Wind Speed", "Solar Radiation", "Rainfall", "Snowfall", "Rentedee Bike Count" having the percentage 1.84%, 7.32% ,6.03%, 5.06%, 1.8% respectively.

We have defined the two seperate funtions one is for "outlier detection" and the other is for "outlier treatment using IQR" and passed all the observations of continous features through it. We have successfully capped out extreme left outliers(<25%) and extreme outliers (>75%) in the 25th and 75th quartile value.

## 3. Categorical Encoding

In [None]:
#Extracting categorical features
categorical_features= dataset.select_dtypes(include='object')
categorical_features

In [None]:
# Encode your categorical columns of Season 
dataset['Winter'] = np.where(dataset['Seasons']=='Winter',1,0)
dataset['Spring'] = np.where(dataset['Seasons']=='Spring',1,0)
dataset['Summer'] = np.where(dataset['Seasons']=='Summer',1,0)
dataset['Autumn'] = np.where(dataset['Seasons']=='Autumn',1,0)

# Drop the original column Season from the dataframe
dataset.drop(columns=['Seasons'], axis=1, inplace=True)

In [None]:
# find the all unique categorical data  of Holiday
dataset['Holiday'].unique()

In [None]:
# Encode your categorical columns of Holiday
dataset[' Holiday '] = np.where(dataset['Holiday']=='Holiday' ,1,0)
dataset['No Holiday'] = np.where(dataset['Holiday']==' No Holiday' ,1,0)


# Drop the original column Holiday from the dataframe
dataset.drop(columns=['Holiday'],axis=1, inplace=True)

In [None]:
dataset['Functioning Day'].unique()

In [None]:
# Encode your categorical columns of Holiday
dataset['Functional day'] = np.where(dataset['Functioning Day']=='Yes',1,0)
dataset['Not Functional day'] = np.where(dataset['Functioning Day']=='No',1,0)

# Drop the original column Holiday from the dataframe
dataset.drop(columns=['Functioning Day'],axis=1, inplace=True)

In [None]:
dataset.head()

In [None]:
# Encode your categorical columns of Hour, Month, Weekend column
cols=['Hour','month','weekdays_weekend']
for col in cols:
  dataset[col]=dataset[col].astype('category')

In [None]:
# Using Pandas get Dummies for Encoding categorical features
dataset = pd.get_dummies(dataset,drop_first=True,sparse=True)
dataset.head()

In [None]:
dataset.columns

In [None]:
dataset.shape


### What all categorical encoding techniques have you used & why did you use those techniques?

Answer :

a. We have used one-hot encoding technique to change our categorical features of object type into int type by creating their dummies so that it becomes compatible to feed it into various ML algorithms in future.

b. Since, we have 3 to 4 unique orderless categories in all the categorical features (which is less in number). So, it is good to use Nominal encoding technique than ordinal.

# Text Normalization

In [None]:
#Distribution plot of Rented Bike Count

plt.figure(figsize=(10,6))
sns.distplot(dataset['Rented Bike Count'])

In [None]:
#Applying square root to Rented Bike Count to improve skewness
plt.figure(figsize=(10,6))
sns.distplot(np.sqrt(dataset['Rented Bike Count']))

## Data Transformation

## Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
dataset.columns

In [None]:
# Transform Your data
dataset['Rented Bike Count']=np.log1p(dataset['Rented Bike Count'])

#  Data Scaling

In [None]:
# Scaling your data
X = dataset.drop(columns = ['Rented Bike Count'] , axis = 1)
y = dataset['Rented Bike Count']

##  Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train , X_test, y_train, y_test =train_test_split(X, y, test_size= .2 , random_state =0 )
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
y_train.shape

### What data splitting ratio have you used and why?

### Answer : we have taken 80% for Training Data and 20% for test data because we want to go by the standard norms distribution.

##  ML Model Implementation

In [None]:
# Importing essential libraries to check the accuracy
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_percentage_error 

In [None]:
# Defining the function that calculated regression metrics
def regression_metrics(y_train_actual,y_train_pred,y_test_actual,y_test_pred):
  print("-"*50)
  ## mean_absolute_error
  MAE_train= mean_absolute_error(y_train,y_train_pred)
  print("MAE on train is:" ,MAE_train)
  MAE_test= mean_absolute_error(y_test,y_test_pred)
  print("MAE on test is:" ,MAE_test)

  print("-"*50)

  ## mean_squared_error
  MSE_train= mean_squared_error(y_train, y_train_pred)
  print("MSE on train is:" ,MSE_train)
  MSE_test  = mean_squared_error(y_test, y_test_pred)
  print("MSE on test is:" ,MSE_test)

  print("-"*50)

  ## root_mean_squared_error
  RMSE_train = np.sqrt(MSE_train)
  print("RMSE on train is:" ,RMSE_train)
  RMSE_test = np.sqrt(MSE_test)
  print("RMSE on test is:" ,RMSE_test)

  print("-"*50)

  ## root_mean_squared_error
  RMSE_train = np.sqrt(MSE_train)
  print("RMSE on train is:" ,RMSE_train)
  RMSE_test = np.sqrt(MSE_test)
  print("RMSE on test is:" ,RMSE_test)

  print("-"*50)

  ## mean_absolute_percentage_error
  MAPE_train = mean_absolute_percentage_error(y_train, y_train_pred)*100
  print("MAPE on train is:" ,MAPE_train, " %")
  MAPE_test = mean_absolute_percentage_error(y_test, y_test_pred)*100
  print("MAPE on test is:" ,MAPE_test, " %")

  print("-"*50)

  ## r2_score
  R2_train= r2_score(y_train,y_train_pred)
  print("R2 on train is:" ,R2_train)  
  R2_test= r2_score(y_test,y_test_pred)
  print("R2 on test is:" ,R2_test)

  print("-"*50)

   ## Adjusted R2_score
  Adj_R2 = (1-(1-r2_score(y_train, y_train_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
  print( 'Adjusted R2 on train is :', Adj_R2)
  Adj_R2 = (1-(1-r2_score(y_test, y_test_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
  print( 'Adjusted R2 on test is :', Adj_R2)

  print("-"*50)

In [None]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Linear Regression

In [None]:
# Importing LinearRegression from sklearn
from sklearn.linear_model import LinearRegression

In [None]:
# ML Model - 1 Implementation
regressor= LinearRegression()

# Fit the Algorithm
regressor.fit(X_train,y_train)

# Predict the model
y_train_regression_pred= regressor.predict(X_train)
y_test_regression_pred= regressor.predict(X_test)

In [None]:
regressor.score(X_train, y_train)

In [None]:
regressor.coef_

In [None]:
regressor.intercept_

## 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Calculating the regression metrics
regression_metrics(y_train,y_train_regression_pred,y_test,y_test_regression_pred)

In [None]:
#Plotting the figure
plt.figure(figsize=(15,10))
plt.plot(y_test_regression_pred, color='Blue')
plt.plot(np.array(y_test), color='Red')
plt.legend(["Predicted","Actual"])
plt.show()

We have started with the most basic and simple ML model i.e Linear Regression. We have tried to evaluate the most important regression metics on both the train and test datesets so that we can conclude our ML model. Here for Linear Regression, we can observe that both the r2 scores are pretty close which explains that on test dataset and our model is following the correct way.

We can comprehend that 'dependent' and 'independent' variables and y we got 0.83 maximum r2 score in LR model implementation.

In order to fetch good and more accurate results, we shall go for cross- Validation & Hyperparameter Tuning of 'Lasso', 'Ridge' and 'Elastic Net' models.

# 2. Cross- Validation & Hyperparameter Tuning

# Ridge (L2) Regression

In [None]:
# import ridge regression from sklearn library
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Creating Ridge instance
ridge= Ridge()

# Defining parameters
parameters = {"alpha": [1e-1,1,5,7,10,11,14,15,16,17], "max_iter":[1,2,3]}

# Train the model
ridgeR = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)
ridgeR.fit(X_train,y_train)

# Predict the output
y_train_ridge_pred = ridgeR.predict(X_train)
y_test_ridge_pred = ridgeR.predict(X_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {ridgeR.best_params_}")
print(f"Negative mean square error is: {ridgeR.best_score_}")

In [None]:
# Calculating regression metrics for Ridge
regression_metrics(y_train,y_train_ridge_pred,y_test,y_test_ridge_pred)

## Lasso (L1) Regression

In [None]:
# import lasso regression from sklearn library
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

# Creating Ridge instance
lasso= Lasso()

# Defining parameters
parameters_lasso = {"alpha": [1e-5,1e-4,1e-3,1e-2,1e-1,1,5], "max_iter":[7,8,9,10]}

# Train the model
lassoR = GridSearchCV(lasso, parameters_lasso, scoring='neg_mean_squared_error', cv=5)
lassoR.fit(X_train,y_train)

# Predict the output
y_train_lasso_pred = lassoR.predict(X_train)
y_test_lasso_pred = lassoR.predict(X_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {lassoR.best_params_}")
print(f"Negative mean square error is: {lassoR.best_score_}")

In [None]:
# Calculating regression metrics for Ridge
regression_metrics(y_train,y_train_lasso_pred,y_test,y_test_lasso_pred)

In [None]:
#Plotting the figure
plt.figure(figsize=(15,10))
plt.plot(y_test_lasso_pred, color='Blue')
plt.plot(np.array(y_test), color='Red')
plt.legend(["Predicted","Actual"])
plt.show()

## Elastic Net Regresson

In [None]:
# import elastic net regression from sklearn library
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

# Creating e_net instance
e_net= ElasticNet()

# Defining hyperparameters
parameters_e_net = {"alpha": [1e-5,1e-4,1e-3,1e-2,1,5], "max_iter":[12,13,14,15]}

# Train the model
e_netR = GridSearchCV(e_net, parameters_e_net, scoring='neg_mean_squared_error', cv=5)
e_netR.fit(X_train,y_train)

# Predict the output
y_train_e_net_pred = e_netR.predict(X_train)
y_test_e_net_pred = e_netR.predict(X_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {e_netR.best_params_}")
print(f"Negative mean square error is: {e_netR.best_score_}")

In [None]:
# Calculating regression metrics for Elastic Net
regression_metrics(y_train,y_train_e_net_pred,y_test,y_test_e_net_pred)

In [None]:
#Plotting the figure
plt.figure(figsize=(15,10))
plt.plot(y_test_e_net_pred, color='Blue')
plt.plot(np.array(y_test), color='Red')
plt.legend(["Predicted","Actual"])
plt.show()

### Which hyperparameter optimization technique have you used and why?

Answer : We have used GridSearchCV as the hyperparameter optimization technique as it uses all possible combinations of hyperparameters and their values. It then calculates the performance for each combination and selects the best value for the hyperparameters. This offers the most accurate tuning method.

### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer : Despite using Lasso, Ridge and Elastic net models, we couldn't see any significant improvement in the r2 score, MSE and on MAPE as well. This shows that we have to go for higher and more complex ML models like Decision trees, Random Forest, LightGBM Regression and XGBoost Regression.

## ML Model - 2 Implementing Decision Tree Regression

In [None]:
# import the regressor
from sklearn.tree import DecisionTreeRegressor 
  
# create a regressor object
TreeR = DecisionTreeRegressor(max_depth=10) 
  
# fit the regressor with X and Y data
TreeR.fit(X_train, y_train)

# predict the model
y_train_tree_pred= TreeR.predict(X_train)
y_test_tree_pred= TreeR.predict(X_test)

In [None]:
# Calculating Regression Metrics
regression_metrics(y_train,y_train_tree_pred,y_test,y_test_tree_pred)

## 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Answer After apply LR models, we tried 'Decision Tree' and we saw a good increment in the r2 score from 0.83 to 0.84 that means "90% Variance of our test dataset is captured by our trained model" which is excellent. On the other side our RMSE also decreased and shifted below 5(=4.7) which is very good.Also accuracy increased from 93% to 95%. On the other hand from the residual plot our values of mean and median are shifting towards 0 that means our model is improving. But, in the quest of more accurate and real predictions, we decided to further tune the hyperparameters and check the results.

Answer : After apply Linear Regression model, We tried 'Decision Tree' and we see that r2 score have been increased by 1% that is .83 to .849 that mean '84.9%' Variannce of our thes data has been captured by trained the model which is good so we have decided to further tune the hyperparameters and check the results.

## 2. Cross- Validation & Hyperparameter Tuning

### Decision Tree with GridSearchCV

In [None]:
# import ridge regression from sklearn library
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

# Creating Ridge instance
decision_tree= DecisionTreeRegressor()

# Defining parameters
parameters= {'max_depth': [8,9,10], 'min_samples_leaf': [6,7,8], 'min_samples_split': [1,2,4]}

# Train the model
decision_treeR = GridSearchCV(decision_tree, parameters, scoring='neg_mean_squared_error', cv=3)
decision_treeR.fit(X_train,y_train)

# Predict the output
y_train_grid_Dtree_pred = decision_treeR.predict(X_train)
y_test_grid_Dtree_pred = decision_treeR.predict(X_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {decision_treeR.best_params_}")
print(f"Negative mean square error is: {decision_treeR.best_score_}")

In [None]:
# Calculating Regression Metrics
regression_metrics(y_train,y_train_grid_Dtree_pred,y_test,y_test_grid_Dtree_pred)

In [None]:
#Plotting the figure
plt.figure(figsize=(15,10))
plt.plot(y_test_grid_Dtree_pred, color='Blue')
plt.plot(np.array(y_test), color='Red')
plt.legend(["Predicted","Actual"])
plt.show()

### Which hyperparameter optimization technique have you used and why?

Answwer : We have used GridSearchCV as the hyperparameter optimization technique as it uses all possible combinations of hyperparameters and provides the more accurate results. It then calculates the performance for each combination and selects the best value for the hyperparameters. This offers the most accurate tuning method.

### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

We have used different combinations of parameters to get the best value of r2 score and least MAPE for our case. The best combination was found out to be {'max_depth': [8,9, 10], 'min_samples_leaf':[6, 7, 8] 'min_samples_split':[1, 2, 3, 4} which resulted into the improvement in the MSE from 43% to 34% on the test dataset by hyperparameter tuning of Decision trees.

## 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

In order to minimise the errors between actual and predicted values, we evaluate our ML model using different metrics. All these metrics try to give us an indication on how close we are with the real/expected output. In our case, each evaluation metric is showing not much difference on the train and test data which shows that our model is predicting a closer expected value. So the Rented Bike Count, the dependent variable, which impacts the business is getting accurately predicted to the extent of ~ 86.9% and ~3% far from the mean of actual absolute values.

## ML Model - Implementing Random Forest Regressor

In [None]:
# import the regressor
from sklearn.ensemble import RandomForestRegressor 
  
# create a regressor object
RF_TreeR = RandomForestRegressor(n_estimators=100, max_depth=10) 
  
# fit the regressor with X and Y data
RF_TreeR.fit(X_train, y_train)

# predict the model
y_train_RFtree_pred= RF_TreeR.predict(X_train)
y_test_RFtree_pred= RF_TreeR.predict(X_test)

In [None]:
# Calculating Regression Metrics using RandomForestRegressor
regression_metrics(y_train,y_train_RFtree_pred,y_test,y_test_RFtree_pred)

## 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

By implimenting using our third model i.e Random Forest we have achieved the r2 score of 0.91 on training and 0.88 on test dataset that is very good MSE also reduced from 34 to 30 and that means our model is moving towards optimal model.

We have increased r2 score (86%) form Decission Tree to r2 score (88%) in Rendom forest.

### 2. Cross- Validation & Hyperparameter Tuning

 Random Forest with RandomizedSearchCV

In [None]:
# import ridge regression from sklearn library
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

# Creating Ridge instance
RF_tree= RandomForestRegressor()

# Defining parameters
parameters= {'n_estimators':[100], 'max_depth': [10,11,12], 'min_samples_leaf': [1, 2]}

# Train the model
RF_treeR = RandomizedSearchCV(RF_tree, parameters, n_iter=5, n_jobs=-1, scoring='neg_mean_squared_error', cv=3,  verbose=3)
RF_treeR.fit(X_train,y_train)

# Predict the output
y_train_grid_RFtree_pred = RF_treeR.predict(X_train)
y_test_grid_RFtree_pred = RF_treeR.predict(X_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {RF_treeR.best_params_}")
print(f"Negative mean square error is: {RF_treeR.best_score_}")

In [None]:
# Calculating Regression Metrics using GridSearchCV in RandomForestRegressor
regression_metrics(y_train,y_train_grid_RFtree_pred,y_test,y_test_grid_RFtree_pred)

In [None]:
#Plotting the figure
plt.figure(figsize=(15,10))
plt.plot(y_test_grid_RFtree_pred, color='Blue')
plt.plot(np.array(y_test), color='Red')
plt.legend(["Predicted","Actual"])
plt.show()

### Which hyperparameter optimization technique have you used and why?

We have used RandomizedSearchCV in Random Forest since we have huge dataset and it is good for huge and complex models where we just want to select random parameters from the bag of parameters. It reduces the processing and training time by taking the random subsets of the provided parameters wihout compromising the accuracy of the model.

### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After using RandomizedSearchCV with different hyperparameters we saw that their is not much significant improvement observed. Although MSE on test dataset has been reduced from 14 to 13.

## ML Model - 4 - LightGBM Regression

LightGBM Regression

In [None]:
# import the regressor
from lightgbm import LGBMRegressor
  
# create a regressor object
lgbmR = LGBMRegressor(boosting_type='gbdt', max_depth=4, learning_rate=0.1, n_estimators=500,  n_jobs=-1) 
  
# fit the regressor with X and Y data
lgbmR.fit(X_train, y_train)

# predict the model
y_train_lgbmR_pred= lgbmR.predict(X_train)
y_test_lgbmrR_pred= lgbmR.predict(X_test)

In [None]:
# Calculating Regression Metrics using RandomForestRegressor
regression_metrics(y_train, y_train_lgbmR_pred, y_test, y_test_lgbmrR_pred)

### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Answer : LightGBM is the lighter version of GBM. It has more faster and accurate than other popular gradient boosting libraries such as XGBoost on several datasets. We want improved further, so we have tried implimenting LightGBM in order to achieve more accurate results.

We saw that with the help of LightGBM we are able to capture 91% of the Variance of the dependent varibale with the help of independent variables(r2 score) for testing dataset.

We have further checked the performance metrics by hyperparameter tuning of LightGBM.

## 2. Cross- Validation & Hyperparameter Tuning

LightGBM with RandomizedSearchCV

In [None]:
# import ridge regression from sklearn library and RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Creating XGBoost instance
lgbm= LGBMRegressor()

# Defining parameters
parameters={"learning_rate":[0.01,0.1],"max_depth":[3,4,5],"n_estimators":[500,600]}

# Train the model
lgbm_rand_R= RandomizedSearchCV(lgbm,parameters,scoring='neg_mean_squared_error',n_jobs=-1,cv=3,verbose=3)
lgbm_rand_R.fit(X_train,y_train)

# Predict the output
y_train_rand_lgbm_pred = lgbm_rand_R.predict(X_train)
y_test_rand_lgbm_pred = lgbm_rand_R.predict(X_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {lgbm_rand_R.best_params_}")
print(f"Negative mean square error is: {lgbm_rand_R.best_score_}")

In [None]:
# Calculating Regression Metrics using GridSearchCV in RandomForestRegressor
regression_metrics(y_train,y_train_rand_lgbm_pred,y_test,y_test_rand_lgbm_pred)

### Which hyperparameter optimization technique have you used and why?

RandomizedSearchCV was still the better option since it is taking very less processing time without compromising the accuracy. So we have mutually decided to use that hyperparameter optimization technique.


### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

We have tried different parameters for tuning of our LightGBM model and achieved 0.96 r2 score on training dataset, 0.91 on testing set as well that means our model is optimized and not falling under the underfitting or overfitting side. The best parameters obtained by the optimatization is {'n_estimators': [500, 600], 'max_depth': [3,4,5], 'learning_rate':[ 0.01,0.1]}.

## ML Model - 5 - XGBoost Regression

XGBoost Regression

In [None]:
# import the regressor
from xgboost import XGBRegressor
  
# create a regressor object
xgbR = XGBRegressor(learning_rate=0.1, max_depth=5) 
  
# fit the regressor with X and Y data
xgbR.fit(X_train, y_train)

# predict the model
y_train_xgbR_pred= xgbR.predict(X_train)
y_test_xgbR_pred= xgbR.predict(X_test)

In [None]:
# Calculating Regression Metrics using RandomForestRegressor
regression_metrics(y_train,y_train_xgbR_pred,y_test,y_test_xgbR_pred)

### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Answer : XGBoost (eXtreme Gradient Boosting) is a Gradiant boosting algorithm and very popular for achieving good accuracies. We have used XGBoost.

We got r2 score of 0.94 for testing dataset which is which is excelent and improved as well. At this point of time slightly improvement in MAPE can lead to huge profit to stakeholders and we were very curious and excited at his point of time to further improve the efficiency of our model and for this we have again decided to tune the various hyperparameters of xgboost.

## 2. Cross- Validation & Hyperparameter Tuning

XGBoost with RandomizedSearchCV

In [None]:

from xgboost import XGBRegressor

In [None]:
from sklearn.model_selection import RandomizedSearchCV


In [None]:
# import ridge regression from sklearn library and RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

# Creating XGBoost instance
xgb= XGBRegressor()

# Defining parameters
parameters={"learning_rate":[0.01, 0.1],"max_depth":[4,5]}

# Train the model
xgb_Rand_R= GridSearchCV(xgb,parameters,scoring='neg_mean_squared_error',n_jobs=-1,cv=3,verbose=3)
xgb_Rand_R.fit(X_train,y_train)

# Predict the output
y_train_rand_xgbR_pred = xgb_Rand_R.predict(X_train)
y_test_rand_xgbR_pred = xgb_Rand_R.predict(X_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {xgb_Rand_R.best_params_}")
print(f"Negative mean square error is: {xgb_Rand_R.best_score_}")

In [None]:
# Calculating Regression Metrics using GridSearchCV in RandomForestRegressor
regression_metrics(y_train,y_train_rand_xgbR_pred,y_test,y_test_rand_xgbR_pred)

In [None]:
#Plotting the figure
plt.figure(figsize=(15,10))
plt.plot(y_test_rand_xgbR_pred, color='Blue')
plt.plot(np.array(y_test), color='Red')
plt.legend(["Predicted","Actual"])
plt.show()

### Which hyperparameter optimization technique have you used and why?

Answer : XGboost is a heavy algorithm and takes much processing time with GridSearchCV. So, tuning of hyperparameter with GridSearchCV was a bit complicated task for us. RandomizedSearchCV is excellent hyperparameter optimization technique for this senario. It can take variety of parameters and take the random possible combinations of hyperparameters. So we have used RandomizedSearchCV for hyperparameter tuning.

### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.


We have tried different parameters for tuning of our XG Boost model and achieved 0.93 r2 score on training dataset, 0.90 on testing set as well that means our model is optimized and not falling under the underfitting or overfitting side. The best parameters obtained by the optimatization is {'n_estimators': [500, 600], 'max_depth': [3,4,5], 'learning_rate':[ 0.01,0.1]}.

Minor improvement in regresson metrics are also significant now as we are moving towards model perfection. With the help of RandomizedSearchCV we got the r2 score of 0.94 (Now 94% of the variance of test set our model is capturing) for test dataset which is 1% higher than without RandomizedSearchCV and the best parameters found out to be{'learning_rate': 0.1, 'max_depth': 13}. Also we have noticed that our MAPE is further reduced and falling under 3% (Minimum error among all models) and on the other hand MSE is also reduced to 12%. We have also seen that on further increasing the max_depth of tree our model is overfitting so above values of parameters are the best combinations.

## 1. Which Evaluation metrics did you consider for a positive business impact and why?

Since predicting sales over a period of time falls under the category of "Time series data" and there are following regression metrics that are required as per our goal of analysis (Predicting future Sales):

1. MAE(Mean Absolute Error): This metric calculates the average magnitude of the errors in the predictions, without considering their direction. It has the inverse relation with the accuracy of the model. In regression analysis our aim is to minimise the MAE and ultimately this will create positive business impact.
2. RMSE(Root Mean Squared Error): It is the square root of MSE and this is the most widely use regression metric since it has the same units as the original data so it is easy to interpret the magnitude of error.

3. MAPE(Mean Absolute Percentage Error): It is calculated by taking the average of the absolute percentage differences between the predicted values and the actual values. This metric is particularly useful when working with time series data(as in our case), as it allows for easy comparison of forecast accuracy across different scales. With the help of MAPE an analyst can easily explain the percentage error to the stakeholders. This metric is considered as one of the most important regression metric in time series data for a positive business impact.
4. R2_Score: R2 score(coefficient of determination) is a metric that is widely used in regression analysis because it measures the proportion of the variance in the dependent variable that is explained by the independent variables. R2 score allows analysts to quickly and easily evaluate the goodness of fit of a model and compare different models. It also provides a clear measure of how well the model is explaining the variance in the dependent variable, which can aid in making decisions about model selection and further analysis.

5. Adjudusted R2_Score: R2 score(coefficient of determination) is a metric that is widely used in regression analysis because it measures the proportion of the variance in the dependent variable that is explained by the independent variables. Adjusted R2 score allows analysts to quickly and easily evaluate the goodness of fit of a model and compare different models. It also provides a clear measure of how well the model is explaining the variance in the dependent variable, which can aid in making decisions about model selection and further analysis.








## 2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:
# Storing different regression metrics in order to make dataframe and compare them
models = ["Linear_regression","Decision_tree","Random_forest","LightGBM","XGboost"]

r2_r = [.83,0.86,0.88,0.91,0.90]
adjusted_r2 = [.82,.86,.88,.90,.90]

# Create dataframe from the lists
data = {'Models': models, 
        'R2': r2_r,
        'Adjusted R2': adjusted_r2
       }
metric_df = pd.DataFrame(data)

# Printing dataframe
metric_df

We have chosen XGboost as our final prediction model with hyperparameters {'learning_rate': 0.1, 'max_depth': 13} as it is very clear from above dataframe that it has given the highest accuracy (97%), least MAPE (3%) and maximum r2 score(0.94) on the testing dataset among all other models.

## 3. Explain the model which you have used and the feature importance using any model explainability tool?

XGBoost (eXtreme Gradient Boosting) provides an efficient implementation of the gradient boosting framework. It is designed for both linear and tree-based models, and it is useful for large datasets. The basic idea behind XGBoost is to train a sequence of simple models, such as decision trees, and combine their predictions to create a more powerful model. Each tree is trained to correct the errors made by the previous trees in the sequence and this known as boosting.

XGBoost uses a technique called gradient boosting to optimize the parameters of the trees. It minimizes the loss function by adjusting the parameters of the trees in a way that reduces the error of the overall model. XGBoost also includes a number of other features, such as regularization, which helps to prevent overfitting, and parallel processing, which allows for faster training times.

Although tree based algorithm gives most accurate results but they have less explanability. With the help of some explanabilty tools like LIME and SHAP we can explain our model to the stakeholders.

# **Conclusion**

In this project we have dataset that We started with loading the data and then we did Exploratory Data Analysis (EDA), on all the feature of our dataset then analysed our dependent variable ie “Ranted Bike count” and then transform it then null values treatment, feature selection, encoding of categorical columns, and analyse the numeric variable, check the correlation and drop the highly correlated variable and then inhot coding then build the model and extract statistical information that quite useful for the Business prespective

Next we have implemented machine learning algorithms like Linear Regression, Lasso, Ridge, Decision Tress, Random Forest, XGBoost . We did some hyperparameter tuning to improve our model performance.

#observation:

* In holiday or non-working days there is demands in rented bikes.
* .There is a surge of high demand in the morning 8AM and in evening 6PM as the people might be going to their work at morning 8AM and returing from their work at the evening 6PM.

* People prefered more rented bikes in the morning than the evening.
* When the rainfall was less, people have booked more bikes except some few cases.

* The Temperature, Hour & Humidity are the most important features that positively drive the total rented bikes count.








## Eveluation matrics

In [None]:
# Storing different regression metrics in order to make dataframe and compare them
models = ["Linear_regression","Decision_tree","Random_forest","LightGBM","XGboost"]
R2_Score_train=[.81,.89,.91,.94,.93]
R2_score_test = [.83,0.86,0.88,0.91,0.90]
Adjusted_r2_train =[.80, .89,.91,.94,.92 ]
Adjusted_r2_test = [.82,.86,.88,.90,.90]

# Create dataframe from the lists
data = {'Models': models, 
        'R2 Score Train': R2_Score_train,
        'R2 Score Test' : R2_score_test,
        'Adjusted R2 Score Train': Adjusted_r2_train,
        'Adjusted R2 Score Test' :Adjusted_r2_test
       }
metric_df = pd.DataFrame(data)

# Printing dataframe
metric_df

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***