<a href="https://colab.research.google.com/github/Mohammdshaheenalam/Bike_Sharing_Demand_prediction/blob/main/Bike_sharing_demand_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name : Bike Sharing Demand Prediction**



Project Type : Regression

Done by : **Md Shaheen Alam**






# **Project Summary -**

 ### A bike-sharing system provides people with a sustainable mode of transportation and has beneficial effects for both the environment and the user. In recent days, Pubic rental bike sharing is becoming popular because of is increased comfortableness and environmental sustainability. Data used include Seoul Bike and Capital Bikeshare program data. Data have weather data associated with it for each hour. For the dataset, we are using linear regression model were train with optimize hyperparameters using a repeated cross validation approach and testing set is used for evaluation. Multiple evaluation indices such as Mean squared error , Root Mean Square error, r2 score,adjusted r2 score are use to measure the prediction performance of the regression models.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


## <b> Problem Description </b>

### Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.


## <b> Data Description </b>

### <b> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.</b>


### <b>Attribute Information: </b>

* ### Date : year-month-day
* ### Rented Bike count - Count of bikes rented at each hour
* ### Hour - Hour of he day
* ### Temperature-Temperature in Celsius
* ### Humidity - %
* ### Windspeed - m/s
* ### Visibility - 10m
* ### Dew point temperature - Celsius
* ### Solar radiation - MJ/m2
* ### Rainfall - mm
* ### Snowfall - cm
* ### Seasons - Winter, Spring, Summer, Autumn
* ### Holiday - Holiday/No holiday
* ### Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
pd.pandas.set_option('display.max_columns',None)
%matplotlib inline
import seaborn as sns
import missingno as msno #(import for missing value visualization)
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
!pip install shap
import shap

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df=pd.read_csv("/content/drive/MyDrive/SeoulBikeData.csv",encoding= 'unicode_escape')

### Dataset First View

In [None]:
# Viewing the data of top 5 rows
df.head()

In [None]:
# View the data of bottom 5 rows
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# 1. shape- gives count of rows and column in tuple form (rows,columns)
# 2. len()- only give no of rows
len(df)

df.shape

### Dataset Information

In [None]:
# Dataset Info
# Info function will give high level information about dataset like-:
# 1. total no of columns
# 2. total no of missing value present in each columns
# 3. datatype of data present in each columns
# 4. gives memory occupy of dataset in ram


df.info()


# **preprocessing the dataset**


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# to check duplicate value use duplicated()function.it will give result in boolean.
# Use sum() function with duplicated() gives total no of duplicated values.

df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
# To visualize missing values we have to import missingno library-:

msno.bar(df) # This is code to visualize missing values.Here i am using bar graph we can use other graph also like heatmap etc.

### What did you know about your dataset?

Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

* Date : year-month-day

* Rented Bike count - Count of bikes rented at each hour

* Hour - Hour of he day

* Temperature-Temperature in Celsius

* Humidity - %

* Windspeed - m/s

* Visibility - 10m

* Dew point temperature - Celsius

* Solar radiation - MJ/m2

* Rainfall - mm

* Snowfall - cm

* Seasons - Winter, Spring, Summer, Autumn

* Holiday - Holiday/No holiday

* Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# lets rename some columns which contain unit in bracket which will create problem while performing task.
df.rename(columns={'Snowfall (cm)': 'Snowfall', 'Rainfall(mm)': 'Rainfall','Solar Radiation (MJ/m2)' : 'Solar Radiation','Dew point temperature(°C)' : 'Dew point temperature', 'Visibility (10m)' : 'Visibility', 'Wind speed (m/s)' : 'Wind speed', 'Humidity(%)' : 'Humidity', 'Temperature(°C)' : 'Temperature'}, inplace=True)

In [None]:
#convert in datetime datatype
df['Date'] = pd.to_datetime(df['Date'])
df['Date']
#Seperate Day, Month, Year from DataFrame Column
df['Day']=df['Date'].dt.day
df['Month']=df['Date'].dt.month
df['Year']=df['Date'].dt.year

In [None]:
# drop the Date column after extracting necessory information
df.drop(['Date'],axis=1,inplace=True)

In [None]:
df.head(3)

### What all manipulations have you done and insights you found?

Answer -
* I have performed some data cleaning and transformation operations on the dataset. Specifically, I have checked for missing and duplicate values and found none. I have also modified the date column by extracting the day, month, and year values and created three new columns based on this information. Finally, I have removed the original date column from the dataset.
* Many columns contain irrelevant information so renamed those columns.
* Seperated date,month,year from date column.
* i droped date column.
* The insights or benefits of these data manipulations may include improved data quality, enhanced ability to perform analysis, and easier interpretation of results. By checking for missing and duplicate values, I have ensured that the dataset is complete and accurate, which can reduce errors in subsequent analysis. By extracting the day, month, and year values from the date column, I have created additional variables that can be used to explore trends or patterns in the data. And by removing the original date column, I have simplified the dataset and made it easier to work with.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
#Use for to get unique value from each categorical columns
def categorical_unique_value(categorical_column, df) :
  for column in categorical_column:
    print('The values which are categorical', column, 'could take in :' , df[column].unique())

In [None]:
#Possible values and important categorical values of dataset
categorical_columns_of_dataset = ['Seasons' , 'Holiday']
categorical_unique_value(categorical_columns_of_dataset , df)

In [None]:
#Groupby function
def create_df_analysis(column):
  return df.groupby(column) ['Rented Bike Count'].sum().reset_index()

In [None]:
#Seasons Column
Season_column = create_df_analysis('Seasons')
print(Season_column)

In [None]:
#Visualisation for Season Column
plt.figure(figsize=(10,7))
splot = sns.barplot(data=Season_column,x='Seasons',y='Rented Bike Count')
for p in splot.patches:
    splot.annotate(format(p.get_height(), '.1f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha = 'center', va = 'center',
                   xytext = (0, 9),
                   textcoords = 'offset points')
plt.xlabel("Seasons", size=14)
plt.ylabel("Rented Bike Count", size=14)
plt.show()

##### 1. Why did you pick the specific chart?

Answer -
I picked the bar chart to show the distribution of rented bike in each seasons.

##### 2. What is/are the insight(s) found from the chart?

Answer-
The chart shows that the number of bikes rented decreases as the humidity increases. This is likely due to the fact that people are less likely to ride their bikes in humid weather. This could be due to a number of factors, such as discomfort, exhaustion, and safety concerns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer -
Yes, the gained insights could help create a positive business impact for bike rental companies.
* Bike rental companies can use the insights to adjust their pricing and marketing strategies. For example, they can offer discounts or promotions on days with lower humidity. They can also target their marketing campaigns to people who are more likely to ride their bikes in humid weather, such as people who live in areas with less humidity or people who ride their bikes for commuting.

#### Chart - 2

In [None]:
#Preparation for Pie Chart
seasons_list = list(Season_column['Seasons'])
rented_count_list = list(Season_column['Rented Bike Count'])
palette_color = sns.color_palette('bright')
explode = (0.05,0.05,0.05,0.05)

In [None]:
# Chart - 2 visualization code
#Pie chart for visualisation for Season Column
plt.figure(figsize=(5,5))
plt.pie(rented_count_list,labels=seasons_list,colors=palette_color,explode=explode,autopct='%0.0f%%')
plt.title("Percentage of Total Number of Bikes rented for each Season")
plt.axis("equal")
plt.show()

##### 1. Why did you pick the specific chart?

Answer - I picked Pie chart to show percentage of total number of bike rented for each season.

##### 2. What is/are the insight(s) found from the chart?

Insights:-
* Summer is the most popular season for bike rentals, with 37% of all rentals. This is likely due to the fact that people are more likely to be outside and active in the summer, and bikes can be a convenient way to get around.
* Spring is the second most popular season for bike rentals, with 29% of all rentals. This is likely due to the fact that the weather is starting to warm up and people are getting ready for the summer months.
* Autumn is the third most popular season for bike rentals, with 26% of all rentals. This is likely due to the fact that the weather is still mild and people are enjoying the last of the warm weather before winter arrives.
* Winter is the least popular season for bike rentals, with only 8% of all rentals. This is likely due to the fact that the weather is cold and snowy, and roads and bike paths may be covered in ice and snow.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# For Holiday Column
Holiday_Column = create_df_analysis('Holiday')
print(Holiday_Column)

In [None]:
# Chart - 3 visualization code
#Visualisation for Holiday Column
plt.figure(figsize=(7,7))
splot = sns.barplot(data=Holiday_Column,x='Holiday',y='Rented Bike Count')
for p in splot.patches:
    splot.annotate(format(p.get_height(), '.1f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha = 'center', va = 'center',
                   xytext = (0, 9),
                   textcoords = 'offset points')
plt.xlabel("Holiday variable",size=14)
plt.ylabel("Rented Bike Count", size=14)
plt.show()

##### 1. Why did you pick the specific chart?

Answer - I picked Bar chart to show rented Bike in Holiday and Non-holiday time.

##### 2. What is/are the insight(s) found from the chart?

Insights:-
With the two variables of holiday and no holiday, the only possible insight is that the number of bikes rented is higher on holidays than on non-holidays. This is likely due to the fact that people have more free time on holidays and are more likely to be outside and active.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer-

Yes, the gained insights from the data that shows the number of bikes rented on holidays and non-holidays can help creating a positive business impact in the following ways:


*   Increase revenue: Bike rental companies can anticipate more demand on holidays and increase their inventory accordingly to avoid running out of bikes. They can also offer special pricing on holidays to attract more customers and increase revenue.
* Improve customer satisfaction: By having enough bikes available and offering special pricing on holidays, bike rental companies can improve customer satisfaction and encourage repeat business.
* Gain a competitive advantage: By understanding and responding to customer demand, bike rental companies can gain a competitive advantage over other businesses.





#### Chart - 4

In [None]:
df.describe()

In [None]:
# Chart - 4 visualization code
#Visualisation for bikes rented on different temperatures
Bikes_rented_on_diff_T = create_df_analysis('Temperature')
print(Bikes_rented_on_diff_T)

In [None]:
# created a histogram chart for different temperature
plt.figure(figsize=(10,7))
sns.histplot(data=Bikes_rented_on_diff_T,x='Temperature',y='Rented Bike Count',bins=200)
plt.title('Number of bikes rented in different temperatures',size=15)
plt.show()

##### 1. Why did you pick the specific chart?

Answer - I created a histogram chart. it shows a clear relationship between temperature and bike rental demand.

##### 2. What is/are the insight(s) found from the chart?

Answer - The number of bikes rented increases as the temperature increases. This is likely due to the fact that people are more likely to want to ride their bikes in warmer weather.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer - Yes, the gained insights from the data that shows the number of bikes rented at different temperatures can help create a positive business impact in the following ways:

* Increase revenue: Bike rental companies can anticipate higher demand on warmer days and increase their inventory accordingly to avoid running out of bikes. They can also offer special pricing on warmer days to attract more customers and increase revenue.
* Improve customer satisfaction: By having enough bikes available and offering special pricing on warmer days, bike rental companies can improve customer satisfaction and encourage repeat business.
* Gain a competitive advantage: By understanding and responding to customer demand, bike rental companies can gain a competitive advantage over other businesses.

#### Chart - 5

In [None]:
df['Seasons'].value_counts()

In [None]:
sns.violinplot(x=df['Seasons'],y=df['Rented Bike Count'],palette =['yellowgreen','aquamarine','springgreen','aqua'])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Insights:-
There is a strong correlation between the number of bikes rented and the temperature. This is likely due to the fact that people are more likely to want to ride their bikes in warmer weather.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Chart - 3 visualization
sns.barplot(x='Day',y='Rented Bike Count',data=df)
plt.show()


##### 1. Why did you pick the specific chart?

Answer- I picked Bar chart to show the number of bikes rented varies from day to day.

##### 2. What is/are the insight(s) found from the chart?

Insight:-
This is likely due to a number of factors, such as the weather, the day of the week, and whether or not there are any special events happening in the area.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer - Yes, the gained insights from the chart showing the number of bikes rented on different days can help create a positive business impact in the following ways:
* Identify the most popular bike rental locations and times on different days of the week. This information can be used to allocate bikes and resources more efficiently.
* Analyze customer demographics and preferences on different days of the week. This information can be used to develop targeted marketing campaigns and product offerings.
* Track trends in bike rental demand on different days of the week. This information can be used to forecast future demand and make necessary adjustments to inventory and pricing.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
sns.barplot(x='Month',y='Rented Bike Count',data=df)
plt.show()

##### 1. Why did you pick the specific chart?

Answer - I picked Bar chart to show the number of bikes rented varies from Month to Month.

##### 2. What is/are the insight(s) found from the chart?

Insights:-
The number of bikes rented varies throughout the year, with higher demand in the warmer months (April- August) and lower demand in the colder months . This is likely due to the fact that people are more likely to want to ride their bikes in warmer weather.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Chart - 5 visualization code
sns.barplot(x='Year',y='Rented Bike Count',data=df)
plt.show()

##### 1. Why did you pick the specific chart?

Answer - I picked Bar chart to show which year have higher bicke rented.

##### 2. What is/are the insight(s) found from the chart?

Insights:-
Bike rental companies can use the insights from the chart to ensure that they have enough bikes available and to offer competitive pricing in all years. This can help to improve customer satisfaction and encourage repeat business.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
sns.scatterplot(x ='Snowfall',y= 'Rented Bike Count',data = df)
plt.show()

##### 1. Why did you pick the specific chart?

Answer - I used scatterplot here because it gives correlation between numerical columns.

##### 2. What is/are the insight(s) found from the chart?


Insights-:
* Here we can see the correlation between Snowfall and Rented bike count.
* we observed that when there is no Snowfall then rented bike count is on the peak.But whenever there is a snowfall rented bike count decreases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

I think yes the gained insight lead to negative growth of business because we can see that there is negative correlation between Snowfall and rented bike count.
* Here rented bike count decreases during snowy season,so there is also fall in revenue of business during snowy season that lead to negative growth.

#### Chart - 10

In [None]:
df.sample(2)

In [None]:
# Chart - 10 visualization code
fig = plt.subplots(figsize=(15, 4))
sns.pointplot(data=df,x='Hour',y='Rented Bike Count',hue='Holiday')
plt.show()

##### 1. Why did you pick the specific chart?

Answer -
* I picked line plot because it gives trends and patterns over time.
* This plot is mainly used to see how variable changes w.r.t time.

##### 2. What is/are the insight(s) found from the chart?

Insights-:
* Demand of rented bikes from night 12:00 AM to morning 5:00 AM decreases in both holiday and no holiday, then after that demand in no holiday is always more than the holiday.
* In holiday demand increases from morning 5:00 AM to evening 6:00PM then slightly dec.
* In no holiday there are two peaks of high demand,one at morning 5:00AM to 8:00AM and 2nd at evening 4:00PM to 6:00PM,may be due to office timing.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Gained insights create both positive and negative impact on business-:

Positive impact-:

*On working days it has been observed that there are two peaks of demand of bikes at morning and evening hours,In order to cater the demands the company has to increase the number of bikes at the peak hours.

Negative impact-:

 If the company unable to cater the demand at the peak hours,it has following impacts-
* Customer trust and satisfaction
* Revenue will decreases.



#### Chart - 11

In [None]:
df.head(1)

In [None]:
# Chart - 11 visualization code
fig = plt.subplots(figsize=(15, 4))
sns.pointplot(data=df,x='Hour',y='Rented Bike Count',hue='Seasons')
plt.show()

##### 1. Why did you pick the specific chart?

Answer-
* I picked line plot because it gives trends and patterns over time.
* This plot is mainly used to see how variable changes w.r.t time.

##### 2. What is/are the insight(s) found from the chart?

Insights-:
* It has been observed that demand in winter season is less as compared to other seasons.
* The demand/hr is almost constant in winter,but there are two peaks has been observed at morning(7to9AM) and evening(5to7PM) may be due to office hr.
* In rest of the seasons demand follow same trend.but in all the seasons there are two peaks-:
1. At morning 7 to 9 AM (may be due to office time).
2. At evening 5 to 7 PM (may be due to office time).
* We can clearly see that demand/hr in summer season is more as compared to other seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There are following planning that we can do by seeing the insights which create positive business impact-:
* Summer has been most demanding season and out of that morning(7to9) and evening(5to7) has been in high demand,so accordingly company has to increase the supply to meet the demand.
* In winter no need to keep stock same as the other seasons

Some points company has to avoid to lead the negative growth-:
* keeping stock same in every seasons.
* Keeping bikes same in every hour.


#### Chart - 12

In [None]:
df.head(3)

In [None]:
# Chart - 12 visualization
fig = plt.subplots(figsize=(15, 4))
sns.lineplot(data=df,x='Day',y='Rented Bike Count',hue='Holiday')
plt.show()

##### 1. Why did you pick the specific chart?

Answer -
* I picked line plot because it gives trends and patterns over time.
* This plot is mainly used to see how variable changes w.r.t time.
* Here i want to see how Rented Bike Count changes with day for Holiday and No Holiday.

##### 2. What is/are the insight(s) found from the chart?

Insights-:
* In starting 2 weeks there is fluctuation in demand on working days as well as holidays.
* After 2 weeks demand is more upto the end of the month on working days as compared to non working days.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Some of the observation company has to keep in mind to create positive impact on business-:
* In starting 2 weeks of the month company has to keep equal number of stock on both working as well as non working days.
* No need to keep stock equally after 15 days to till end of the month on working as well as holidays.

Some of the points company has to keep in mind to avoid negatively impact on business-:
* Company has to understand the monthly trend of the demand on working as well as non working days.
* If company didn't keep in mind, monthly trend of the demand, and keep number of bikes equally in holidays as well as working days then there may be chance that customer will not getting bike when demand is high.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
fig = plt.subplots(figsize=(15, 4))
sns.lineplot(data=df,x='Day',y='Rented Bike Count',hue='Seasons')
plt.show()

##### 1. Why did you pick the specific chart?

Answer -
* I picked line plot because it gives trends and patterns over time.
* This plot is mainly used to see how variable changes w.r.t time.


##### 2. What is/are the insight(s) found from the chart?

Insights-:
* It has been observed through graph that, there is less demand in winter season throughout everydays in the months as compared to other seasons.
* More demanding season is Summer as compared to others.
* There is fluctuation of demand between spring and autumn season throughout everyday in the months.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

After analysis all the obsevations, if company will take some decision after planning then there is a chance to create positive impact on business-:

   Some decisions like-
* Increase the stock of bikes in Summer season.
* Summer is more demanding season so if company will provide some offers on booking due to this there is a chance for more booking.

There is some points which company has to avoid to lead negative growth of business-:
* Winter is less demanding season so company should keep less stock in winter.
* Company has to increase no of bikes in summer otherwise there will be following impact on company-
  1. fails to meet customer demand
  2. customer trust and satisfaction
  3. Revenue loss




#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(11,11))
sns.heatmap(df.corr(),annot=True)
plt.show()


##### 1. Why did you pick the specific chart?

Answer -
* Heatmap--> plot rectangular data as a color encoded matrix.

* Heatmap is also give correlations between variables.

* I just wanted to know correlations between the variables so that's why i used heatmap.

##### 2. What is/are the insight(s) found from the chart?

Insights-:
* There is a positive correlation between Rented bike count vs hour and Rented bike count vs Temperature.
* There is a negaive correlation between Rented bike count vs snowfall,Rented bike count vs rainfall and Rented bike count vs humidity.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df)

##### 1. Why did you pick the specific chart?

Answer -
* Pairplot gives pairwise relationship between numericals column.

* Pairplot automatically detect numericals column and then pair by pair draw pairplot between all numerical columns.

* I just wanted to see correlation between numericals column for that reason i used pairplot.




##### 2. What is/are the insight(s) found from the chart?

From above pair plot we got to know that, there is not clear linear
relationship between variables. other than dew point temp, temperature & solar radiation.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1.Average number of bikes rented in a functioning day is more than 700:

2.In year of 2017,number of rented bikes count was very less as compared to 2018.

3.Demand of rented bikes in the summer season is more than the demand in other seasons.


### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: mu = 700

Alternate Hypothesis : mu < 700

Test Type: Left Tailed Test



#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Extract the data for functioning days
func_days_data = df[df['Functioning Day'] == 'Yes']['Rented Bike Count'].values
# Calculate the sample mean and standard deviation
sample_mean = np.mean(func_days_data)
sample_std = np.std(func_days_data, ddof=1)
# calculate the sample size
n = func_days_data.shape[0]
# set alpha level
alpha = 0.05
dof = n - 1


# calculate the t-statistic
null_hypothesis_mean=700
t_stat = (sample_mean - null_hypothesis_mean) / (sample_std / np.sqrt(n))

# calculate the p-value
p_value = stats.t.sf(np.abs(t_stat), dof)


# check if the p-value is less than alpha
if p_value < alpha:
    print('Reject null hypothesis. Average number of bikes rented in functioning day is more than 700.')
else:
    print('Failed to reject null hypothesis. Average number of bikes rented in functioning day is not more than 700.')



##### Which statistical test have you done to obtain P-Value?

I have used t-Test as the statistical testing to obtain P-Value and found the result that Null hypothesis has been rejected and Average number of bikes rented in functioning day is more than 700.


##### Why did you choose the specific statistical test?

Answer Here.

In [None]:
sample_mean = np.mean(func_days_data)

In [None]:
sample_median = np.median(func_days_data)

In [None]:
mean_median_difference=sample_median - sample_mean
print("Mean Median Difference is :-",mean_median_difference)

* From the above chart we can see median is greater than mean over 10. So, the distribution is postively skewed.

* So, for a skewed data we can use T-test for better result. Thus, I used t - test.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***