<a href="https://colab.research.google.com/github/Abhishek-singh0416/bike-sharing-demand-prediction/blob/main/bike_sharing_demand_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **Project Name**    - bike sharing demand prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Abhishek Singh-**


# **Project Summary -**

The project focuses on the optimization of rental bike availability and accessibility in urban cities, with the overarching goal of enhancing mobility comfort for the public. In contemporary urban environments, the introduction of rental bikes has become a pivotal element in addressing the challenges of transportation and offering a sustainable and convenient alternative. The key objective is to ensure that rental bikes are not only available but also accessible to the public at the right time, thereby minimizing waiting times and contributing to a seamless urban transportation experience.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


*    Explore and understand hourly, daily, monthly, and yearly variations in bike
rentals. Identify any seasonality or trends in rental patterns.

*    Analyze the influence of weather conditions on bike rentals. Consider variables such as temperature, humidity, wind speed, visibility, and solar radiation.

*    Investigate the impact of categorical variables like seasons, holidays, and functioning days on bike demand.

*    Explore the difference in bike rental patterns on weekends versus weekdays.

*    Develop machine learning models to predict the bike count required at each hour. Consider time-series forecasting models and regression models to capture the dynamics of bike demand.

*    Assess the performance of the predictive models using appropriate metrics. Fine-tune the models to improve accuracy and reliability.

*    Extract actionable insights from the analysis and models. Provide recommendations for optimizing the supply of rental bikes based on the identified patterns.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor



### Dataset Loading

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/Abhishek-singh0416/bike-sharing-demand-prediction/main/SeoulBikeData.csv', encoding='ISO-8859-1')



### Dataset First View

In [None]:
# Dataset First Look
data.head(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = data.shape

# Display the count of rows and columns
print(f"Number of Rows: {rows}")
print(f"Number of Columns: {columns}")

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()



#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values

### What did you know about your dataset?

The dataset has 8760 rows and 14 columns.

Columns include features like 'Date', 'Rented Bike Count', 'Hour', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', 'Seasons', 'Holiday', and 'Functioning Day'.


There are no null values in any of the columns.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

### Variables Description

**Date**: The date of the observation.


**Rented Bike Count:** The number of bikes rented at a particular date and hour.


**Hour:** The hour of the day when the observation was recorded.


**Temperature(°C):** The temperature in degrees Celsius at the time of observation.


**Humidity(%):**The humidity percentage at the time of observation.


**Wind speed (m/s)**: The wind speed in meters per second.


**Visibility (10m):** The visibility in meters.


**Dew point temperature(°C):** The dew point temperature in degrees Celsius.


**Solar Radiation (MJ/m2):** The solar radiation in MegaJoules per square meter.


**Rainfall(mm):** The amount of rainfall in millimeters.


**Snowfall (cm):** The amount of snowfall in centimeters.


**Seasons:** The season during which the observation was made (e.g., Winter, Spring, Summer, Autumn).



**Holiday:** Indicates whether it's a holiday or not (e.g., "No Holiday", "Holiday").


**Functioning Day:** Indicates whether it's a functioning day or not (e.g., "Yes", "No").

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in data.columns:
    unique_values = data[column].unique()
    print(f"Unique values for {column}:\n{unique_values}\n")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Convert 'Date' column to datetime format
data['Date'] = pd.to_datetime(data['Date'],format='%d/%m/%Y')

# Extract day, month, and year into separate columns
data['Day'] = data['Date'].dt.day
data['Month'] = data['Date'].dt.month
data['Year'] = data['Date'].dt.year

data.drop(['Year', 'Day'], axis=1, inplace=True)


# Convert 'Day' into a new column 'Weekend' (1 for weekend, 0 for weekday)
data['Weekend'] = data['Date'].apply(lambda x: 1 if x.weekday() >= 5 else 0)


# Rename complex column names
data.rename(columns={
    'Rented Bike Count': 'Rented_Bike_Count',
    'Temperature(°C)': 'Temperature',
    'Humidity(%)': 'Humidity',
    'Wind speed (m/s)': 'Wind_Speed',
    'Visibility (10m)': 'Visibility',
    'Dew point temperature(°C)': 'Dew_Point_Temperature',
    'Solar Radiation (MJ/m2)': 'Solar_Radiation',
    'Rainfall(mm)': 'Rainfall',
    'Snowfall (cm)': 'Snowfall',
    'Functioning Day': 'Functioning_Day',
}, inplace=True)

# Change data type of 'Hour', 'Weekend', and 'Month' to categorical
data['Hour'] = data['Hour'].astype('category')
data['Weekend'] = data['Weekend'].astype('category')
data['Month'] = data['Month'].astype('category')


In [None]:
data.head()

### What all manipulations have you done and insights you found?

Renamed complex column names

Converted 'Date' column to datetime format.

 Extracted day, month, and year into separate columns

 Droped the 'Year' column as it was not required

 Converted 'Day' into a new column 'Weekend' (1 for weekend, 0 for weekday)

 Changed data type of 'Hour', 'Weekend', and 'Month' to categorical

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(12, 6))
hourly_avg = data.groupby('Hour')['Rented_Bike_Count'].mean()
sns.lineplot(x=hourly_avg.index, y=hourly_avg.values)
plt.title('Hourly Variations in Bike Rentals')
plt.xlabel('Hour')
plt.ylabel('Average Rented Bike Count')
plt.show()

##### 1. Why did you pick the specific chart?
the 'Hour' variable is a continuous variable representing time in hours. Line plots are well-suited for representing continuous data and showing trends over time.

Line plots are effective for visualizing trends in data. In this case, it helps in understanding how the average rented bike count changes over different hours of the day.



.

##### 2. What is/are the insight(s) found from the chart?

The average rented bike count is around 600 at the start of the day (midnight).


From midnight to around 5 AM, there is a significant drop in the average rented bike count, reaching around 200.

This suggests a period of low demand during the early morning hours.


Starting from around 5 AM, there is a steady increase in the average rented bike count, reaching a peak around 1000 at 8 AM.

This peak indicates a surge in bike rentals during the early morning hours, possibly corresponding to the start of the workday.


From 8 AM to 10 AM, there is a drop in the average rented bike count, settling around 600.

This suggests a decrease in demand after the morning peak, possibly as people have reached their destinations.


Starting from 10 AM, there is a substantial rise in the average rented bike count, exceeding 1400.

This surge in demand during late morning to early afternoon hours indicates increased bike usage, potentially for various activities.



After the peak around 16:00, there seems to be a gradual decline in the average rented bike count.

This decline may continue into the evening, suggesting decreasing demand as the day progresses.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**


Understanding the hourly demand patterns allows for optimized bike availability during peak hours.

The insights help in efficient resource allocation, ensuring that a sufficient number of bikes are available during high-demand periods and reducing excess capacity during low-demand periods.


The business can focus on improving services during specific time frames, addressing user needs and preferences during peak hours to create a positive user experience.


**Potential Challenges and Negative Impact:**


If the surge in demand during certain hours surpasses the available bike capacity, it could lead to unmet user needs and dissatisfaction. This could result in negative reviews and a potential loss of business.


Managing fluctuations in demand may require dynamic resource allocation and operational adjustments. This could lead to increased operational costs during peak hours, impacting profitability.


Users might face frustration if there's a mismatch between demand and bike availability, especially during peak hours. This frustration can lead to a negative perception of the service.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
weather_columns = ['Temperature', 'Humidity', 'Wind_Speed', 'Visibility', 'Solar_Radiation', 'Rented_Bike_Count']

# Calculate the correlation matrix
correlation_matrix = data[weather_columns].corr()

# Display the correlation matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix: Weather Conditions vs. Bike Rentals')
plt.show()

##### 1. Why did you pick the specific chart?

the heatmap is a powerful and efficient visualization tool for exploring correlations. It condenses a significant amount of information into a visually digestible format, making it an excellent choice for understanding the influence of weather conditions on bike rentals.

##### 2. What is/are the insight(s) found from the chart?

Temperature and solar radiation show relatively stronger positive correlations with bike rentals.


Humidity has a weak negative correlation, suggesting a  decrease in bike rentals with higher humidity.


Wind speed and visibility have weaker positive correlations, indicating subtle relationships with bike rentals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Positive correlation with temperature and solar radiation suggests increased bike rentals during warmer and sunnier conditions.

 By optimizing bike availability during these conditions, the business can meet heightened demand, enhancing user satisfaction and revenue.


Understanding the influence of weather on bike rentals allows for targeted marketing during favorable weather conditions.
The business can implement weather-responsive marketing campaigns to attract users during periods when demand is likely to be high.


Awareness of weather-related patterns enables efficient resource allocation and operational planning.
The business can align staffing, maintenance, and bike distribution with expected demand, reducing operational costs and enhancing efficiency.


**Potential Challenges and Negative Growth:**

Bike rentals show a positive correlation with temperature and solar radiation.
Over-dependence on favorable weather conditions may lead to challenges during adverse weather, potentially resulting in lower demand and revenue.



Weak negative correlation with humidity suggests a slight decrease in bike rentals with higher humidity.

High humidity might deter users, and if not managed well, it could lead to periods of lower demand and potential negative user experiences.
Limited Capacity during Peaks:

Positive correlations indicate increased demand during certain weather conditions.
Managing bike availability during peak periods may become challenging if demand exceeds capacity, leading to unmet user needs and potential dissatisfaction.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
for weather_var in weather_columns[:-1]:  # Exclude 'Rented Bike Count'
    plt.figure(figsize=(8, 5))
    sns.scatterplot(x=weather_var, y='Rented_Bike_Count', data=data)
    plt.title(f'Scatter Plot: {weather_var} vs. Bike Rentals')
    plt.xlabel(weather_var)
    plt.ylabel('Rented Bike Count')
    plt.show()

##### 1. Why did you pick the specific chart?

scatter plots are a versatile and informative visualization tool for exploring relationships between two continuous variables. They are particularly well-suited for analyzing the impact of weather variables on bike rentals, allowing for both qualitative and quantitative assessments.

##### 2. What is/are the insight(s) found from the chart?

As temperature increases, there is a noticeable trend of higher bike rentals

The scatter plot for humidity shows a scattered distribution without a clear trend. While there might be a slight decrease in bike rentals with higher humidity

Higher wind speeds are associated with slightly increased bike rentals,

better visibility is associated with increased bike rentals

higher solar radiation is associated with increased bike rentals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Box plot for Seasons vs. Bike Rentals
plt.figure(figsize=(10, 6))
sns.boxplot(x='Seasons', y='Rented_Bike_Count', data=data)
plt.title('Seasons vs. Bike Rentals')
plt.show()


##### 1. Why did you pick the specific chart?

he box plot is a suitable choice for exploring the distributional characteristics of bike rentals across different categories of season. It provides a concise and informative summary of the data, making it easier to identify patterns and variations in demand.

##### 2. What is/are the insight(s) found from the chart?



Bike rentals are highest during the summer season.



 Bike rentals in spring and autumn follow a similar pattern, but with lower demand compared to summer.


 Bike rentals are significantly lower during the winter season.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Optimizing resources based on seasonal patterns ensures that the business meets increased demand, leading to positive user experiences and potential revenue growth.

Season-specific marketing campaigns can attract more users during peak seasons, driving engagement and potentially increasing customer acquisition.

Implementing dynamic pricing, such as discounts during off-peak seasons, can incentivize users to rent bikes, leading to increased utilization and revenue.

Users are more likely to have a positive experience when bikes are readily available during their preferred seasons, leading to increased customer satisfaction and loyalty.

**Potential Challenges and Negative Growth**:



Depending solely on bike rentals might pose challenges during the winter season. The business may experience a decline in revenue, and maintaining a large fleet during this period could lead to underutilized resources.


Over-reliance on favorable weather might result in challenges during unexpected weather events. Unpredictable weather patterns could impact user behavior and demand.


 Frequent adjustments in fleet size may incur additional maintenance costs. Proper planning is required to manage maintenance efficiently.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Box plot for Holiday vs. Bike Rentals
plt.figure(figsize=(8, 5))
sns.boxplot(x='Holiday', y='Rented_Bike_Count', data=data)
plt.title('Holiday vs. Bike Rentals')
plt.show()


##### 1. Why did you pick the specific chart?

he box plot is a versatile and informative visualization tool, well-suited for comparing distributions and identifying patterns in numerical data across different categories. It provides a comprehensive summary of the data distribution, making it an effective choice for investigating the impact of holidays on bike rental

##### 2. What is/are the insight(s) found from the chart?

Bike rentals are higher on non-holiday days.

The higher demand on non-holiday days suggests that users are more likely to utilize bike rental services on regular working days.


Bike rentals decrease on holidays.

During holidays, users may engage in different activities, travel, or spend time with family, leading to a decrease in demand for bike rentals

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact**:

Adjusting resources based on the observed higher demand on non-holiday days can lead to resource optimization.
Targeted Marketing:


Tailoring marketing strategies to highlight the convenience of bike rentals on regular working days can attract more users during those periods.
User-Centric Promotions:


**Potential Challenges and Negative Growth**:

 The observed decline in bike rentals during holidays could pose a challenge, as the business may experience reduced revenue during these periods.
.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Box plot for Functioning Day vs. Bike Rentals
plt.figure(figsize=(8, 5))
sns.boxplot(x='Functioning_Day', y='Rented_Bike_Count', data=data)
plt.title('Functioning Day vs. Bike Rentals')
plt.show()


##### 1. Why did you pick the specific chart?

the box plot is a suitable choice for exploring the distributional characteristics of bike rentals across different categories of functioning days. It provides a concise and informative summary of the data, making it easier to identify patterns and variations in demand..

##### 2. What is/are the insight(s) found from the chart?

here are no bike rentals on non-functioning days.

bike rental service operates exclusively on functioning days (presumably weekdays). Users may primarily use the service for commuting or daily activities related to work.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Challenge: Users who seek transportation solutions on non-functioning days might turn to alternative services if bike rentals are not available. This could result in customer dissatisfaction and potential loss of user base.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Create a new variable for Day of the Week
data['Day_of_Week'] = data['Date'].dt.day_name()

# Convert 'Day_of_Week' to a categorical variable
data['Day_of_Week'] = pd.Categorical(data['Day_of_Week'], categories=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], ordered=True)

# Aggregate Bike Rentals by Day of the Week
daily_avg_rentals = data.groupby('Day_of_Week')['Rented_Bike_Count'].mean()

# Visualize the Bike Rental Patterns
plt.figure(figsize=(12, 6))
sns.barplot(x=daily_avg_rentals.index, y=daily_avg_rentals.values, palette="viridis")
plt.title('Average Bike Rentals by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Rented Bike Count')
plt.show()

# Explore Hourly Patterns on Weekends and Weekdays
plt.figure(figsize=(12, 6))
sns.lineplot(x='Hour', y='Rented_Bike_Count', hue='Day_of_Week', data=data, palette="viridis")
plt.title('Hourly Bike Rentals by Day of the Week')
plt.xlabel('Hour')
plt.ylabel('Rented Bike Count')
plt.legend(title='Day of the Week')
plt.show()



##### 1. Why did you pick the specific chart?

 the choice of specific charts aims to present the information in a visually effective and interpretable manner

##### 2. What is/are the insight(s) found from the chart?

Average bike rentals remain relatively consistent from Monday to Friday.

Average bike rentals are lower on Sundays compared to weekdays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
 The consistent demand for bike rentals on weekdays allows for efficient operational planning. Optimizing resources and staffing during weekdays can lead to cost-effective operations.




Understanding the reasons behind lower Sunday rentals allows for user-centric adaptations. By addressing potential barriers or offering incentives, the business can enhance the overall user experience and satisfaction.

**Potential Challenges and Negative Growth:**
:

The observed decrease in bike rentals on Sundays poses a challenge, as it indicates a potential dip in demand on weekends.
 in service.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# correlation = data['Temperature'].corr(data['Rented_Bike_Count'])
print(" Correlation between 'Rented_Bike_Count' and")

for col in weather_columns:
  corre_matrix = data['Rented_Bike_Count'].corr(data[col])
  print(f" '{col}': {corre_matrix}")




##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Statement 1**:

There is a significant difference in the average bike rentals between weekdays (Monday to Friday) and weekends (Saturday and Sunday).


**Statement 2**:

The average bike rentals on Sundays are significantly lower than the average bike rentals on Saturdays.


**Statement 3**:

The hourly bike rentals on weekdays during office hours (9 AM to 5 PM) are significantly higher than during non-office hours.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0)**: The average bike rentals on weekdays are equal to the average bike rentals on weekends.



**Alternative Hypothesis (H1)**: The average bike rentals on weekdays are different from the average bike rentals on weekends.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats

# Statement 1: Average bike rentals between weekdays and weekends
weekday_rentals = data[data['Weekend'] == 0]['Rented_Bike_Count']
weekend_rentals = data[data['Weekend'] == 1]['Rented_Bike_Count']

t_statistic, p_value = stats.ttest_ind(weekday_rentals, weekend_rentals)
print(f"Statement 1 Result: p-value = {p_value}")


##### Which statistical test have you done to obtain P-Value?

Test: Independent samples t-test

##### Why did you choose the specific statistical test?

The objective was to compare the means of two independent groups

The variable of interest, 'Rented_Bike_Count,' represents continuous numerical data

The t-test assumes that the data within each group follows a normal distribution. While the t-test is robust to violations of normality for large sample sizes, it's a reasonable choice when the sample sizes are not extremely small.




### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
def handle_outliers_iqr(data, column):
    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)

    # Calculate IQR
    IQR = Q3 - Q1

    # Define lower and upper bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Clip the outliers
    data[column] = np.clip(data[column], lower_bound, upper_bound)

# List of columns with outliers
columns_with_outliers = ['Temperature', 'Humidity', 'Wind_Speed', 'Visibility',
                         'Dew_Point_Temperature', 'Solar_Radiation']

# Apply the function to each column in the data
for column in columns_with_outliers:
    handle_outliers_iqr(data, column)



In [None]:
# Visualize the result
for column in columns_with_outliers:
    sns.boxplot(x=data[column])
    plt.title(f'Boxplot of {column} after Outlier Handling')
    plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
print(data.info())
data['Holiday'].unique()

In [None]:
# # Encode your categorical columns
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'Seasons' column
data['Seasons'] = label_encoder.fit_transform(data['Seasons'])

# Display the first few rows to check the result
data.head()

In [None]:
categorical_features = ['Hour', 'Seasons']
one_hot = OneHotEncoder()

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
continuous_features = ['Temperature', 'Humidity',
       'Wind_Speed', 'Visibility', 'Dew_Point_Temperature', 'Solar_Radiation',
       'Rainfall', 'Snowfall']
scaler = StandardScaler()

# Combine OneHotEncoder and StandardScaler using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', one_hot, categorical_features),
        ('cont', scaler, continuous_features)
    ])

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
X = data.drop('Rented_Bike_Count', axis=1)
y = data['Rented_Bike_Count']

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
data['Weekend'].astype(int)

data['Seasons'].astype(int)



### 6. Data Scaling

In [None]:
# Scaling your data


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Fit the Algorithm
model.fit(X_train, y_train)

# Predict on the model
y_pred = model.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
print(f'Root Mean Squared Error: {rmse}')

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Perform cross-validation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')



# Fit the Algorithm

# Predict on the model
# Convert negative MSE to RMSE
rmse_scores = (-cv_scores) ** 0.5

In [None]:
rmse_scores

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
model2 = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])


# Fit the Algorithm
model2.fit(X_train, y_train)

# Predict on the model
y_pred = model2.predict(X_test)



#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
print(f'Root Mean Squared Error: {rmse}')

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
cv_scores = cross_val_score(model2, X, y, cv=5, scoring='neg_mean_squared_error')


# Convert negative MSE to RMSE
rmse_scores2 = (-cv_scores) ** 0.5


In [None]:
rmse_scores2

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The final model chosen for predicting bike-sharing demand is the RandomForestRegressor. This model was selected based on its performance across various evaluation metrics

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***