<a href="https://colab.research.google.com/github/Preetirai-tech/Bike-Sharing-Demand-Prediction/blob/main/Bike_Sharing_Demand_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Bike Sharing Demand Prediction**



## **Project Type** - **Regression**
## **Contribution**  -  **Individual (Preeti Rai)** 
<br>

![Screenshot (32)](https://user-images.githubusercontent.com/102009481/177841865-7d86b86b-2849-4240-92c5-26ee85b8715b.png)


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/Preetirai-tech/Bike-Sharing-Demand-Prediction

#**Index**

1. Problem Statement
2. Know Your Data
3. Understanding Your Variables
4. EDA
5. Data Cleaning
6. Feature Engineering
7. Model Building
8. Model Implementaion.
9. Conclusion

# ***Let's Begin !***

# **1. Problem Statement**


**The "Bike Sharing Demand Prediction" project addresses the challenge faced by bike sharing companies in accurately forecasting and meeting the fluctuating demand for bike rentals. The unpredictable nature of bike rental demand poses difficulties in managing fleet size, allocating resources, and providing optimal customer service. Without a reliable demand prediction system, bike sharing companies often struggle to ensure a sufficient number of bikes are available during peak periods, resulting in frustrated customers and missed revenue opportunities. Conversely, overestimating demand leads to surplus bikes and unnecessary operational costs. Therefore, the problem at hand is to develop a robust machine learning model that can accurately forecast bike rental demand, enabling companies to optimize fleet management, allocate resources efficiently, and deliver an exceptional user experience while maximizing profitability.**

## ***2. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# data visualisation and manipulation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import missingno as msno

pd.set_option('display.max_columns', 500)

plt.style.use('ggplot')





import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Seoul Bike Dataset
bike_sharing_df = pd.read_csv("/content/drive/MyDrive/AlmaBetter/Capstone Project/Supervised: Regression/SeoulBikeData.csv", 
                              encoding ='latin')

### Dataset First View

In [None]:
# Head of the data
bike_sharing_df.head()

In [None]:
# tail of the data
bike_sharing_df.tail()

### Dataset Rows & Columns count

In [None]:
# Shape of the data
bike_sharing_df.shape

There are 8760 rows and 14 columns in this dataset.

In [None]:
# Number of columns in the data
bike_sharing_df.columns

### Dataset Information

In [None]:
# Dataset Info
bike_sharing_df.info()

**Observation:**

- **Float64 datatype:** 6 columns ie ``Temperature(°C)``,  ``Wind speed (m/s``, ``Dew point temperature(°C)``, ``Solar Radiation(MJ/m2)``, ``Rainfall(mm)``, ``Snowfall(cm)`` & ``Seasons``. 
- **Int64 datatype:** 4 columns ie ``Rented Bike``, ``Count, Hour``, ``Humidity(%)`` & ``Visibility(10m)``.
- **Object datatype:** 4 columns ie ``Date``, ``Seasons``, ``Holidays`` & ``Functioming Day``.**              








In [None]:
# Number of unique values in each columns
bike_sharing_df.nunique()

**From the above result, it is observed that this datasets contains bike rental data of 1 year (since there are 365 unique values in a Date column)**

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print('The number of duplicated values in each column:' , bike_sharing_df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bike_sharing_df.isnull().sum()

In [None]:
# Visualizing the missing values
msno.matrix(bike_sharing_df)

**From the above results, it is evident that there are no missing values in the dataset .**

### What did you know about your dataset?

- **The dataset contains 8760 rows and 14 columns.**
- **There are 6 columns of datatype float64, 4 columns of datatype int64 and 4 columns of datatype object.**
- **There are no missing and duplicate values in the dataset.**
- **The dataset contains bike rental data of 1 year.**
- **Input features: ``Date``, ``Hour``, ``Temperature(°C)``, ``Humidity(%)``, ``Wind speed (m/s)``,``Visibility (10m)``, ``Dew point temperature(°C)``, ``Solar Radiation (MJ/m2)``, ``Rainfall(mm)``, ``Snowfall (cm)``, ``Season``, ``Holiday`` & ``Functioning Day``**
- **Target feature: ``Rented Bike Count``** 

## ***3. Understanding Your Variables***

In [None]:
# Dataset Columns
bike_sharing_df.columns.tolist()

In [None]:
# Dataset Describe
bike_sharing_df.describe(include = 'all').T

In [None]:
bike_sharing_df['Seasons'].value_counts()

In [None]:
bike_sharing_df['Functioning Day'].value_counts()

In [None]:
bike_sharing_df['Holiday'].value_counts()

### Variables Description 

The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.

**Attribute Information:**
- **Date:** The specific calendar date for the bike rental record. <br>
- **Rented Bike Count:** The number of bikes rented during a specific time interval.
- **Temperature:** The temperature in Celsius at the time of the bike rental.
- **Humidity:** The relative humidity percentage at the time of the bike rental.
- **Wind Speed:** The speed of the wind in meters per second at the time of the bike rental.
- **Visibility:** The visibility in meters at the time of the bike rental.
- **Dew Point Temperature:** The temperature at which air becomes saturated and dew forms at the time of the bike rental.
- **Solar Radiation:** The amount of solar radiation in mega-joules per square meter at the time of the bike rental.
- **Rainfall:** The amount of rainfall in millimeters at the time of the bike rental.
- **Snowfall:** The amount of snowfall in centimeters at the time of the bike rental.
- **Seasons:** The four seasons (Spring, Summer, Autumn, Winter) corresponding to the bike rental record.

- **Holiday:** A categorical variable indicating whether the day of the bike rental record is a holiday or not. It has two possible values: "Holiday" and "No Holiday". The "Holiday" value represents a day that is recognized as a holiday, while the "No Holiday" value represents a regular day that is not a designated holiday.

- **Functioning Day:** A categorical variable indicating whether the bike rental service was functioning on the day of the record. It has two possible values: "Yes" and "No". The "Yes" value indicates that the bike rental service was operational and functioning normally on that day. Conversely, the "No" value indicates that the bike rental service was not operating, potentially due to maintenance, strikes, or other reasons.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
for i in bike_sharing_df.columns.to_list():
  print('Number of unique values in', i, 'is', bike_sharing_df[i].nunique())

In [None]:
# Converting Date column of datatype Object to Datetime datatype
bike_sharing_df['Date'] = pd.to_datetime(bike_sharing_df['Date'], dayfirst = True)

In [None]:
# Extracting day name feature
bike_sharing_df['Day'] = bike_sharing_df['Date'].dt.day_name()

# Extracting month name feature
bike_sharing_df['Month'] = bike_sharing_df['Date'].dt.month_name()

# Extracting year feature
bike_sharing_df['Year'] = bike_sharing_df['Date'].dt.year



In [None]:
# Dropping Date column
bike_sharing_df.drop(columns = ['Date'], inplace = True)

In [None]:
#Rename the complex columns name
bike_sharing_df = bike_sharing_df.rename(columns={
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind Speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew point temperature',
                                'Solar Radiation (MJ/m2)':'Solar Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                              })
     

In [None]:
bike_sharing_df.sample(3)

In [None]:
# convert Hour and Year columns from integer to object
bike_sharing_df['Hour'] = bike_sharing_df['Hour'].astype('object')
bike_sharing_df['Year'] = bike_sharing_df['Year'].astype('object')

## 4. ***Exploratory Data Analysis***

**What is EDA?**

- EDA stands for Exploratory Data Analysis. It is a crucial step in the data analysis process that involves exploring and understanding the characteristics, patterns, and relationships within a dataset. EDA aims to uncover insights, identify patterns, detect outliers, and gain a deeper understanding of the data before conducting further analysis or modeling.

###**4.1 Numeric and Categorical Features**

In [None]:
# Dividing data into numerical and categorical features

categorical_features = bike_sharing_df.select_dtypes(include = 'object')
numerical_features = bike_sharing_df.select_dtypes(exclude = 'object')


In [None]:
categorical_features.head(2)

In [None]:
numerical_features.head(2)

### **4.2 Univariate Analysis**

#### **4.2.1 Data Distribution of Numeric features**

In [None]:
# figsize
plt.figure(figsize=(15,10))

# title
plt.suptitle('Data Distribution of Numeric Features', fontsize = 20, fontweight = 'bold', y=1.02)

for i, col in enumerate(numerical_features):
  # subplots 3 rows and 4 columns
  plt.subplot(3, 3, i+1 )

  # dist plot
  sns.distplot(bike_sharing_df[col])

  plt.title(col)
  plt.tight_layout()



## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***