<a href="https://colab.research.google.com/github/Preetirai-tech/Bike-Sharing-Demand-Prediction/blob/main/Bike_Sharing_Demand_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Bike Sharing Demand Prediction**



## **Project Type** - **Regression**
## **Contribution**  -  **Individual (Preeti Rai)** 
<br>

![Screenshot (32)](https://user-images.githubusercontent.com/102009481/177841865-7d86b86b-2849-4240-92c5-26ee85b8715b.png)


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/Preetirai-tech/Bike-Sharing-Demand-Prediction

#**Index**

1. Problem Statement
2. Know Your Data
3. Understanding Your Variables
4. EDA
5. Data Cleaning
6. Feature Engineering
7. Model Building
8. Model Implementaion.
9. Conclusion

# **Let's Begin !**

# **1. Problem Statement**


**The "Bike Sharing Demand Prediction" project addresses the challenge faced by bike sharing companies in accurately forecasting and meeting the fluctuating demand for bike rentals. The unpredictable nature of bike rental demand poses difficulties in managing fleet size, allocating resources, and providing optimal customer service. Without a reliable demand prediction system, bike sharing companies often struggle to ensure a sufficient number of bikes are available during peak periods, resulting in frustrated customers and missed revenue opportunities. Conversely, overestimating demand leads to surplus bikes and unnecessary operational costs. Therefore, the problem at hand is to develop a robust machine learning model that can accurately forecast bike rental demand, enabling companies to optimize fleet management, allocate resources efficiently, and deliver an exceptional user experience while maximizing profitability.**

# **2. Know Your Data**

### Import Libraries

In [None]:
# Import Libraries

# data visualisation and manipulation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import missingno as msno

pd.set_option('display.max_columns', 500)

plt.style.use('ggplot')





import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Seoul Bike Dataset
bike_sharing_df = pd.read_csv("/content/drive/MyDrive/AlmaBetter/Capstone Project/Supervised: Regression/SeoulBikeData.csv", 
                              encoding ='latin')

### Dataset First View

In [None]:
# Display the first 5 rows
bike_sharing_df.head()

In [None]:
# Display the last 5 rows
bike_sharing_df.tail()

In [None]:
# Check ramdom sample
bike_sharing_df.sample(5)

### Dataset Rows & Columns count

In [None]:
# Dimensions of the dataset
bike_sharing_df.shape

There are 8760 rows and 14 columns in this dataset.

In [None]:
# Number of columns in the data
bike_sharing_df.columns

### Dataset Information

In [None]:
# Get information about the dataset
bike_sharing_df.info()

**Observation:**

- **Float64 datatype:** 6 columns ie ``Temperature(°C)``,  ``Wind speed (m/s``, ``Dew point temperature(°C)``, ``Solar Radiation(MJ/m2)``, ``Rainfall(mm)``, ``Snowfall(cm)`` & ``Seasons``. 
- **Int64 datatype:** 4 columns ie ``Rented Bike``, ``Count, Hour``, ``Humidity(%)`` & ``Visibility(10m)``.
- **Object datatype:** 4 columns ie ``Date``, ``Seasons``, ``Holidays`` & ``Functioming Day``.**              








In [None]:
# Number of unique values in each columns
bike_sharing_df.nunique()

**From the above result, it is observed that this datasets contains bike rental data of 1 year (since there are 365 unique values in a Date column)**

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print('The number of duplicated values in each column:' , bike_sharing_df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Check for missing values

bike_sharing_df.isnull().sum()

In [None]:
# Visualizing the missing values
msno.matrix(bike_sharing_df)

**From the above results, it is evident that there are no missing values in the dataset .**

### What did you know about your dataset?

- **The dataset contains 8760 rows and 14 columns.**
- **There are 6 columns of datatype float64, 4 columns of datatype int64 and 4 columns of datatype object.**
- **There are no missing and duplicate values in the dataset.**
- **The dataset contains bike rental data of 1 year.**
- **Input features: ``Date``, ``Hour``, ``Temperature(°C)``, ``Humidity(%)``, ``Wind speed (m/s)``,``Visibility (10m)``, ``Dew point temperature(°C)``, ``Solar Radiation (MJ/m2)``, ``Rainfall(mm)``, ``Snowfall (cm)``, ``Season``, ``Holiday`` & ``Functioning Day``**
- **Target feature: ``Rented Bike Count``** 

# **3. Understanding Your Variables**

In [None]:
# Dataset Columns
bike_sharing_df.columns.tolist()

In [None]:
# Statistical summary of the dataset
bike_sharing_df.describe().T

**Observations:**
- ``Rainfall``: 75% of the datapoints recorded were of below 0 cm ie no rainfall. Only 25% records was above 0.
- ``Snowfall``: 75% of the datapoints were recorded were of 0 ie No Snowfall. Only 25% was above 0.

In [None]:
bike_sharing_df['Seasons'].value_counts()

In [None]:
bike_sharing_df['Functioning Day'].value_counts()

In [None]:
bike_sharing_df['Holiday'].value_counts()

### Variables Description 

The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.

**Attribute Information:**
- **Date:** The specific calendar date for the bike rental record. <br>
- **Rented Bike Count:** The number of bikes rented during a specific time interval.
- **Temperature:** The temperature in Celsius at the time of the bike rental.
- **Humidity:** The relative humidity percentage at the time of the bike rental.
- **Wind Speed:** The speed of the wind in meters per second at the time of the bike rental.
- **Visibility:** The visibility in meters at the time of the bike rental.
- **Dew Point Temperature:** The temperature at which air becomes saturated and dew forms at the time of the bike rental.
- **Solar Radiation:** The amount of solar radiation in mega-joules per square meter at the time of the bike rental.
- **Rainfall:** The amount of rainfall in millimeters at the time of the bike rental.
- **Snowfall:** The amount of snowfall in centimeters at the time of the bike rental.
- **Seasons:** The four seasons (Spring, Summer, Autumn, Winter) corresponding to the bike rental record.

- **Holiday:** A categorical variable indicating whether the day of the bike rental record is a holiday or not. It has two possible values: "Holiday" and "No Holiday". The "Holiday" value represents a day that is recognized as a holiday, while the "No Holiday" value represents a regular day that is not a designated holiday.

- **Functioning Day:** A categorical variable indicating whether the bike rental service was functioning on the day of the record. It has two possible values: "Yes" and "No". The "Yes" value indicates that the bike rental service was operational and functioning normally on that day. Conversely, the "No" value indicates that the bike rental service was not operating, potentially due to maintenance, strikes, or other reasons.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
for i in bike_sharing_df.columns.to_list():
  print('Number of unique values in', i, 'is', bike_sharing_df[i].nunique())

In [None]:
# Converting Date column of datatype Object to Datetime datatype
bike_sharing_df['Date'] = pd.to_datetime(bike_sharing_df['Date'], dayfirst = True)

In [None]:
# Extracting day name feature
bike_sharing_df['Day'] = bike_sharing_df['Date'].dt.day_name()

# Extracting month name feature
bike_sharing_df['Month'] = bike_sharing_df['Date'].dt.month_name()

# Extracting year feature
bike_sharing_df['Year'] = bike_sharing_df['Date'].dt.year



In [None]:
# Dropping Date column
bike_sharing_df.drop(columns = ['Date'], inplace = True)

In [None]:
#Rename the complex columns name
bike_sharing_df = bike_sharing_df.rename(columns={
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind Speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew point temperature',
                                'Solar Radiation (MJ/m2)':'Solar Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                              })
     

In [None]:
bike_sharing_df.sample(3)

In [None]:
# convert Hour and Year columns from integer to object
bike_sharing_df['Hour'] = bike_sharing_df['Hour'].astype('object')
bike_sharing_df['Year'] = bike_sharing_df['Year'].astype('object')

# 4. **Exploratory Data Analysis**

**What is EDA?**

- EDA stands for Exploratory Data Analysis. It is a crucial step in the data analysis process that involves exploring and understanding the characteristics, patterns, and relationships within a dataset. EDA aims to uncover insights, identify patterns, detect outliers, and gain a deeper understanding of the data before conducting further analysis or modeling.

##**4.1 Numeric and Categorical Features**

In [None]:
# Dividing data into numerical and categorical features

categorical_features = bike_sharing_df.select_dtypes(include = 'object')
numerical_features = bike_sharing_df.select_dtypes(exclude = 'object')


In [None]:
categorical_features.head(2)

In [None]:
numerical_features.head(2)

## **4.2 Univariate Analysis**

### **4.2.1 Data Distribution of Numeric features**

In [None]:
# figsize
plt.figure(figsize=(15,10))

# title
plt.suptitle('Data Distribution of Numeric Features', fontsize = 20, fontweight = 'bold', y=1.02)

for i, col in enumerate(numerical_features):
  # subplots 3 rows and 3 columns
  plt.subplot(3, 3, i+1 )

  # dist plot
  sns.distplot(bike_sharing_df[col])
  plt.axvline(bike_sharing_df[col].mean(), color='magenta', linestyle='dashed', linewidth=2)
  plt.axvline(bike_sharing_df[col].median(), color='cyan', linestyle='dashed', linewidth=2)

  plt.title(col)
  plt.tight_layout()



**Observations:**
- For numerical features, we can see that the majority of distributions are right-skewed and few are left-skewed.
- **Right-skewed columns:** ``Rented Bike Count``, ``Wind speed``, ``Solar Radiation``, ``Rainfall`` & ``Snowfall``.
- **Left-skewed columns:** ``Visibility`` & ``Dew point temperature`` 

### **4.2.2 Outlier Analysis of Numeric features**

In [None]:
# figsize
plt.figure(figsize = (15,10))

# title
plt.suptitle('Outlier Analysis of Numeric features', fontsize = 20, fontweight='bold', y=1.02)

for i, col in enumerate(numerical_features):
  # subplots 3 rows, 3 columns
  plt.subplot(3,3, i+1)

  # boxplots
  sns.boxplot(numerical_features[col])
  
  plt.title(col)
  plt.tight_layout()

**Observations:**
- Outliers are visible in most of the numerical columns.
- These columns are ``Rented Bike Count``, ``Wind Speed``, ``Solar Radiation``, ``Rainfall`` & ``Snowfall``.
- The columns like ``Temperature``, ``Humidity``, ``Visibility`` & ``Dew point temperature`` do not contain any outliers.

### **4.2.3 Univariate Analysis of Categorical Features**

In [None]:
# figure
plt.figure(figsize = (20,8))

# title
plt.suptitle('Univariate Analysis of Categorical Features', fontsize = 20, fontweight = 'bold', y = 1.02)

for i, col in enumerate(categorical_features):
  # subplots of 
  plt.subplot(3,3, i+1)

  # Countplots
  sns.countplot(x = categorical_features[col])
  
  plt.xticks(rotation ='vertical')
  plt.title(col)
  plt.tight_layout()


**Observations:**
- Every hour has an equal number of counts in the dataset.
- Every season has almost equal number of counts.
- Dataset has more records of No holiday than a holiday which is obvious as most of the days are working days.
- Dataset has more records of Functioning Day than no functioning day which is obvious as most of the days are working days.
- Except Friday, other Days have equal number of counts in the dataset.
- Months like April, June, September, November & February have a slightly low number of count comparted to other months.
- More data was colected in the year 2018 than 2017.

## **4.3 Bivariate and Multivariate Analysis**

### **4.3.1 Analysis between target variable and numerical features**

In [None]:
# Identify patterns and trends in numerical features

plt.suptitle('Bivariate Analysis of Numerical features', fontsize=20, fontweight='bold', y=1.02)


for i in numerical_features:
  plt.figure(figsize=(15,6))
  sns.lineplot(x= i, y='Rented Bike Count', data = numerical_features, palette='Grouped')
  plt.title(f"Bike Demand over {i}");
  print('\n')
  plt.xticks(rotation = 45)


In [None]:

plt.figure(figsize = (15, 10))

# title
plt.suptitle('Bivariate Analysis of Numerical features', fontsize=20, fontweight='bold', y=1.02)

for index, col in enumerate(numerical_features):

  # subplots of 3 rows and 3 columns
  plt.subplot(3,3, index+1)

  # line plots
  sns.scatterplot(x = numerical_features[col], y = numerical_features['Rented Bike Count'])

  plt.title(f'Bike Damand Over {col}')
  plt.xticks(rotation = 45)
  plt.tight_layout()




### **4.3.2 Bivariate Analysis of Categorical Features**

In [None]:
# Counting number of category present in each feature with respect to target feature  

# figsize
plt.figure(figsize=(15,10))
# title
plt.suptitle('Bivariate Analysis of Categorical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(categorical_features):      
   # subplots of 3 rows and 3 columns
  plt.subplot(3, 3, i+1)                                
  a = bike_sharing_df.groupby(col)[['Rented Bike Count']].mean().reset_index()

  # barplot
  sns.barplot(x=a[col], y=a['Rented Bike Count'])
  # x-axis label
  plt.title(f'Average bike rentals across {col}')
  plt.xticks(rotation = 'vertical')
  plt.tight_layout()

**Observations**:
- **Hours:** The highest demand is in hours from say 7-10 and from 15-19. This could be the reason that in most of the metroploitan cities this is the peak office time and so more people would be renting bikes.import itertools
- **Seasons:** Summer season had the higest Bike Rent Count. People are more likely to rent bikes in summer. Bike rentals in winter is very less compared to other seasons.
- **Holidays:** High number of bikes were rented on No Holidays.*
- **Functioning Day:** On 'No Functioning Day, only 295 bikes were rented. Hence, this column does not add value to our prediction, we can drop this column in the next steps.*
- **Day:** Most of the bikes were rented on Weekdays compared to weekends.*
- **Month:** From March Bike Rent Count started increasing and it was highest in June.*

### **4.3.3 Multivariate Analysis**

In [None]:
# Analysing bike demand with respect to hour and different third value

for i in categorical_features:
  if i == 'Hour':
    pass
  else:
    plt.figure(figsize=(15,8))
    sns.lineplot(x= bike_sharing_df["Hour"], y= bike_sharing_df['Rented Bike Count'], hue= bike_sharing_df[i], marker ='o')
    plt.title(f"Bike Demand over Hour wrt to {i}")
  plt.show()

In [None]:
#Bar plot for seasonwise monthly distribution of Rented_Bike_Count
fig,ax=plt.subplots(figsize=(15,8))
sns.barplot(x='Month',y='Rented Bike Count',data= bike_sharing_df, hue='Seasons',ax=ax);
ax.set_title('Season-wise monthly Rented Bike Count');
plt.show();

**Observations:**
- The above regression plots for the numerical features indicate that the columns ``Temperature``, ``Wind_speed``, ``Visibility``, ``Dew_point_temperature`` & ``Solar_Radiation`` are positively correlated with the target variable, ie , with an increase in these features results in an increase in rented bike count.
- On the other hand, ``Rainfall``, ``Snowfall`` & ``Humidity`` are negatively correlated with the target variable, indicating that with an increase in these features results in a decrease in rented bike count.

# **5. Data Cleaning**

**What is data cleaning?**
Data cleaning, also known as data cleansing or data scrubbing, refers to the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset. It involves handling missing data, removing duplicates, addressing outliers, standardizing formats, resolving inconsistencies, and validating data. Data cleaning ensures that the data is accurate, complete, and reliable for analysis or machine learning purposes.

## **5.1 Handling Missing Values**

In [None]:
# Checking for missing values
bike_sharing_df.isnull().sum()

**As we can see there are no null values present in our dataset and therefore we are good to go.**

## **5.2 Handling duplicate values**

In [None]:
# Checking for duplicate values
bike_sharing_df.duplicated().sum()

**As we can see there are no duplicate values, so we can move ahead.**

## **5.3 Handling Outliers**

**Outliers are data points that deviate significantly from the majority of the data and can have a disproportionate impact on statistical analysis or modeling.**

In [None]:
#Creating a boxplot to detect columns with outliers
# figsize
plt.figure(figsize = (15,10))

# title
plt.suptitle('Outlier Analysis of Numeric features', fontsize = 20, fontweight='bold', y=1.02)

for index , col in enumerate(numerical_features):
  # subplots 3 rows, 3 columns
  plt.subplot(3,3, index+1)

  # boxplots
  sns.boxplot(numerical_features[col])
  
  plt.title(col)
  plt.tight_layout()

**Here we can see that the columns that contain outliers are Rented Bike Count``, ``Windspeed``, ``Solar Radiation``, ``Rainfall`` & ``Snowfall``**

In [None]:
#Creating a list of columns that contains outliers
outlier_cols = ['Rented Bike Count', 'Wind Speed', 'Solar Radiation', 'Rainfall','Snowfall']
outlier_cols

In [None]:
def calculate_ranges(data, column):

  # Skip categorical columns
  if data[column].dtype == 'object':
    return None, None  
  else:
    # Calculate quartiles
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    
    # Calculate IQR
    IQR = Q3 - Q1
    
    # Calculate upper and lower ranges
    upper_range = Q3 + 1.5 * IQR
    lower_range = Q1 - 1.5 * IQR
    
    return upper_range, lower_range

In [None]:
calculate_ranges(numerical_features, 'Rented Bike Count')

In [None]:
# Identify potential outliers
plt.figure(figsize = (15,10))

for index, col in enumerate(outlier_cols):

  # Apply calculate_ranges function to get upper bound and lower bound
  upper_bound, lower_bound = calculate_ranges(bike_sharing_df, col)

  # Identify potential outliers
  outliers = bike_sharing_df[(bike_sharing_df[col] > upper_bound) | (bike_sharing_df[col] < lower_bound)]

# Visualize the potential outliers
  #plt.figure(figsize=(8, 6))
  
  # subplots 3 rows, 3 columns
  plt.subplot(3,3, index+1)
  plt.hist(bike_sharing_df[col], bins=30, color='lightblue', edgecolor='black', label='Data')
  plt.hist(outliers[col], bins=10, color='red', edgecolor='black', label='Potential Outliers')
  plt.xlabel(col)
  plt.ylabel('Frequency')
  
  plt.suptitle('Distribution of Numerical features with Potential Outliers', fontsize = 20, fontweight='bold', y=1.02)
  plt.legend()
  plt.tight_layout()
  #plt.show()

In [None]:
# Create a function to count the total number of outliers in each column

def count_outliers(data):
    # Initialize a variable to store the total number of outliers
    outlier_count = {}

    # Loop through each column in the list containing outliers
    for col in outlier_cols:

        # Calculate the upper and lower ranges
        upper_range, lower_range = calculate_ranges(data, col)

        # Count the number of outliers in the column
        outlier_count[col] = len(data[(data[col] > upper_range) | (data[col] < lower_range)])

    return outlier_count

In [None]:
# Number of outliers in each column
count_outliers(bike_sharing_df)

**Observation**:
- It is not wise to trim the entire outliers as we tend to lose many data points. Hence we are not simply removing the outlier instead of that we are using the clipping method.

In [None]:
# we do not want any transformation in our target variable as it is possible to have outlier in Seoul Environment
# Removing rainfall and snowfall as it may remove important information as these 2 columns are highly skewed.

num_features = ['Temperature', 'Humidity', 'Wind Speed', 'Visibility', 'Dew point temperature', 'Solar Radiation']


**Clipping Method:** In this method, we set a cap on our outliers data, which means that if a value is higher than or lower than a certain threshold, all values will be considered outliers. This method replaces values that fall outside of a specified range with either the minimum or maximum value within that range.

In [None]:
# we are going to replace the datapoints with upper and lower bound of all the outliers

def clip_outliers(bike_df):
    #numerical_features = ['Temperature', 'Humidity', 'Wind Speed', 'Visibility', 'Dew point temperature', 'Solar Radiation']
  
    for col in num_features:
        # Using IQR method to define the range of upper and lower limits
        q1 = bike_df[col].quantile(0.25)
        q3 = bike_df[col].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        
        # Replacing the outliers with the upper and lower bounds
        bike_df[col] = bike_df[col].clip(lower_bound, upper_bound)
    
    return bike_df


In [None]:
new_df = bike_sharing_df.copy()
# using the function to treat outliers
new_df = clip_outliers(new_df)

In [None]:
# checking the boxplot after outlier treatment

# figsize
plt.figure(figsize=(15,8))
# title
plt.suptitle('Outlier Analysis of Numerical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(num_features):
  # subplot of 3 rows and 2 columns
  plt.subplot(3, 2, i+1)            

  # countplot
  sns.boxplot(new_df[col])
  # x-axis label
  plt.xlabel(col)
  plt.tight_layout()

In [None]:
# checking for distribution after treating outliers.

# figsize
plt.figure(figsize=(15,6))
# title
plt.suptitle('Data Distibution of Numerical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(num_features):
  # subplots 3 rows, 2 columns
  plt.subplot(3, 2, i+1)                      

  # dist plots
  sns.distplot(new_df[col])  
  # x-axis label
  plt.xlabel(col)
  plt.tight_layout()  

**We can also observe some shifts in the distribution of the data after treating outliers. Some of the data were skewed before handling outliers, but after doing so, the features almost follow the normal distribution. Therefore, we are not utilizing the numerical feature transformation technique.**

# **6. Feature Engineering**

- Feature engineering is the process of transforming raw data into a set of meaningful, informative, and predictive features that can be used to train machine learning models. It involves selecting, creating, or modifying features in the dataset to enhance the performance and effectiveness of the models.
- Feature engineering is a critical step in machine learning because the quality and relevance of features can significantly impact the model's performance. Well-engineered features can help capture relevant patterns, relationships, and structures in the data, enabling the model to make accurate predictions or classifications

## **6.1 Regression Plot**

In [None]:
# Checking Linearity of all numerical features with our target variable

# figsize
plt.figure(figsize=(15, 10))

# title
plt.suptitle('Regression Analysis of Numerical features', fontsize=20, fontweight='bold', y=1.02)

for i, col in enumerate(numerical_features):

  # subplots of 3 rows and 3 columns
  plt.subplot(3, 3, i+1) 

  # regression plots
  sns.regplot(x= numerical_features[col], y = numerical_features['Rented Bike Count'], scatter_kws={"color": "blue"}, line_kws={"color": "red"})
    
  plt.title(f'Dependend variable and {col}')
  plt.tight_layout()
     

**Most of the numerical features are positively correlated to our target variable.**

## **6.2 Correlation Coefficient and Heatmap**

- The correlation coefficient is a statistical measure that quantifies the strength and direction of the linear relationship between two variables. It provides an indication of how closely the variables are related to each other.
The correlation coefficient, often denoted as "r," ranges from -1 to 1.
- A correlation coefficient of 1 indicates a perfect positive linear relationship, where the variables increase or decrease together with a constant slope.
- A correlation coefficient of -1 indicates a perfect negative linear relationship, where the variables move in opposite directions with a constant slope.
- A correlation coefficient of 0 indicates no linear relationship between the variables.
- The correlation coefficient is calculated using the covariance between the variables divided by the product of their standard deviations. 
- The correlation coefficient provides insight into the strength and direction of the relationship between variables. 
- However, it only measures linear relationships and does not capture other types of associations, such as nonlinear or complex dependencies.


In [None]:
# Heatmap relative to all numeric columns
corr_matrix = bike_sharing_df.corr()
mask = np.array(corr_matrix)
mask[np.tril_indices_from(mask)] = False

fig = plt.figure(figsize=(10, 10))
sns.heatmap(corr_matrix, mask=mask, annot=True, cbar=True, vmax=0.8, vmin=-0.8, cmap='RdYlGn')
plt.show()

In [None]:
plt.figure(figsize=(2,4), dpi=150)
sns.heatmap(bike_sharing_df.corr()[["Rented Bike Count"]].sort_values
            (by="Rented Bike Count", ascending=False)[1:],annot=True)
plt.title('Features Correlating with Rented Bike Count', fontsize=10, fontweight='bold', y=1.02);

#heatmap.set_title('Features Correlating with Rented Bike Count', fontdict={'fontsize':18}, pad=16);

**From the above graph we could see that the columns Temperature and Dew Point Temperature are highly corelated. We can drop one of them. As the corelation between Temperature and our dependent variable "Bike Rented Count" is high compared to Dew Point Temperature. So we will Keep the Temperature column and drop the Dew Point Temperature column.**

In [None]:
# droping Dew point temperature column due to multi-collinearity

new_df.drop('Dew point temperature', axis=1, inplace=True)
     

# **6.3 VIF**

- VIF, which stands for Variance Inflation Factor, is a measure used in regression analysis to assess multicollinearity among predictor variables.
- Multicollinearity occurs when predictor variables in a regression model are highly correlated with each other, which can cause issues in interpreting the individual effects of the variables and can lead to unstable and unreliable model estimates.
- The VIF quantifies the extent to which the variance of the estimated regression coefficient is inflated due to multicollinearity. 
- It measures how much the variance of a particular predictor variable's estimated coefficient is increased compared to if that variable were uncorrelated with the other predictor variables in the model.

Interpreting VIF values:
- A VIF of 1 indicates no multicollinearity, meaning the predictor variable is not correlated with the other predictors.
- A VIF greater than 1 suggests some degree of multicollinearity, where higher values indicate stronger correlation with other predictors.
- A commonly used threshold is a VIF value of 5 or 10. Variables with VIF values exceeding these thresholds are considered to have high multicollinearity and may need to be addressed.

By examining VIF values, researchers can identify predictor variables that contribute to multicollinearity and take appropriate actions, such as removing highly correlated variables, combining variables, or gathering additional data to mitigate the multicollinearity issue.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# function to calculate Multicollinearity

def calculate_vif(X):

  # For each X, calculate VIF and save in dataframe
  vif = pd.DataFrame()
  vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
  vif["features"] = X.columns
  
  return vif

In [None]:
# multicollinearity result

calculate_vif(new_df[[i for i in new_df.describe().columns if i not in ['Rented Bike Count','Date']]])

**These are our final numerical variables to be considered for model building.**

## **6.4 Encoding**

Encoding refers to the process of converting categorical variables into numerical representations that can be understood and processed by machine learning algorithms. Since many machine learning algorithms require numerical inputs, encoding categorical variables becomes necessary.

Common techniques for encoding categorical variables in machine learning include:

1. One-Hot Encoding: This technique creates binary columns for each category in a categorical variable. Each category is represented by a separate binary column, where a value of 1 indicates the presence of that category and 0 indicates its absence. This approach allows algorithms to interpret categorical variables without assuming any ordinal relationship among the categories.

2. Label Encoding: Label encoding assigns a unique numerical label to each category in a categorical variable. Each category is mapped to a corresponding numerical value. However, caution should be exercised with label encoding, as it may introduce an arbitrary ordinal relationship between the categories, which may not be appropriate for some algorithms.

3. Ordinal Encoding: Similar to label encoding, ordinal encoding assigns numerical labels to categories. However, in ordinal encoding, the labels are assigned in a way that represents an ordered relationship between the categories. This can be useful when there is a natural order or hierarchy among the categories.

4. Target Encoding: Target encoding replaces each category with the mean (or another statistical measure) of the target variable within that category. Target encoding can be helpful when the relationship between the categorical variable and the target variable is important for prediction.

In [None]:
# droping Year columns as it does not account for any information addition

new_df.drop(['Year'], axis=1, inplace = True)
categorical_features.drop('Year', axis = 1, inplace = True)

In [None]:
# Check Unique Values for each categorical variable.
for i in categorical_features:
  print("Number of unique values in", i, "is" , new_df[i].nunique())

**We will use one hot encoding for ``Seasons`` and Numeric encoding for ``Holiday`` and ``Functioning day``. Other columns are already encoded.**

In [None]:
ab = new_df.copy()
ab =pd.get_dummies(ab, columns=['Seasons'],prefix='Seasons',drop_first=True)

In [None]:
ab.head()

In [None]:
new_df = pd.get_dummies(new_df, columns = ['Seasons'], prefix='Seasons', drop_first = True)

In [None]:
new_df.head(2)

In [None]:
# Numerical Encoding for holiday and functioning_day

new_df['Holiday'] = new_df['Holiday'].map({'Holiday': 1, 'No Holiday': 0})

new_df['Functioning Day'] = new_df['Functioning Day'].map({'Yes': 1, 'No': 0})

In [None]:
new_df.head(2)

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***