# **Project Name**    - Bike sharing demand Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Name -**            M Dhanunjaya



# **Project Summary -**

Rental bikes have become increasingly popular in urban cities, offering enhanced mobility comfort. The key to ensuring a smooth experience for the public is making rental bikes readily available and accessible at all times, reducing waiting times. However, one major challenge lies in maintaining a stable supply of rental bikes throughout the city. To address this concern, accurate predictions of the required bike count at each hour are crucial.

In this data analysis project, we started by importing the necessary libraries and examining our dataset, which contains 8760 rows and 14 columns, with no duplicate or missing data.

After initial exploration, we focused on studying the individual features and the data they represent. We noticed that the 'Date' column was in 'object' datatype, so we converted it to the datetime datatype. From this column, we extracted additional features such as 'Date', 'month', 'year', and the number of weeks. Subsequently, we dropped the original 'Date' column and renamed certain columns for convenience.

Next, we proceeded with data visualization, creating various charts and graphs to gain useful insights. Based on the visualizations, we formulated three hypothetical statements and performed hypothesis tests to validate them. The statements were:

The average bike count in Seoul city at any point in time is greater than 100.
The average temperature in Seoul city at any point is greater than 10 degrees Celsius.
The standard deviation of humidity in Seoul city is 20.
Following the exploratory data analysis, we addressed some data preprocessing steps. We performed one-hot encoding on categorical features while dropping the first column to avoid multicollinearity. To deal with the right-skewed distribution of the dependent variable 'Rented_bike_count,' we applied a square root transformation to achieve a more normal distribution. Additionally, we used MinMax scaling to scale our features. Finally, we split the data into an 80-20 train-test ratio for model training and evaluation.

Moving on to the machine learning phase, we implemented various models and calculated various statistical parameters to assess their performance. After comparing the models, we found that the Random Forest Regressor exhibited the best performance in predicting the demand for city bikes for a particular hour.

In conclusion, deploying the Random Forest Regressor model can significantly help the bike rental company accurately predict the bike demand, enabling them to efficiently meet the city's bike rental needs.

# **GitHub Link -**

Provide your GitHub Link here.

https://github.com/Dhana009/Bike-sharing-demand-Prediction

# **Problem Statement**


**Write Problem Statement Here.**


The main objective of this project is to create a robust machine learning model capable of predicting the demand for rental bikes in urban cities on an hourly basis. The primary challenge addressed here is to ensure a consistent and adequate supply of rental bikes while minimizing waiting times for users. By accurately forecasting the demand for rental bikes at different hours of the day, the project aims to optimize bike-sharing systems in cities. The ultimate goal is to enhance the availability and accessibility of rental bikes, enabling cities to allocate resources effectively and meet the demands of their bike-sharing services efficiently.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from scipy.stats import norm
from scipy.stats import chi2
from scipy.stats import t
from scipy.stats import f
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


#setting font size throughout the notebook
plt.rcParams.update({'font.size': 14})

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')


# reading data file
dir_path = '/content/drive/MyDrive/Almabetter/capstone projects/bike/SeoulBikeData.csv'
df = pd.read_csv(dir_path, encoding = 'ISO-8859-1')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
x = df.shape
print(f'Dataset as {x[0]} rows & {x[1]} columns in total')

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dup = df.duplicated().sum()
print(f'total no. of duplicates are {dup} in the dataset')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.rcParams['figure.figsize'] = (25, 5)

### What did you know about your dataset?

**Columns Information:**

There are no null values in the total dataset.

**Date** has information about date in format year-month-day and it is in the object data type, we do need convert it to date type, if needed.

**Rented Bike count** show the count of bikes rented and it is in the integer datatype

**Hour** - shows the no of hours the bikes rented

**Temperature** Temperature are in the Celsius format

**Humidity** - humidity is given with respect to the particular hour on the given date (in %)

**Windspeed** - Windspeed is given with respect to the particular hour on the given date (in m/s)

**Visibility** - Visibility is given with respect to the particular hour on the given date (upto 10m)

**Dew point temperature** - Dew point temperature is given with respect to the particular hour on the given date (in Celsius)

**Solar radiation** - Solar Radition is given with respect to the particular hour on the given date (in MJ/m2)

**Rainfall** - Rainfall is given with respect to the particular hour on the given date (in mm)

**Snowfall** - Snowfall is given with respect to the particular hour on the given date (in cm)

**Seasons** - shows the season it is rented like "Winter, Spring, Summer, Autumn"

**Holiday** - show if that day is a holiday or not

**Functional Day** - shows if the hours were functional or not in that day

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include = 'all')

### Variables Description

**Columns Information:**

There are no null values in the total dataset.

**Date** has information about date in format year-month-day and it is in the object data type, we do need convert it to date type, if needed.

**Rented Bike count** show the count of bikes rented and it is in the integer datatype

**Hour** - shows the no of hours the bikes rented

**Temperature** Temperature are in the Celsius format

**Humidity** - humidity is given with respect to the particular hour on the given date (in %)

**Windspeed** - Windspeed is given with respect to the particular hour on the given date (in m/s)

**Visibility** - Visibility is given with respect to the particular hour on the given date (upto 10m)

**Dew point temperature** - Dew point temperature is given with respect to the particular hour on the given date (in Celsius)

**Solar radiation** - Solar Radition is given with respect to the particular hour on the given date (in MJ/m2)

**Rainfall** - Rainfall is given with respect to the particular hour on the given date (in mm)

**Snowfall** - Snowfall is given with respect to the particular hour on the given date (in cm)

**Seasons** - shows the season it is rented like "Winter, Spring, Summer, Autumn"

**Holiday** - show if that day is a holiday or not

**Functional Day** - shows if the hours were functional or not in that day

### Check Unique Values for each variable.

In [None]:
# Check Unique Values
for i in df.columns:
  len_value = len(df[i].unique())
  print(f"The number of unique variables in {i} column are: {len_value}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#converting Date column to date time format
df['Date']=pd.to_datetime(df['Date'])

# creating new colums
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# using date column to get the week number
df['Week Number'] = df['Date'].dt.week

# lets drop the date column
df.drop(columns=['Date'],axis='columns',inplace=True)

In [None]:
# renaming the column names for better understanding
df.rename(columns={'Temperature(°C)':'Temperature',
                       'Humidity(%)':'Humidity',
                       'Wind speed (m/s)':'Wind Speed',
                       'Visibility (10m)':'Visibility',
                       'Dew point temperature(°C)':'Dew Point Temperature',
                       'Solar Radiation (MJ/m2)':'Solar Radiation',
                       'Rainfall(mm)':'Rainfall',
                        'Snowfall (cm)':'Snowfall'
                       }, inplace = True)

### What all manipulations have you done and insights you found?

. There were no missing values in the dataset

. We have extracted Day, Month, year and week number from date colum and then we dropped the date column from the dataset

. Finally we renamed the column for better understanding

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1
total rented bike w.r.t Holiday, functioning Day, season and Year

In [None]:
# Chart - 1 visualization code
# we are using the columns Holiday, Functioning day, Seasons, and Year to plot a graph to check the total rented bikes sum with respect to each category

# creating a list to loop
x = ['Year','Seasons','Holiday', 'Functioning Day']

# looping for every element in the list
for elements in x:
  bike_sum = df.groupby(elements)['Rented Bike Count'].sum().sort_values(ascending = False)
  plt.rcParams['figure.figsize'] = (5,5)
  # plotting bar plot
  bike_sum.plot.bar()

  #setting colum chart title to infer about the chart
  plt.title(f'Total Sum of Rented Bikes with {elements} column')
  plt.show()
  # printing values obtained for reference
  print(bike_sum)

##### 1. Why did you pick the specific chart?

with bar charts we can visuallize the sum better and helps us interpret them in a better way.

##### 2. What is/are the insight(s) found from the chart?


On regular days (non-holiday), the total number of bikes rented is 5,956,419, while on holidays, it is 215,895.

On functioning days, the total bikes rented amount to 6,172,314, whereas on non-functioning days, there were no bike rentals, resulting in a count of 0.

The bike rentals for different seasons are as follows:

Summer: 2,283,234 rentals
Autumn: 1,790,002 rentals
Spring: 1,611,909 rentals
Winter: 487,169 rentals.

It is evident that bike rentals are significantly lower during winter compared to other seasons, with summer experiencing the highest rental activity.
In the year 2018, the total number of bikes rented was 5,986,984, while in 2017, it was 185,330.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

. On operational days, no bikes were rented, indicating that on non-operational days, there were also no bike rentals. This lack of business on non-operational days has resulted in the cessation of operations.

. Winter months witness a decline in the number of bike rentals, negatively impacting the business. Conversely, during summer, the business experiences a positive impact, with a higher number of bike rentals.

#### Chart - 2
Spread of various values in Holiday, Functioning Day, Seasons and Year


In [None]:
# Chart - 2 visualization code
#The following code creates pie charts to visualize the distribution of unique values in the DataFrame for the columns 'Holiday', 'Functioning Day', 'Seasons', and 'Year'.
#Since these columns have a small number of unique values, pie charts are chosen for this analysis.

# List of columns to analyze
columns_to_analyze = ['Holiday', 'Functioning Day', 'Seasons', 'Year']

# Loop through each element in the list
for column in columns_to_analyze:
    # Obtain value counts for each unique value in the column
    value_counts = df[column].value_counts()
    plt.rcParams['figure.figsize'] = (5, 5)

    # Create a pie chart to visualize the distribution of unique values
    # Pctdistance 0.6 is set to display the value inside the chart; if set more than 1, it'll display outside the chart.
    value_counts.plot(kind='pie', autopct='%1.1f%%', pctdistance=0.6)

    # Set the title for the pie chart
    plt.title(f'Distribution of Unique Values for {column}')
    plt.show()



##### 1. Why did you pick the specific chart?



Pie charts are favored for their ease of interpretation, allowing a clear understanding of the relative proportions of different categories presented as percentages.

##### 2. What is/are the insight(s) found from the chart?

The data reveals that 95% of the recorded days are categorized as working days (not holidays), while holidays constitute 4.9% of the total days.

A significant majority, approximately 96.6%, of the data corresponds to functioning days, whereas the remaining data corresponds to non-functioning days.

The data for different seasons shows a nearly equal distribution, with each season accounting for approximately 25% of the total data.

The dataset comprises records from the years 2017 and 2018. The majority of the data, approximately 91.5%, belongs to the year 2018, while the year 2017 accounts for the remaining 8.5%.

Before dropping the date column, it was determined that the dataset spans from the start date of January 12, 2017, to the last date of December 11, 2018.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The analysis indicates that the majority of days are classified as functional days and not holidays.

The dataset includes data for all seasons, which strengthens the analysis by capturing information for both functioning days and holidays across various seasons.

The recorded data covers a time range starting from January 12, 2017, and ending on December 11, 2018.

#### Chart - 3
Total Bike rented with respect to vairous rainfall values

In [None]:
# Chart - 3 visualization code

#The following code generates two bar plots to visualize the total number of bikes rented with respect to various rainfall values.
#The first plot shows the raw data, while the second plot applies a log transformation to enhance visibility.

# Grouping data by 'Rainfall' and calculating the sum of 'Rented Bike Count'
rainfall_rent = df.groupby(['Rainfall'])['Rented Bike Count'].sum()

# Setting plot size
plt.rcParams['figure.figsize'] = (20, 5)

# Creating a bar plot for raw data
rainfall_rent.plot.bar()
plt.title('Total Bikes Rented with Respect to Various Rainfall Values')
plt.show()

# Applying log transformation to the data for better visualization
rainfall_rent_log = np.log(rainfall_rent)

# Creating a bar plot for log-transformed data
rainfall_rent_log.plot.bar()
plt.title('Total Bikes Rented with Respect to Various Rainfall Values (Log Transformed)')
plt.show()


##### 1. Why did you pick the specific chart?

Bar charts give a better understanding for the understanding this situation.

##### 2. What is/are the insight(s) found from the chart?

After examining the data, it became apparent that bike rentals were notably higher on days when there was no rainfall (rainfall = 0.0). Nevertheless, upon further analysis, when we omitted the data points with zero rainfall, we discovered that the majority of bike rentals took place during periods with lower rainfall values. This observation suggests that people tend to rent bikes more often during light rain conditions rather than during heavy rainfall or completely dry weather.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the data, it is evident that people tend to rent bikes more during periods with lower rainfall. Conversely, as the rainfall increases, bike rentals experience a decline in sales.

#### Chart - 4
Hourly Bike rental with respect to Seasons, Holiday, and year


In [None]:
# Chart - 4 visualization code
# The following code generates point plots to explore the bike rentals with respect to different parameters: Seasons, Holiday, Functioning Day, and Year.

# List of parameters to explore
Parameters = ['Seasons', 'Holiday', 'Functioning Day', 'Year']

# Looping through each parameter
for param in Parameters:
    # Setting the title for the point plot
    plt.title(f'Hourly Bike Rentals with Respect to {param}')

    # Creating the point plot using seaborn
    sns.pointplot(data=df, x="Hour", y="Rented Bike Count", hue=param)

    # Displaying the plot
    plt.show()


##### 1. Why did you pick the specific chart?

Sinc the data ranges from 0 to 23 as number of hours, a line chart can represent the data very well.

##### 2. What is/are the insight(s) found from the chart?

The analysis reveals that the peak hours for bike rentals are at 8 AM and 6 PM, suggesting a strong correlation with typical office commuting hours. This pattern implies that people are likely renting bikes to travel to and from their workplaces.

It was observed that bike rentals on non-functioning days are non-existent, indicating zero demand for bikes during days when the business is not operational.

Bike demand shows a notable increase during the summer season, whereas it declines during the winter months. This seasonal trend highlights that more people opt for bike rentals in the warmer months, while the demand decreases during the colder winter season.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's evident from the data that bikes are rented most frequently at 8 AM (8:00 AM) and 6 PM (6:00 PM). This trend suggests that bike rentals are heavily influenced by the time of day. The pattern strongly indicates that people are likely using bikes to commute to work in the morning and return home in the evening. Biking seems to be a popular choice for daily office travel during these peak hours.

#### Chart - 5
Total rented bikes with respect to year and month (Bar Plot)

In [None]:
# Chart - 5 visualization code
# To further understand the variations in bike rentals with respect to different months of the business year,
# a bar chart is plotted based on the observations made in Chart 1 regarding the bike rentals during various seasons.

# Grouping data by 'Year' and 'Month', and calculating the sum of 'Rented Bike Count'
group_by_year_month = df.groupby(['Year', 'Month'])['Rented Bike Count'].sum()

# Setting plot size
plt.rcParams['figure.figsize'] = (20, 5)

# Creating a bar chart to visualize bike rentals with respect to various months of the business year
group_by_year_month.plot.bar()

# Displaying the plot
plt.show()


##### 1. Why did you pick the specific chart?

A verticle bar represents the total in a better way

##### 2. What is/are the insight(s) found from the chart?



From the bar chart, it becomes evident that the company experienced relatively lower bike rentals during the first 11 months of the business year. However, starting from December 2017, there was a noticeable spike in bike rentals. Although the spike in December was not exceptionally high, the growth in bike rentals appears to be significant. This observation suggests that there was a positive trend in bike rentals towards the end of the year, indicating potential growth and increased demand for bike rentals during that period.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

After maintaining a stable position in the market for 11 months, the company's sales started to show growth. This indicates that the company experienced a period of stability and then began to witness an upward trend in sales, signifying a positive development in its business performance.

#### Chart - 6
weekly growth report

In [None]:
# Chart - 6 visualization code

# Grouping data by 'Year' and 'Week Number', and calculating the sum of 'Rented Bike Count' for each week
Weekly_growth_in_rented_bike = df.groupby(['Year', 'Week Number'])['Rented Bike Count'].sum()

# Setting the plot size
plt.rcParams['figure.figsize'] = (20, 5)

# Creating a bar chart to visualize the weekly growth in rented bikes
Weekly_growth_in_rented_bike.plot.bar()

# Displaying the plot
plt.show()


##### 1. Why did you pick the specific chart?

A bar plot is well suited to explore this data

##### 2. What is/are the insight(s) found from the chart?


Early weeks: slow growth

Week 50 onwards: sales increase significantly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The data analysis indicates a positive growth trend in rented bikes, with the growth starting from the second week onwards. Moreover, a noteworthy observation is that during the 25th week of the second year, the maximum number of bikes was rented, suggesting a peak in bike rental demand during that particular period.

#### Chart - 7
Total bike rented with respect to various conditions of Wind Speed

In [None]:
# Chart - 7 visualization code

# Calculate the average number of rented bikes based on wind speed conditions
average_bikes_rented = df.groupby(['Wind Speed'])['Rented Bike Count'].mean()

# Set the plot size
plt.rcParams['figure.figsize'] = (20, 5)

# Apply log transformation to the data for improved visualization
average_bikes_rented = np.log(average_bikes_rented)

# Create a bar chart to visualize the average bike rentals with respect to various wind speed conditions
average_bikes_rented.plot.bar()

# Set the chart title
plt.title('Average Bike Rentals with Respect to Wind Speed Conditions')

# Display the plot
plt.show()

# Print separator lines
print('-'*100)
print('Wind speed has minimal impact on bike rentals; the average is relatively consistent across different wind speed conditions.')
print('-'*100)



##### 1. Why did you pick the specific chart?

Bar charts give a better understanding for the understanding this situation.

##### 2. What is/are the insight(s) found from the chart?

data is uniformly distributed, we see that the wind speed doesnot affect the bike renting

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Bike rentals are preferred by people when the wind speed is moderate, typically ranging from 0.3 to 3.4. This observation suggests that wind speed does have an impact on bike rental preferences, as riders seem to favor biking during these moderate wind conditions.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# Create a histogram to visualize the distribution of Solar Radiation
plt.hist(df['Solar Radiation'], bins=50, color='blue', edgecolor='black')

# Label the x and y axes
plt.xlabel('Solar Radiation')
plt.ylabel('Count')

# Set the chart title
plt.title('Histogram for Solar Radiation')

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Histogram chart shows the quantitave of rental bikes

##### 2. What is/are the insight(s) found from the chart?

Demand of rental bikes is on the low soalar radiation i.e 0.0

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


People tend to avoid renting bikes when the solar radiation level exceeds 0.05.

#### Chart - 9
Total bike rented with respect to various weather conditions

In [None]:
# Chart - 9 visualization code

# Numerical features to visualize
numerical_features = ['Humidity', 'Wind Speed', 'Visibility', 'Dew Point Temperature', 'Solar Radiation', 'Rainfall', 'Snowfall']

# Loop through each numerical feature
for feature in numerical_features:
    # Group data by the feature and calculate the sum of 'Rented Bike Count'
    temp_df = df.groupby([feature])['Rented Bike Count'].sum()
    temp_df = temp_df.reset_index()

    # Create a scatter plot and line plot to visualize 'Total Rented Bike Count' vs. the current feature
    sns.scatterplot(data=temp_df, x=feature, y='Rented Bike Count')
    sns.lineplot(x=feature, y='Rented Bike Count', data=temp_df)

    # Labeling the x and y axes, and setting the title for the plot
    plt.xlabel(feature, fontsize=12)
    plt.ylabel('Total Rented Bike Count', fontsize=14)
    plt.title(f'Total Rented Bike Count vs. {feature}', fontsize=14)

    # Display the plot
    plt.show()


##### 1. Why did you pick the specific chart?

a line plot helps us understand the trends efficiently.

##### 2. What is/are the insight(s) found from the chart?

People prefer to rent bikes when the wind speed ranges from 0.3 to 4.

Bike rentals increase when the visibility is high, specifically at 2000.

Bike rentals are more likely when the dew point temperature falls within the range of -0.25 to 25.

Lower solar radiation levels (0.0) correspond to an increase in bike rentals.

Bike rentals are more common during periods of lower rainfall (0.2).

Similarly, bike rentals are more likely when there is minimal snowfall (0.1).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is an impact of the weather conditions on the people renting bikes

#### Chart - 10

In [None]:
# Chart - 10 visualization code

# Create a histogram to visualize the distribution of Visibility
plt.hist(df['Visibility'], bins=50, color='blue', edgecolor='black')

# Label the x and y axes
plt.xlabel('Visibility')
plt.ylabel('Count')

# Set the chart title
plt.title('Histogram for Visibility')

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Histogram chart shows the quantitative of the visibility

##### 2. What is/are the insight(s) found from the chart?

Histogram chart shows the rental bikes are on huge demand when the visibility is 2000

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

When visibility is 2000 people prefer the most to rent bikes

#### Chart - 11

In [None]:
# Chart - 11 visualization code

# Applying square root to 'Rented Bike Count' to improve skewness
plt.figure(figsize=(7, 3))
plt.xlabel('Rented Bike Count')
plt.ylabel('Density')

# Create a distribution plot with square root-transformed data
ax = sns.distplot(np.sqrt(df['Rented Bike Count']), color="y")

# Add vertical lines for mean and median of the square root-transformed data
ax.axvline(np.sqrt(df['Rented Bike Count']).mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(np.sqrt(df['Rented Bike Count']).median(), color='black', linestyle='dashed', linewidth=2)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

The ditribution chart shows the ditributon of the rent bike counts

##### 2. What is/are the insight(s) found from the chart?

we found the mean and median distribution of the rented bikes

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The mean and median of the rental bikes are approximately equal.

#### Chart - 12

In [None]:
# Chart - 12 visualization code


# List of selected features
selected_features = ['Temperature', 'Humidity', 'Wind Speed', 'Visibility', 'Dew Point Temperature', 'Solar Radiation', 'Rainfall', 'Snowfall']

# Loop through each selected feature and create a histogram
for col in selected_features:
    plt.figure(figsize=(8, 4))
    sns.histplot(df[col], bins=50, color='blue', edgecolor='black')
    plt.xlabel(col)
    plt.show()



##### 1. Why did you pick the specific chart?

histogram shows the better quantitative of the feature

##### 2. What is/are the insight(s) found from the chart?

Weather conditions were explored, and the following observations were made:

Temperature is normally distributed, with values ranging from -20 to 40.

Humidity also follows a normal distribution, with values between 0 and 90.

Wind speed exhibits a right-skewed distribution, varying from 0 to 7.

Visibility has a left-skewed distribution, ranging from 0 to 2000.

Dew point temperature is distributed symmetrically, with values spanning from -30 to 30.

Solar radiation displays a highly right-skewed distribution, ranging from 0 to 3.5.

Rainfall demonstrates a highly right-skewed distribution, varying from 0 to 35.

Snowfall exhibits a highly right-skewed distribution, with values between 0 and 8.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We have analysed the various weather conditions.

#### Chart - 13


In [None]:

# Columns selected for pair plot visualization
selected_columns = ['Rented Bike Count', 'Temperature', 'Humidity', 'Wind Speed', 'Visibility', 'Solar Radiation', 'Rainfall', 'Snowfall', 'Holiday']
pair_plot_df = df[selected_columns]

# Create a pair plot using seaborn
sns.pairplot(pair_plot_df, diag_kind="kde", kind='reg', hue='Holiday')

# Setting labels for better interpretation of the plot
plt.title('Pair Plot')
plt.ylabel('Feature/Property')
plt.xlabel('Feature/Property')

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Pair plots are used to show relationship between various variables

Pair plots can also help us explore the distribution of variables in your dataset.

##### 2. What is/are the insight(s) found from the chart?

Observations from the pair plot:

There is a positive correlation between wind speed and solar radiation.

Temperature and rented bike count exhibit a strong positive correlation.

Humidity and solar radiation are negatively correlated.

#### Chart - 14

In [None]:
# Chart - 14 visualization code

# Printing regression plots for all the numerical features
numerical_columns = list(df.select_dtypes(['int64', 'float64']).columns)
numerical_features = pd.Index(numerical_columns)

for col in numerical_features:
    fig, ax = plt.subplots(figsize=(10, 6))
    sns.regplot(x=df[col], y=df['Rented Bike Count'], scatter_kws={"color": 'orange'}, line_kws={"color": "black"})
    plt.xlabel(col)
    plt.ylabel('Rented Bike Count')
    plt.title(f'Regression Plot: {col} vs. Rented Bike Count')
    plt.show()


##### 1. Why did you pick the specific chart?

Linear Regression show the best fit line for the data i.e the avg increase in X with respect to Y

##### 2. What is/are the insight(s) found from the chart?

Insights from the regression plots:

As the temperature (X) increases from -10 to 30, the demand for rental bikes (Y) also increases.

An increase in humidity (X) is associated with a decrease in the demand for rental bikes (Y).

For wind speed (X) ranging from 0 to 3, the demand for rental bikes (Y) increases.

The visibility (X) does not show a clear relationship with the demand for rental bikes (Y), as indicated by the flat best-fit line.

The dew point temperature (X) exhibits an increasing trend with respect to the demand for rental bikes (Y) based on the upward-sloping best-fit line.

Solar radiation (X) generally increases with the demand for rental bikes (Y), as shown by the upward-sloping best-fit line.

Both snowfall and rainfall (X) are associated with a decrease in the demand for rental bikes (Y), as indicated by the downward-sloping best-fit lines.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights for bussinees is the avg demand of rental bikes on the specific conditions of the environment

#### Chart - 15


In [None]:

# Calculate the correlation matrix
corr_matrix = df.corr()

# Plot the correlation heatmap
sns.heatmap(corr_matrix, annot=True, cmap='inferno')

# Set labels for better interpretation of the plot
plt.title('Correlation Matrix Heatmap')
plt.ylabel('Feature/Property')
plt.xlabel('Feature/Property')

# Display the heatmap
plt.show()


##### 1. Why did you pick the specific chart?

The corelation chart shows the relation between the two  specific feature

##### 2. What is/are the insight(s) found from the chart?

A correlation coefficient close to +1 indicates a strong positive correlation, meaning that as one variable increases, the other also tends to increase.

On the other hand, a correlation coefficient close to -1 indicates a strong negative correlation, suggesting that as one variable increases, the other tends to decrease.

A correlation coefficient close to 0 indicates a weak or no correlation between the two variables. In this case, changes in one variable do not have a significant impact on the other variable.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Research Hypothesis: The average bike count at any point in time is greater than 100.

Null Hypothesis (H0): The average bike count is equal to 100.

Alternative Hypothesis (Ha): The average bike count is greater than 100.

#### 2. Perform an appropriate statistical test.

In [None]:
# Taking a random sample of 500 data points for 'Rented Bike Count'
names = df['Rented Bike Count'].sample(500)

# Calculating the sample mean of 'Rented Bike Count'
rented_bike_count_mean = np.mean(names)

# Calculating the sample standard deviation of 'Rented Bike Count'
rented_bike_count_std = np.std(names)


In [None]:
one = (rented_bike_count_mean-100)/(rented_bike_count_std/(np.sqrt(500)))

In [None]:
# Calculating the probability
probability_z = norm.cdf(one , 0, 1)
print(probability_z)

In [None]:
p1 = 1-probability_z
p1

##### Which statistical test have you done to obtain P-Value?

We have chosen Z-test to obtain p-value.

##### Why did you choose the specific statistical test?



As we are conducting a hypothesis test for the population mean, we have chosen to use the Z-test to calculate the p-value. The resulting probability is close to 100%, indicating that we have significant evidence to reject the null hypothesis (H0) that the average bike count in the city is equal to 100. Therefore, based on the sample data, we can confidently conclude that the average bike count at any point in time is greater than 100.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.



The research hypothesis is that the average temperature at any point is greater than 10 degrees Celsius.

Null Hypothesis (H0): The average temperature is equal to 10 degrees Celsius.

Alternative Hypothesis (Ha): The average temperature is greater than 10 degrees Celsius.

#### 2. Perform an appropriate statistical test.

In [None]:
# Taking a random sample of 500 data points for 'Temperature'
temp_data_sample = df['Temperature'].sample(500)

# Calculating the sample mean of 'Temperature'
temp_data_mean = np.mean(temp_data_sample)

# Calculating the sample standard deviation of 'Temperature'
temp_data_std = np.std(temp_data_sample)

# Calculating the t-score for the one-sample t-test
t_score_temp = (temp_data_mean - 10) / (temp_data_std / (np.sqrt(500)))

# Displaying the calculated t-score
t_score_temp


In [None]:
prob_z = norm.cdf(t_score_temp, 0, 1)
print(prob_z)

In [None]:
p1 = 1-prob_z
p1

##### Which statistical test have you done to obtain P-Value?

We have chosen Z-test to obtain p-value.

##### Why did you choose the specific statistical test?

As we conducted the hypothesis testing for the population mean, we utilized the Z-test to calculate the p-value. The resulting probability is 99%, indicating that we have substantial evidence to reject the null hypothesis (H0) which assumes the average temperature is equal to 10 degrees Celsius. Therefore, based on the sample data, we can confidently conclude that the average temperature at any point in time is greater than 10 degrees Celsius.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.



The research hypothesis is that the standard deviation of humidity is equal to 20.

Null Hypothesis (H0): The standard deviation of humidity is not equal to 20.

Alternative Hypothesis (Ha): The standard deviation of humidity is equal to 20.

#### 2. Perform an appropriate statistical test.

In [None]:
# Taking a random sample of 50 data points for 'Humidity'
humidity_sample = df['Humidity'].sample(50)

# Calculating the sample variance of 'Humidity'
sample_variance = (np.std(humidity_sample))**2

# Calculating the test statistic for the chi-square test
test_statistic = (49 * sample_variance) / (20*20)

# Displaying the calculated test statistic
test_statistic


In [None]:
prob = chi2.cdf(test_statistic,49)
print(prob)


##### Which statistical test have you done to obtain P-Value?

We have chosen Chi2-test to obtain p-value.

##### Why did you choose the specific statistical test?

As we conducted the hypothesis testing for the population standard deviation, we utilized the Chi-square test to calculate the p-value. The resulting probability is 45.53%, indicating that we do not have enough evidence to reject the null hypothesis (H0) which assumes that the standard deviation of humidity is not equal to 20. Therefore, based on the sample data, we do not have sufficient evidence to conclude that the standard deviation of humidity is 20.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values
(No Missing Values Found)

In [None]:

# Missing Values/Null Values Count
missing_values = df.isna().sum()
print(missing_values)

#### What all missing value imputation techniques have you used and why did you use those techniques?

There were no missing values present in the database, no manipulations were done

### 2. Handling Outliers

After investigating and plotting box plots, we observed that the following variables have many outliers:

Wind Speed

Solar Radiation

Rainfall

Snowfall

However, upon careful consideration, we concluded that these outlier values are not erroneous data points but rather represent meaningful and relevant insights. Therefore, there is no need to clip or remove these extreme values as they hold valuable information for our analysis.

The code provided to handle the outliers can be uncommented if needed, but based on our findings, it is recommended to retain the outlier values for a more comprehensive and accurate understanding of the data.

In [None]:
# # Handling Outliers & Outlier treatments
# numerical_vars = df.describe().columns
# for var in numerical_vars:
#   plt.figure(figsize=(2, 2))
#   sns.boxplot(y=var, data=df)
#   plt.xlabel(var)
#   plt.ylabel('Values')
#   plt.title(f'Box Plot for {var}')
#   plt.show()

In [None]:
# # Outliers are observed in the following columns
# outliers_col=['Wind Speed','Solar Radiation','Rainfall','Snowfall']

# #writing a function to handle outliers in the dataframe
# def cliping_outliers(df1):
#     for col in df1[outliers_col]:
#         # using IQR method to define range of upper and lower limit.
#         q1 = df1[col].quantile(0.25)
#         q3 = df1[col].quantile(0.75)
#         iqr = q3 - q1
#         lower_bound = q1 - 1.5 * iqr
#         upper_bound = q3 + 1.5 * iqr

#         # replacing the outliers with upper and lower bound
#         df1[col] = df1[col].clip(lower_bound, upper_bound)
#     return df1


# # calling the function and handeling outliers
# df = cliping_outliers(df)

In [None]:
# # after handeling Outliers & Outlier treatments

# for var in outliers_col:
#   plt.figure(figsize=(2, 2))
#   sns.boxplot(y=var, data=df)
#   plt.xlabel(var)
#   plt.ylabel('Values')
#   plt.title(f'Box Plot for {var}')
#   plt.show()

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# convert object type data to dumy variables(binary form)

df['Winter']=np.where(df["Seasons"]=='Winter',1,0)
df['Spring']=np.where(df["Seasons"]=='Spring',1,0)
df['Summer']=np.where(df["Seasons"]=='Summer',1,0)
df['Autumn']=np.where(df["Seasons"]=='Autumn',1,0)

df['Holiday']=np.where(df["Holiday"]=='Holiday',1,0)
df['Functioning Day']=np.where(df['Functioning Day']=='Yes',1,0)

# Since Seasons is encoded into 4 new features we are dropping the orignal feature
df.drop('Seasons',axis=1, inplace = True)

x=['Month','Hour']
for i in x:
      df = pd.concat([df, pd.get_dummies(df[i], prefix=i, drop_first=True)], axis=1)
      df = df.drop([i], axis=1)

df.info()

#### What all categorical encoding techniques have you used & why did you use those techniques?

In order to represent categorical variables as numerical values suitable for a machine learning model, we employed One-hot encoding. This process involved creating four new columns for the 'Seasons' column, each corresponding to a specific category ('Winter', 'Spring', 'Summer', and 'Autumn'). We assigned a value of 1 to the matching category and 0 to the other categories within each new column.

For the 'Holiday' and 'Functioning Day' columns, which had two distinct values, we directly encoded them into binary form, using 0 and 1 to indicate the presence or absence of each category.

Moreover, to handle the 'Month' and 'Hour' columns with 12 and 24 unique values, respectively, we applied One-hot encoding. This resulted in multiple new columns representing individual values within each category. To avoid multicollinearity, we dropped the first column from each set of encoded columns.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

We have created some features in the data wrangling section

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# since day and week number are not correlated we are dropping them
df = df.drop(['Day', 'Week Number'],axis=1)

### removing multicollear
df['Total Temp'] = 0.7*df['Temperature'] + 0.3*df['Dew Point Temperature']
df=df.drop(['Temperature','Dew Point Temperature'],axis=1)


As we can see there is multicollinearity between the columns 'Temperature' & 'Dew Point Temperature'. Hence we are creating a new column as 0.7 x Temperature + 0.3 x Dew point Temprature as Total Temperature

##### What all feature selection methods have you used  and why?

In EDA, we extracted new features from the date column: date, month, and year.

The 'Dew Point Temperature' feature was dropped due to collinearity with 'Temperature'.

'Day' and 'Month' were also dropped as they showed low correlation with 'Rented Bike Count' (0.04 and 0.07, respectively).

##### Which all features you found important and why?

The remaining columns are equally important as they have average collineriaty between them

### 5. Data Transformation

In [None]:
sns.histplot(df['Rented Bike Count'],kde=True)

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

after plotting hist plot we see that the rented bike count is Right skewed, so it is important to transform the data as seen in chart 13

In [None]:
# Transform Your data
df['Rented Bike Count']=np.sqrt(df['Rented Bike Count'])
sns.histplot(df['Rented Bike Count'],kde=True)

### 6. Data Scaling
we have used minmax scalar after spliting the data

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

we have used minmax scalar after spliting the data

### 7. Dimesionality Reduction
(we did not find it meaning full to further reduce dimentionality)

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x = df.drop(columns=['Rented Bike Count'], axis=1)
y = np.sqrt(df['Rented Bike Count'])
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=.20,random_state=4)

In [None]:
scaler = MinMaxScaler()
x_train = scaler.fit_transform(X_train)
x_test = scaler.transform(X_test)
print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)

##### What data splitting ratio have you used and why?

As a standard practice we have split the data into 80-20 ratio.

### 9. Handling Imbalanced Dataset
(Not applicable in our case as the data is balanced)

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

In [None]:
# initiating test and train dictionary for future reference and comparing values
train_data={}
test_data={}

In [None]:
def evaluate_model(model, X_train, X_test, y_train, y_test, output_name):
    '''This function implements the given model and calculates the statistics.
    It adds the results to the train and test dictionaries.
    '''
    # Fit the model
    model.fit(X_train, y_train)
    score = model.score(X_train, y_train)
    print(f'The score for {output_name} is: {score}')

    # Predict using the model
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    print('\n')
    print('-'*50)
    print(f'Metrics for {output_name} train dataset')
    print('-'*50)

    # Calculate and print Mean Squared Error (MSE)
    MSE_train = mean_squared_error(y_train, y_pred_train)
    print(f'MSE : {MSE_train}')

    # Calculate and print Mean Absolute Error (MAE)
    mae_train = mean_absolute_error(y_train, y_pred_train)
    print(f'Mean absolute Error : {mae_train}')

    # Calculate and print Root Mean Square Error (RMSE)
    RMSE_train = np.sqrt(MSE_train)
    print(f'RMSE : {RMSE_train}')

    # Calculate and print R-squared (R2)
    r2_train = r2_score(y_train, y_pred_train)
    print(f'R2 : {r2_train}')

    # Calculate and print Adjusted R-squared (Adjusted R2)
    # Formula => Adjusted R-squared = 1 - [(1 - R-squared) * (n - 1) / (n - p - 1)]
    a_r2_train = 1 - (1 - r2_train) * ((X_train.shape[0] - 1) / (X_train.shape[0] - X_train.shape[1] - 1))
    print(f'Adjusted R^2: {a_r2_train}')

    # Update the observed values to the train dictionary for future references
    train_data[output_name] = MSE_train, mae_train, RMSE_train, r2_train, a_r2_train
    print('\n')

    print('-'*50)
    print(f'Metrics for {output_name} test dataset')
    print('-'*50)

    # Calculate and print Mean Squared Error (MSE)
    MSE_test = mean_squared_error(y_test, y_pred_test)
    print(f'MSE : {MSE_test}')

    # Calculate and print Mean Absolute Error (MAE)
    mae_test = mean_absolute_error(y_test, y_pred_test)
    print(f'Mean absolute Error : {mae_test}')

    # Calculate and print Root Mean Square Error (RMSE)
    RMSE_test = np.sqrt(MSE_test)
    print(f'RMSE : {RMSE_test}')

    # Calculate and print R-squared (R2)
    r2_test = r2_score(y_test, y_pred_test)
    print(f'R2 : {r2_test}')

    # Calculate and print Adjusted R-squared (Adjusted R2)
    a_r2_test = 1 - (1 - r2_test) * ((X_test.shape[0] - 1) / (X_test.shape[0] - X_test.shape[1] - 1))
    print(f'Adjusted R^2: {a_r2_test}')

    # Update the observed values to the test dictionary for future references
    test_data[output_name] = MSE_test, mae_test, RMSE_test, r2_test, a_r2_test
    print('\n')

    # Plot the actual vs. predicted values
    plt.figure(figsize=(10, 5))
    plt.title(f'Actual vs. Predicted for {output_name}')
    plt.plot(np.array(y_pred_test))
    plt.plot(np.array(y_test))
    plt.legend(["Predicted", "Actual"])
    plt.show()


### ML Model - 1
# Linear Regression

In [None]:
# ML Model - 1 Implementation
LiReg = LinearRegression()
evaluate_model(LiReg, X_train, X_test, y_train, y_test, 'Linear Regression')

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Linear Regression is a statistical technique used to model the relationship between two variables, typically referred to as the independent variable (or predictor variable) and the dependent variable (or response variable). It assumes a linear relationship between these variables, where a change in the independent variable is associated with a constant change in the dependent variable.

The goal of linear regression is to estimate the parameters of the linear equation that best fits the observed data. The equation is typically represented as:

Y = mX + b

##### Which hyperparameter optimization technique have you used and why?

For Linear Regression we do not do Hyper parameter optimization

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Adjusted R^2 on Train set is 0.8186804773197106

Adjusted R^2 on test set is 0.8207516355447193


#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 2
# Lasso

In [None]:
lasso = Lasso()
evaluate_model(lasso, X_train, X_test, y_train, y_test, 'Lasso without Hyperparameter Tuning')

In [None]:
# ML Model - 2 Implementation

# Hyperparameter Tuning
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
lasso_regressor = GridSearchCV(lasso,parameters, scoring='neg_mean_squared_error', cv=5)
lasso_regressor.fit(X_train, y_train)
print(f'The best fit alpha value is found out to be : {lasso_regressor.best_params_}')
print(f'Using {lasso_regressor.best_params_} the negative mean squared error is: {lasso_regressor.best_score_}')

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
lasso=Lasso(alpha=0.0001,max_iter=4000)
evaluate_model(lasso, X_train, X_test, y_train, y_test, 'Lasso with Hyperparameter Tuning')

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Lasso, also known as L1 regularization, is a linear regression technique used in machine learning and statistics to prevent overfitting and select a subset of important features from a larger set of predictors. It adds a penalty term to the linear regression objective function, which is the absolute value of the coefficients multiplied by a tuning parameter called the regularization strength. This penalty encourages the model to shrink the coefficients of less important features to exactly zero, effectively eliminating them from the model. This results in a sparse model with a subset of predictors that are most relevant to the prediction task, making it useful for feature selection and model interpretability. Lasso is particularly effective when dealing with datasets that have a large number of predictors and may suffer from multicollinearity, as it can automatically perform feature selection and regularization simultaneously.



#### 2. Cross- Validation & Hyperparameter Tuning

##### Which hyperparameter optimization technique have you used and why?

We have used Grid search CV as hyperparameter optimization technique. It finds the optimal aplha value for which the model is able to perform better.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No significant improvement seen

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

### ML Model - 3
#Ridge

In [None]:
ridge=Ridge()
evaluate_model(ridge, X_train, X_test, y_train, y_test, 'Ridge without Hyperparameter Tuning')

In [None]:
# ML Model - 3 Implementation
ridge=Ridge()
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)
ridge_regressor.fit(X_train,y_train)
print("The best fit alpha value is found out to be :" ,ridge_regressor.best_params_)
print("\nUsing ",ridge_regressor.best_params_, " the negative mean squared error is: ", ridge_regressor.best_score_)

In [None]:
# Visualizing evaluation Metric Score chart
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
ridge= Ridge(alpha=1)
# Fit the Algorithm
evaluate_model(ridge, X_train, X_test, y_train, y_test, 'Ridge with Hyperparameter Tuning')

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Ridge is a type of regularization technique used in machine learning, particularly in linear regression. It helps prevent overfitting by adding a penalty term to the loss function during model training. The penalty term is proportional to the square of the magnitude of the model's coefficients, which are the parameters that determine the relationship between input features and the predicted output. Ridge regularization encourages the model to use smaller coefficients, resulting in a simpler and more generalizable model. It is also known as L2 regularization because it adds the squared L2 norm of the coefficients to the loss function. Ridge can be tuned with a hyperparameter called the regularization strength, which controls the trade-off between fitting the data and regularizing the model. A higher regularization strength results in more regularization and a simpler model, while a lower regularization strength allows the model to fit the data more closely. Ridge is widely used in machine learning for regression tasks when dealing with multicollinearity or high-dimensional data.

#### 2. Cross- Validation & Hyperparameter Tuning

##### Which hyperparameter optimization technique have you used and why?

We have used Grid search CV as hyperparameter optimization technique. It finds the optimal aplha value for which the model is able to perform better.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No significant improvement seen

##### Which hyperparameter optimization technique have you used and why?

We have used Grid search CV as hyperparameter optimization technique. It finds the optimal aplha value for which the model is able to perform better.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No significant improvement seen, the model performance decreased comapred to linear regression

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model 4
#ElasticNet

In [None]:
elasticnet = ElasticNet(alpha=0.001, l1_ratio=0.5)

evaluate_model(elasticnet, X_train, X_test, y_train, y_test, 'ElasticNet')

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

ElasticNet is a statistical method used for linear regression, which combines the L1 (Lasso) and L2 (Ridge) regularization techniques. It aims to overcome the limitations of both methods by adding a mixture of both penalties to the linear regression model. ElasticNet introduces two hyperparameters, alpha and l1_ratio, which control the strength of regularization and the balance between L1 and L2 regularization, respectively. This allows ElasticNet to handle multicollinearity in the data, select relevant features, and achieve better prediction performance compared to Lasso or Ridge alone. In summary, ElasticNet is a flexible regularization technique that combines the advantages of Lasso and Ridge regularization to improve linear regression models by preventing overfitting and improving model interpretability.

#### 2. Cross- Validation & Hyperparameter Tuning

##### Which hyperparameter optimization technique have you used and why?

No Hyperparameter tuning for Elastic Net

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No significant improvement seen, the model performance decreased comapred to linear regression

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model 5
# Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_model=RandomForestRegressor()
evaluate_model(rf_model, X_train, X_test, y_train, y_test, 'Random Forest')


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Random Forest is a popular machine learning algorithm used for both classification and regression tasks. It is an ensemble method that combines multiple decision trees to make more accurate predictions. The algorithm creates a "forest" of decision trees by randomly selecting a subset of features and data samples from the training dataset. Each tree in the forest is trained independently on these subsets, and their predictions are combined to obtain the final output.

#### 2. Cross- Validation & Hyperparameter Tuning

##### Which hyperparameter optimization technique have you used and why?

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Improvement seen

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML model 6
# Support Vector Regressor
(just to check its performance on regression task)

In [None]:
from sklearn.svm import SVR
support_vector = SVR(kernel = 'rbf')
evaluate_model(support_vector, X_train, X_test, y_train, y_test, 'Support Vector Regressor')

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Support Vector Regressor (SVR) is a supervised machine learning algorithm used for regression tasks. It is based on the Support Vector Machine (SVM) algorithm, which is commonly used for classification tasks. SVR is designed to predict continuous numerical values rather than discrete classes.

#### 2. Cross- Validation & Hyperparameter Tuning

##### Which hyperparameter optimization technique have you used and why?

No Hyperparameter tuning for Support vector Regressor

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

it is found that this model fails drastically

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

it is found that this model fails drastically

### ML model 7
# Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
evaluate_model(dtr, X_train, X_test, y_train, y_test, 'Decision Tree Regressor')

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Decision Tree Regressor is a machine learning algorithm used for regression tasks, which involves predicting a continuous target variable. It works by recursively splitting the feature space into subsets based on the values of input features, and then predicting the target value for each subset. The splits are determined based on a set of predefined rules or criteria, such as minimizing the variance of the target variable or maximizing the information gain. The resulting tree-like structure allows for easy interpretation and visualization. Decision Tree Regressor can handle both numerical and categorical features, and is capable of capturing non-linear relationships between features and the target variable. However, it is prone to overfitting and may not perform well on complex datasets with noisy or sparse data. Regularization techniques, such as pruning or setting maximum depth, can be applied to mitigate overfitting.

#### 2. Cross- Validation & Hyperparameter Tuning

##### Which hyperparameter optimization technique have you used and why?

No Hyperparameter tuning for Decision Tree Regressor

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No significant improvement seen, the model performance decreased comapred to Random Forest

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
features = x.columns
importances = rf_model.feature_importances_
indices = np.argsort(importances)

In [None]:
#Plotting figure
plt.figure(figsize=(8,10))
plt.title('Importance of Feature')
plt.barh(range(len(indices)), importances[indices], color='green', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices], fontsize = 8)
plt.xlabel('Relative Importance')

plt.show()


## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# # Save the File
import pickle
pickle_path = dir_path + 'RandomForestRegressor.pkl'

# serialize process (wb=write byte)
pickle.dump(rf_model, open(pickle_path,'wb'))


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:

# Load the saved model from the pickle file
Regression_model= pickle.load(open(pickle_path,'rb'))

# Predicting the unseen data(test set)
Regression_model.predict(X_test)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In [None]:
# converting the test and train stastics into a dataframe
Test=pd.DataFrame(test_data,index=["Test MSE", 'Test MAE', "Test RMSE",'Test R^2','Test Adjusted R^2'])
Train=pd.DataFrame(train_data,index=["Train MSE", 'Train MAE', "Train RMSE",'Train R^2','Train Adjusted R^2'])

In [None]:
result_df = pd.concat([Train, Test], axis=0)
result_df.transpose()

In [None]:
# Viewing the data fro train
Train.transpose()

In [None]:
# Viewing the data fro test
Test.transpose()

## Write the conclusion here.

### ML model conclusion

### ***Considering adjusted r^2 score on test stastics we have selected Random Forest as best performing model with accuracy of 91.4%***

### *** EDA Observations : ***

95% of the days are working days (not holidays), and 4.9% are holidays.

96.6% of the values are recorded as functioning days, while the remaining are non-functioning days.

The data recorded for various seasons is almost equal (around 25% each).

The data includes records for the years 2017 and 2018, with most of the data belonging to 2018 (91.5%) and the rest to 2017 (8.5%).

It is observed that mostly the days are functional and not holidays.

The data covers all seasons, providing a comprehensive analysis of functioning days and holidays.

The captured data spans from January 12, 2017, to December 11, 2018.

The total bikes rented on no holiday is 5,956,419 and on holidays is 215,895.

The total bikes rented on functioning days is 6,172,314, and on non-functioning days, it is 0.

The total bikes rented in 2018 is 5,986,984, and in 2017, it is 185,330.

On a functioning day, the bike rented sum is zero, indicating no bikes were rented on non-functioning days, impacting the business negatively.

During winters, the bikes are rented less, affecting the business negatively, while during summer, the impact is positive, and more bikes are rented.

There is a significant spike in bike rentals starting from December 2017, indicating significant growth during that period.

The demand for bikes is highest at 8 am and 6 pm, suggesting people are renting bikes for commuting to and from the office.

Bike rentals during non-functioning days are zero.

Bike demand is higher during summer and lower during winters.

The analysis shows that bike rentals were more when the rainfall was 0.0.

However, when excluding rainfall values of 0.0, it was observed that most rentals occurred during low rainfall values.

People prefer to rent bikes when the wind speed is moderate, between 0.3 to 3.

There is a minor impact of wind speed on bike renting preference.

The demand for rental bikes is high when visibility is 2000, and low when solar radiation is 0.0.

People do not prefer to rent bikes when solar radiation is above 0.05.




### *** Behaviour of People at Various Weather Conditions ***


People tend to rent bikes when wind speed is between 0.3 to 4.

High visibility (2000) leads to increased bike rentals.

Dew point temperature between -0.25 to 25 results in higher bike rentals.

Bike rentals are more when solar radiation is less (0.0).

Lower rainfall (0.2) is associated with higher bike rentals.

Bike rentals increase when snowfall is less (0.1).





### *** Various Weather Conditions : ***


Temperature is normally distributed between -20 to 40.

Humidity is normally distributed from 0 to 90.

Wind speed is right-skewed, ranging from 0 to 7.

Visibility is left-skewed, ranging from 0 to 2000.

Dew point temperature ranges from -30 to 30.

Solar radiation is highly right-skewed, with values from 0 to 3.5.

Rainfall is highly right-skewed, ranging from 0 to 35.

Snowfall is highly right-skewed, with values from 0 to 8.


### *** Effect of Various Parameters on Renting Bikes : ***
An increase in temperature (X) from -10 to 30 leads to an increase in demand for rental bikes (Y).

For an increase in humidity (X), the demand for rental bikes decreases (Y).

An increase in wind speed (X) from 0 to 3 leads to an increase in the demand for rental bikes (Y).

There is no significant relationship between visibility (X) and the demand for rental bikes (Y).

The demand for rental bikes increases with an increase in dew point temperature (X).

The demand for rental bikes normally increases with an increase in solar radiation (X).

The demand for rental bikes decreases with an increase in snowfall and rainfall (X).