<a href="https://colab.research.google.com/github/Pawanme9034/Bike_Sharing_Demand_Prediction-Capstone_Project/blob/main/Bike_Sharing_Demand_Prediction_Capstone_ProjectML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Bike Sharing Demand Prediction



##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual
##### **Name-**  Pawan Kumar Singh


# **Project Summary -**

### Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.


# **GitHub Link -**

https://github.com/Pawanme9034/Bike_Sharing_Demand_Prediction-Capstone_Project/blob/main/Bike_Sharing_Demand_Prediction_Capstone_ProjectML.ipynb

# **Problem Statement**


Problem Statement is the prediction of bike count required at each hour for the stable supply of rental bikes.


# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#let's import the modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from datetime import datetime
import datetime as dt

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MultiLabelBinarizer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import log_loss

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')



### Dataset Loading

In [None]:
# Load Dataset
import requests
from io import StringIO
# uploading data through Github directly
url = "https://raw.githubusercontent.com/Pawanme9034/Bike_Sharing_Demand_Prediction-Capstone_Project/main/SeoulBikeData.csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)

bike_df=pd.read_csv(data)

### Dataset First View

In [None]:
# Dataset First Look
bike_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
bike_df.shape

### Dataset Information

In [None]:
# Dataset Info
bike_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
value=len(bike_df[bike_df.duplicated()])
value

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bike_df.isnull().sum()

### What did you know about your dataset?

1. This Dataset contains 8760 lines and 14 columns.
2. the data is orgenized and there are timestamp.
3. there are no missing or null values in dataset.
4. dtypes: float64(6), int64(4), object(4)
5. memory usage: 848.3+ KB



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
bike_df.columns

In [None]:
# Dataset Describe
bike_df.describe()

### Variables Description

**Date** : *The date of the day, during 365 days from 01/12/2017 to 30/11/2018, formating in DD/MM/YYYY, type : str*, we need to convert into datetime format.

**Rented Bike Count** : *Number of rented bikes per hour which our dependent variable and we need to predict that, type : int*

**Hour**: *The hour of the day, starting from 0-23 it's in a digital time format, type : int, we need to convert it into category data type.*

**Temperature(°C)**: *Temperature in Celsius, type : Float*

**Humidity(%)**: *Humidity in the air in %, type : int*

**Wind speed (m/s)** : *Speed of the wind in m/s, type : Float*

**Visibility (10m)**: *Visibility in m, type : int*

**Dew point temperature(°C)**: *Temperature at the beggining of the day, type : Float*

**Solar Radiation (MJ/m2)**: *Sun contribution, type : Float*

**Rainfall(mm)**: *Amount of raining in mm, type : Float*

**Snowfall (cm)**: *Amount of snowing in cm, type : Float*

**Seasons**: *Season of the year, type : str, there are only 4 season's in data *.

**Holiday**: *If the day  is holiday period or not, type: str*

**Functioning Day**: *If the day is a Functioning Day or not, type : str* *italicized text* **bold text**



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(bike_df.apply(lambda col: col.unique()))

In [None]:
#print the unique value
bike_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#Rename the complex columns name
bike_df=bike_df.rename(columns={'Rented Bike Count':'Rented_Bike_Count',
                                'Temperature(°C)':'Temperature',
                                'Temperature(�C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind_speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew_point_temperature',
                                'Dew point temperature(�C)':'Dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'Solar_Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                                'Functioning Day':'Functioning_Day'})

In [None]:
# checking columns names
bike_df.columns

In [None]:
# Changing the "Date" column into three "year","month","day" column
# Convert 'Date' column to datetime format
bike_df['Date'] = bike_df['Date'].apply(lambda x:dt.datetime.strptime(x,"%d/%m/%Y"))

In [None]:
# Extract year, month, and day of the week
bike_df['year'] = bike_df['Date'].dt.year
bike_df['month'] = bike_df['Date'].dt.month
bike_df['day'] = bike_df['Date'].dt.day_name()

In [None]:
#creating a new column of "weekdays_weekend" and drop the column "Date","day","year"
bike_df['weekdays_weekend']=bike_df['day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )
# Drop 'Date', 'day', and 'year' columns
bike_df=bike_df.drop(columns=['Date','day','year'],axis=1)

In [None]:
# Get value counts of 'weekdays_weekend' column
bike_df['weekdays_weekend'].value_counts()

* So we convert the "date" column into 3 different column i.e "year","month","day"
* The "year" column in our data set is basically contain the 2 unique number contains the details of from 2017 december to 2018 november so if i consider this is a one year then we don't need the "year" column so we drop it.
* The other column "day", it contains the details about the each day of the month, for our relevence we don't need each day of each month data but we need the data about, if a day is a weekday or a weekend so we convert it into this format and drop the "day" column

In [None]:
#Change the int64 column into catagory column
cols=['Hour','month','weekdays_weekend']
for col in cols:
  bike_df[col]=bike_df[col].astype('category')

In [None]:
# assign the numerical columns of the DataFrame 'bike_df' to a variable,
# Select numerical columns
numerical_columns=list(bike_df.select_dtypes(['int64','float64']).columns)
numerical_features=pd.Index(numerical_columns)
numerical_features

### What all manipulations have you done and insights you found?

1. Rename the complex columns name
2. Changing the "Date" column into three "year","month","day" column
3. Convert 'Date' column to datetime format
4. Extract year, month, and day of the week
5. creating a new column of "weekdays_weekend" and drop the column "Date","day","year"
6. Drop 'Date', 'day', and 'year' columns
7. Change the int64 column into catagory column
8. assign the numerical columns of the DataFrame 'bike_df' to a variable,

Sure! Here's a summary of the manipulation code you provided previously:

1. Extracting Year, Month, and Day:
   - You added three columns ('year', 'month', 'day') to the DataFrame 'bike_df'.
   - The 'year' column contains the year extracted from the 'Date' column using the 'dt.year' property.
   - The 'month' column contains the month extracted from the 'Date' column using the 'dt.month' property.
   - The 'day' column contains the day of the week (e.g., 'Sunday', 'Monday') extracted from the 'Date' column using the 'dt.day_name()' method.

2. Converting Date column to datetime format:
   - You used the 'apply()' function with a lambda function to convert the 'Date' column to a datetime format.
   - The lambda function utilized 'strptime()' from the 'datetime' module to parse each date string according to the format "%d/%m/%Y".
   - The resulting datetime objects were assigned back to the 'Date' column.

3. Determining Weekdays vs. Weekends:
   - You added a new column called 'weekdays_weekend' to the DataFrame 'bike_df'.
   - The values in the 'weekdays_weekend' column were determined using the 'apply()' function and a lambda function.
   - If the 'day' value was 'Saturday' or 'Sunday', the corresponding value in the 'weekdays_weekend' column was set to 1; otherwise, it was set to 0.

4. Dropping Columns:
   - You dropped the 'Date', 'day', and 'year' columns from the DataFrame 'bike_df' using the 'drop()' function.
   - The 'columns' parameter was used to specify the names of the columns to be dropped, and the 'axis' parameter was set to 1 to indicate column-wise operation.

5. Renaming Complex Column Names:
   - You renamed the complex column names in the DataFrame 'bike_df' to make them more analysis-ready.
   - The 'rename()' function was used with a dictionary mapping the current column names to the desired new column names.
   - The column names that were renamed included 'Rented Bike Count', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', and 'Functioning Day'.

These manipulations demonstrate various data preprocessing steps, such as extracting date components, converting data types, creating derived features, and renaming columns.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

# Rented bikes according to the month

In [None]:
# Chart - 1 visualization code
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(data=bike_df, x='month', y='Rented_Bike_Count', ax=ax, capsize=.2, hue='Seasons')
ax.set_title('Count of Rented Bikes According to Month')
ax.set_xlabel('Month')
ax.set_ylabel('Rented Bike Count')

plt.show()


##### 1. Why did you pick the specific chart?

The bar plot was chosen because it allows for a clear comparison of the count of rented bikes across different months. The categorical nature of the months is well-suited for representation on the x-axis, while the numerical count of rented bikes is represented on the y-axis. The use of different colors for each season aids in visual comparison. The bar plot enables the identification of any seasonal patterns or variations in bike rental demand and facilitates easy interpretation of the data. Overall, it is an effective and concise way to convey the count of rented bikes for each month and analyze trends over time.

##### 2. What is/are the insight(s) found from the chart?

Seasonal Variations: The average count of rented bikes varies across seasons. Summer months (June, July, and August) have the highest average count, while winter months have relatively lower counts.

Monthly Variations: Within each season, there are variations in the average count of rented bikes across different months.

Winter Season: The winter season generally exhibits lower average counts of rented bikes, suggesting reduced demand during colder months.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can positively impact the business by informing decisions related to seasonal demand planning, marketing strategies, and pricing. Understanding the seasonal and monthly variations in bike rental demand allows businesses to optimize resource allocation, target promotions effectively, and adjust pricing strategies. However, the lower average counts during the winter season may pose a temporary negative impact. Nonetheless, this insight can also present opportunities for diversification or focusing on alternative revenue streams during colder months. It is crucial for businesses to analyze these insights in their specific context to make informed decisions and drive positive business outcomes.

#### Chart - 2

# Hourly Distribution

In [None]:
# Chart - 2 visualization code
fig, ax = plt.subplots(figsize=(12, 6))
sns.pointplot(data=bike_df, x='Hour', y='Rented_Bike_Count', hue='weekdays_weekend', ax=ax, markers=["o", "s"], linestyles=["-", "--"], palette=["#1f77b4", "#ff7f0e"])
ax.set(title='Count of Rented Bikes According to Weekdays/Weekends', xlabel='Hour', ylabel='Rented Bike Count')

# Calculating the percentage of weekdays and weekends
weekday_percentage = bike_df['weekdays_weekend'].value_counts(normalize=True) * 100
weekday_percentage = weekday_percentage.round(2)

# Adding percentage labels
for patch in ax.patches:
    height = patch.get_height()
    ax.text(patch.get_x() + patch.get_width() / 2, height + 5, f'{height:.2f}%', ha='center')

# Adding legend with percentage labels
handles, labels = ax.get_legend_handles_labels()
labels = [f'{label} ({weekday_percentage[i]}%)' for i, label in enumerate(labels)]
ax.legend(handles, labels)

plt.show()


##### 1. Why did you pick the specific chart?

The specific chart, a point plot with hue grouping, was chosen to visualize the count of rented bikes according to weekdays and weekends on an hourly basis. It allows for easy comparison between weekdays and weekends and provides a clear understanding of the distribution of bike counts throughout the day. The inclusion of percentage labels adds further context and insights to the chart. Overall, this chart effectively presents the desired information in a concise and visually appealing manner.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart indicate that bike rental counts vary throughout the day. Weekdays show higher rental counts during peak commuting hours, while weekends have higher counts in the afternoon and early evening. The highest average counts are observed on weekday evenings and weekend afternoons. These insights suggest different usage patterns between weekdays and weekends and can inform resource planning and marketing strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


The insights gained from the data can have a positive impact on the business by optimizing operations and resource allocation. Understanding peak demand periods allows businesses to improve customer satisfaction and revenue. However, insights showing consistently low rental counts during certain hours may indicate a need to attract customers during off-peak times to avoid negative growth. Overall, leveraging these insights helps businesses make informed decisions to drive positive business outcomes.

#### Chart - 3

# Count of Rented Bikes According to Seasons

In [None]:
# Chart - 3 visualization code
fig, ax = plt.subplots(figsize=(12, 6))
sns.boxplot(data=bike_df, x='Seasons', y='Rented_Bike_Count', ax=ax)
ax.set(title='Count of Rented Bikes According to Seasons', xlabel='Seasons', ylabel='Rented Bike Count')
plt.show()


##### 1. Why did you pick the specific chart?

The specific chart, a boxplot, was chosen to visualize the count of rented bikes according to seasons because it effectively displays the distribution of the data and provides insights into the variability and central tendency of the rented bike counts across different seasons. The boxplot allows for easy comparison between seasons, showing the median, quartiles, and potential outliers. It also helps identify any seasonal patterns or differences in the rented bike counts. Overall, the boxplot is a suitable choice for understanding the distribution and variability of the data across different seasons.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart are:

1. There is variation in rented bike counts across different seasons.
2. Summer has the highest bike rental demand.
3. Winter has the lowest bike rental demand.

These insights highlight the seasonal trends and can guide businesses in resource allocation and planning to meet customer demand effectively.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can help create a positive business impact. By understanding the seasonal variation in rented bike counts, businesses can strategically allocate resources and tailor their marketing efforts to capitalize on the peak demand during summer and other high-demand seasons. This can lead to increased customer satisfaction, improved operational efficiency, and potentially higher revenue.

There are no insights from the chart that directly indicate negative growth. However, the lower bike rental demand during winter may pose a challenge for businesses during that season. They may need to implement alternative strategies such as offering discounts, promoting indoor activities, or diversifying their services to mitigate any potential negative impact on business growth.

#### Chart - 4

# Rented Bike Count vs Snowfall

In [None]:
# Chart - 4 visualization code
fig, ax = plt.subplots(figsize=(18, 5))
sns.scatterplot(data=bike_df, x='Snowfall', y='Rented_Bike_Count', ax=ax)
ax.set(title='Rented Bike Count vs Snowfall', xlabel='Snowfall', ylabel='Rented Bike Count')
plt.show()

##### 1. Why did you pick the specific chart?

The specific chart, a scatter plot, was chosen to visualize the relationship between the "Rented_Bike_Count" and "Snowfall" variables. Scatter plots are effective in showing the distribution of data points and any potential patterns or trends between two numerical variables. In this case, the scatter plot allows us to examine how the rented bike count varies with different levels of snowfall. By plotting the "Rented_Bike_Count" on the y-axis and "Snowfall" on the x-axis, we can observe any potential correlation or relationship between these two variables.

##### 2. What is/are the insight(s) found from the chart?

The insight from the scatter plot chart is that there appears to be a mixed relationship between the "Rented_Bike_Count" and "Snowfall" variables.

1. For low to moderate levels of snowfall (0-2.5), the rented bike count tends to be relatively high, indicating that people still use bikes for transportation despite the presence of snow.
2. However, as the snowfall increases beyond 2.5, the rented bike count starts to decrease, suggesting that severe snowfall has a negative impact on bike usage.
3. There are also some instances of higher rented bike counts at specific snowfall levels, such as at 0.6, 1.1, and 3.6, which may indicate unique factors or variations in customer behavior.

Overall, the scatter plot highlights the complex relationship between snowfall and bike rentals, showing a mix of positive and negative impacts depending on the level of snowfall.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from the scatter plot can potentially help create a positive business impact.

Positive Impact:
1. The insight that bike rentals remain relatively high during low to moderate levels of snowfall suggests that businesses can promote biking as a viable transportation option even in winter conditions. This can lead to increased revenue and customer engagement.

Negative Impact:
2. The insight that severe snowfall leads to a decrease in bike rentals indicates a potential negative impact on business growth during extreme weather conditions. In such situations, it may be challenging to attract customers and maintain regular bike rental activities.

To mitigate the negative impact, businesses can focus on alternative services or promotions during periods of heavy snowfall, such as offering indoor cycling classes or maintenance services. By diversifying their offerings and adapting to changing weather conditions, businesses can minimize the negative effects and maintain a positive business impact overall.

#### Chart - 5

In [None]:
# Print the numerical data
print(bike_df[['Temperature', 'Rented_Bike_Count']])


In [None]:
# Chart - 5 visualization code

import seaborn as sns
import matplotlib.pyplot as plt

# Set the style of the plot
sns.set_style('darkgrid')

# Create the scatter plot
sns.scatterplot(data=bike_df, x='Temperature', y='Rented_Bike_Count', color='blue', alpha=0.5, marker='o', s=50)

# Customize the plot
plt.xlabel('Temperature (°C)')
plt.ylabel('Rented Bike Count')
plt.title('Scatter Plot of Temperature vs Rented Bike Count')
plt.legend(['Bike Data'], loc='upper right')

# Adjust the plot aesthetics
plt.tight_layout()

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

# Set the style of the plot
sns.set_style('darkgrid')

# Create the line plot
sns.lineplot(data=bike_df, x='Solar_Radiation', y='Rented_Bike_Count', color='green')

# Customize the plot
plt.xlabel('Solar Radiation')
plt.ylabel('Mean Rented Bike Count')
plt.title('Mean Rented Bike Count by Solar Radiation')

# Adjust the plot aesthetics
plt.tight_layout()

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
bike_df.groupby('Wind_speed').mean()['Rented_Bike_Count'].plot()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
num_plots = len(numerical_features)
num_pairs = num_plots // 3  # Number of pairs of three
remainder = num_plots % 3  # Remaining plots

fig, axes = plt.subplots(num_pairs, 3, figsize=(21, num_pairs*3))

for i in range(num_pairs):
    for j in range(3):
        col = numerical_features[i*3 + j]
        sns.regplot(x=bike_df[col], y=bike_df['Rented_Bike_Count'], scatter_kws={"color": 'green'}, line_kws={"color": "black"}, ax=axes[i, j])
        axes[i, j].set_xlabel(col)
        axes[i, j].set_ylabel('Rented Bike Count')


if remainder > 0:
    for j in range(remainder):
        col = numerical_features[num_pairs*3 + j]
        sns.regplot(x=bike_df[col], y=bike_df['Rented_Bike_Count'], scatter_kws={"color": 'green'}, line_kws={"color": "black"}, ax=axes[num_pairs, j])
        axes[num_pairs, j].set_xlabel(col)
        axes[num_pairs, j].set_ylabel('Rented Bike Count')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
numerical_columns = [col for col in bike_df.columns if bike_df[col].dtype != object]
num_plots = len(numerical_columns)
num_pairs = num_plots // 6  # Number of pairs of six
remainder = num_plots % 6  # Remaining plots

fig, axes = plt.subplots(num_pairs, 6, figsize=(24, num_pairs*4))

for i in range(num_pairs):
    for j in range(6):
        col = numerical_columns[i*6 + j]
        sns.distplot(bike_df[col], ax=axes[i, j])
        axes[i, j].set_xlabel(col)
        axes[i, j].set_ylabel('Density')
        axes[i, j].set_title(f'Distribution Plot: {col}')

if remainder > 0:
    for j in range(remainder):
        col = numerical_columns[num_pairs*6 + j]
        sns.distplot(bike_df[col], ax=axes[num_pairs, j])
        axes[num_pairs, j].set_xlabel(col)
        axes[num_pairs, j].set_ylabel('Density')
        axes[num_pairs, j].set_title(f'Distribution Plot: {col}')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
categorical_features = bike_df.select_dtypes(include=['object', 'category']).columns

num_plots = len(categorical_features)
num_cols = 3
num_rows = (num_plots + num_cols - 1) // num_cols

fig, axes = plt.subplots(num_rows, num_cols, figsize=(25, 5*num_rows))
fig.tight_layout()

for i, feature in enumerate(categorical_features):
    row = i // num_cols
    col = i % num_cols

    ax = axes[row, col]
    sns.boxplot(data=bike_df, x=feature, y='Rented_Bike_Count', ax=ax)
    ax.set_xlabel(feature)
    ax.set_ylabel('Rented Bike Count')
    ax.set_title(f'Rented Bike Count by {feature}', loc='left')

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
import plotly.graph_objects as go

x = bike_df['Temperature']
y = bike_df['Humidity']
z = bike_df['Rented_Bike_Count']

fig = go.Figure(data=[go.Scatter3d(
    x=x,
    y=y,
    z=z,
    mode='markers',
    marker=dict(
        size=5,
        color=z,
        colorscale='Viridis',
        opacity=0.8
    )
)])

fig.update_layout(
    scene=dict(
        xaxis_title='Temperature',
        yaxis_title='Humidity',
        zaxis_title='Rented Bike Count'
    ),
    title='3D Scatter Plot of Bike Data',
    width=800,
    height=600
)

fig.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

sns.set_style("whitegrid")

# Create a copy of the original DataFrame with the specified columns
cat_columns = ['Hour', 'Seasons', 'Holiday', 'Functioning_Day', 'month', 'weekdays_weekend']
cat_data = bike_df[cat_columns + ['Rented_Bike_Count']].copy()

# Apply scaling to the 'Rented Bike Count' column
scaler = MinMaxScaler()
cat_data['Rented_Bike_Count'] = scaler.fit_transform(cat_data['Rented_Bike_Count'].values.reshape(-1, 1))

# Add a 'Count' column to count the occurrences of each category
cat_data['Count'] = 1

# Group the data by the categorical columns and count the occurrences over time
grouped_data = cat_data.groupby(cat_columns).sum().reset_index()

# Plot the line plot
fig, ax = plt.subplots(figsize=(12, 8))
sns.lineplot(data=grouped_data, x='Hour', y='Rented_Bike_Count', hue='Seasons', style='weekdays_weekend', markers=True, ax=ax)

# Customize the plot
ax.set_xlabel('Hour')
ax.set_ylabel('Scaled Rented Bike Count')
ax.set_title('Scaled Rented Bike Count of Categorical Variables over Time')
ax.legend(title='Seasons')

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code


sns.set_style("whitegrid")

# Select the numerical features
numerical_features = ['Temperature', 'Humidity', 'Wind_speed', 'Visibility', 'Dew_point_temperature',
                      'Solar_Radiation', 'Rainfall', 'Snowfall', 'Rented_Bike_Count']

# Copy the selected columns from the DataFrame
num_data = bike_df[numerical_features].copy()

# Apply scaling to the numerical features
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(num_data.values)
num_data_scaled = pd.DataFrame(scaled_data, columns=num_data.columns)

# Select the features to include in the line plot
selected_features = ['Temperature', 'Humidity', 'Wind_speed','Snowfall', 'Rainfall']

# Plot the line plot
fig, ax = plt.subplots(figsize=(12, 8))
for feature in selected_features:
    sns.lineplot(data=num_data_scaled, x=feature, y='Rented_Bike_Count', ax=ax, label=feature)

# Customize the plot
ax.set_xlabel('Scaled Numerical Features')
ax.set_ylabel('Scaled Rented Bike Count')
ax.set_title('Scaled Rented Bike Count vs. Scaled Numerical Features')
ax.legend()

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
bike_df.corr()['Rented_Bike_Count']

In [None]:
# Correlation Heatmap visualization code
## plot the Correlation matrix
plt.figure(figsize=(20,8))
correlation=bike_df.corr()
mask = np.triu(np.ones_like(correlation, dtype=bool))
sns.heatmap((correlation),mask=mask, annot=True,cmap='coolwarm')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(bike_df,hue='Seasons',corner=True)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Assumption : Observations are identically distributed(all the elemets in the test have equail probability to ocure)

In [None]:
#Cheking Histogram
plt.figure(figsize=(14, 6))
plt.hist(bike_df['Rented_Bike_Count'])
plt.show()


#### 2. Perform an appropriate statistical test.

In [None]:
#Help from Python
from scipy.stats import shapiro

DataToTest = bike_df['Rented_Bike_Count']

stat, p = shapiro(DataToTest)

print('stat=%.2f, p=%.30f' % (stat, p))

if p > 0.05:
    print('Normal distribution')
else:
    print('Not a normal distribution')

##### Which statistical test have you done to obtain P-Value?

Normality test using Shapiro-Wilk Test : tests If data is normally distributed

##### Why did you choose the specific statistical test?

The Shapiro-Wilk test is used to assess the normality of a dataset. It helps determine whether the data follows a normal distribution or deviates significantly from it.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Asumption - Identical and Normal Distribution

#### 2. Perform an appropriate statistical test.

In [None]:
bike_df.corr()['Rented_Bike_Count']

In [None]:
FirstSample =bike_df[1:30]['Rented_Bike_Count']
SecondSample = bike_df[1:30]['Temperature']

plt.plot(FirstSample,SecondSample)
plt.show()

In [None]:
#Spearman Rank Correlation
from scipy.stats import spearmanr
stat, p = spearmanr(FirstSample, SecondSample)

print('stat=%.3f, p=%5f' % (stat, p))
if p > 0.05:
    print('independent samples')
else:
    print('dependent samples')

In [None]:
#pearson correlation
from scipy.stats import pearsonr
stat, p = pearsonr(FirstSample, SecondSample)

print('stat=%.3f, p=%5f' % (stat, p))
if p > 0.05:
    print('independent samples')
else:
    print('dependent samples')

##### Which statistical test have you done to obtain P-Value?

Answer.
Correlation Test - Pearson and Spearman’s Rank Correlation

##### Why did you choose the specific statistical test?

Pearson correlation is used to measure the strength and direction of a linear relationship between two continuous variables.

Spearman's rank correlation is used to measure the strength and direction of a monotonic relationship between two variables, which can be continuous or ranked.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
contingency_data = pd.crosstab(bike_df['Seasons'], bike_df['Holiday'],margins = False)
contingency_data

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency
import scipy.stats as stats


# Perform the chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_data)

# Print the chi-square test result
print("Chi-square statistic:", chi2)
print("p-value:", p_value)
print("Degrees of freedom:", dof)

if p > 0.05:
    print('independent categories')
else:
    print('dependent categories')

##### Which statistical test have you done to obtain P-Value?

Based on the chi-square test results, the chi-square statistic is 122.59, the p-value is approximately 2.14e-26, and the degrees of freedom is 3.

Since the p-value is extremely small (smaller than the typical significance level of 0.05), we can reject the null hypothesis of independence between the "Seasons" and "Holiday" variables. This suggests that there is a significant association or dependency between the two variables in the dataset.



##### Why did you choose the specific statistical test?

the occurrence of seasons and holidays in the dataset is not independent of each other. The variables "Seasons" and "Holiday" are related, and the difference in their frequencies is statistically significant.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
pip install missingno -q

In [None]:

import missingno as msno
import pandas as pd

# Assuming you have a DataFrame named bike_df
msno.matrix(bike_df)


In [None]:
# Handling Missing Values & Missing Value Imputation
print(bike_df.isna().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

lockely there are no missing values in the dataset.so i don't use any missing value imputation technique

### 2. Handling Outliers

### Rented_Bike_Count is skewed so i use boxplot

In [None]:
bike_df.skew()

In [None]:
bike_df['Rented_Bike_Count'].skew()

In [None]:
# Handling Outliers & Outlier treatments
#Boxplot of Rented Bike Count to check outliers
plt.figure(figsize=(10,6))
plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=bike_df['Rented_Bike_Count'])
plt.show()

The above boxplot shows that we have detect outliers in Rented Bike Count column

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(10, 8))
plt.xlabel('Rented Bike Count')
plt.ylabel('Density')

ax = sns.distplot(np.sqrt(bike_df['Rented_Bike_Count']), color="purple")
ax.axvline(np.sqrt(bike_df['Rented_Bike_Count']).mean(), color='blue', linestyle='dashed', linewidth=2)
ax.axvline(np.sqrt(bike_df['Rented_Bike_Count']).median(), color='green', linestyle='dashed', linewidth=2)

plt.show()


In [None]:
#After applying sqrt on Rented Bike Count check wheater we still have outliers
plt.figure(figsize=(10,6))

plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=np.sqrt(bike_df['Rented_Bike_Count']))
plt.show()

In [None]:
bike_df['Rented_Bike_Count']=np.sqrt(bike_df['Rented_Bike_Count'])

##### What all outlier treatment techniques have you used and why did you use those techniques?

np.sqrt(bike_df['Rented_Bike_Count']) will return a new Series object with the square root of each value in the 'Rented_Bike_Count' column.


This transformation is often applied to data to reduce skewness, as taking the square root can help normalize the distribution and make it more symmetric.


### 3. Categorical Encoding

In [None]:
#Assign all catagoriacla features to a variable
categorical_features = bike_df.select_dtypes(include=['object', 'category']).columns
categorical_features

In [None]:
# using one hot encoding for domification
bike_df_copy = pd.get_dummies(bike_df, columns=categorical_features, drop_first=True)

#### What all categorical encoding techniques have you used & why did you use those techniques?

One-hot encoding is a common preprocessing step for categorical features in machine learning models. By converting categorical variables into binary columns, it allows machine learning algorithms to work with categorical data more effectively.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
bike_df_copy.columns

In [None]:
X = bike_df_copy.drop(columns=['Rented_Bike_Count'], axis=1)
y = np.sqrt(bike_df_copy['Rented_Bike_Count'])
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)


In [None]:
# with the following function we can select highly correlated features
# it will remove the first feature that is correlated with anything other feature

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

In [None]:
corr_features = correlation(X_train, 0.7)
len(set(corr_features))

In [None]:
corr_features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
X_train.drop(corr_features,axis=1)
X_test.drop(corr_features,axis=1)

In [None]:
bike_df_copy=bike_df_copy.drop('Dew_point_temperature', axis=1, inplace=True)


##### What all feature selection methods have you used  and why?

Correlated features are features that exhibit a high degree of linear relationship or dependency between them. Having highly correlated features can introduce multicollinearity in a model, which can affect the model's performance and interpretability.

Correlated features are features that exhibit a high degree of linear relationship or dependency between them. Having highly correlated features can introduce multicollinearity in a model, which can affect the model's performance and interpretability.

##### Which all features you found important and why?

other featuers are not correlated to each other, Correlated features are not considered important because they can introduce redundancy, affect model interpretability, lead to model instability, and increase the risk of overfitting. Selecting uncorrelated features can improve model performance, enhance interpretability, and ensure more reliable and generalized results.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Plotting the distplots without any transformation

for col in X_train.columns:
    plt.figure(figsize=(8,2))
    plt.subplot(121)
    sns.distplot(X_train[col])
    plt.title(col)

    plt.subplot(122)
    stats.probplot(X_train[col], dist="norm", plot=plt)
    plt.title(col)

    plt.show()

In [None]:
# Transform Your data
# Apply Yeo-Johnson transform
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PowerTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import pandas as pd

pt1 = PowerTransformer()
X_train_transformed2 = pt1.fit_transform(X_train)
X_test_transformed2 = pt1.transform(X_test)

lr = LinearRegression()

# Perform cross-validation with 5 folds
cv_scores = cross_val_score(lr, X_train_transformed2, y_train, cv=5, scoring='r2')

# Train the model on the full training set
lr.fit(X_train_transformed2, y_train)

# Predict on the test set
y_pred3 = lr.predict(X_test_transformed2)

r2 = r2_score(y_test, y_pred3)
print("R-squared score:", r2)

df = pd.DataFrame({'cols': X_train.columns, 'Yeo_Johnson_lambdas': pt1.lambdas_})
# print(df)

print("Cross-validated R-squared scores:", cv_scores)
print("Mean R-squared score:", cv_scores.mean())
mean_r2 = np.mean(cv_scores)
print("Mean R-squared score:", mean_r2)


In [None]:
X_train_transformed = pd.DataFrame(X_train_transformed2,columns=X_train.columns)

for col in X_train_transformed.columns:
    plt.figure(figsize=(8,2))
    plt.subplot(121)
    sns.distplot(X_train[col])
    plt.title(col)

    plt.subplot(122)
    sns.distplot(X_train_transformed[col])
    plt.title(col)

    plt.show()

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# fit the scaler to the train set, it will learn the parameters
scaler.fit(X_train)

# transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Instantiate a LinearRegression model
lr = LinearRegression()

# Fit the model using the scaled features (X_train_scaled) and the continuous labels (y_train)
lr.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred = lr.predict(X_test_scaled)

# Calculate the mean squared error (MSE) on the test set
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
from sklearn.metrics import r2_score

# Assuming y_test contains the true target values and y_pred contains the predicted values
r2 = r2_score(y_test, y_pred)
print("R-squared score:", r2)
from sklearn.model_selection import cross_val_score

# Assuming X_train_scaled and y_train contain the scaled training data and target values, respectively
# Assuming lr is your trained linear regression model

# Perform cross-validation with 5 folds
cv_scores = cross_val_score(lr, X_train_scaled, y_train, cv=5, scoring='r2')

# Print the cross-validated R-squared scores
print("Cross-validated R-squared scores:", cv_scores)
print("Mean R-squared score:", cv_scores.mean())


In [None]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

ax1.scatter(X_train['Visibility'], X_train['Temperature'])
ax1.set_title("Before Scaling")
ax2.scatter(X_train_scaled['Visibility'], X_train_scaled['Temperature'],color='red')
ax2.set_title("After Scaling")
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Before Scaling')
sns.kdeplot(X_train['Visibility'], ax=ax1)
sns.kdeplot(X_train['Temperature'], ax=ax1)

# after scaling
ax2.set_title('After Standard Scaling')
sns.kdeplot(X_train_scaled['Visibility'], ax=ax2)
sns.kdeplot(X_train_scaled['Temperature'], ax=ax2)
plt.show()

# Comparison of Distributions

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Age Distribution Before Scaling')
sns.kdeplot(X_train['Temperature'], ax=ax1)

# after scaling
ax2.set_title('Age Distribution After Standard Scaling')
sns.kdeplot(X_train_scaled['Temperature'], ax=ax2)
plt.show()

In [None]:
from sklearn.preprocessing import MinMaxScaler

MinMaxScaler_scaler = MinMaxScaler()

# fit the scaler to the train set, it will learn the parameters
MinMaxScaler_scaler.fit(X_train)

# transform train and test sets
X_train_scaled_MinMaxScaler = MinMaxScaler_scaler.transform(X_train)
X_test_scaled_MinMaxScaler = MinMaxScaler_scaler.transform(X_test)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Create an instance of the LinearRegression model
lr = LinearRegression()

# Fit the model on the scaled training data
lr.fit(X_train_scaled_MinMaxScaler, y_train)

# Predict the target variable for the scaled test data
y_pred_MinMaxScaler = lr.predict(X_test_scaled_MinMaxScaler)

# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred_MinMaxScaler)

print("Mean Squared Error:", mse)

# Calculate the R-squared score
r2 = r2_score(y_test, y_pred_MinMaxScaler)
print("R-squared score:", r2)

# Perform cross-validation with 5 folds
cv_scores_MinMaxScaler = cross_val_score(lr, X_train_scaled_MinMaxScaler, y_train, cv=5, scoring='r2')
mean_r2_MinMaxScaler = np.mean(cv_scores_MinMaxScaler)
print("Mean cross-validated R-squared score:", mean_r2_MinMaxScaler)
print("Mean R-squared score:", cv_scores_MinMaxScaler.mean())


In [None]:
X_train_scaled_MinMaxScaler = pd.DataFrame(X_train_scaled_MinMaxScaler, columns=X_train.columns)
X_test_scaled_MinMaxScaler = pd.DataFrame(X_test_scaled_MinMaxScaler, columns=X_test.columns)

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

ax1.scatter(X_train['Visibility'], X_train['Temperature'])
ax1.set_title("Before Scaling")
ax2.scatter(X_train_scaled_MinMaxScaler['Visibility'], X_train_scaled_MinMaxScaler['Temperature'],color='red')
ax2.set_title("After Scaling")
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Before Scaling')
sns.kdeplot(X_train['Visibility'], ax=ax1)
sns.kdeplot(X_train['Temperature'], ax=ax1)

# after scaling
ax2.set_title('After Standard Scaling')
sns.kdeplot(X_train_scaled_MinMaxScaler['Visibility'], ax=ax2)
sns.kdeplot(X_train_scaled_MinMaxScaler['Temperature'], ax=ax2)
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Age Distribution Before Scaling')
sns.kdeplot(X_train['Temperature'], ax=ax1)

# after scaling
ax2.set_title('Age Distribution After Standard Scaling')
sns.kdeplot(X_train_scaled_MinMaxScaler['Temperature'], ax=ax2)
plt.show()

# Why scaling is important?

Scaling is important in machine learning to:
- Normalize the features and bring them to a comparable range.
- Improve algorithm convergence and stability.
- Prevent numerical instability and biased results.
- Ensure compatibility with certain algorithms.
- Enable fair and accurate feature comparisons.

##### Which method have you used to scale you data and why?

I use standarization method just buecause this parform slighty better then normalization mathod

### 7. Dimesionality Reduction

The "Curse of Dimensionality" refers to the challenges that arise when dealing with high-dimensional data. It causes sparsity of data, increased computational complexity, overfitting, and difficulties with distance-based algorithms. Mitigation strategies include feature selection, dimensionality reduction, regularization, and leveraging domain knowledge.

##### Do you think that dimensionality reduction is needed? Explain Why?

Apply PCA to the training set:

In [None]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score


In [None]:
for i in range(1,47):
  pca = PCA(n_components=i)  # Specify the number of components to keep
  X_train_pca = pca.fit_transform(X_train)
#Fit a linear regression model on the transformed training set:
  lrp = LinearRegression()
  lrp.fit(X_train_pca, y_train)
# Transform the test set using the same PCA transformation:
  X_test_pca = pca.transform(X_test)
# Make predictions on the transformed test set:
  y_pred = lrp.predict(X_test_pca)
# Evaluate the performance of the model using the R-squared score:
  r2 = r2_score(y_test, y_pred)
  print("R-squared score:", r2)

In [None]:
pca = PCA(n_components=45)  # Specify the number of components to keep
X_train_pca = pca.fit_transform(X_train)
#Fit a linear regression model on the transformed training set:
lrp = LinearRegression()
lrp.fit(X_train_pca, y_train)
# Transform the test set using the same PCA transformation:
X_test_pca = pca.transform(X_test)
# Make predictions on the transformed test set:
y_pred = lrp.predict(X_test_pca)
# Evaluate the performance of the model using the R-squared score:
r2 = r2_score(y_test, y_pred)
print("R-squared score:", r2)


In [None]:
# transforming in 3D
pca = PCA(n_components=3)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

In [None]:
import plotly.express as px
y_train_pca = y_train.astype(str)
fig = px.scatter_3d(df, x=X_train_pca[:,0], y=X_train_pca[:,1], z=X_train_pca[:,2],
              color=y_train_pca)
fig.update_layout(
    margin=dict(l=20, r=20, t=20, b=20),
    paper_bgcolor="LightSteelBlue",
)
fig.show()

In [None]:
plt.plot(np.cumsum(pca.explained_variance_ratio_))

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

PCA is appled on dataset. number of columns are not very high. so PCA dont help much.

Dimensionality reduction techniques are used to reduce the number of features or variables in a dataset while preserving the essential information. These techniques are beneficial when working with high-dimensional data, as they can help in simplifying the data representation, reducing computational complexity, and removing noise or redundant information.



Principal Component Analysis (PCA): PCA is a popular linear technique that transforms the original variables into a new set of uncorrelated variables called principal components. These components are ordered by the amount of variance they explain in the data, allowing for dimensionality reduction by selecting a subset of the components that capture most of the variability.

### 8. Data Splitting

In [None]:
X = bike_df_copy.drop(columns=['Rented_Bike_Count'], axis=1)
y = np.sqrt(bike_df_copy['Rented_Bike_Count'])
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)

What data splitting ratio have you used and why?

 A data splitting ratio of 75% for training and 25% for testing was used. This ratio is commonly used in machine learning and is known as the 75/25 split.

The reason for choosing this ratio is to strike a balance between having enough data for training the model and having enough data for evaluating its performance. By allocating 75% of the data for training, the model has a substantial amount of data to learn from and generalize patterns. The remaining 25% is used for testing, allowing us to assess how well the model performs on unseen data and evaluate its generalization capabilities.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

'Rented_Bike_Count' column is left skewed in dataset, but in privous outlier removing process i apply np.squr and is column convarted into normal distribution.. so now this dataset is balanced.

In [None]:
# Handling Imbalanced Dataset (If needed)
#Cheking Histogram
plt.figure(figsize=(14, 6))
plt.hist(bike_df['Rented_Bike_Count'])
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
plt.xlabel('Rented Bike Count')
plt.ylabel('Density')
ax = sns.distplot(bike_df['Rented_Bike_Count'], color="green")
plt.show()

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

np.sqrt(bike_df['Rented_Bike_Count']) will return a new Series object with the square root of each value in the 'Rented_Bike_Count' column.


This transformation is often applied to data to reduce skewness, as taking the square root can help normalize the distribution and make it more symmetric.

## ***7. ML Model Implementation***

### ML Model - 1

# Linear Regression

In [None]:
# ML Model - 1 Implementation
# Fit the Algorithm
reg= LinearRegression()
reg.fit(X_train, y_train)


In [None]:
#check the score
reg.score(X_train, y_train)


In [None]:
# Predict on the model
#get the X_train and X-test value
y_pred_train=reg.predict(X_train)
y_pred_test=reg.predict(X_test)
print("Predicted target values:", y_pred_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Calculate the evaluation metric scores
train_score = r2_score(y_train, y_pred_train)
print(train_score)
test_score = r2_score(y_test, y_pred_test)
print(test_score)

In [None]:
# Cross checking with cross val score
np.mean(cross_val_score(reg,X_train, y_train,scoring='r2',cv=10))

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_lr= mean_squared_error((y_train), (y_pred_train))
print("MSE :",MSE_lr)

#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)


#calculate MAE
MAE_lr= mean_absolute_error(y_train, y_pred_train)
print("MAE :",MAE_lr)



#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr= r2_score(y_train, y_pred_train)
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Define the parameter grid for GridSearchCV

param_grid = {
    'fit_intercept': [True, False],
}

# Fit the Algorithm
# Perform GridSearchCV with cross-validation
grid_search = GridSearchCV(reg, param_grid, cv=5)
grid_search.fit(X_train, y_train)



In [None]:
# Predict on the model
# Print the best parameters and best score
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

In [None]:

# Perform cross-validation with the best estimator
linear_regression_best = grid_search.best_estimator_
cv_scores = cross_val_score(linear_regression_best, X_train, y_train, cv=10)

# Print the cross-validation scores
print("Cross-Validation Scores:", cv_scores)
print("Mean CV Score:", cv_scores.mean())

# Fit the best estimator on the training data
linear_regression_best.fit(X_train, y_train)

# Evaluate the model on the testing data
test_score = linear_regression_best.score(X_test, y_test)
print("Test Score:", test_score)

##### Which hyperparameter optimization technique have you used and why?


GridSearchCV is a class in scikit-learn that facilitates performing an exhaustive search over specified parameter values for an estimator. It is commonly used for hyperparameter tuning, which involves finding the optimal values for the hyperparameters of a machine learning model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

yse, there are slight improvement in the model.



without girdsearchcv--

test Score:  0.7688887774277025


with girdsearchcv--



Test Score: 0.7905536900393838


### ML Model - 2

# **DECISION TREE**

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# ML Model - 1 Implementation
from sklearn.tree import DecisionTreeRegressor

decision_regressor = DecisionTreeRegressor(max_depth=None,
                                           max_features=12,
                                           max_leaf_nodes=150)

# Fit the Algorithm
decision_regressor.fit(X_train, y_train)

In [None]:
# Predict on the model
#get the X_train and X-test value
y_pred_train_d = decision_regressor.predict(X_train)
y_pred_test_d = decision_regressor.predict(X_test)

In [None]:
# Assuming you have defined and assigned values to X_train and y_train

# Calculate the R² score on the training data
score_train = decision_regressor.score(X_train, y_train)

# Print the R² score
print("R² score on training data:", score_train)


In [None]:
# Visualizing evaluation Metric Score chart
#import the packages
from sklearn.metrics import mean_squared_error
print("Model Score:",decision_regressor.score(X_train,y_train))

#calculate MSE
MSE_d= mean_squared_error(y_train, y_pred_train_d)
print("MSE :",MSE_d)

#calculate RMSE
RMSE_d=np.sqrt(MSE_d)
print("RMSE :",RMSE_d)


#calculate MAE
MAE_d= mean_absolute_error(y_train, y_pred_train_d)
print("MAE :",MAE_d)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_d= r2_score(y_train, y_pred_train_d)
print("R2 :",r2_d)
Adjusted_R2_d=(1-(1-r2_score(y_train, y_pred_train_d))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_d))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for GridSearchCV
param_grid = {
    'max_depth': [None, 5, 8, 9, 10, 15, 16],
    'max_features': [2, 5, 9, 10, 12],
    'max_leaf_nodes': [100, 150, 180]
}

# Create an instance of DecisionTreeRegressor
decision_regressor = DecisionTreeRegressor()

# Create the GridSearchCV object
grid_search = GridSearchCV(decision_regressor, param_grid, cv=5)

# Fit the data to perform the grid search
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Use the best estimator for predictions
best_estimator = grid_search.best_estimator_
y_pred_test_tuned = best_estimator.predict(X_test)


In [None]:
# Print the best parameters and best score
print("Best Parameters: ", best_params)
print("Best Score: ", best_score)



In [None]:
# Perform cross-validation with best parameters
cross_val_scores = cross_val_score(decision_regressor,X_train, y_train, cv=10)
print("Cross-Validation Scores: ", cross_val_scores)
print("Average Cross-Validation Score: ", cross_val_scores.mean())

##### Which hyperparameter optimization technique have you used and why?


GridSearchCV is a class in scikit-learn that facilitates performing an exhaustive search over specified parameter values for an estimator. It is commonly used for hyperparameter tuning, which involves finding the optimal values for the hyperparameters of a machine learning model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

no i dont seen any improvement.



without girdsearchcv--

test Score: 0.8512091074696333

with girdsearchcv--



Test Score: 0.7410361966835403

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Each evaluation metric provides a different perspective on the performance of a machine learning model. Here's an explanation of each metric and its indication towards the business impact of the ML model used:

1. Mean Squared Error (MSE):
   - MSE measures the average squared difference between the predicted and actual values.
   - It gives more weight to larger errors due to the squaring operation.
   - A lower MSE indicates better model performance and suggests that the model's predictions are closer to the actual values.
   - In terms of business impact, a lower MSE implies that the model's predictions have less error and are more accurate. This can lead to more reliable decision-making and improved efficiency in business operations.

2. Root Mean Squared Error (RMSE):
   - RMSE is the square root of MSE and represents the average magnitude of the prediction errors.
   - It is expressed in the same units as the target variable, making it easily interpretable.
   - Like MSE, a lower RMSE indicates better model performance and suggests that the model's predictions are closer to the actual values.
   - From a business perspective, a lower RMSE implies that the model's predictions have smaller errors on average. This can increase trust in the model's output and support decision-making with more confidence.

3. Mean Absolute Error (MAE):
   - MAE measures the average absolute difference between the predicted and actual values.
   - Unlike MSE, it does not square the errors, giving equal weight to all errors.
   - A lower MAE indicates better model performance and suggests that the model's predictions have less absolute error.
   - In terms of business impact, a lower MAE means that the model's predictions are closer to the actual values on average. This can lead to better resource allocation, improved planning, and reduced costs.

4. R-squared (R2):
   - R2 measures the proportion of the variance in the target variable that is explained by the model.
   - It ranges from 0 to 1, with higher values indicating a better fit of the model to the data.
   - An R2 of 1 means that the model explains all the variability in the target variable.
   - From a business perspective, a higher R2 suggests that the model captures a larger portion of the target variable's variation, indicating its ability to provide useful insights and predictions.

5. Adjusted R-squared (Adjusted R2):
   - Adjusted R2 is a modified version of R2 that adjusts for the number of predictors in the model.
   - It penalizes the addition of irrelevant predictors that do not improve the model's performance significantly.
   - The adjusted R2 value can be lower than R2 if the model's added predictors are not informative.
   - In terms of business impact, the adjusted R2 helps assess the incremental value of adding more predictors to the model. It provides a more conservative estimate of the model's explanatory power and helps in avoiding overfitting.

The business impact of the ML model used depends on the specific context and objectives of the business. However, in general, a well-performing ML model with lower MSE, RMSE, and MAE indicates more accurate predictions and improved decision-making. Higher R2 and adjusted R2 values suggest that the model captures more of the target variable's variation, enabling valuable insights and predictions for the business. Ultimately, a reliable and high-performing ML model can lead to enhanced efficiency, cost reduction, improved resource allocation, and better-informed business decisions.

# ML Model - 3

# RandomForestRegressor

In [None]:
# ML Model - 3 Implementation
#import the packages
from sklearn.ensemble import RandomForestRegressor
# Create an instance of the RandomForestRegressor
rf_model = RandomForestRegressor()

In [None]:
# Fit the Algorithm
rf_model.fit(X_train,y_train)

In [None]:
# Predict on the model
y_pred_train_r = rf_model.predict(X_train)
y_pred_test_r = rf_model.predict(X_test)

In [None]:
rf_model.score(X_train, y_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#import the packages
from sklearn.metrics import mean_squared_error
print("Model Score:",rf_model.score(X_train,y_train))

#calculate MSE
MSE_rf= mean_squared_error(y_train, y_pred_train_r)
print("MSE :",MSE_rf)

#calculate RMSE
RMSE_rf=np.sqrt(MSE_rf)
print("RMSE :",RMSE_rf)


#calculate MAE
MAE_rf= mean_absolute_error(y_train, y_pred_train_r)
print("MAE :",MAE_rf)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_rf= r2_score(y_train, y_pred_train_r)
print("R2 :",r2_rf)
Adjusted_R2_rf=(1-(1-r2_score(y_train, y_pred_train_r))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_r))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Number of trees in random forest
n_estimators = [20,60,100,120]

# Number of features to consider at every split
max_features = [0.2,0.6,1.0]

# Maximum number of levels in tree
max_depth = [2,8,None]

# Number of samples
max_samples = [0.5,0.75,1.0]

# Bootstrap samples
bootstrap = [True,False]

# Minimum number of samples required to split a node
min_samples_split = [2, 5]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2]
# Fit the Algorithm

# Predict on the model

In [None]:
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
              'max_samples':max_samples,
              'bootstrap':bootstrap,
              'min_samples_split':min_samples_split,
              'min_samples_leaf':min_samples_leaf
             }
print(param_grid)

In [None]:
from sklearn.model_selection import RandomizedSearchCV

rf_grid = RandomizedSearchCV(estimator = rf_model,
                       param_distributions = param_grid,
                       cv = 5,
                       verbose=2,
                       n_jobs = -1)

In [None]:
rf_grid.fit(X_train,y_train)

In [None]:
rf_grid.best_params_

In [None]:
rf_grid.best_score_

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is a class in scikit-learn that allows you to perform an exhaustive search over a specified parameter grid to find the best combination of hyperparameters for a given model. It automates the process of tuning hyperparameters by evaluating the model's performance on different combinations of parameter values using cross-validation.

Using GridSearchCV helps automate the process of hyperparameter tuning and allows you to find the optimal set of hyperparameters for your model based on the specified parameter grid and evaluation metric.

GridSearchCV is chosen for hyperparameter tuning because it performs an exhaustive search over all specified combinations of hyperparameters, incorporates cross-validation for robust evaluation, automates the process, identifies optimal hyperparameters, and improves the generalizability of the model.


GridSearchCV is good for small dataset. this take more time and more comutation in camparistion to randonmisesearchcv. if we working on big dataset then we chouse other approch to reduse time.


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

without randomsearchcv



best score : 0.9853788678117673


with randomsearchcv


beat soore : 0.8804336666598515


there are no improvenmat to apply hyparparameter tuning..




### 1. Which Evaluation metrics did you consider for a positive business impact and why?

When considering a Random Forest model for a positive business impact, the following evaluation metrics can be considered:

1. Mean Squared Error (MSE) or Root Mean Squared Error (RMSE):
   - These metrics measure the average squared or square root of the differences between the predicted and actual values.
   - Lower MSE or RMSE values indicate better model performance with smaller prediction errors.
   - In a business context, lower MSE or RMSE suggests more accurate predictions, which can lead to improved decision-making, cost reduction, and increased operational efficiency.

2. Mean Absolute Error (MAE):
   - MAE measures the average absolute difference between the predicted and actual values.
   - Similar to MSE and RMSE, lower MAE values indicate better model performance with smaller errors.
   - In a business setting, lower MAE suggests more accurate predictions, which can support decision-making, resource allocation, and cost reduction.

3. R-squared (R2) or Adjusted R-squared (Adjusted R2):
   - R2 measures the proportion of the variance in the target variable explained by the model.
   - Higher R2 or Adjusted R2 values indicate a better fit of the model to the data.
   - In a business context, higher R2 or Adjusted R2 values suggest that the model captures a larger portion of the target variable's variation, providing valuable insights for decision-making and prediction purposes.

When evaluating a Random Forest model, it's important to consider multiple metrics to gain a comprehensive understanding of its performance. MSE, RMSE, and MAE focus on prediction errors, while R2 and Adjusted R2 assess the model's explanatory power. By assessing these metrics, businesses can determine the accuracy, reliability, and predictive power of the Random Forest model, ultimately leading to informed decision-making and positive business impact.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

i get best score in random forest model so choose random forest model.
Random Forest is a popular and effective model for prediction due to its high accuracy, robustness to noise and outliers, ability to handle non-linear relationships, feature importance analysis, scalability, and capability to handle missing data.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

i am using random forest model


 https://lime-ml.readthedocs.io/en/latest/lime.html

In [None]:
# instaling lime
!pip install lime -q

In [None]:
import lime
from lime import lime_tabular

In [None]:
# Define the feature names for LIME
feature_names = X_train.columns.tolist()

# Define the LIME explainer
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values, feature_names=feature_names, mode='regression')

# Select a random test instance for explanation
instance_idx = np.random.randint(len(X_test))
instance = X_test.iloc[instance_idx]
true_label = y_test.iloc[instance_idx]

# Generate an explanation using LIME
num_features = 47  # Number of features to include in the explanation
exp = explainer.explain_instance(instance.values, rf_model.predict, num_features=num_features)

# Print the true label and the LIME explanation
print("True Label:", true_label)
print("LIME Explanation:")
exp.show_in_notebook()


Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
bike_df=bike_df.drop('Dew_point_temperature', axis=1, inplace=True)


X = bike_df.drop(columns=['Rented_Bike_Count'], axis=1)
y = bike_df['Rented_Bike_Count']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)


step1 = ColumnTransformer(
    transformers=[('col_tnf', OneHotEncoder(sparse=False, drop='first'), categorical_features)],
    remainder='passthrough'
)
step2 = RandomForestRegressor()
pipe = Pipeline([
    ('step1', step1),
    ('step2', step2)
])

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)

# Apply the square root transformation to the target variable for evaluation
y_test_sqrt = np.sqrt(y_test)
y_pred_sqrt = np.sqrt(y_pred)

print('R2 score:', r2_score(y_test_sqrt, y_pred_sqrt))
print('MAE:', mean_absolute_error(y_test_sqrt, y_pred_sqrt))


In [None]:
# Load the File and predict unseen data.
import joblib

# Assuming 'pipe' is the trained pipeline model
joblib.dump(pipe, 'model_file.pkl')

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:

# Load the saved model
loaded_model = joblib.load('model_file.pkl')


In [None]:

# Create a DataFrame for the unseen data
unseen_data = pd.DataFrame({
    'Hour': [23],
    'Temperature': [3.8],
    'Humidity': [83],
    'Wind_speed': [1.1],
    'Visibility': [390],
    'Solar_Radiation': [0.0],
    'Rainfall': [0.4],
    'Snowfall': [0.0],
    'Seasons': ['Autumn'],
    'Holiday': ['No Holiday'],
    'Functioning_Day': ['Yes'],
    'month': [11],
    'weekdays_weekend': [1]
})

# Make predictions on the unseen data
predictions = loaded_model.predict(unseen_data)

# Display the predictions
print(predictions)





### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***