<a href="https://colab.research.google.com/github/DhawalKhandait/Bike-Sharing-Demand-Prediction-Regression-/blob/main/Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Seoul Bike Sharing Demand Prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual

# **Project Summary -**

#### **1. Data Exploration:**
a. Data Wrangling

#### **2. Exploratory Data Analysis:**

**a. Univariate Analysis-**

i. Dependent Variable

ii. Numerical Variables

**b. Bivariate Analysis-**

i. Line plot numerical variables v/s rented bike count

ii. Spread of numerical variables across hours

iii. Categorical Variables v/s rented bike count

iv. Spread of Rented Bike Count across categorical variables

v. Numerical variable v/s rented bike count

vi. Spread of numerical variable across months


#### **3. Feature Selection -**

#### **4. Supervise Machine learning algorithms and implementation :**

a. Linear Regression

b. Lasso Regression
 
c. Ridge Regression

d. Elastic Net Regression

e. Gradient Descent Regression

f. Random Forest Regression

g. Gradient Boosting Regression

h. XGBoost Regression


#### **5. Model Explainability -**

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime as dt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error,mean_absolute_error, r2_score

%matplotlib inline
sns.set_style("whitegrid",{'grid.linestyle':'--'})
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Dataset
bike_df=pd.read_csv("/content/drive/MyDrive/Regression/SeoulBikeData.csv",encoding='unicode_escape')

### Dataset First View

In [None]:
# Dataset First Look
bike_df.head(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# computing number of rows
rows=len(bike_df.axes[0])

# computing number of columns
columns=len(bike_df.axes[1])

print("NUmber of rows : ",rows)
print("Number of columns : ",columns)

### Dataset Information

This dataset contains the data of rented bike count in the city of seoul. It presents the count of bikes rented per hour and the weather conditions for the day. The data is of one year from December 2017 to November 2018.

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
bike_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bike_df.isnull().sum()

**Above we can see there are no missing values as well as no duplicate values in the dataset.**

###**In the given dataset there are total 14 columns and most of them have 0 null values.**

### What did you know about your dataset?

In this Dataset, we have 8760 rows and 14 columns from which "rented bike count" is our target variable. There are numerical variables as well as categorical variables and one date variable which is stored as object so we will need to convert its dtype.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
bike_df.columns=map(str.lower, bike_df.columns)
bike_df.columns

In [None]:
# Dataset Describe
bike_df.info()

### Variables Description 

### <b> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.</b>


### <b>Attribute Information: </b>

* ### Date : year-month-day
* ### Rented Bike count - Count of bikes rented at each hour
* ### Hour - Hour of he day
* ### Temperature-Temperature in Celsius
* ### Humidity - %
* ### Windspeed - m/s
* ### Visibility - 10m
* ### Dew point temperature - Celsius
* ### Solar radiation - MJ/m2
* ### Rainfall - mm
* ### Snowfall - cm
* ### Seasons - Winter, Spring, Summer, Autumn
* ### Holiday - Holiday/No holiday
* ### Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# converting the dtype of date column
def get_date(str_date):
  date_obj= dt.datetime.strptime(str_date,'%d/%m/%Y')
  date_obj = pd.to_datetime(date_obj.date(), format="%Y-%m-%d")
  return date_obj

bike_df['date'] = bike_df['date'].apply(get_date)


In [None]:
bike_df.info()

In [None]:
# extracting the day, month and day of the week 

bike_df['day'] = bike_df['date'].apply(lambda x : x.day)
bike_df['month'] = bike_df['date'].apply(lambda x : x.month)
bike_df['day_of_week'] = bike_df['date'].dt.day_name()

bike_df = bike_df.drop("date", axis=1)

bike_df.head()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
bike_df.nunique()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

## **Univariate Analysis**

### **Dependent Variable**

First we will start with analyzing our target variable which is **rented bike count**.

#### Chart - 1

In [None]:
# Chart - 1 visualization code
dependent_var ="rented bike count"

In [None]:
bike_df[dependent_var].describe()

Now let's see the distribution of dependent variables 'rented bike count'


In [None]:
# distribution plot
plt.figure(figsize=(9,7))
sns.distplot(bike_df[dependent_var])
plt.title("Distribution Plot")
plt.show()

##### 1. Why did you pick the specific chart?

* Use this chart in order to get the how the Depedent variables(rented bike count) are dsitributed along the Independent features.
* Above we can see that Dependent variable is rightly skewed.

##### 2. What is/are the insight(s) found from the chart?

Dependent variable i.e rented bike count is slightly skewed towards right side (positively skewed). So we will apply transformation and again look at the distribution.

Below are some transformation technique to reduce skewness.

<b>square-root for moderate skew:</b>
sqrt(x) for positively skewed data,
sqrt(max(x+1) - x) for negatively skewed data

<b>log for greater skew:</b>
log10(x) for positively skewed data,
log10(max(x+1) - x) for negatively skewed data

<b>inverse for severe skew:</b>
1/x for positively skewed data
1/(max(x+1) - x) for negatively skewed data

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


No, the above distribution drawn doesn't create a positive impact on business. Because, the graph is highly skewed that the data is not normally distributed. 

In [None]:
# Applying square-root transformation

plt.figure(figsize=(9,8))
sns.distplot(np.sqrt(bike_df[dependent_var]))
plt.title("Distribution Plot - After Transformation")
plt.show()

It looks good and almost near to the normal distribution

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Box Plot
plt.figure(figsize=(5,6))
sns.boxplot(y=bike_df[dependent_var])
plt.title("Boxplot")
plt.show()

##### 1. Why did you pick the specific chart?

* plot the distribution plot in order to see how the dependent variable is spread and we cam to conclusion that it is positively skewed.
* From Boxplot we come to know about the outliers in the dependent variable

##### 2. What is/are the insight(s) found from the chart?

From boxplot, we can see the median value of rented bike count is near 500 and there are some outliers towards upper limit. After applying transformation there will be no outliers.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

No, above boxplot doesn't create positive effect, as we can see there are more number of outliers in this variable 
The above Boxplot drawn gives the idea of outliers present in the dependent variable

## **Independent Variables**

## **Numerical Variables**

Now let's have a look at the numerical features and plot some graphs to understand them

In [None]:
# numerical variables

numerical_var =list(bike_df.describe().columns[1:])
numerical_var

In [None]:
bike_df[numerical_var].describe().T

In [None]:
# Unique count of numerical variables

lst=[]
for col in numerical_var:
  lst.append(bike_df[col].nunique())

unique_cnt_df=pd.DataFrame(index=numerical_var,columns=["unique_count"])  
unique_cnt_df["unique_count"]=lst
unique_cnt_df

#### Chart - 3

In [None]:
# Chart - 3 visualization code
for col in numerical_var:
  features = bike_df[col]
  sns.histplot(features)
  plt.axvline(features.mean(),color='black',linestyle='dashed',linewidth=2)
  plt.axvline(features.median(),color='red',linestyle='dashed',linewidth=2)
  plt.title(col)
  plt.show()

##### 1. Why did you pick the specific chart?

* In above case, Histogram plot of all the variable is plotted.
* Histogram plot shows frequency distributions.

##### 2. What is/are the insight(s) found from the chart?

* In the above case we get to know about how the numerical variables are distributed around Average line.
* Histogram is showing the frequency distribution, but average line helps to know the mean of the data for each variable

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, It will help creating a positive business impact

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# boxplot for each numerical features

for col in numerical_var:
  fig=plt.figure()
  ax=fig.gca()
  bike_df.boxplot(col,ax=ax)
  ax.set_title(col)
plt.show()  

Variables such as wind speed (m/s), solar radiation (mj/m2), rainfall(mm), snowfall (cm) has outliers as seen in the boxplot.

##### 1. Why did you pick the specific chart?

Boxplot helps detect the outliers present in the dataset or in particular feature
* Boxplot tells how the variable is afffected by outliers

##### 2. What is/are the insight(s) found from the chart?


* From above plot we can conclude that the numerical variables **Windspeed, Solar radiation, rainfall and snowfall** are affected by outliers
* whereas, the numerical vaiables **hour, temperature, humidity, visibility, dew-point, day and month** are not affected by outliers

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights helps creating positive business impact.

## **Categorical Variables**

In [None]:
categorical_var=list(bike_df.select_dtypes(include='object'))
categorical_var

In [None]:
# Season Columns
print(f"Count of distinct catrgories in season variable :{bike_df['seasons'].nunique()}")
print(list(bike_df["seasons"].unique()))

In [None]:
# Holiday Columns
print(f"Count of distinct categories in holiday variable:{bike_df['holiday'].nunique()}")
print(list(bike_df['holiday'].unique()))

In [None]:
# Functioning day columns 
print(f"Count of distinct categories in functioning day variable:{bike_df['functioning day'].nunique()}")
print(list(bike_df["functioning day"].unique()))

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Count plot for the categorical features

for col in categorical_var:
  plt.figure(figsize=(3,4))
  sns.countplot(data=bike_df,x=col)
  plt.title(col)
  plt.show()

##### 1. Why did you pick the specific chart?

* Bar charts are useful to compare different categorical or discrete variables

##### 2. What is/are the insight(s) found from the chart?

* We can say that this columns will not have a greater impact.
* There are very less count of Holiday and No functioning day. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights help creating a positive business impact

## **Bivariate Analysis**

## Numerical variabels vs rented bike count

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Scatter-plot of numerical_ver vs rented bike count

for col in numerical_var:
  fig = plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature = bike_df[col]
  label = bike_df['rented bike count']
  plt.scatter(x=feature,y=label)
  plt.xlabel(col)
  plt.ylabel('rented bike count')
  ax.set_title('rented bike count vs ' + col)

  z=np.polyfit(bike_df[col],bike_df['rented bike count'], 1)
  y_hat = np.poly1d(z)(bike_df[col])

  plt.plot(bike_df[col],y_hat,"r--",lw=1)

plt.show()  

##### 1. Why did you pick the specific chart?

Scatter plots shows how much one variable is affected by another or the relationship between them with the help of dots in two dimension

##### 2. What is/are the insight(s) found from the chart?

* From above we can conclude that the independent variable **rented bike count** is highly correlated with **temperature, humidity, windspeed, visibility, dew point temperature, rainfall and snowfall.** 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

above highly correlated features are impacting negatively on business.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Boxplot of numerical_ver vs rented bike count

for col in numerical_var:
  fig = plt.figure(figsize=(9,6))
  ax=fig.gca()
  bike_df.boxplot(column='rented bike count',by=col,ax=ax)
  ax.set_title('Label by' + col)
  ax.set_ylabel("rented bike couont")
plt.show()  

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Lineplot numerical_var vs rented bike count

fig,ax=plt.subplots(3,2,figsize=(15,9))

bike_df.groupby('temperature(°c)').mean()['rented bike count'].plot(ax=ax[0][0])

bike_df.groupby('humidity(%)').mean()['rented bike count'].plot(ax=ax[0][1])

bike_df.groupby('wind speed (m/s)').mean()['rented bike count'].plot(ax=ax[1][0])

bike_df.groupby('solar radiation (mj/m2)').mean()['rented bike count'].plot(ax=ax[1][1])

bike_df.groupby('rainfall(mm)').mean()['rented bike count'].plot(ax=ax[2][0])

bike_df.groupby('snowfall (cm)').mean()['rented bike count'].plot(ax=ax[2][1])

plt.show()

##### 1. Why did you pick the specific chart?

* we use line plot to see the trend of numerical variables with respect to the rented bike count.
* means to see the count of rented bikes affected by the temperature, humidity, solar radiations, rainfaal, wind-speed and snowfall.

##### 2. What is/are the insight(s) found from the chart?

* When the temperature is more the rental bike count is also high. 
* With increase in humidity the demand of rental bikes decreases. 
* Wind speed and solar radiation do not have much impact on the bike count. 
* When there is more than 10mm rainfall the demand of bike decreases but above 20mm of rain there is a huge peak. This could be the outlier or rainfall in the Summer.
* With increase in snowfall there is a decrease in rented bike count.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights help creating a positive business impact.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Spread of numerical variables across hours

fig, ax = plt.subplots(3,2,figsize=(15,9))

sns.lineplot(x='hour', y='rented bike count', data=bike_df, color='Navy', ax=ax[0][0])

sns.lineplot(x='hour', y='temperature(°c)', data=bike_df, color='Navy', ax=ax[0][1])

sns.lineplot(x='hour', y='humidity(%)', data=bike_df, color='Navy', ax=ax[1][0])

sns.lineplot(x='hour', y='wind speed (m/s)', data=bike_df, color='Navy', ax=ax[1][1])

sns.lineplot(x='hour', y='visibility (10m)', data=bike_df, color='Navy', ax=ax[2][0])

sns.lineplot(x='hour', y='solar radiation (mj/m2)', data=bike_df, color='Navy', ax=ax[2][1])

plt.show()

##### 1. Why did you pick the specific chart?

Line plot shows the continuous trend of the data, how it is performing with respect to time 


##### 2. What is/are the insight(s) found from the chart?

* At the beginning of the day the demand of rental bike increases with the highest peak in the evening and later decreasing.
* The demand of rental bike is at peak at 8am and 6pm so we can say that demand is more during office opening and closing time.
* Temperature, wind speed, solar radiation also increases and are at the peak in afternoon.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, the gained insights help creating a positive business impact. 
* as the bike sharing system is based on hourly pattern, so this graph helps us understand how the dependent varible or demand of bikes frequetly changes with respect to hour

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Spread of numerical_var accorss months

fig, ax = plt.subplots(3,2,figsize=(15,9))

sns.lineplot(x='month', y='temperature(°c)', data=bike_df, color='#E56124', ax=ax[0][0])

sns.lineplot(x='month', y='humidity(%)', data=bike_df, color='#E56124', ax=ax[0][1])

sns.lineplot(x='month', y='wind speed (m/s)', data=bike_df, color='#E56124', ax=ax[1][0])

sns.lineplot(x='month', y='visibility (10m)', data=bike_df, color='#E56124', ax=ax[1][1])

sns.lineplot(x='month', y='rainfall(mm)', data=bike_df, color='#E56124', ax=ax[2][0])

sns.lineplot(x='month', y='snowfall (cm)', data=bike_df, color='#E56124', ax=ax[2][1])

plt.show()


##### 1. Why did you pick the specific chart?

Line plot shows the continuous trend of the data, how it is performing with respect to time 

##### 2. What is/are the insight(s) found from the chart?

Above all six plots showing the monthwise geographical trend for the 1 year 
* 1) temperature vs Month - Temperature is maximum in between 6 to 8 months
* 2 humidity vs month - humidty is maximum where the rise in temperaute and case of rainfall
* 3) windspeed vs months - windspeed is maximum in 1 to 4 month 
* 4) Visibility vc month - visibility is maximum in summer and minimum from 1 to 5 month
* 5) rainfall vs month - rainfall is maximum in mansoon
* 6) snowfall vs month - snaowfall is maximum in winter

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Above plot helps gained positive impact on business , because the bike sharing demand is also depends on some geographical conditions

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Pie chart Categorical_var vs rented bike count

fig, ax = plt.subplots(2,2,figsize=(12,9))

bike_df.groupby('seasons').sum()['rented bike count'].plot.pie(autopct='%1.1f%%', shadow=True,ax= ax[0][0])
ax[0][0].set_title("seasons")

bike_df.groupby('holiday').sum()['rented bike count'].plot.pie(autopct='%1.1f%%',ax= ax[0][1])
ax[0][1].set_title("holiday")

bike_df.groupby('functioning day').sum()['rented bike count'].plot.pie(autopct='%1.1f%%', ax= ax[1][0])
ax[1][0].set_title("functioning day")

bike_df.groupby('day_of_week').sum()['rented bike count'].plot.pie(autopct='%1.1f%%', ax= ax[1][1])
ax[1][1].set_title("day of week")

plt.show()

##### 1. Why did you pick the specific chart?


* Pie chart are shown as a percentage of the whole pie.
* It shows the proportion.

##### 2. What is/are the insight(s) found from the chart?

* Above we can see Autumn, Spring and Summer this three seasons has the highest demand of rented bikes.
* on working days there is approximately 97% demand of the rent bikes beccause we can say that people use this rented bike services in order to go to office, etc works, and we can predict that peoples generally on holdays prefer to stay at home or prefer there own vehicles.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights help creating a positive business impact

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Boxplot of categorical_var vs rented bike count

for col in categorical_var:
  fig = plt.figure(figsize=(8,6))
  ax = fig.gca()
  bike_df.boxplot(column= 'rented bike count', by = col, ax = ax)
  ax.set_ylabel("rented bike count")
plt.show()

##### 1. Why did you pick the specific chart?

* Use Boxplot because it helps detect outliers
* determine where the majority of the points land at a glance

##### 2. What is/are the insight(s) found from the chart?

* In Summer the demand of rented bike is high because temperature and solar radiation is high in summer. 
* We have seen there are less holidays so obviously rented bike count is also less on holidays. 
* Almost no demand on non functioning day.
* The demand of rental bikes slightly decreases on weekend days i.e saturday and sunday.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


yes, the gained insights help creating a positive business impact

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# spread of rented bike count across categorical_var
fig, ax = plt.subplots(2,2,figsize=(12,9))

sns.barplot(x= 'seasons', y= 'rented bike count', data= bike_df, ax= ax[0][0])

sns.pointplot(x= 'month', y= 'rented bike count', hue= 'seasons',
              data= bike_df, ax= ax[0][1])

sns.lineplot(x= 'hour', y= 'rented bike count', hue= 'holiday',
             ci=None, data= bike_df, ax= ax[1][0])

sns.barplot(x= 'day_of_week', y= 'rented bike count', data= bike_df, ax= ax[1][1])

plt.show()

##### 1. Why did you pick the specific chart?

The above plots shows how the dependent variable rented bike count is affecting with respect to Independet variables

##### 2. What is/are the insight(s) found from the chart?

* There is a huge demand for bike rents in summer season while the least bike rents occur in winter.
* We can see there is a high demand for rented bike in the month of June, August and less demand in the month of December, January and February i.e winter season.
* Non holidays have comparatively high demand for rented bikes as compared to holidays. 
* There is a high demand for rented bikes during office days and demand decreases slightly on Sunday.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the gained insights help creating a positive business impact.

# **Feature Selection**

## **Correlation**

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr_df = bike_df.corr()
plt.figure(figsize=(12,10))
sns.heatmap(corr_df, annot =True, cmap ="crest" )
plt.show()

##### 1. Why did you pick the specific chart?

* Correlattion heatmap helps us know how the two variables are related to each-other.

##### 2. What is/are the insight(s) found from the chart?

The most correlated features to the rented bike count are:
* hour
* temperature(°c)
* dew point temperature(°c)
* solar radiation (mj/m2)

There is a high correlation between dew point temperature(°c) and temperature(°c).

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


above correlation heatmap insights lead to negative growth because the variables like hour, temperature, dew point temperature and solar radiations are highly corelated, keeping them will negatively impact the predictions as well as business.

# **Detecting Multicollinearity using VIF**


In [None]:
# Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
calc_vif(bike_df[[i for i in numerical_var]])

##### 1. Why did you pick the specific chart?

* VIF starts at 1 and has no upper limit
* VIF = 1, no correlation between the independent variable and the other variables
* VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others

##### 2. What is/are the insight(s) found from the chart?

* We can see here that 'dew point temperature(°c)', 'temperature(°c)' have a high VIF value, meaning they can be predicted by other independent variables in the dataset. These two variables are highly correlated.

* Dropping one of the correlated features will help in bringing down the multicollinearity between correlated features.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights help creating a positive business impact.

In [None]:
# droping 'dew point temperature(°c)', 'day', 'month'

calc_vif(bike_df[[ i for i in numerical_var if i not in ['dew point temperature(°c)', 'day', 'month']]])

* After droping 'dew point temperature(°c)', 'day' and 'month', VIF values for all features have decreased less than 5 that is good to build regression model.

In [None]:
# droping 'dew point temperature(°c)', 'day', 'month' from original dataset
data= bike_df.drop(['dew point temperature(°c)', 'day', 'month'], axis=1)

In [None]:
# Correlation Heatmap after reducing the multicollinearity
plt.figure(figsize=(12,8))
sns.heatmap(data.corr(), annot = True, cmap = 'coolwarm')
plt.show()

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
data.isna().sum()

### 2. Categorical Encoding

In [None]:
# creating column of weekend by replacing the days with 1 and 0
data['weekend'] = data['day_of_week'].apply(lambda x : 1 if x == 'Sunday' or x == 'Saturday' else 0)
data.drop('day_of_week', axis = 1, inplace= True)

In [None]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder
label_en = LabelEncoder()

data[['seasons', 'holiday', 'functioning day']] = data[['seasons', 'holiday', 'functioning day']].apply(label_en.fit_transform)

#### What all categorical encoding techniques have you used & why did you use those techniques?

Above we pweformed label encoding in order to convert categorical variables into numericals which helps us makes easier for using various models 

In [None]:
data.head()

### 3. Feature Manipulation & Selection

#### 1. Feature Manipulation

dont't need to manipulate features because in VIF I have already drop the feature with high multicollinearity and also created some new features while Data Exploration.


### 4. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In Data exploration , i.e. during data cleaning wrangling this data is already transformed. So, again no need tranform the data.


### 5. Data Scaling

Data is scaled enough to make predictions.

### 6. Data Splitting
* we first separate the entire data into X & y i.e. into independent and dependent variables.

In [None]:
X = data.iloc[:,1:]
y = np.sqrt(data.iloc[:,0])

In [None]:
X.head()

In [None]:
y.head()

In [None]:
# train test split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)

#### What data splitting ratio have you used and why? 
Answer - Above we split the data in to 70:30 ratio .i.e. 70% of the data used for training purpose and 30% of the data will be used for testing.

In [None]:
print(X_train.shape,y_train.shape)

In [None]:
print(X_test.shape,y_test.shape)

### 7. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

No, the dataset is not imbalanced.

## ***7. ML Model Implementation***

### ML Model - 1

# **Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# create an instance of Linear Regression
linear_reg = LinearRegression()

# fitting the Linear regrssion model
linear_reg.fit(X_train, y_train)


In [None]:
# model score
linear_reg.score(X_train,y_train)

In [None]:
linear_reg.coef_

In [None]:
linear_reg.intercept_

In [None]:
# Prediction on train and test data

train_pred_lr=linear_reg.predict(X_train)
test_pred_lr=linear_reg.predict(X_test)

## **Evaluation Matrices**

### ***Create the function for all the possible evaluation metrics***

In [None]:
# Create train and test result dictionaries
train_result = {}
test_result = {}

def evaluation_metrics(y_test, y_pred, model=None, train=True):

   ''' takes actual target values and estimated target values as input
      and returns evaluation metrics as output '''

   MSE = mean_squared_error(y_test, y_pred)
   print("MSE :", MSE)
   RMSE = np.sqrt(MSE)
   print("RMSE :", RMSE)
   MAE = mean_absolute_error(y_test, y_pred)
   print("MAE :", MAE)
   R2_score = r2_score(y_test, y_pred)
   print("R2_score :", R2_score)
   Adj_r2_score = 1-(1-r2_score(y_test, y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
   print("Adjusted R2_score :", Adj_r2_score)
   
   # appending result into dictionary

   if train:
     train_result[model] = [MSE,RMSE,MAE,R2_score,Adj_r2_score]
   else:
     test_result[model] =  [MSE,RMSE,MAE,R2_score,Adj_r2_score]


####  Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# train data Evaluation metrics
evaluation_metrics(y_train, train_pred_lr, model ='Linear', train= True)

In [None]:
# test data Evaluation metrics
evaluation_metrics(y_test, test_pred_lr, model = 'Linear', train = False)

In [None]:
# line graph of actual and predictive values
plt.figure(figsize=(15,8))
plt.plot(test_pred_lr[:100])
plt.plot(np.array(y_test)[:100])
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
# Heteroskedasticity 
plt.figure(figsize=(10,6))
plt.axhline(y =0, color = 'r', linestyle= '--')
plt.scatter(test_pred_lr, y_test - test_pred_lr)
plt.xlabel('predicted bike count')
plt.ylabel('residual');

### ML Model - 2

# **Lasso Regression**

#### Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.linear_model import Lasso
lasso=Lasso()

# Cross-Validation
parameter={'alpha':[1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
lasso_reg=GridSearchCV(lasso,parameter,cv=5)
lasso_reg.fit(X_train, y_train)

In [None]:
print(" The best fit alpha value is found to be :", lasso_reg.best_params_)

In [None]:
# Prdiction on train and test datasets

train_pred_lasso=lasso_reg.predict(X_train)
test_pred_lasso=lasso_reg.predict(X_test)

## **Evaluation Metrics for Lasso Regression**

####  Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Train data Evaluation Metrics
evaluation_metrics(y_train,train_pred_lasso,model='Lasso',train=True)

In [None]:
# Test data Evaluation Metrics
evaluation_metrics(y_test,test_pred_lasso,model='Lasso',train=False)

In [None]:
# line graph of actual and predictive values for Lasso Regression

plt.figure(figsize=(15,8))
plt.plot(test_pred_lasso[:100])
plt.plot(np.array(y_test)[:100])
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
# Heteroskedasticity
plt.figure(figsize=(10,6))
plt.axhline(y=0,color='r',linestyle='--')
plt.scatter(test_pred_lasso,y_test - test_pred_lasso)
plt.xlabel('predicted bike count')
plt.ylabel('residual');

### ML Model - 3

## **Ridge Regression**

#### Cross- Validation & Hyperparameter Tuning


In [None]:
from sklearn.linear_model import Ridge
ridge = Ridge()

# Cross-Validation
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
ridge_reg = GridSearchCV(ridge, parameters, cv=5)
ridge_reg.fit(X_train, y_train)

In [None]:
print(" The best fit alpha value is found to be :", ridge_reg.best_params_)

In [None]:
# Prediction on train and test dataset

train_pred_ridge=ridge_reg.predict(X_train)
test_pred_ridge=ridge_reg.predict(X_test)

## **Evaluation Metrics for Ridge Regression**

####  Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Evaluation metrics for train dataset(train_pred_ridge)
evaluation_metrics(y_train,train_pred_ridge,model='Ridge',train=True)

In [None]:
# Evaluation metrics for test dataset(test_pred_ridge)
evaluation_metrics(y_test,test_pred_ridge,model='Ridge',train=False)

In [None]:
# line graph of actual and predicted values
plt.figure(figsize=(12,6))
plt.plot(test_pred_ridge[:100])
plt.plot(np.array(y_test)[:100])
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
# Heteroskedasticity
plt.figure(figsize=(12,6))
plt.axhline(y = 0, color = 'r', linestyle = '--')
plt.scatter(test_pred_ridge, y_test- test_pred_ridge)
plt.xlabel('predicted bike count')
plt.ylabel('residuals');

### ***ML Model - 4***

## **Elastic Net Regression**

#### Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.linear_model import ElasticNet
elasticnet_reg = ElasticNet()

# parameters 
en_params = {'alpha': [1e-15,1e-10,1e-5,1e-3,1e-2,1e-1,1,5,10,20,30,40,50,100],
             'l1_ratio' : [0.1,0.2,0.3,0.4,0.5]}

en_grid = GridSearchCV(elasticnet_reg, en_params, cv= 5)
en_grid.fit(X_train,y_train)

In [None]:
print(" The best fit alpha value is found to be :", en_grid.best_params_)

In [None]:
# prediction on train and teest dataset
train_pred_en = en_grid.predict(X_train)
test_pred_en = en_grid.predict(X_test)

## **Evaluation Metrics for Elastic Net Regression**

#### Explaining the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Train dataset evaluation metrics for 
evaluation_metrics(y_train, train_pred_en, model='ElasticNet', train=True)

In [None]:
# Test dataset evaluation metrics for 
evaluation_metrics(y_test, test_pred_en, model='ElasticNet', train=False)

In [None]:
# line graph of actual and predicted values
plt.figure(figsize=(12,6))
plt.plot(test_pred_en[:100])
plt.plot(np.array(y_test)[:100])
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
# Heteroskedasticity
plt.figure(figsize=(12,6))
plt.axhline(y = 0, color = 'r', linestyle = '--')
plt.scatter(test_pred_en, y_test- test_pred_en)
plt.xlabel('predicted bike count')
plt.ylabel('residuals');

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***