<a href="https://colab.research.google.com/github/Nakulcj7/bike/blob/main/Bike_sharing_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Seoul Bike Sharing Demand Prediction



##### **Project Type**    -Regression
##### **Contribution**    - Individual


# **Project Summary -**

Bike sharing systems have gained widespread popularity in urban environments, offering a sustainable and efficient mode of transportation. This project focuses on developing a predictive model for bike sharing demand, leveraging historical data, weather conditions, and other relevant factors. The primary goal is to create a robust and accurate prediction system to optimize bike allocation and enhance user experience.There were approximately 8760 records and 14 attributes in the dataset.This dataset contains information on Seoul city's weather conditions (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall)and the number of bikes rented on every hour and the date information.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**



It is necessary to make the rental bike avaiable and accessible for the public at the right time as the waiting period shortens.Eventually,providing the city with a stable supply of rental bikes becomes a major concern.The main think to focus here is to predict the bike count required at each hour for a stable supply of rental bikes.


The major objective here is to count the rental bikes required on an daily hour basis and also to identify the features which influences the hourly demant for rental bikes.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#loading the dataset
bd_df=pd.read_csv("/content/drive/MyDrive/Almabetter/SeoulBikeData.csv", encoding='ISO-8859-1')

### Dataset First View

In [None]:
# Dataset First Look
bd_df.head(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
bd_df.shape

In [None]:
print(f'number of rows : {bd_df.shape[0]}  \nnumber of columns : {bd_df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
bd_df.info()

In [None]:
#viewing the statistical summary of the data
bd_df.describe(include='all').T

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
bd_df[bd_df.duplicated()]

This shows that there are no duplicate entries in the dataset.


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bd_df.isnull().sum()

So it is evident that there is no null values in the dataset.So we can that the dataset is balanced.


### What did you know about your dataset?

The dataset provided contains 14 columns and 8760 rows and does not have any missing or duplicate values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
bd_df.columns

In [None]:
# Dataset Describe
bd_df.describe(include='all').T


### Variables Description





 This dataset contains information on Seoul city's weather conditions (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall)and the number of bikes rented on every hour and the date information.

Attribute Information:


*   Date :The date of each observation in the format 'year-month-day'

*   Rented Bike count - Count of bikes rented at each hour

*   Hour - Hour of the day

*   Temperature - Temperature recorded in the city in Celsius (°C).

*   Humidity - Relative humidity in %

*   
Windspeed - Speed of the wind in m/s


*   Visibility - measure of distance at which object or light can be clearly discerned in units of 10m
*   Dew point temperature - Temperature recorded in the beginning of the day in Celsius(°C).


*   Solar radiation - Intensity of sunlight in MJ/m^2


*   Rainfall - Amount of rainfall received in mm


*   Snowfall - Amount of snowfall received in cm


*   Seasons - Season of the year (Winter, Spring, Summer, Autumn)


*   Holiday - Whether the day is a Holiday or not (Holiday/No holiday)


*   Functional Day -Whether the rental service is available (Yes-Functional hours) or not (No-Non functional hours)



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
bd_df.nunique()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#Converting the datatype of Date column to datatime
bd_df['Date'] = pd.to_datetime(bd_df['Date'], format='%d/%m/%Y')

#Extracting Month,Weekday and Year from the date column
bd_df['Month']=bd_df['Date'].dt.month
bd_df['Days_of_week']=bd_df['Date'].dt.day_name()
bd_df['Year']=bd_df['Date'].dt.year
bd_df['Day']=bd_df['Date'].dt.day


In [None]:
#The number of unique values in Date column
bd_df['Date'].nunique()

The dataset contains records of rented bikes per hour for a period of 365 days

In [None]:
#The number of unique values in Year column
bd_df['Year'].value_counts()


Most of the records are from the year 2018

In [None]:
#Finding the date of first and last entry in the dataset
print(f'The dataset contains observations from ',min(bd_df['Date']).date(),'to',max(bd_df['Date']).date())


In [None]:
#Creating a column which specifies  if the day is a Weekend('Y')or not ('N')
bd_df['Weekend']=bd_df['Days_of_week'].apply(lambda x : ('Y') if x in ['Saturday','Sunday'] else ('N'))


In [None]:
#Displaying the unique values in the categorical columns
categorical_columns=['Seasons','Holiday', 'Functioning Day','Days_of_week','Weekend']

for col in categorical_columns:
  print(f'The unique values in the column {col} are {bd_df[col].unique()}')

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# **Univariate Analysis**

Let's see how some of the important numerical independent features are distibuted in our data.

In [None]:
fig = plt.figure(figsize=(12,15))
c=1
lis=['Hour','Temperature(°C)',	'Humidity(%)',	'Wind speed (m/s)',	'Visibility (10m)',	'Dew point temperature(°C)',	'Solar Radiation (MJ/m2)',	'Rainfall(mm)','Snowfall (cm)']
for i in lis:
  plt.subplot(3,3, c)
  sns.histplot(bd_df[i],kde=True)
  plt.title('Distibution of {}'.format(i))
  c+=1
plt.tight_layout()



*   Distribution of Temperature,Humidity,Dew point temperature are almost normal.

*   Wind speed,Solar Radiation,Rainfall,Snowfall-positively skewed
*   Visibility is negatively skewed





Lets see how is the dependent variable Rented Bike Count distributed?

In [None]:
sns.displot(bd_df['Rented Bike Count'],kde=True,color='black')
plt.title('Distibution of Rented Bike Count')

In [None]:
#Checking for outliers

fig = plt.figure(figsize=(8,25))
c=1
for i in lis :
    plt.subplot(13,1, c)
    plt.xlabel('Distibution of {}'.format(i))
    sns.boxplot(x=i,data=bd_df)
    c = c + 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)


The outlier values are not extreme,nor unusual.So,we retain these values in our dataset.

In [None]:
#The number of records belonging to each category
for col in categorical_columns:
 print('Column :',col)
 print(bd_df[col].value_counts(),'\n')

#Basic conclusion from inivariate analysis


*   Number of records are mostly similar throughout the seasons(need to dig more for better understanding).

*   More number of records on non-holiday(working days) & Functioning days of the rental service.

*   Bike rentals are fewers on Weekends

*   Not much info from hour at the moment.
*   The temperature is mostly >0, for now lets consider Seoul on the warmer side.


*   Humidity is also moderate but still on warmer side.


*   Wind speed is not that extreme.


*   Most of the rainfall is <4 mm.

*   Snowfall is mostly 0-1 cm and not that extreme in most cases.





# Bivariate Analysis

## Correlation Heatmap

In [None]:
#Correlation heatmap of numerical features in the dataset
plt.figure(figsize = (10,10))
sns.heatmap(bd_df.corr(),annot=True,linewidth = 0.5, vmin=-1, vmax=1, cmap = 'YlGnBu')




*   Dew point temperature is strongly correlated with temperature.
*   Temperature,Hour shares a stronger correlation with Rented Bike count.



# **Scatter plot showing the high correlation of Temperature and Dew point Temperature**

In [None]:
plt.figure(figsize=[10,6])
sns.scatterplot(data=bd_df, y='Temperature(°C)', x='Dew point temperature(°C)')

**Were rental services offered on non-functional days?**

In [None]:
len(bd_df[bd_df['Functioning Day']=='No'])


It is highly unlikely that services will be provided on non-functional days.But since there were few observations (295) recorded on those days,let's check if there were any exceptional cases.

In [None]:
#Calculating the count of rental bikes,number of holidays &non-holidays and number of records for Functioning and Non-Functioning days

bd_df.groupby(['Functioning Day','Holiday']).agg(bikerentalcounts=('Rented Bike Count','sum'),no_of_holidays_nonholidays=('Date',lambda x: x.nunique()),no_of_records=(('Date','count')))

In [None]:
plt.figure(figsize=(8,6))
sns.barplot(x='Functioning Day',y='Rented Bike Count',data=bd_df)



*   The rental service were functional on most days during the period from Dec 2017 to Nov 2018(only 13 non-functional days)
*   Although,we've observed few records on Non-Functioning Day,rental services were not offered on those days(no exceptions)



# **Which are the days on which the rental facility was unavailable?**

In [None]:
non_functioning_days =bd_df.loc[bd_df['Functioning Day']=='No']

#Holiday on which the rental service was unavailable
non_functioning_days.loc[non_functioning_days['Holiday']=='Holiday']['Date'].unique()

The holiday on which the rental service was not functioning is Hangeul day.It is a national Korean commemorative day marking the invention and the proclamation of Hangul , the alphabet of the Korean language

In [None]:
non_functioning_days.loc[non_functioning_days['Holiday']=='No Holiday']['Date'].value_counts().to_frame(name = 'Hours_of_non_operation').reset_index().rename(columns={'index':'Date'})

The services were not for available for 1 day in the month of April,1 day in May,4 days in September,3 days each in October and November.

# **What is the likelihood of people renting bikes on holidays and non-holidays?**

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x='Holiday',y='Rented Bike Count',data=bd_df,palette='Set1')



*   The demand for rented bikes is higher on non holidays



# **What is the count of rented bikes during different seasons over the entire period of observation?**

In [None]:
#Finding the total number of bikes rented in each season
season_df=bd_df.groupby('Seasons')['Rented Bike Count'].sum().reset_index()['Rented Bike Count'].to_frame(name = 'season_count').reset_index()

In [None]:
#Finding the total number of bikes rented in each month
month_df=(bd_df.groupby(['Seasons','Month'])['Rented Bike Count'].sum()).to_frame(name = 'month_count').reset_index()

In [None]:
import calendar

# Define month names
month_names = [
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November', 'December'
]

# Create a figure and axis
fig, ax = plt.subplots()
size = 1
group_names = ['Autumn', 'Spring', 'Summer', 'Winter']
group_size = season_df['season_count']
subgroup_size = month_df['month_count']

# Setting figure colors using color maps
a, b, c, d = [plt.cm.Blues, plt.cm.Reds, plt.cm.Greens, plt.cm.Purples]
outer_colors = [a(.8), b(.8), c(.8), d(.8)]
inner_colors = [*a(np.linspace(.7, .4, 3)), *b(np.linspace(.7, .4, 3)), *c(np.linspace(.7, .4, 3)), *d(np.linspace(.7, .4, 3))]

# Creating the outer pie chart
patches, texts, pcts = ax.pie(
    group_size,
    radius=3.2,
    colors=outer_colors,
    wedgeprops=dict(width=size, edgecolor='w'),
    labels=group_names,
    autopct='%1.1f%%',
    textprops={'fontsize': 16},
    labeldistance=1.1,
    pctdistance=0.85
)
plt.setp(pcts, color='white', fontweight='bold')
plt.setp(texts, fontweight=600)

# Creating the inner pie chart with month names
subgroup_names = [calendar.month_abbr[month_num] for month_num in month_df['Month']]
patches1, texts1, pcts1 = ax.pie(
    subgroup_size,
    radius=3.2 - size,
    colors=inner_colors,
    labels=subgroup_names,
    wedgeprops=dict(width=1.2, edgecolor='w'),
    autopct='%1.1f%%',
    textprops={'fontsize': 14},
    labeldistance=0.8,
    pctdistance=0.65
)
plt.setp(pcts1, color='w', fontweight='bold', fontsize=12)
plt.setp(texts1, fontweight=600)

ax.set(aspect="equal")

# Show the pie chart
plt.show()




*   The demand for rental bikes is lowest during Winters(Dec-Feb),highest during Summers(June-August)



## **What is the demand for rental bikes during different days of the week?**

In [None]:
plt.figure(figsize=(8,6))
sns.boxenplot(x='Days_of_week',y='Rented Bike Count',data=bd_df,palette='Set1')



*   Least demand on Sunday,Slightly higher demand on Friday

*   More demand on weekdays than weekends.





# What is the demand for rental bikes during weekdays and weekends?

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x='Weekend',y='Rented Bike Count',data=bd_df,palette='Set1')

In [None]:
bd_df.groupby(['Weekend'])['Rented Bike Count'].mean()



*   The average demand for rental bikes is lower on Weekends(Sat-Sun) as compared to Weekdays(Mon-Fri).



# What is the demand for rental bikes during different hours of the day?

In [None]:
plt.figure(figsize=(10,8))
sns.lineplot(x='Hour',y='Rented Bike Count',data=bd_df,palette='Set1',hue='Seasons',lw=1.5)



*   The demand for rental bikes peaks at 8 (8:00 am) and 18 (6:00 PM ).

*   This peak in demand coincides with opening and closing hours of various institutions and offices.

*   The demand for rental bikes increases steadily after 10:00 AM and continues till 6:00 PM
*   The demand for bikes is least during the early hours (1:00 AM to 6:00 AM)

*   Regardless,of the seasons,this has been the general trend noticed.







## What is the variation of Rented bikes count over the entire period of observation based on various factors?

In [None]:
fig = plt.figure(figsize=(15,12))
c=1
cont = ['Date','Temperature(°C)',	'Wind speed (m/s)',	'Visibility (10m)',	'Solar Radiation (MJ/m2)',	'Rainfall(mm)',	'Snowfall (cm)']
for i in cont:
  plt.subplot(4,2,c)
  sns.lineplot(x=i,y='Rented Bike Count',data=bd_df,palette='inferno')
  plt.title('Demand of Rental bikes at different {}'.format(i))
  c = c + 1
plt.tight_layout()



*   Temperature vs Bike count plot : The demand is higher during warmer temperatures (25°C-30°C)



*  Windspeed vs Bike count plot : The demand for rental bikes is relatively uniform over all windspeeds upto 5 m/s .Beyond that speed,we observe a higher demand for bikes.

*   Visibility vs Bike count plot : The count of bikes rented is few on times when the visibility is extremely low,less than 1000m.
*   Solar radiation vs Bike count plot:There is an overall increase in the demand with increase in Solar radiation.


*   Rainfall vs Bike count plot : The peak between 20 mm and 25 mm seems out of place,on refering to the dataset we find that such observations are recorded during Summer Season.However,people still continue to opt for rental bikes,since they have to go to work (No Holiday).


*   Snowfall vs Bike count plot : The demand for bikes is comparatively lower when the snowfall received is 4 cm and above



# Inspecting the observations where there is a peak in demand for bikes regardless of the weather conditions

In [None]:
#1.Rainfall
bd_df[(bd_df['Rainfall(mm)'] >=20) & (bd_df['Rainfall(mm)'] <=25)]

These are working days



In [None]:
bd_df[(bd_df['Snowfall (cm)'] >=5) & (bd_df['Snowfall (cm)'] <=8)]



*   These are also working days




# What are the factors which influence the demand for rental bikes during a day?

In [None]:

fig = plt.figure(figsize=(11,9))
c=1
columns=['Rented Bike Count','Temperature(°C)','Visibility (10m)', 'Wind speed (m/s)','Solar Radiation (MJ/m2)', 'Humidity(%)']
for i in columns :
    plt.subplot(3,2,c)
    plt.ylabel(i)
    plt.title(label=i,fontsize=15,color="green")
    sns.lineplot(data=bd_df, x='Hour', y=i, color='r')
    c = c + 1
plt.tight_layout()



*   Temperature, visibility, windspeed, and humidity appear to be positively associated to the hourly demand for rental bikes.
*   The rented bike counts are highest during the hours from 7:00 AM to 20.00 (8:00 PM), when the temperature is highest, there is the most visibility, windspeed, and humidity is lowest.



# What are the factors which influence the demand for rental bikes during different months?

In [None]:
fig = plt.figure(figsize=(12,12))
c=1
columns=['Rented Bike Count','Temperature(°C)','Visibility (10m)', 'Wind speed (m/s)','Solar Radiation (MJ/m2)', 'Humidity(%)','Rainfall(mm)','Snowfall (cm)']
for i in columns :
    plt.subplot(4,2, c)
    plt.ylabel(i)
    plt.title(label=i,fontsize=15,color="green")
    sns.lineplot(data=bd_df, x='Month', y=i, color='r')
    c = c + 1
plt.tight_layout()



*   The monthly count of rented bikes is positively associated with Temperature.

*   Snowfall movement coincides with season, with heavy snowfall from December to February throughout the winter season. There's a decline in count of rented bikes during these months.
*   Rainfall tends to be more frequent in Seoul from June to August, during the summer season.However,this has not lead to decline in demand for rental bikes during those months.






# What are the factors which influence the demand for rental bikes during various seasons of the year?

In [None]:
fig = plt.figure(figsize=(12,12))
c=1
columns=['Rented Bike Count','Temperature(°C)','Visibility (10m)', 'Wind speed (m/s)', 'Humidity(%)','Rainfall(mm)', 'Snowfall (cm)']
for i in columns :
    plt.subplot(4,2,c)
    plt.ylabel(i)
    plt.title(label=i,fontsize=15,color="black")
    sns.barplot(data=bd_df, x='Seasons', y=i, palette='Set1')
    sns.lineplot(data=bd_df, x='Seasons', y=i, color='black')
    c = c + 1
plt.tight_layout()



*   It is evident that the seasonal demand for rental bikes is positively associated with temperature, solar radiation ,rainfall ,humidity and is negatively related with Snowfall received.
*   Therefore,the demand is highest during Summer season and least during winters



# Basic Conclusions from Bivariate Analysis



*   Temperature and Hour have a strong correlation with the count of rented bikes.

* Dew point temperature is highly positively correlated to the Temperature.  
* The peak demands for rental bikes occur on the opening (8-9 AM) and closing times (6-7pm) of offices and institutions

*   During the period from Dec 2017 to Nov 2018,bike rental facilities were available on most days.The service was unavailable only for 13 days.

*   The demand for rental bikes is higher on Regular days(Non-Holidays)
*   There is more demand for rental bikes on Weekdays than on Weekends.





*   There is a significant drop in the number of rented bikes during Winters(Dec-Feb) because it's freezing cold!


*       The demand for bikes increases during warmer temperatures,which is why there's maximum count of rented bikes during the Summer season.





# Feature engineering

In [None]:
#Checking for multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor

def check_vif(dataframe):
  '''
  This function calculates the variance inflation factor of the independent features in the datasdet
  '''

  # the independent variables set
  X =dataframe
  # VIF dataframe
  vif_data = pd.DataFrame()
  vif_data["feature"] = X.columns

  # calculating VIF for each feature
  vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                            for i in range(len(X.columns))]
  print(vif_data)

In [None]:
#Displaying the columns in the dataframe
bd_df.columns

In [None]:
#Checking the VIF value of certain columns in bd_df
check_vif(bd_df[['Hour', 'Temperature(°C)', 'Humidity(%)',
       'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)',
       'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)',
        'Month','Year','Day']])



Multicolinearity causes reduction in the statistical power of your regression model

Let's check the values of VIF if we exclude Dew point temperature and Year.


In [None]:
check_vif(bd_df[['Hour', 'Temperature(°C)', 'Humidity(%)','Wind speed (m/s)', 'Visibility (10m)',
       'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)','Month','Day']])

The VIF of the features,now lie within the acceptable range.

In [None]:
#Dropping 'Dew point temperature(°C)','Year' to reduce the VIF
bd_df.drop(columns=['Dew point temperature(°C)','Year'],inplace=True)

In [None]:
#Creating a copy of the main dataframe 'bd_df'
df=bd_df.copy()

In [None]:
#Creating dummies for the Categorical columns
df = pd.get_dummies(bd_df, columns = ['Seasons','Holiday','Weekend','Functioning Day'],drop_first=True)
df.head(2)

In [None]:
df.columns

In [None]:
#Dropping the columns Date and Days_of_week
df.drop(['Days_of_week','Date'],axis=1,inplace=True)

In [None]:
#Displaying the columns present in the dataframe 'df'
df.columns

# Implementation of Regression Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

In [None]:
#Defining independent and dependent variables

y = df['Rented Bike Count']
X = df.drop('Rented Bike Count',axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
print(f'The shape of X : {X.shape}\n The shape of X_train : {X_train.shape}\n The shape of X_test : {X_test.shape}')

In [None]:
#Creating functions to calculate the Evaluation metrics for the regression models

def evaluate_model(name,X_test,y_true,y_pred):

  '''
  This function calculate  metrics for evaluating
  the perfomance of Regression models
  '''
  list_=[]
  #calculating mean absolute error
  MAE =  mean_absolute_error(y_true,y_pred)
  print(f'MAE : {MAE}')

  #finding mean_squared_error
  MSE  = mean_squared_error(y_true,y_pred)
  print("MSE :" , MSE)

  #finding root mean squared error
  RMSE = np.sqrt(MSE)
  print("RMSE :" ,RMSE)

  #finding the r2 score
  r2 = r2_score(y_true,y_pred)
  print("R2 :" ,r2)

  #finding the adjusted r2 score
  adj_r2=1-(1-r2_score(y_true,y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
  print("Adjusted R2 : ",adj_r2)
  list_.extend([name,MAE,MSE,RMSE,r2,adj_r2])
  return(list_)

In [None]:
#Creating a  list which would store lists of different models and their performance metrics
list_of_models=[]

## Linear Regression

**Multiple linear regression**

In [None]:
#Scaling the features
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)

In [None]:
#Importing the Linear Regression model
from sklearn.linear_model import LinearRegression

In [None]:
#Fitting the data to Linear Regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
#predicting the values of y from X_test
y_pred= regressor.predict(X_test)

In [None]:
#Evaluating the model
list_of_models.append(evaluate_model('Multiple Linear Regression',X_test,y_test,y_pred))

#### Chart - 1

In [None]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***