<a href="https://colab.research.google.com/github/Dipak9699-ds/FlyTheNest/blob/main/Bike_Sharing_Demand_Prediction_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -

$\color{red}{\text{Bike Sharing Demand Prediction}}$




# **Project Summary -**

The goal of this project is to develop a machine learning model to predict the demand for bike sharing. The dataset used for this project contains various features related to weather conditions, date and time, and other factors that may influence bike rental demand.

Overall, the bike sharing demand prediction project aimed to provide an accurate and reliable model to forecast the bike rental demand, which can be beneficial for bike sharing companies or city planners in optimizing bike availability and improving operational efficiency.

# **Problem Statement**


**Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.**

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as wr
wr.filterwarnings('ignore')
%matplotlib inline

### Load Dataset

In [None]:
# Load Dataset
bike_df = pd.read_csv('SeoulBikeData.csv', encoding ='latin')

### Dataset First View

In [None]:
# Dataset First Look
bike_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(bike_df.shape)

### Dataset Information

In [None]:
# Dataset Info
bike_df.info()

### Check Duplicate Values

In [None]:
# Dataset Duplicate or Non Duplicate Value Count
bike_df.duplicated().value_counts()

In [None]:
# Dataset Duplicate Value Count
len(bike_df[bike_df.duplicated()])

### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(bike_df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(bike_df.isnull())

The above dataset has 8760 rows and 14 columns. There are no mising values and duplicate values in the dataset.

### Check Columns List

In [None]:
# Dataset Columns
bike_df.columns

### Describe Dataset

In [None]:
# Dataset Describe
bike_df.describe().T.style.background_gradient()

### Variables Description

*   Date : Date (year-month-day)
*   Rented Bike count : Count of bikes rented at each hour
*   Hour : Hour of the day (0-23)
*   Temperature : Temperature of the day (in celsius)
*   Humidity : Humidity measure (in %)
*   Windspeed : Windspeed (m/s)
*   Visibility : Visibility measure (10m)
*   Dew point temperature : Dew point temperature measure (in celsius)
*   Solar radiation : Solar radiation (MJ/m2)
*   Rainfall : Rainfall measure (in mm)
*   Snowfall : Snowfall measure (in cm)
*   Seasons : Winter, Spring, Summer, Autumn
*   Holiday : Weather a holiday or not
*   Functional Day : Weather a functional day or not











### Check unique values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in bike_df.columns:
  print("No. of unique values in ",i,"is",bike_df[i].nunique(),".")

In [None]:
# Check Unique Values for each variable.
for column in bike_df.columns:
  print(str(column) + ' : ' + str(bike_df[column].unique()))
  print('____________________________________________')

### Changing column name

In [None]:
# Rename the complex columns name
bike_df = bike_df.rename(columns={'Rented Bike Count':'Rented_Bike_Count',
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind_speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'Solar_Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                                'Functioning Day':'Functioning_Day'})

In [None]:
# Dataset Columns
bike_df.columns

### Breaking date column & create a new column weekdays_weekend

In [None]:
# Convert the "Date" column to datetime format
import datetime as dt
bike_df['Date'] = bike_df['Date'].apply(lambda x: dt.datetime.strptime(x,"%d/%m/%Y"))

In [None]:
# Changing the "Date" column into three "year", "month", "day" column
bike_df['year'] = bike_df['Date'].dt.year
bike_df['month'] = bike_df['Date'].dt.month
bike_df['day'] = bike_df['Date'].dt.day_name()

In [None]:
# Creating a new column of "weekdays_weekend" and drop the column "Date", "day", "year"
bike_df['weekdays_weekend']=bike_df['day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )
bike_df=bike_df.drop(columns=['Date','day','year'],axis=1)

In [None]:
bike_df.head()

In [None]:
bike_df.info()

In [None]:
bike_df['weekdays_weekend'].value_counts()

### Changing data type

In [None]:
# Change the int64 column into catagory column
cols=['Hour','month','weekdays_weekend']
for col in cols:
  bike_df[col]=bike_df[col].astype('category')

In [None]:
# Let's check the result of data type
bike_df.info()

In [None]:
# Dataset is ready to analysis.
# Create a copy of the current dataset and assigning to bike_df1
bike_df1 = bike_df.copy()

### Manipulations have I done and insights I found?

* First of all change the column names and give proper names to all column.
* Python read "Date" column as a object type basically it reads as a string, as the date column is very important to analyze the users behaviour so we need to convert it into datetime format then we split it into 3 column i.e 'year', 'month', 'day'as a category data type.
* The "year" column in our data set is basically contain the 2 unique number contains the details of from 2017 december to 2018 november so if i consider this is a one year then we don't need the "year" column so we drop it.
* The other column "day", it contains the details about the each day of the month, for our relevence we don't need each day of each month data but we need the data about, if a day is a weekday or a weekend so we convert it into this format and drop the "day" column.
* As "Hour","month","weekdays_weekend" column are show as a integer data type but actually it is a category data tyepe. so we need to change this data tyepe if we not then, while doing the further anlysis and correleted with this then the values are not actually true so we can mislead by this.

### Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Anlysis of data by vizualisation
fig, ax = plt.subplots(figsize=(8,5))
sns.barplot(data=bike_df, x='month', y='Rented_Bike_Count', ax=ax, capsize=.2)
ax.set(title='Count of Rented bikes acording to Month ')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

From the above bar plot we can clearly say that from the month 5 to 10 the demand of the rented bike is high as compare to other months.these months are comes inside the summer season.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Analysis of data by vizualisation
fig, ax = plt.subplots(figsize=(8,5))
sns.barplot(data=bike_df, x='weekdays_weekend', y='Rented_Bike_Count', ax=ax, capsize=.2)
ax.set(title='Count of Rented bikes acording to weekdays_weekenday')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

From the above bar plot we can say that in the week days which represent in blue colur show that the demand of the bike higher because of the office.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Analysis of data by vizualisation
fig, ax = plt.subplots(figsize=(10,5))
sns.pointplot(data=bike_df, x='Hour',y='Rented_Bike_Count',hue='weekdays_weekend',ax=ax)
ax.set(title='Count of Rented bikes acording to weekdays_weekend')

##### 1. Why did you pick the specific chart?

A point plot is a categorical plot that displays the mean value (or another statistical estimate) and confidence intervals for different categories. It is commonly used to compare the relationship between a categorical variable and a numeric variable across different groups or levels.

##### 2. What is/are the insight(s) found from the chart?

From the above point plot we can say that in the week days which represent in blue colur show that the demand of the bike higher because of the office.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Analysis of data by vizualisation
fig, ax = plt.subplots(figsize=(10,5))
sns.barplot(data=bike_df, x='Hour', y='Rented_Bike_Count', ax=ax, capsize=.2)
ax.set(title='Count of Rented bikes acording to Hour')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

In the above plot which shows the use of rented bike according the hours and the data are from all over the year.

Generally people use rented bikes during their working hour from 7am to 9am and 5pm to 7pm.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Analysis of data by vizualisation
fig, ax = plt.subplots(figsize=(8,5))
sns.barplot(data=bike_df, x='Functioning_Day', y='Rented_Bike_Count', ax=ax, capsize=.2)
ax.set(title='Count of Rented bikes acording to Functioning Day')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

In the above bar plot which shows the use of rented bike in functioning day or not, and it clearly shows that, People don't use rented bikes in no functioning day.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Analysis of data by vizualisation
fig, ax = plt.subplots(figsize=(10,5))
sns.pointplot(data=bike_df, x='Hour', y='Rented_Bike_Count', hue='Functioning_Day', ax=ax)
ax.set(title='Count of Rented bikes acording to Functioning Day')

##### 1. Why did you pick the specific chart?

A point plot is a categorical plot that displays the mean value (or another statistical estimate) and confidence intervals for different categories. It is commonly used to compare the relationship between a categorical variable and a numeric variable across different groups or levels.

##### 2. What is/are the insight(s) found from the chart?

In the above point plot which shows the use of rented bike in functioning day or not, and it clearly shows that, People don't use rented bikes in no functioning day.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Analysis of data by vizualisation
fig, ax = plt.subplots(figsize=(8,5))
sns.barplot(data=bike_df, x='Seasons', y='Rented_Bike_Count', ax=ax, capsize=.2)
ax.set(title='Count of Rented bikes acording to Seasons')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

In the above bar plot which shows the use of rented bike in in four different seasons, and it clearly shows that, In summer season the use of rented bike is high.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Analysis of data by vizualisation
fig, ax = plt.subplots(figsize=(10,5))
sns.pointplot(data=bike_df, x='Hour', y='Rented_Bike_Count', hue='Seasons', ax=ax)
ax.set(title='Count of Rented bikes acording to seasons')

##### 1. Why did you pick the specific chart?

A point plot is a categorical plot that displays the mean value (or another statistical estimate) and confidence intervals for different categories. It is commonly used to compare the relationship between a categorical variable and a numeric variable across different groups or levels.

##### 2. What is/are the insight(s) found from the chart?

In the above point plot which shows the use of rented bike in in four different seasons, and it clearly shows that,

In summer season the use of rented bike is high and peak time is 7am-9am and 5pm-7pm.

In winter season the use of rented bike is very low because of snowfall.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Analysis of data by vizualisation
fig, ax = plt.subplots(figsize=(8,5))
sns.barplot(data=bike_df, x='Holiday', y='Rented_Bike_Count', ax=ax, capsize=.2)
ax.set(title='Count of Rented bikes acording to Holiday ')

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the count of most bookings made by the agent that's why I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

In the above bar plot which shows the use of rented bike is more on Non-holiday compare to holiday.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Analysis of data by vizualisation
fig, ax = plt.subplots(figsize=(10,5))
sns.pointplot(data=bike_df, x='Hour', y='Rented_Bike_Count', hue='Holiday', ax=ax)
ax.set(title='Count of Rented bikes acording to Holiday ')

##### 1. Why did you pick the specific chart?

A point plot is a categorical plot that displays the mean value (or another statistical estimate) and confidence intervals for different categories. It is commonly used to compare the relationship between a categorical variable and a numeric variable across different groups or levels.

##### 2. What is/are the insight(s) found from the chart?

In the above point plot which shows the use of rented bike in a holiday, and it clearly shows that, in holiday people uses the rented bike from 2pm-8pm.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Analyze of Numerical variables distplots

# Assign the numerical column to variable
numerical_columns = list(bike_df.select_dtypes(['int64','float64']).columns)
numerical_features = pd.Index(numerical_columns)
numerical_features

# Let's see how data is distributed for every column
plt.figure(figsize=(12,10))
plotnumber = 1

for column in numerical_features:
    if plotnumber <= 9 :
        ax = plt.subplot(3,3,plotnumber)
        sns.distplot(bike_df[column])
        plt.xlabel(column,fontsize=20)
    plotnumber+=1
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Distplot is used basically for univariant set of observations and visualizes it through a histogram i.e. only one observation and hence we choose one particular column of the dataset.

##### 2. What is/are the insight(s) found from the chart?

In the above distplot we can see that there are right skew and left skew are present in most of the columns.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Numerical vs.Rented_Bike_Count

# Print the plot to analyze the relationship between "Rented_Bike_Count" and "Temperature"
bike_df.groupby('Temperature')['Rented_Bike_Count'].mean().plot()

In [None]:
# Print the plot to analyze the relationship between "Rented_Bike_Count" and "Dew_point_temperature"
bike_df.groupby('Dew_point_temperature')['Rented_Bike_Count'].mean().plot()

In [None]:
# Print the plot to analyze the relationship between "Rented_Bike_Count" and "Solar_Radiation"
bike_df.groupby('Solar_Radiation')['Rented_Bike_Count'].mean().plot()

In [None]:
# Print the plot to analyze the relationship between "Rented_Bike_Count" and "Snowfall"
bike_df.groupby('Snowfall')['Rented_Bike_Count'].mean().plot()

In [None]:
# Print the plot to analyze the relationship between "Rented_Bike_Count" and "Rainfall"
bike_df.groupby('Rainfall')['Rented_Bike_Count'].mean().plot()

In [None]:
# Print the plot to analyze the relationship between "Rented_Bike_Count" and "Wind_speed"
bike_df.groupby('Wind_speed')['Rented_Bike_Count'].mean().plot()

##### 1. Why did you pick the specific chart?

The plot() is used to draw points (markers) in a diagram. By default,
 the plot() draws a line from point to point. The function takes parameters for specifying points in the diagram.

##### 2. What is/are the insight(s) found from the chart?

From the above plots we see that,
* People like to ride bikes when it is pretty hot around 25°C in average.
* 'Dew_point_temperature' is almost same as the 'temperature' there is some similarity present we can check it in our next step.
* The amount of rented bikes is huge, when there is solar radiation, the counter of rents is around 1000.
* On the y-axis, the amount of rented bike is very low when we have more than 4 cm of snow, the bike rents is much lower.
* Even if it rains a lot the demand of of rent bikes is not decreasing, here for example even if we have 20 mm of rain there is a big peak of rented bikes.
* Demand of rented bike is uniformly distribute despite of wind speed but when the speed of wind was 7 m/s then the demand of bike also increase that clearly means peoples love to ride bikes when its little windy.


#### Chart - 13

In [None]:
# Chart - 13 Visualization code
# Printing the regression plot for all the numerical features

# Let's see how data is distributed for every column
plotnumber = 1
plt.figure(figsize=(12,10))

for column in numerical_features:
    if plotnumber <= 9:
        ax = plt.subplot(3,3,plotnumber)
        sns.regplot(x=bike_df[column], y=bike_df['Rented_Bike_Count'], scatter_kws={"color": 'green'}, line_kws={"color": "black"})
        ax.set_xlabel(column, fontsize=12)
        ax.set_ylabel('Rented Bike Count', fontsize=12)
    plotnumber += 1

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Regplot used to create a scatter plot with a linear regression line fit to the data. It allows you to visualize the relationship between two variables and assess the strength and direction of their linear correlation. The regplot function can be used to perform simple linear regression analysis and visualize the resulting model.

##### 2. What is/are the insight(s) found from the chart?

From the above regression plot of all numerical features we see that the columns 'Temperature', 'Wind_speed','Visibility', 'Dew_point_temperature', 'Solar_Radiation' are positively relation to the target variable.

Which means the rented bike count increases with increase of these features.

'Rainfall','Snowfall','Humidity' these features are negatively related with the target variaable which means the rented bike count decreases when these features increase.

#### Chart - 14

In [None]:
# Chart - 14 Visualization code
# Visualize the outliers using boxplot
plt.figure(figsize=(15,12))
graph = 1

for column in numerical_features:
    if graph <= 9:
        plt.subplot(3,3,graph)
        ax=sns.boxplot(bike_df[column])
        plt.xlabel(column,fontsize=10)
    graph+=1
plt.show()

##### 1. Why did you pick the specific chart?

Boxplot is used to create box and whisker plots. A boxplot is a visual representation of the distribution of a dataset, showing the median, quartiles, and any outliers.

##### 2. What is/are the insight(s) found from the chart?

In the above boxplot we can see that outliers are present in most of the columns like wind speed, solar radiation etc.

#### Chart - 15 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Plot the Correlation matrix
plt.figure(figsize=(20,8))
correlation = bike_df.corr(numeric_only=True)
sns.heatmap((correlation), annot=True, cmap='YlGnBu')

##### 1. Why did you pick the specific chart?

The correlation heatmap chart is a great way to visualize correlations between multiple variables. It provides a clear and concise view of the relationships between the variables, which allows for easy and quick analysis. Additionally, the color coding used in the heatmap helps to quickly and easily identify correlations that may otherwise not be as apparent.

##### 2. What is/are the insight(s) found from the chart?

We can observe on the heatmap that on the target variable line the most positively correlated variables to the rent are :

* Temperature
* Dew point temperature
* Solar radiation

And most negatively correlated variables are:

* Humidity
* Rainfall
* Snowfall

From the above correlation heatmap, We see that there is a positive
correlation between columns 'Temperature' and 'Dew point temperature' i.e 0.91 so even if we drop this column then it dont affects the outcome of our analysis. And they have the same variations.. so we can drop the column 'Dew point temperature(°C)'.

### Feature Engineering & Data Pre-processing

#### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Missing Values/Null Values Count
print(bike_df.isnull().sum())

# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(bike_df.isnull(), cbar=False)

* There are no missing values to handle in the given dataset.**

#### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# Remove outliers using zscore.
from scipy.stats import zscore

z_score = zscore(bike_df[numerical_features])
abs_z_score = np.abs(z_score)

filtering_entry = (abs_z_score < 3).all(axis=1)

bike_df = bike_df[filtering_entry]

* I have used z_score technique to treat outliers.
* The z-score technique is used to treat outliers because it provides a standardized way to identify and handle data points that deviate significantly from the mean of a distribution.

#### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

# Assign all catagoriacla features to a variable
categorical_features=list(bike_df.select_dtypes(['object','category']).columns)
categorical_features=pd.Index(categorical_features)
categorical_features

# Creat a copy
bike_df_copy = bike_df

def one_hot_encoding(data, column):
    data = pd.concat([data, pd.get_dummies(data[column], prefix=column, drop_first=True)], axis=1)
    data = data.drop([column], axis=1)
    return data

for col in categorical_features:
    bike_df_copy = one_hot_encoding(bike_df_copy, col)
bike_df_copy.head()

* I have used One_hot_encoding technique for categorical data conversion.
* One hot encoding is used to represent categorical variables numerically in a format that is suitable for machine learning algorithms. It is a popular technique for handling categorical data because many machine learning algorithms are designed to work with numerical data rather than categorical data.

### Feature Manipulation & Selection

#### 1. Feature Scalling

In [None]:
# Split data into x and y
y = bike_df_copy['Rented_Bike_Count']
X = bike_df_copy.drop(columns='Rented_Bike_Count', axis=1)

# Feature Scalling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
X_scaled.shape[1]

* I have used standard scaler technique for data scaling.
* Standard scaling technique is used to promote better analysis, enhance model performance, and ensure consistent and meaningful comparisons among variables in various statistical and machine learning tasks.

#### 2. Feature Selection using VIF

In [None]:
# Finding variance inflation factor in each scaled column i.e X_scaled.shape[1] (1/(1-R2))
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["vif_score"] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]
vif["Features"] = X.columns

# Let's check the values
vif

In [None]:
# Drop the Dew point temperature column
bike_df_copy = bike_df_copy.drop(['Dew_point_temperature'],axis=1)

* I have used Correlation heatmat (corr) and VIF method for feature selection.
* A correlation heatmap (corr) is used to visualize the correlation between different variables in a dataset. It is a graphical representation of the correlation matrix, where each cell represents the correlation coefficient between two variables. The correlation coefficient indicates the strength and direction of the linear relationship between two variables.
* VIF (Variance Inflation Factor) is used to measure multicollinearity in regression analysis. Multicollinearity occurs when there is a high correlation between two or more predictor variables in a regression model, which can lead to issues in the interpretation of the model and unstable coefficient estimates.

We can observe on the heatmap that on the target variable line the most positively correlated variables to the rent are :

Temperature
Dew point temperature
Solar radiation
And most negatively correlated variables are:

Humidity
Rainfall
From the above correlation heatmap, We see that there is a positive correlation between columns 'Temperature' and 'Dew point temperature' i.e 0.91 so even if we drop this column then it dont affects the outcome of our analysis. And they have the same variations.. so we can drop the column 'Dew point temperature(°C)'.

### Data Train-Test Split

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

 # Split into 70:30 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 42)

# Describes info about train and test set
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

* I have split the data in 70-30 % ratio in train-test.
* The 70-30 ratio for train-test split is a commonly used practice in machine learning and data analysis, although it is not a hard rule and can vary depending on the specific problem and dataset. The 70% of the data is typically allocated to the training set, while the remaining 30% is allocated to the test set.

### ML Model Implementation

In [None]:
# Fit the Algorithm
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)

# Check the score
reg.score(X_train, y_train)

In [None]:
# Get the X_train and X-test value
y_pred_train = reg.predict(X_train)
y_pred_test = reg.predict(X_test)

In [None]:
print("\n================Train Result==========================")

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

# Calculate MSE
MSE_lr= mean_squared_error((y_train), (y_pred_train))
print("MSE :",MSE_lr)

# Calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)

# Calculate MAE
MAE_lr= mean_absolute_error(y_train, y_pred_train)
print("MAE :",MAE_lr)

# Calculate r2 and adjusted r2
r2_lr= r2_score(y_train, y_pred_train)
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

# Storing the train set metrics value in a dataframe for later comparison
dict1={'Model':'Linear Regression',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
training_df=pd.DataFrame(dict1,index=[1])

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

print("\n=================Test Result==========================")

# Calculate MSE
MSE_lr= mean_squared_error(y_test, y_pred_test)
print("MSE :",MSE_lr)

# Calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)

# Calculate MAE
MAE_lr= mean_absolute_error(y_test, y_pred_test)
print("MAE :",MAE_lr)

# Calculate r2 and adjusted r2
r2_lr= r2_score((y_test), (y_pred_test))
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score((y_test), (y_pred_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",Adjusted_R2_lr )

# Storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Linear Regression',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
test_df=pd.DataFrame(dict2,index=[1])

In [None]:
# Visualizing evaluation Metric Score chart
print(training_df)
print(test_df)