 **Name**  Anjali Prasad




 **contribution** individual



# **Project Summary -**


Bike demand prediction is a common problem faced by bike rental companies, as accurately forecasting the demand for bikes can help optimize inventory and pricing strategies. In this project, I aim to develop a regression supervised machine learning model to predict the demand for bikes in a given time period.

Originally dataset of bike rental information from a bike sharing company, had information including details on the number of bikes rented, the time and date of the rental, and various weather and seasonality features, information on other relevant factors that could impact bike demand, such as holidays, functioning or non functioning day.

After preprocessing and cleaning the data, I split it into training and test sets and used the training data to train our machine leaming model. I experimented with several different model architectures and hyperparameter settings, ultimately selecting the model that performed the best on the test data.

To evaluate the performance of our model, I used a variety of metrics, including mean absolute error, root mean squared error, and R-squared. I

found that our model was able to make highly accurate predictions, with an R-squared value of 0.88 and a mean absolute error of just 2.58.

In addition to evaluating the performance of our model on the test data, I also conducted a series of ablation studies to understand the impact of individual features on the model's performance. I found that the temperature, as well as the weather and seasonality features, had the greatest impact on bike demand.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concem. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

My goal is to develop a model that is highly accurate, with a low mean absolute error and a high R-squared value. The model should also be able to provide insights into the factors that most impact bike demand, helping the bike sharing company to make data-driven decisions about how to optimize their operations

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

#Datetime library for manipulating Date columns.

from datetime import datetime

import datetime as dt


# from sci-kit library scaling, transforming and labeling functions.
# which is used to change raw feature vectors into a representation

#suitable for the downstream estimators.
from sklearn.preprocessing import MinMaxScaler

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MultiLabelBinarizer

#Importing various machine learning models. from sklearn.linear_model import LinearRegression

from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from sklearn.linear_model import ElasticNet

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import cross_validate

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.model_selection import RandomizedSearchCV

#import different metrics from sci-kit libraries for model evaluation.

from sklearn import metrics
from sklearn.metrics import r2_score

from sklearn.metrics import mean_squared_error

from sklearn.metrics import accuracy_score

from sklearn.metrics import mean_absolute_error

from sklearn.metrics import log_loss

#Importing warnings library. The warnings module handles warnings in Python.
import warnings
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
bike_df=pd.read_csv('/content/drive/MyDrive/SeoulBikeData.csv',encoding='latin')

### Dataset First View

In [None]:
# Dataset First Look
bike_df.head()

In [None]:
bike_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(bike_df.shape)

In [None]:
# getting all columns
bike_df.columns

### Dataset Information

In [None]:
# Dataset Info
bike_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f"Data is duplicated? (bike_df.duplicated().value_counts()), unique values with (len(bike_df[bike_df.duplicated()])) duplication")

In [None]:
for i in bike_df.columns.tolist():

  print(f"No. of unique values in {i} is {bike_df[i].nunique()}.")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bike_df.isna().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(bike_df.isnull(), cbar=False)

### What did you know about your dataset?

Answer In a day we have 24 hours and we have 365 days a year so 365 multiplied by 24-8760, which represents the number of line in the dataset

There are no null values.

Dataset has all unique values Le, there is no duplicate, which means data is free from bias as duplicates which can cause problems in downstream analysis, such as biasing results or making it difficult to accurately summarize the data.

Date has some object data types, it should be datetime data type.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(f'Features: {bike_df.columns.to_list()}')

In [None]:
# Dataset Describe
bike_df.info()

### Variables Description

Answer Breakdown of Our Features:

Date: The date of the day, during 365 days from 01/12/2017 to 30/11/2018, formating in DD/MM/YYYY, type: str, we need to convert into datetime format.

Rented Bike Count: Number of rented bikes per hour which our dependent variable and we need to predict that, type: int

Hour: The hour of the day, starting from 0-23 it's in a digital time format, type: int, we need to convert it into category data type.

Temperature("C): Temperature in Celsius, type: Float

Humidity(%): Humidity in the air in %, type: int

Wind speed (m/s): Speed of the wind in m/s, type: Float

Visibility (10m): Visibility in m, type: int

Dew point temperature (°C): Temperature at the beggining of the day, type: Float

Solar Radiation (MJ/m2): Sun contribution, type: Float

Rainfall(mm): Amount of raining in mm, type: Float

Snowfall (cm): Amount of snowing in om, type: Float

Seasons: "Season of the year, type: stc, there are only 4 seseori's in dets",

Holiday: If the day is holiday period or not, type: str

Functioning Day. If the day is a Functioning Day or not, type: str

Missing/null values

In [None]:
bike_df.isnull().sum()

In [None]:
# visualizing missing values
missing =pd.DataFrame((bike_df.isnull().sum())*100/bike_df.shape[0]).reset_index()

plt.figure(figsize=(16,5))

ax= sns.pointplot(x='index', y=0, data=missing)

plt.xticks(rotation=90, fontsize=7)

plt.title("Percentage of Missing values")

plt.ylabel("PERCENTAGE")

plt.show()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# duplicate values
value=len(bike_df [bike_df.duplicated()])

print("The number of duplicate values in the data set is = ",value)



In [None]:
# rename column complex names
bike_df=bike_df.rename(columns={'Rented Bike Count': 'Rented_Bike_Count',

'Temperature(°C)': 'Temperature',

"Humidity (%)": "Humidity",

'Wind speed(m/s)': 'wind_speed',

'Visibility(10m)': 'Visibility',

'Dew point temperature (°C)': 'Dew_point_temperature',

'Solar Radiation (M3/m2)': 'Solar_Radiation',
'Rainfall(mm)': 'Rainfall',

'Snowfall (cm)': 'Snowfall',

'Functioning Day':'Functioning_Day'})

In [None]:
# breakdown column
bike_df['Date']= bike_df['Date'].str.replace('-', '/')

bike_df['Date']= bike_df['Date'].apply(lambda x: dt.datetime.strptime(x, "%d/%m/%Y"))

In [None]:
bike_df['year']= bike_df["Date"].dt.year
bike_df ['month'] =bike_df['Date'].dt.month

bike_df[ 'day']= bike_df['Date'].dt.day_name()

In [None]:
bike_df["weekdays weekend"]=bike_df['day'].apply(lambda x: 1 if x== 'Saturday' or x=='Sunday' else 0)

bike_df=bike_df.drop(columns=['Date', 'day', 'year'], axis=1)


In [None]:
bike_df.head()

In [None]:
bike_df["weekdays weekend"].value_counts()

In [None]:
# changing datatype
cols=['Hour', 'month', 'weekdays weekend']
for col in cols:
  bike_df[col]=bike_df[col].astype('category')

In [None]:

bike_df.info()

In [None]:
bike_df.columns

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
fig,ax= plt.subplots(figsize=(12,7))
sns.barplot(data=bike_df , x='month', y='Rented_Bike_Count', ax=ax, capsize=.2)
ax.set(title='count of rented bikes according to month')

##### 1. Why did you pick the specific chart?

Answer i choose bar graph to show count of rented bike according to month


##### 2. What is/are the insight(s) found from the chart?

Answer here we can clearly see the demand is higher during may to october(5-10)

#### Chart - 2

In [None]:
# Chart - 2 visualization code
fig,ax= plt.subplots(figsize=(8,6))
sns.barplot(data=bike_df , x='weekdays weekend', y='Rented_Bike_Count', ax=ax, capsize=.2)
ax.set(title='count of rented bikes according to weekdays weekend')

#### Chart - 3

In [None]:
# Chart - 3 visualization code
fig,ax= plt.subplots(figsize=(12,7))
sns.pointplot(data=bike_df , x='Hour', y='Rented_Bike_Count', hue= 'weekdays weekend', ax=ax)
ax.set(title='count of rented bikes according to weekdays_weekends')

##### 1. Why did you pick the specific chart?

Answer i used point plot to show demands of bike

##### 2. What is/are the insight(s) found from the chart?

Answer we can see that week days which reprents bluecolor shows demand in bike.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
fig,ax= plt.subplots(figsize=(12,7))
sns.barplot(data=bike_df , x='Hour', y='Rented_Bike_Count', ax=ax , capsize=.2)
ax.set(title='count of rented bikes according to Hour')

##### 1. Why did you pick the specific chart?

Answer i choose this graph to show count of rented bikes according to Hour.

##### 2. What is/are the insight(s) found from the chart?

Answer here we can see that generally peaple are using thier bike in working hours from 7 am to 9 pm.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
fig,ax= plt.subplots(figsize=(8,6))
sns.barplot(data=bike_df , x='Functioning_Day', y='Rented_Bike_Count', ax=ax , capsize=.2)
ax.set(title='count of rented bikes according to Functioning day')

##### 1. Why did you pick the specific chart?

Answer i choose bar graph to show count of rented bikes according to Functioning day

#### Chart - 6

In [None]:
# Chart - 6 visualization code
fig,ax= plt.subplots(figsize=(12,7))
sns.pointplot(data=bike_df , x='Hour', y='Rented_Bike_Count', hue='Functioning_Day', ax=ax)
ax.set(title='count of rented bikes according to Functioning_Day')

##### 1. Why did you pick the specific chart?

Answer i choose point graph to count of rented bikes according to Functioning_Day

##### 2. What is/are the insight(s) found from the chart?

Answer here we can see that peaple dont use bike in no functioning day

#### Chart - 7

In [None]:
# Chart - 7 visualization code
fig,ax= plt.subplots(figsize=(12,6))
sns.barplot(data=bike_df , x='Seasons', y='Rented_Bike_Count', ax=ax , capsize=.2)
ax.set(title='count of rented bikes according to Functioning seasons')

#### Chart - 8

In [None]:
# Chart - 8 visualization code
fig,ax= plt.subplots(figsize=(12,6))
sns.pointplot(data=bike_df , x='Hour', y='Rented_Bike_Count', hue='Seasons', ax=ax)
ax.set(title='count of rented bikes according to Seasons')

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Winter has least number of bike counts

#### Chart - 9

In [None]:
# Chart - 9 visualization code
fig,ax= plt.subplots(figsize=(8,6))
sns.barplot(data=bike_df , x='Holiday', y='Rented_Bike_Count', ax=ax , capsize=.2)
ax.set(title='count of rented bikes according to Holiday')

#### Chart - 10

In [None]:
# Chart - 10 visualization code
fig,ax= plt.subplots(figsize=(12,6))
sns.pointplot(data=bike_df , x='Hour', y='Rented_Bike_Count', hue='Holiday', ax=ax)
ax.set(title='count of rented bikes according to Holiday')

##### 2. What is/are the insight(s) found from the chart?

Answer in holiday peaple uses rented bike from 2pm-8pm

In [None]:
#  seperate numeric value from dataframe
numeric_features=bike_df.select_dtypes(exclude=['object' , 'category'])

numeric_features

In [None]:
n=1
plt.figure(figsize=(15,10))
for i in numeric_features.columns:
  plt.subplot(3,3,n)
  n=n+1
  sns.distplot(bike_df[i])
  plt.title(i)
  plt.tight_layout()


In [None]:
bike_df.groupby('Temperature').mean()['Rented_Bike_Count'].plot()

In [None]:
bike_df.groupby('Dew point temperature(°C)').mean()['Rented_Bike_Count'].plot()

In [None]:
bike_df.groupby('Solar Radiation (MJ/m2)').mean()['Rented_Bike_Count'].plot()

In [None]:
bike_df.groupby('Snowfall').mean()['Rented_Bike_Count'].plot()

In [None]:
bike_df.groupby('Wind speed (m/s)').mean()['Rented_Bike_Count'].plot()

In [None]:
bike_df.groupby('Rainfall').mean()['Rented_Bike_Count'].plot()

**Regression** **plot**

The regression plots in seabom are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during exploratory data analyses. Regression plots as the name suggests creates a regression line between 2 parameters and helps to visualize their linear relationships.

In [None]:
#printing the regression plot for all the numerical features

for col in numeric_features:

  fig,ax=plt.subplots(figsize=(8,4))

  sns.regplot(x=bike_df[col], y=bike_df['Rented_Bike_Count'], scatter_kws={"color": 'orange'}, line_kws={"color": "black"})

From the above regression plot of all numerical features we see that the columne 'Temperature, 'Wind speed, Visibility, Dew point temperature, 'Solar Radiation' are positively relation to the target variable.

which means the rented bike count increases with increase of these features.

Rainfall, Snowfall, Humidity' these features are negatively related with the target varlaable which means the rented bike count decreases when these features increase



**Normalise Rented_Bike_Count column data**

The data normalization (also referred to ae data pre-processing) is a basic element of data mining. It means transforming the data, namely converting the source data in to another format that allows processing data offlectively. The main purpose of data normalization is to minimize or even exclude duplicated data

In [None]:
 #Distribution plot of Rented Bike Count

plt.figure(figsize=(10,6))

plt.xlabel('Rented_Bike_Count')

plt.ylabel('Density')

ax=sns.distplot(bike_df['Rented_Bike_Count'], hist=True,color="y")

ax.axvline(bike_df['Rented_Bike_Count'].mean(), color="magenta", linestyle="dashed", linewidth=2)

ax.axvline(bike_df['Rented_Bike_Count'].median(), color='black', linestyle="dashed", linewidth=2)

plt.show()

The above graph shows that, Rented Bike Count has moderate right skewness. Since the assumption of linear regression is that 'the distribution of dependent variable has to be normal, so we should perform some operation to make it normal.


**Finding Outliers and treatment **



In [None]:
# Boxplot for Rented bike Count to check outliers

plt.figure(figsize=(10,6))

plt.ylabel('Rented_Bike_Count')

sns.boxplot(x=bike_df['Rented_Bike_Count'])
plt.show()

In [None]:
#outliers treatments

bike_df.loc[bike_df['Rainfall']>=4, 'Rainfall']= 4

bike_df.loc[bike_df['Solar Radiation (MJ/m2)']>=2.5, 'Solar_Radiation']=2.5

bike_df.loc[bike_df['Snowfall']>2, 'Snowfall']= 2

bike_df.loc[bike_df[ 'Wind speed (m/s)']>=4, 'Wind_speed']= 4

we have applied outlier treatment techniques to the dataset by replacing the outliers with the maximum values.

In [None]:
#Applying square root to Rented Bike Count to improve skewness

plt.figure(figsize=(8,6))

plt.xlabel('Rented Bike Count')

plt.ylabel('Density')

ax=sns.distplot(np.sqrt(bike_df['Rented_Bike_Count']), color="y")



ax.axvline(np.sqrt(bike_df['Rented_Bike_Count']).mean(), color="magenta", linestyle='dashed', linewidth=2)

ax.axvline(np.sqrt(bike_df['Rented_Bike_Count']).median(), color='black', linestyle='dashed', linewidth=2)

plt.show()

Since we have generic rule of applying Square root for the skewed variable in order to make it normal.After applying Square root to the skewed Rented Bike Count, here we get almost normal distribution.



In [None]:
#after applying sqrt on Rented Bike Count check wheater we still have outliers

plt.figure(figsize=(10,6))

plt.ylabel('Rented_Bike_Count')

sns.boxplot(x=np.sqrt(bike_df[ 'Rented_Bike_Count']))

plt.show()

In [None]:
bike_df.corr()

Checking of Correlation between variables

Checking in OLS Model

Ordinary least squares (OLS) regression is a statistic variables and a dependent variable the relationship between one or more indepen



In [None]:
#import the module

#assign the 'x','y' value

import statsmodels.api as sm

X=bike_df[[ 'Temperature', 'Humidity(%)', "Wind speed (m/s)",'Visibility (10m)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall', 'Snowfall']]
Y = bike_df['Rented_Bike_Count']
bike_df.head()

In [None]:
X=sm.add_constant(X)
X

In [None]:
model=sm.OLS(Y,X).fit()
model.summary()

In [None]:
X.corr()

 From the OLS model we find that the "Temperature' and 'Dew point temperature are highly correlated so we need to drop one of them. For droping them we check the (P>(tl) value from above table and we can see that the 'Dew point temperature' value is higher so we need to drop Dew point temperature column


Heatmap

A correlation Heatmap is a type of graphical representation that displays the correlation matrix, which helps to determine the correlation between different variables.

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(bike_df.corr(),cmap='PiYG' ,annot=True)

In [None]:
bike_df=bike_df.drop(['Dew point temperature(°C)'], axis=1)

In [None]:
bike_df.info()

We can observe on the heatmap that on the target variable line, the most positively correlated variables to the rent are:

⚫ the temperature

• the dew point temperature

• the solar radiation

And most negatively correlated variables are:

⚫ humidity

• rainfall

• From the above correlation heatmap, We see that there is a positive correlation between columns Temperature' and 'Dew point temperature' i.e 0.91 so even if we drop this column then it won't affect the outcome of our analysis. And they have the same variations we can drop the column 'Dew point temperature("C).

Feature Engineering & Data Pre-processing

Create the dummy variables

A dataset may contain various type of values, sometimes it consists of categorical values. So, in-order to use these ob programming efficiently we crosto durmeny variables.

One Hot Encoding

In [None]:
#Assign all categorical features to a variable

categorical_features=list(bike_df.select_dtypes(['object', 'category']).columns)
categorical_features=pd. Index(categorical_features)

categorical_features




one hot encoding

A ane hot encoding allows the representation of estagorical data to bo morod. any nuachène learning algorithma cannot work with categorical data directly. The categoriam mcast be converted into mambers. This is required for both leput and output variables that are categorical.

In [None]:
# create a copy


bike_df_copy= bike_df

def one_hot_encoding(data, column):

  data= pd.concat([data, pd.get_dummies(data[column], prefix=column, drop_first=True)], axis=1)
  data= data.drop([column], axis=1)
  return data

for col in categorical_features:
  bike_df_copy =one_hot_encoding(bike_df_copy, col)

bike_df_copy.head()



- Model Training

Train Test split for regression

Before, fitting any model it is a rule of thumb to spligthe dataset into a training and test set. This means some proportions of the data will go into training the model and some portion will be used to evaluate how our model is performing on any unseen data. The proportions may vary from 60:40, 70:30, 75:25 depending on the person but mostly used is 80:20 for training and testing respectively. In this step we will split our data into training and testing set using scikit learn library.


In [None]:
#Assign the value in X and Y

x=bike_df_copy.drop(columns=['Rented_Bike_Count'], axis=1)
y=np.sqrt(bike_df_copy['Rented_Bike_Count'])

X.head()

In [None]:
y.head()

In [None]:
#Create test and train data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.25, random_state=0)

print(X_train.shape)

print(X_test.shape)

In [None]:
bike_df_copy.info()

In [None]:
bike_df_copy.describe().columns

The mean squared error (MSE) tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the "errors") and squaring them. It's called the mean squared error as you're finding the average of a set of errors. The lower the MSE, the better the forecast.

MSE formula (1/n) 2(actual-forecast)2 Where:

nnumber of itens,

summation notation,

Actual original or observed y-value,

Forecasty-value from regression.

• Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).

• Mean Absolute Error (MAE) are metrics used to evaluate a Regression Model.... Here, errors are the differences between the predicted values (values predicted by our regression model) and the actual values of a variable.

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.

• Formula for R-Squared

B-1- Unexplained Variation Total Variation

R2-1-Total Variation Unexplained Variation

Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.
 LINEAR REGRESSION

Regression models describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line

Linear regression uses a linear approach to model the relationship between independent and dependent variables. In simple words its a best fit line drawn over the values of independent variables and dependent variable. In case of single variable, the formula is same as straight line equation having an intercept and slope.

y_pred - βo + βια

where

Bo and Bi

are intercept and slope respectively.

In case of multiple features the formula translates into:

y_pred Bo+Biz +++....

where x1,x2,x3 are the features values and

βος βι. β..... are weights assigned to each of the features. These become the parameters which the algorithm tries to learn using Gradient descent.

In [None]:
#import the packages
from sklearn.linear_model import LinearRegression
reg =LinearRegression().fit(X_train, y_train)

In [None]:
 #check the score
reg.score(X_train, y_train)


In [None]:
 #check the coefficeint
reg.coef_

In [None]:
#get the x train and X-test value
y_pred_train=reg.predict(X_train)
y_pred_test=reg.predict(X_test)


In [None]:
 #import the packages

from sklearn.metrics import mean_squared_error

#calculate MSE

MSE_lr= mean_squared_error((y_train), (y_pred_train))
print("MSE:",MSE_lr)

#calculate RMSE

RMSE_lr=np.sqrt(MSE_lr)
print("AISE:", RMSE_lr)

#calculate MAE

MAE_lr= mean_absolute_error(y_train, y_pred_train)
print("HAE:", MAE_lr)

#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr= r2_score(y_train, y_pred_train)
print("R2:",r2_lr)

Adjusted_R2_lr= (1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2:",1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X.shape[0]-X_test.shape[1]-1)))

In [None]:
#storing the test set metrics value in a dataframe for later comparison

dict1={'Model': 'Linear regression',

'MAE': round((MAE_lr),3),

'MSE': round((MSE_lr), 3),

'RMSE':round((RMSE_lr),3),

'R2_score':round((r2_lr),3),

'Adjusted_R2':round((Adjusted_R2_lr),2)

}

training_df=pd.DataFrame(dict1, index=[1])



In [None]:
#import the packages

from sklearn.metrics import mean_squared_error

#calculate MSE

MSE_lr= mean_squared_error(y_test, y_pred_test)

print("MSE:", MSE_lr)

#calculate RMSE

RMSE_lr=np.sqrt(MSE_lr)

print("RMSE:", RMSE_lr)

#calculate MAE

MAE_lr= mean_absolute_error(y_test, y_pred_test)

print("MAE:",MAE_lr)

#import the packages

from sklearn.metrics import r2_score

#calculate r2 and adjusted r2

r2_lr= r2_score((y_test), (y_pred_test))

Adjusted_R2_lr= (1-(1-r2_score(y_test, y_pred_test))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :", Adjusted_R2_lr)

In [None]:
dict2={'Model': 'Linear regression',

'MAE': round((MAE_lr),3),

'MSE': round((MSE_lr), 3),

'RMSE':round((RMSE_lr),3),

'R2_score':round((r2_lr),3),

'Adjusted_R2':round((Adjusted_R2_lr),2)

}

training_df=pd.DataFrame(dict2, index=[1])

In [None]:
 ### Heteroscadacity Residual plot

plt.scatter((y_pred_test), (y_test)-(y_pred_test))

plt.xlabel('Predicted Values')

plt.ylabel('Residuals')

plt.title('Residual Plot')

plt.show()

In [None]:
# Actual Price vs predicte for Linear Regression plot

plt.figure(figsize=(10,8))

plt.plot(y_pred_test)

plt.plot(np.array(y_test))

plt.legend(["Predicted", "Actual"])

plt.xlabel('No of Test Data')

plt.show()

Lasso Regression

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV


In [None]:
[78] # Create an instance of Lasso Regression implementation
from sklearn.linear_model import Lasso
lasso= Lasso(alpha=1.0, max_iter=3000)

# Fit the Lasso model
lasso.fit(X_train, y_train)
#Create the model score
print(lasso.score(X_test, y_test), lasso.score(X_train, y_train))

In [None]:
y_pred_train_lasso=lasso.predict(X_train)
y_pred_test_lasso=lasso.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error

#calculate MSE

MSE_l= mean_squared_error((y_train), (y_pred_train_lasso))
print("MSE:", MSE_l)

#calculate RMSE

RMSE_l=np.sqrt(MSE_l)
print("AMSE:", RMSE_l)

#calculate MAE

MAE_l=mean_absolute_error(y_train, y_pred_train_lasso)
print("MAE:",MAE_l)

from sklearn.metrics import r2_score

#calculate r2 and adjusted r2

r2_l=r2_score(y_train, y_pred_train_lasso)

print("R2 :",r2_l)

Adjusted_R2_l= (1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
# storing the test set metrics value in a dataframe for later comparison

dict1={'Model': 'Lasso regression', 'MAE':round((MAE_l),3),

'MSE':round((MSE_l),3),

'RMSE': round((RMSE_l),3),

'R2_score':round((r2_l),3),

'Adjusted R2': round((Adjusted_R2_l),2)

}
training_df=training_df.append(dict1, ignore_index=True)



In [None]:
from sklearn.metrics import mean_squared_error

#calculate MSE

MSE_l=mean_squared_error(y_test, y_pred_test_lasso)

print("MSE:",MSE_l)

from sklearn.metrics import mean_squared_error

#calculate MSE

MSE_l= mean_squared_error(y_test, y_pred_test_lasso)
print("MSE:",MSE_l)

#calculate RMSE

RMSE_l=np.sqrt(MSE_l)
print("RMSE", RMSE_l)

#calculate MAE

MAE_1= mean_absolute_error(y_test, y_pred_test_lasso)
print("MAE:",MAE_l)

from sklearn.metrics import r2_score

#calculate r2 and adjusted r2

r2_l=r2_score((y_test), (y_pred_test_lasso))
print("R2:",r2_l)
Adjusted_R2_l= (1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
# storing the test set metrics value in a dataframe for later comparison

dict2={'Model': 'Lasso regression',

      'MAE': round((MAE_l),3),
      'MSE': round((MSE_l),3),

      'RMSE': round((RMSE_l),3),

      'R2_score': round((r2_l),3),

      'Adjusted R2':round((Adjusted_R2_l),2),

    }
test_df=test_df.append(dict2, ignored_text=True)


In [None]:
 ### Heteroscadacity- Residual plot
plt.scatter((y_pred_test_lasso), (y_test-y_pred_test_lasso))

plt.xlabel('Predicted Values')

plt.ylabel('Residuals')

plt.title('Residual Plot')
plt.show()

In [None]:
#Plot the figure

plt.figure(figsize=(10,8))

plt.plot(np.array(y_pred_test_lasso))

plt.plot(np.array((y_test)))

plt.legend(["Predicted", "Actual"])

plt.show()

- RIDGE REGRESSION

Ridge regression is a method of estimating the coefficients of regression models in scenarios where the independent variables are highly correlated. It uses the linear regression model with the L2 regularization method.

In [None]:
#import the packages from sklearn.linear_model import Ridge

ridge= Ridge(alpha=0.1)





In [None]:
#FIT THE MODEL
ridge.fit(X_train,y_train)

In [None]:
#check the score
ridge.score(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_ridge=ridge.predict(X_train)
y_pred_test_ridge=ridge.predict(X_test)

In [None]:
y_pred_train_ridge

In [None]:
y_pred_test_ridge

In [None]:
from sklearn.metrics import mean_squared_error

#calculate MSE

MSE_r=mean_squared_error(y_test, y_pred_test_ridge)

print("MSE:",MSE_r)

#calculate RMSE

RMSE_r=np.sqrt(MSE_r)
print("RMSE", RMSE_r)

#calculate MAE

MAE_r= mean_absolute_error(y_test, y_pred_test_ridge)
print("MAE:",MAE_r)

from sklearn.metrics import r2_score

#calculate r2 and adjusted r2

r2_r=r2_score((y_test), (y_pred_test_ridge))
print("R2:",r2_l)
Adjusted_R2_r= (1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
dict1={'Model': 'Ridge regression',

'MAE': round((MAE_r),3),

'MSE':round((MSE_r),3),

'RMSE':round((RMSE_r),3),

'R2_score': round((r2_r),3),

'Adjusted R2':round((Adjusted_R2_r),2)}

training_df=training_df.append(dict1, ignore_index=True)

In [None]:
 #import the packages from sklearn.metrics import mean_squared_error

#calculate MSE

MSE_r =mean_squared_error(y_test, y_pred_test_ridge)

print("MSE:",MSE_r)

#calculate RMSE

RMSE_r=np.sqrt(MSE_r)
print("RMSE:", RMSE_r)

#calculate MAE

MAE_r =mean_absolute_error(y_test, y_pred_test_ridge)
print("MAE:", MAE_r)

#import the packages from sklearn.metrics import r2_score #calculate r2 and adjusted: r2 r2_r=r2_score((y_test), (y_pred_test_ridge))

In [None]:
print("R2 :",r2_r)

Adjusted_R2_r=(1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
 # storing the test set metrics value in a dataframe for later comparison

dict2={'Model': 'Ridge regression',

        'MAE':round ((MAE_r),3),

        'MSE': round((MSE_r),3),

        'RMSE' : round((RMSE_r),3),

        'R2_score': round((r2_r),3),

        'Adjusted R2': round((Adjusted_R2_r),2)}

test_df=test_df.append(dict2, ignore_index=True)

In [None]:
### Heteroscadacity - Residual plot

plt.scatter((y_pred_test_ridge), (y_test)-(y_pred_test_ridge))

plt.xlabel('Predicted Values')

plt.ylabel('Residuals')

plt.title('Residual Plot')

plt.show()

In [None]:
#Plot the figure

plt.figure(figsize=(10,8))

plt.plot((y_pred_test_ridge))

plt.plot((np.array(y_test)))

plt.legend(["Predicted", "Actual"])

plt.show()

ELASTIC NET REGRESSION


Elastic Net regression is a linear regression model that combines both L1 (Lasso) and L2 (Ridge) regularization penalties to overcome some of the limitations of each individual method.

The model introduces two hyperparameters, alpha and 11 ratio, which control the strength of the L1 and 12 penalties, respectively. Elastic Nat mgression is particularly useful when dealing with datasets that have high dimensionality and multicollinearity between features.

In [None]:
from sklearn.linear_model import ElasticNet

#a L1+ b 12

#alpha a + b and 11_ratio= a/(a+b)
elasticnet =ElasticNet(alpha=0.1, l1_ratio=0.5)


In [None]:
elasticnet.fit(X_train,y_train)


In [None]:
#check the score
elasticnet.score(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_en=elasticnet.predict(X_train)
y_pred_test_en=elasticnet.predict(X_test)

print(y_pred_train_en)
print(y_pred_test_en)

In [None]:

#import the packages

from sklearn.metrics import mean_squared_error

#calculate MSE

MSE_e= mean_squared_error((y_train), (y_pred_train_en))
print("MSE:", MSE_e)

#calculate RASE

RMSE_e=np.sqrt(MSE_e)
print("RMSE:", RMSE_e)

#calculate MAE

MAE_e=mean_absolute_error(y_train, y_pred_train_en)
print("MAE:", MAE_e)

#import the packages

from sklearn.metrics import r2_score

#calculate r2 and adjusted r2

r2_e= r2_score(y_train, y_pred_train_en)

print("R2:",r2_e)

Adjusted_R2_e=(1-(1-r2_score((y_test), (y_pred_test_en)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_en)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))


In [None]:
#storing the test set metrics value in a dataframe for later compari

dict1={'Model': 'Elastic net regression',

      'MAE': round((MAE_e),3),

      'MSE': round((MSE_e),3),

      'RMSE': round((RMSE_e),3),

      'R2_score': round((r2_e),3),

      'Adjusted R2':round((Adjusted_R2_e),2)}
training_df=training_df.append(dict1, ignore_index=True)

In [None]:
from sklearn.metrics import mean_squared_error
MSE_e =mean_squared_error(y_test, y_pred_test_en)
print("MSE:",MSE_e)

#calculate RNSE AHSE_e-np.sqrt(MSE_e) print("ANSE:", RMSE_e)

#calculate MAE

MAE_e =mean_absolute_error(y_test, y_pred_test_en)
print("MAE:",MAE_e)

#import the packages
from sklearn.metrics import r2_score

#calculate r2 and adjusted r2

r2_e= r2_score((y_test), (y_pred_test_en))
print("R2 :",r2_e)
Adjusted_R2_e=(1-(1-r2_score((y_test), (y_pred_test_en)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_en)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))


In [None]:
# storing the test set metrics value in a dataframe for later comparison

dict2={'Model': 'Elastic net regression Test',

        'MAE': round((MAE_e),3),

        'MSE': round((MSE_e),3),

        'RMSE': round((RMSE_e),3), 'R2_score': round((r2_e),3),

        'Adjusted R2': round((Adjusted_R2_e),2)}

test_df=test_df.append(dict2, ignore_index=True)



In [None]:
### Heteroscadacity- Residual plo

plt.scatter((y_pred_test_en), (y_test)-(y_pred_test_en))

plt.xlabel('Predicted Values')

plt.ylabel('Residuals')
plt.title('Residual plot')
plt.show()

In [None]:
#Plot the figure

plt.figure(figsize=(10,8))

plt.plot(np.array(y_pred_test_en))

plt.plot((np.array(y_test)))

plt.legend(["Predicted", "Actual"])

plt.show()

DECISION TREE

A decision tree is a type of supervised machine learning algorithm that is commonly used for classification and regression tasks. It works by recursively splitting the data into subsets based on the values of certain attributes, ultimately arriving at a set of decision rules that can be used to classify or predict outcomes for new data.


In [None]:
from sklearn.tree import DecisionTreeRegressor

decision_regressor= DecisionTreeRegressor(criterion='friedman_mse', max_depth=8,
                                          max_features=9, max_leaf_nodes=100,)

decision_regressor.fit(X_train, y_train)


In [None]:
#get the X_train and X-test value

y_pred_train_d = decision_regressor.predict(X_train)
y_pred_test_d =decision_regressor.predict(X_test)

print(y_pred_train_d)

print(y_pred_test_d)

In [None]:
#import the packages
 #import the packages from sklearn.metrics import mean_squared_error print("Model Score:", decision_regressor.score(x_train,y_train))

#calculate MSE

MSE_d= mean_squared_error(y_train, y_pred_train_d)
print("MSE:",MSE_d)


RMSE_d=np.sqrt(MSE_d)
print("RMSE:", RMSE_d)

#calculate MAE

MAE_d= mean_absolute_error(y_train, y_pred_train_d)
print("MAE:", MAE_d)



from sklearn.metrics import r2_score

#calculate r2 and adjusted r2

r2_d=r2_score(y_train, y_pred_train_d)


print("R2:",r2_d)

Adjusted_R2_d=(1-(1-r2_score((y_test), (y_pred_test_d)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_d)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
# storing the test set metrics value in a dataframe for later comparison

dict1={'Model': 'Dicision tree regression',

"MAE": round((MAE_d),3),

"MSE" :round ((MSE_d),3),

"RMSE": round ((RMSE_d),3 ),

'R2_score':round((r2_d),3),

'Adjusted R2': round((Adjusted_R2_d),2)

}

training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#storing the test set metrics value in a dataframe for later comparison

dict2={"Model": 'Dicision tree regression',

"MAE":round ((MAE_d),3),

'MSE': round ((MSE_d),3),

"RMSE":round (( RMSE_d),3),

"R2_score" : round((r2_d),3),

'Adjusted R2': round((Adjusted_R2_d),2)

}
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
### Heteroscadacity - Residual plot

plt.scatter((y_pred_test_d), (y_test)-(y_pred_test_d))

plt.xlabel('Predicted Values')

plt.ylabel('Residuals')

plt.title('Residual Plot')

plt.show()

In [None]:
#Plot the figure

plt.figure(figsize=(10,8))

plt.plot((np.array(y_pred_test_d)))

plt.plot(np.array((y_test)))

plt.legend(["Predicted", "Actual"])

plt.show()

# **conclusion**

- Conclusion

During our analysis, we conducted an initial exploratory data analysis (EDA) on all the features in our dataset. Firstly we analysed our

dependent variable 'Rented Bike count' and applied transformations as necessar. We then examined the categorical variables and removed those with majority of one class. We also studied the numerical variables, calculated their correlations, distribution and the their relationships with the dependent variable. Additionally we removed some numerical features that contained mostly 0 values and applied one-hot encoding to the categorical variables.

Subsequently, we employed 7 machine learning algorithms including Linear Regression, Lasso, Ridge, Elastic Net, Decision Tree.