<a href="https://colab.research.google.com/github/Ritesh-saini74/Ds-project/blob/main/Copy_of_Machine_Learning_Reg_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Bike sharing demand prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual


#Problem Statement

The problem statement for bike-sharing demand is predicting the number of bikes that will be rented from a bike-sharing system at a given time based on factors such as weather, day of the week, and time of day. The purpose is to build a predictive model that can accurately forecast bike rental demand to optimize bike allocation and improve the bike-sharing system’s overall efficiency.

The problem statement may involve answering specific questions such as:

What is the expected demand for bikes during peak hours, weekdays, or weekends?

How does weather (e.g., wind, temperature, precipitation) affect bike rental demand?

Are any specific locations or routes with higher or lower demand for bikes?
How can we optimize the bike-sharing system to meet fluctuating demand and minimize operational costs?

Can the bike-sharing system expand or improve to better serve users’ needs and promote sustainable transportation?

The problem statement for bike-sharing demand analysis typically involves predicting bike rental demand and optimizing bike allocation to improve the bike-sharing system’s efficiency and sustainability.

So are Aims is :

To create a model of the demand for shared bikes with the available independent variables.

To understand the demand dynamics of the market using the model.

# **GitHub Link -**

https://github.com/Ritesh-saini74/Data-science---Project-

# ***Let's Begin !***

## *** Know Your Data***

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import log_loss


from datetime import datetime
import datetime as dt


### Dataset Loading

In [None]:
# load the data
from google.colab import drive
drive.mount('/content/drive')

In [None]:
dataset = pd.read_csv('/content/SeoulBikeData (1).csv',encoding = 'latin1')

### Dataset First View

In [None]:
# dataset
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset with rows and columns
dataset.shape

There are 8760 rows and 14 columns in our data

### Dataset Information

In [None]:
# Dataset info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset duplicate value counts

x = dataset.duplicated().value_counts()
print(f'data is duplicate {x}')
print(f'the number of duplicate value in the data is',len(dataset[dataset.duplicated()]))

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

dataset.isna().sum()
dataset.isnull().sum()

In [None]:
# Visualizing the missing values

sns.heatmap(dataset.isnull(), cbar=False)

**Map represent that there is no null value in data **


### What did you know about your dataset?

This Dataset contains 8760 rows and 14 columns.

Data set have all the unique value.

There is no null value or missing value in our data.

In a day we have 24 hours and we have 365 days a year so 365 multiplied by 24 = 8760, which represents the number of line in the dataset.


## *** Understanding Your Variables***

In [None]:
# Dataset Columns

dataset.columns

In [None]:
# Dataset Describe

dataset.describe(include = 'all')

### Variables Description

Date : The date of the day, during 365 days from 01/12/2017 to 30/11/2018, formating in DD/MM/YYYY, type : str, we need to convert into datetime format.

Rented Bike Count : Number of rented bikes per hour which our dependent variable and we need to predict that, type : int

Hour: The hour of the day, starting from 0-23 it's in a digital time format, type : int, we need to convert it into category data type.

Temperature(°C): Temperature in Celsius, type : Float

Humidity(%): Humidity in the air in %, type : int

Wind speed (m/s) : Speed of the wind in m/s, type : Float

Visibility (10m): Visibility in m, type : int

Dew point temperature(°C): Temperature at the beggining of the day, type : Float

Solar Radiation (MJ/m2): Sun contribution, type : Float

Rainfall(mm): Amount of raining in mm, type : Float

Snowfall (cm): Amount of snowing in cm, type : Float

Seasons: *Season of the year, type : str, there are only 4 season's in data *.

Holiday: If the day is holiday period or not, type: str

Functioning Day: If the day is a Functioning Day or not, type : str

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

dataset.nunique()

##  ***Data Wrangling***

#Rename the complex columns name


In [None]:
#Rename the complex columns name

dataset=dataset.rename(columns={'Rented Bike Count':'Rented_Bike_Count',
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind_speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'Solar_Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                                'Functioning Day':'Functioning_Day'})

# Changing the "Date" column into three "year","month","day" column


In [None]:
# Changing the "Date" column into three "year","month","day" column

dataset['Date'] = dataset['Date'].apply(lambda x: dt.datetime.strptime(x,"%d/%m/%Y"))

In [None]:
dataset['year'] = dataset['Date'].dt.year
dataset['month'] = dataset['Date'].dt.month
dataset['day'] = dataset['Date'].dt.day_name()

In [None]:
##creating a new column of "weekdays_weekend" and drop the column "Date","day","year"

dataset['weekdays_weekend']=dataset['day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )
dataset=dataset.drop(columns=['Date','day','year'],axis=1)

So we convert the "date" column into 3 different column i.e "year","month","day".

The "year" column in our data set is basically contain the 2 unique number contains the details of from 2017 december to 2018 november so if i consider this is a one year then we don't need the "year" column so we drop it.

The other column "day", it contains the details about the each day of the month, for our relevence we don't need each day of each month data but we need the data about, if a day is a weekday or a weekend so we convert it into this format and drop the "day" column.

In [None]:
dataset.info()

In [None]:
dataset['weekdays_weekend'].value_counts()

As "Hour","month","weekdays_weekend" column are show as a integer data type but actually it is a category data tyepe. so we need to change this data type if we not then, while doing the further anlysis and correleted with this then the values are not actually true so we can mislead by this

In [None]:
##Change the int64 column into catagory column

cols=['Hour','month','weekdays_weekend']
for col in cols:
  dataset[col]=dataset[col].astype('category')

In [None]:
dataset['weekdays_weekend'].unique()

### What all manipulations have you done and insights you found?



*   Firstly, we changed our data variable or features name
*   As we see before data column is in str format or object type so we convert the date data into three columns 'Hour','month' and 'weekdays_weekend'
* As "Hour","month","weekdays_weekend" column are show as a integer data type but actually but we want this data as a categorical data so we convert that







## *** Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

Our dependent variable is "Rented Bike Count" so we need to analysis this column with the other columns by using some visualisation plot. First we analyze the category data type then we proceed with the numerical data type

In [None]:
#anlysis of data by vizualisation (Month)

fig,ax=plt.subplots(figsize=(12,8))
sns.barplot(data=dataset,x='month',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Month ')

##### 1. Why did you pick the specific chart?

BAR PLOT - It shows the relationship between a numeric and a categoric variable.

##### 2. What is/are the insight(s) found from the chart?

From the above bar plot we can clearly say that from the month 5 to 10 the demand of the rented bike is high as compare to other months. These months are comes inside the summer season.

In [None]:
#anlysis of data by vizualisation (weekdays_weekend)

fig,ax=plt.subplots(figsize=(12,8))
sns.pointplot(data=dataset,x='Hour',y='Rented_Bike_Count',hue='weekdays_weekend',ax=ax)
ax.set(title='Count of Rented bikes acording to weekdays_weekend ')

##### 1. Why did you pick the specific chart?

Point plots can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables.

##### 2. What is/are the insight(s) found from the chart?

From the above point plot we can say that in the week days which represent in blue colur show that the demand of the bike higher because of the office.

Peak Time are 7 am to 9 am and 5 pm to 7 pm

The orange colur represent the weekend days, and it show that the demand of rented bikes are very low specially in the morning hour but when the evening start from 4 pm to 8 pm the demand slightly increases.

In [None]:
#anlysis of data by vizualisation (Hour)

fig,ax=plt.subplots(figsize=(15,8))
sns.barplot(data=dataset,x='Hour',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Hour ')

##### 1. Why did you pick the specific chart?

It allows you to compare different sets of data among different groups easily.





##### 2. What is/are the insight(s) found from the chart?

In the above plot which shows the use of rented bike according the hours and the data are from all over the year.

generally people use rented bikes during their working hour from 7am to 9am and 5pm to 7pm

In [None]:
#anlysis of data by vizualisation (Functioning day)

fig,ax=plt.subplots(figsize=(10,8))
sns.barplot(data=dataset,x='Functioning_Day',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Functioning Day ')

##### 1. Why did you pick the specific chart?

It allows you to compare different sets of data among different groups easily.


##### 2. What is/are the insight(s) found from the chart?

In the above bar plot which shows the use of rented bike in functioning daya or not, and it clearly shows that,
Peoples dont use reneted bikes in no functioning day.

In [None]:
#anlysis of data by vizualisation (Season)

fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=dataset,x='Hour',y='Rented_Bike_Count',hue='Seasons',ax=ax)
ax.set(title='Count of Rented bikes acording to seasons ')

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(15,8))
sns.barplot(data=dataset,x='Seasons',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Seasons ')

##### 1. Why did you pick the specific chart?

Bar graph and point plot chart summarises the large set of data in simple visual form.
Bar graph displays each category of data in the frequency distribution whereas point plot gives us to better understanding the data on each and every point

##### 2. What is/are the insight(s) found from the chart?

In the above bar plot and point plot which shows the use of rented bike in in four different seasons, and it clearly shows that,
In summer season the use of rented bike is high and peak time is 7am-9am and 7pm-5pm.
In winter season the use of rented bike is very low because of snowfall.

In [None]:
#anlysis of data by vizualisation (Holiday)

fig,ax=plt.subplots(figsize=(10,8))
sns.barplot(data=dataset,x='Holiday',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Holiday ')

In [None]:
fig,ax=plt.subplots(figsize=(12,8))
sns.pointplot(data=dataset,x='Hour',y='Rented_Bike_Count',hue='Holiday',ax=ax)
ax.set(title='Count of Rented bikes acording to Holiday ')

##### 1. Why did you pick the specific chart?

Bar graph and point plot chart summarises the large set of data in simple visual form.
Bar graph displays each category of data in the frequency distribution whereas point plot gives us to better understanding the data on each and every point

##### 2. What is/are the insight(s) found from the chart?

In the above bar plot and point plot which shows the use of rented bike in a holiday, and it clearly shows that,
plot shows that in holiday people uses the rented bike from 2pm-8pm

# Analyze of Numerical variables

What is Numerical Data

Numerical data is a data type expressed in numbers, rather than natural language description. Sometimes called quantitative data, numerical data is always collected in number form. Numerical data differentiates itself from other number form data types with its ability to carry out arithmetic operations with these numbers.

In [None]:
##assign the numerical coulmn to variavle

numerical_columns=list(dataset.select_dtypes(['int64','float64']).columns)
numerical_features=pd.Index(numerical_columns)
numerical_features

In [None]:
#printing displots to analyze the distribution of all numerical features
for col in numerical_features:
  plt.figure(figsize=(7,5))
  sns.distplot(x=dataset[col])
  plt.xlabel(col)
plt.show()

##### 1. Why did you pick the specific chart?

The distplot represents the univariate distribution of data i.e. data distribution of a variable against the density distribution

##### 2. What is/are the insight(s) found from the chart?

Its represent that the distribution is normal distribution or the data distribution is skewed in nature

#Analysis the Numerical vs.Rented_Bike_Count

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Temperature"

plt.figure(figsize=(8,6))
sns.scatterplot(x='Temperature',y='Rented_Bike_Count',data = dataset)
plt.title('Temperature vs Rented Bike Count')
plt.show()

##### 1. Why did you pick the specific chart?

Form a groupby object by grouping multiple values, its a by default graph or chart


##### 2. What is/are the insight(s) found from the chart?

From the above plot we see that people like to ride bikes when it is
pretty hot around 25°C in average

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Dew_point_temperature"

dataset.groupby('Dew_point_temperature').mean()['Rented_Bike_Count'].plot()

##### 1. Why did you pick the specific chart?

Form a groupby object by grouping multiple values, its a by default graph or chart


##### 2. What is/are the insight(s) found from the chart?

From the above plot of "Dew_point_temperature' is almost same as the 'temperature' there is some similarity present we can check it in our next step.

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Solar_Radiation"

plt.figure(figsize=(8,6))
sns.scatterplot(x='Solar_Radiation',y='Rented_Bike_Count',data = dataset)
plt.title('Solar_Radiation vs Rented Bike Count')
plt.show()

##### 1. Why did you pick the specific chart?

Form a groupby object by grouping multiple values, its a by default graph or chart


##### 2. What is/are the insight(s) found from the chart?

from the above plot we see that, the amount of rented bikes is huge, when there is solar radiation, the counter of rents is around 1000

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Snowfall"

plt.figure(figsize=(8,6))
sns.scatterplot(x='Snowfall',y='Rented_Bike_Count',data = dataset)
plt.title('Snowfall vs Rented Bike Count')
plt.show()

##### 1. Why did you pick the specific chart?

Form a groupby object by grouping multiple values, its a by default graph or chart


##### 2. What is/are the insight(s) found from the chart?

We can see from the plot that, on the y-axis, the amount of rented bike is very low When we have more than 4 cm of snow, the bike rents is much lower

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Rainfall"

plt.figure(figsize=(8,6))
sns.scatterplot(x='Rainfall',y='Rented_Bike_Count',data = dataset)
plt.title('Rainfall vs Rented Bike Count')
plt.show()

##### 1. Why did you pick the specific chart?

Form a groupby object by grouping multiple values, its a by default graph or chart


##### 2. What is/are the insight(s) found from the chart?

We can see from the above plot that even if it rains a lot the demand of of rent bikes is not decreasing, here for example even if we have 20 mm of rain there is a big peak of rented bikes

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Wind_speed"

dataset.groupby('Wind_speed').mean()['Rented_Bike_Count'].plot()

##### 1. Why did you pick the specific chart?

Form a groupby object by grouping multiple values, its a by default graph or chart


##### 2. What is/are the insight(s) found from the chart?

We can see from the above plot that the demand of rented bike is uniformly distribute despite of wind speed but when the speed of wind was 7 m/s then the demand of bike also increase that clearly means peoples love to ride bikes when its little windy.

#Regression plot

The regression plots in seaborn are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during exploratory data analyses. Regression plots as the name suggests creates a regression line between 2 parameters and helps to visualize their linear relationships.

In [None]:
#printing the regression plot for all the numerical features

for col in numerical_features:
  fig,ax=plt.subplots(figsize=(6,4))
  sns.regplot(x=dataset[col],y=dataset['Rented_Bike_Count'],scatter_kws={"color": 'orange'}, line_kws={"color": "black"})

What is/are the insight(s) found from the chart?


From the above regression plot of all numerical features we see that the columns 'Temperature', 'Wind_speed','Visibility', 'Dew_point_temperature', 'Solar_Radiation' are positively relation to the target variable.

which means the rented bike count increases with increase of these features.
'Rainfall','Snowfall','Humidity' these features are negatively related with the target variaable which means the rented bike count decreases when these features increase.

#Normalise Rented_Bike_Count column data

The data normalization (also referred to as data pre-processing) is a basic element of data mining. It means transforming the data, namely converting the source data in to another format that allows processing data effectively. The main purpose of data normalization is to minimize or even exclude duplicated data

In [None]:
#Distribution plot of Rented Bike Count

plt.figure(figsize=(7,5))
plt.xlabel('Rented_Bike_Count')
plt.ylabel('Density')
ax=sns.distplot(dataset['Rented_Bike_Count'],hist=True ,color="y")
ax.axvline(dataset['Rented_Bike_Count'].mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(dataset['Rented_Bike_Count'].median(), color='black', linestyle='dashed', linewidth=2)
plt.show()

What is/are the insight(s) found from the chart?


The above graph shows that Rented Bike Count has moderate right skewness. Since the assumption of linear regression is that 'the distribution of dependent variable has to be normal', so we should perform some operation to make it normal.

In [None]:
#Boxplot of Rented Bike Count to check outliers

plt.figure(figsize=(10,6))
plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=dataset['Rented_Bike_Count'])
plt.show()

What is/are the insight(s) found from the chart?


The above boxplot shows that we have detect outliers in Rented Bike Count

In [None]:
#Applying square root to Rented Bike Count to improve skewness

plt.figure(figsize=(10,8))
plt.xlabel('Rented Bike Count')
plt.ylabel('Density')

ax=sns.distplot(np.sqrt(dataset['Rented_Bike_Count']), color="y")
ax.axvline(np.sqrt(dataset['Rented_Bike_Count']).mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(np.sqrt(dataset['Rented_Bike_Count']).median(), color='black', linestyle='dashed', linewidth=2)

plt.show()

What is/are the insight(s) found from the chart?


Since we have generic rule of applying Square root for the skewed variable in order to make it normal .After applying Square root to the skewed Rented Bike Count, here we get almost normal distribution.

In [None]:
#After applying sqrt on Rented Bike Count check wheater we still have outliers

plt.figure(figsize=(10,6))

plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=np.sqrt(dataset['Rented_Bike_Count']))
plt.show()

In [None]:
dataset.corr()

After applying Square root to the Rented Bike Count column, we find that there is no outliers present.

#Checking of Correlation between variables

#Checking in OLS Model

Ordinary least squares (OLS) regression is a statistical method of analysis that estimates the relationship between one or more independent variables and a dependent variable

In [None]:
#import the module
#assign the 'x','y' value

import statsmodels.api as sm
X = dataset[[ 'Temperature','Humidity',
       'Wind_speed', 'Visibility','Dew_point_temperature',
       'Solar_Radiation', 'Rainfall', 'Snowfall']]
Y = dataset['Rented_Bike_Count']
dataset.head()

In [None]:
#add a constant column
X = sm.add_constant(X)
X

In [None]:
## fit a OLS model

model= sm.OLS(Y, X).fit()
model.summary()

R sqauare and Adj Square are near to each other. 40% of variance in the Rented Bike count is explained by the model.

For F statistic , P value is less than 0.05 for 5% levelof significance.

P value of dew point temp and visibility are very high and they are not significant.

Omnibus tests the skewness and kurtosis of the residuals. Here the value of Omnibus is high., it shows we have skewness in our data.

The condition number is large, 3.11e+04. This might indicate that there are strong multicollinearity or other numerical problems

Durbin-Watson tests for autocorrelation of the residuals. Here value is less than 0.5. We can say that there exists a positive auto correlation among the variables.

In [None]:
X.corr()

From the OLS model we find that the 'Temperature' and 'Dew_point_temperature' are highly correlated so we need to drop one of them.

for droping the we check the (P>|t|) value from above table and we can see that the 'Dew_point_temperature' value is higher so we need to drop Dew_point_temperature column

For clarity, we use visualisation i.e heatmap in next step

#### Chart - 14 - Correlation Heatmap

we check correletion betweeen variables using Correlation heatmap, it is graphical representation of correlation matrix representing correlation between different variables

In [None]:
## plot the Correlation matrix

plt.figure(figsize=(12,8))
correlation=dataset.corr()
mask = np.triu(np.ones_like(correlation, dtype=bool))
sns.heatmap((correlation),mask=mask, annot=True,cmap='coolwarm')

##### 1. Why did you pick the specific chart?

we check correletion betweeen variables using Correlation heatmap, it is graphical representation of correlation matrix representing correlation between different variables

##### 2. What is/are the insight(s) found from the chart?

We can observe on the heatmap that on the target variable line the most positively correlated variables to the rent are :

*   the temperature
*   the dew point temperature
*   the solar radiation

And most negatively correlated variables are:


*   Humidity
*   Rainfall





From the above correlation heatmap, We see that there is a positive correlation between columns 'Temperature' and 'Dew point temperature' i.e 0.91 so even if we drop this column then it dont affects the outcome of our analysis. And they have the same variations.. so we can drop the column 'Dew point temperature(°C)'.

In [None]:
#drop the Dew point temperature column

dataset=dataset.drop(['Dew_point_temperature'],axis=1)

In [None]:
dataset.info()

## *** Feature Engineering & Data Pre-processing***

---



# Create the dummy variables


A dataset may contain various type of values, sometimes it consists of categorical values. So, in-order to use those categorical value for programming efficiently we create dummy variables.

In [None]:
#Assign all catagoriacla features to a

categorical_features=list(dataset.select_dtypes(['object','category']).columns)
categorical_features=pd.Index(categorical_features)
categorical_features

#One hot encoding

A one hot encoding allows the representation of categorical data to be more expressive. Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. This is required for both input and output variables that are categorical.

In [None]:
#create a copy
dataset_copy = dataset

def one_hot_encoding(data, column):
    data = pd.concat([data, pd.get_dummies(data[column], prefix=column, drop_first=True)], axis=1)
    data = data.drop([column], axis=1)
    return data

for col in categorical_features:
    dataset_copy = one_hot_encoding(dataset_copy, col)
dataset_copy.head()

#Train Test split for regression

Before, fitting any model it is a rule of thumb to split the dataset into a training and test set. This means some proportions of the data will go into training the model and some portion will be used to evaluate how our model is performing on any unseen data. The proportions may vary from 60:40, 70:30, 75:25 depending on the person but mostly used is 80:20 for training and testing respectively. In this step we will split our data into training and testing set using scikit learn library.

In [None]:
#Assign the value in X and Y

X = dataset_copy.drop(columns=['Rented_Bike_Count'], axis=1)
y = np.sqrt(dataset_copy['Rented_Bike_Count'])

In [None]:
X.head()

In [None]:
y.head()

In [None]:
#Create test and train data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)
print(X_train.shape)
print(X_test.shape)

In [None]:
dataset_copy.describe().columns

* The mean squared error (MSE) tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them.
It’s called the mean squared error as you’re finding the average of a set of errors. The lower the MSE, the better the forecast.

* MSE formula = (1/n) * Σ(actual – forecast)2
Where:

*   n = number of items,
* Σ = summation notation,
* Actual = original or observed y-value,
* Forecast = y-value from regression.

* Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).

* Mean Absolute Error (MAE) are metrics used to evaluate a Regression Model. ... Here, errors are the differences between the predicted values (values predicted by our regression model) and the actual values of a variable.

* R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.

* Formula for R-Squared
\begin{aligned} &\text{R}^2 = 1 - \frac{ \text{Unexplained Variation} }{ \text{Total Variation} } \\ \end{aligned}
​
  
* R
2
 =1−
Total Variation
Unexplained Variation
​

* Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.
​


# **LINEAR REGRESSION**


Linear regression uses a linear approach to model the relationship between independent and dependent variables. In simple words its a best fit line drawn over the values of independent variables and dependent variable. In case of single variable, the formula is same as straight line equation having an intercept and slope.

$$ \text{y_pred} = \beta_0 + \beta_1x$$

where $$\beta_0 \text{ and } \beta_1$$ are intercept and slope respectively.

In case of multiple features the formula translates into:

$$ \text{y_pred} = \beta_0 + \beta_1x_1 + \beta_2x_2 +\beta_3x_3 +.....$$

where x_1,x_2,x_3 are the features values and
$$\beta_0,\beta_1,\beta_2.....$$
 are weights assigned to each of the features. These become the parameters which the algorithm tries to learn using Gradient descent.

Gradient descent is the process by which the algorithm tries to update the parameters using  a loss function . Loss function is nothing but the diffence between the actual values and predicted values(aka error or residuals). There are different types of loss function but this is the simplest one. Loss function summed over all observation gives the cost functions. The role of gradient descent is to update the parameters till the cost function is minimized i.e., a global minima is reached. It uses a hyperparameter 'alpha' that gives a weightage to the cost function and decides on how big the steps to take. Alpha is called as the learning rate. It is always necesarry to keep an optimal value of alpha as high and low values of alpha might make the gradient descent overshoot or get stuck at a local minima. There are also some basic assumptions that must be fulfilled before implementing this algorithm. They are:

1. No multicollinearity in the dataset.

2. Independent variables should show linear relationship with dv.

3. Residual mean should be 0 or close to 0.

4. There should be no heteroscedasticity i.e., variance should be constant along the line of best fit.



Let us now implement our first model.
We will be using LinearRegression from scikit library.


## *** ML Model Implementation***

### ML Model - 1 - **Implementing Logistic Regression**


In [None]:
# ML Model - 1 Implementation

reg= LinearRegression().fit(X_train, y_train)

In [None]:
#check the score

reg.score(X_train, y_train)

In [None]:
#check the coefficeint
reg.coef_

In [None]:
#get the X_train and X-test value
y_pred_train=reg.predict(X_train)
y_pred_test=reg.predict(X_test)

In [None]:
#calculate MSE
MSE_lr= mean_squared_error((y_train), (y_pred_train))
print("MSE :",MSE_lr)

In [None]:
#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)

In [None]:
#calculate MAE
MAE_lr= mean_absolute_error(y_train, y_pred_train)
print("MAE :",MAE_lr)

In [None]:
#calculate r2 and adjusted r2
r2_lr= r2_score(y_train, y_pred_train)
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

Looks like our r2 score value is 0.77 that means our model is able to capture most of the data variance. Lets save it in a dataframe for later comparisons.

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Linear regression ',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
training_df=pd.DataFrame(dict1,index=[1])

In [None]:
#calculate MSE
MSE_lr= mean_squared_error(y_test, y_pred_test)
print("MSE :",MSE_lr)

In [None]:
#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)

In [None]:
#calculate MAE
MAE_lr= mean_absolute_error(y_test, y_pred_test)
print("MAE :",MAE_lr)

In [None]:
#calculate r2 and adjusted r2
r2_lr= r2_score((y_test), (y_pred_test))
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score((y_test), (y_pred_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",Adjusted_R2_lr )

The r2_score for the test set is 0.78. This means our linear model is performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Linear regression ',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
test_df=pd.DataFrame(dict2,index=[1])

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test),(y_test)-(y_pred_test))

In [None]:
#Plot the figure
plt.figure(figsize=(8,5))
plt.plot(y_pred_test)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

# ML Model- 2 LASSO REGRESSION

In [None]:
# Fit the Lasso model
lasso = Lasso(alpha=1.0, max_iter=3000)
lasso.fit(X_train, y_train)
# Create the model score
print(lasso.score(X_test, y_test), lasso.score(X_train, y_train))

In [None]:
#get the X_train and X-test value
y_pred_train_lasso=lasso.predict(X_train)
y_pred_test_lasso=lasso.predict(X_test)

In [None]:
#calculate MSE
MSE_l= mean_squared_error((y_train), (y_pred_train_lasso))
print("MSE :",MSE_l)

In [None]:
#calculate RMSE
RMSE_l=np.sqrt(MSE_l)
print("RMSE :",RMSE_l)

In [None]:
#calculate MAE
MAE_l= mean_absolute_error(y_train, y_pred_train_lasso)
print("MAE :",MAE_l)

In [None]:
#calculate r2 and adjusted r2
r2_l= r2_score(y_train, y_pred_train_lasso)
print("R2 :",r2_l)
Adjusted_R2_l = (1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

Looks like our r2 score value is 0.40 that means our model is not able to capture most of the data variance. Lets save it in a dataframe for later comparisons.

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Lasso regression ',
       'MAE':round((MAE_l),3),
       'MSE':round((MSE_l),3),
       'RMSE':round((RMSE_l),3),
       'R2_score':round((r2_l),3),
       'Adjusted R2':round((Adjusted_R2_l ),2)
       }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#calculate MSE
MSE_l= mean_squared_error(y_test, y_pred_test_lasso)
print("MSE :",MSE_l)

In [None]:
#calculate RMSE
RMSE_l=np.sqrt(MSE_l)
print("RMSE :",RMSE_l)

In [None]:

#calculate MAE
MAE_l= mean_absolute_error(y_test, y_pred_test_lasso)
print("MAE :",MAE_l)

In [None]:
#calculate r2 and adjusted r2
r2_l= r2_score((y_test), (y_pred_test_lasso))
print("R2 :",r2_l)
Adjusted_R2_l=(1-(1-r2_score((y_test), (y_pred_test_lasso)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_lasso)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

The r2_score for the test set is 0.38. This means our linear model is not performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Lasso regression ',
       'MAE':round((MAE_l),3),
       'MSE':round((MSE_l),3),
       'RMSE':round((RMSE_l),3),
       'R2_score':round((r2_l),3),
       'Adjusted R2':round((Adjusted_R2_l ),2),
       }
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(8,5))
plt.plot(np.array(y_pred_test_lasso))
plt.plot(np.array((y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_lasso),(y_test-y_pred_test_lasso))

# ML Model - 3 RIDGE REGRESSION

In [None]:
#FIT THE MODEL
ridge= Ridge(alpha=0.1)
ridge.fit(X_train,y_train)

In [None]:
#check the score
ridge.score(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_ridge=ridge.predict(X_train)
y_pred_test_ridge=ridge.predict(X_test)

In [None]:
#calculate MSE
MSE_r= mean_squared_error((y_train), (y_pred_train_ridge))
print("MSE :",MSE_r)

In [None]:
#calculate RMSE
RMSE_r=np.sqrt(MSE_r)
print("RMSE :",RMSE_r)

In [None]:
#calculate MAE
MAE_r= mean_absolute_error(y_train, y_pred_train_ridge)
print("MAE :",MAE_r)

In [None]:
#calculate r2 and adjusted r2
r2_r= r2_score(y_train, y_pred_train_ridge)
print("R2 :",r2_r)
Adjusted_R2_r=(1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

**Looks like our r2 score value is 0.77 that means our model is  able to capture most of the data variance. Lets save it in a dataframe for later comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Ridge regression ',
       'MAE':round((MAE_r),3),
       'MSE':round((MSE_r),3),
       'RMSE':round((RMSE_r),3),
       'R2_score':round((r2_r),3),
       'Adjusted R2':round((Adjusted_R2_r ),2)}
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#calculate MSE
MSE_r= mean_squared_error(y_test, y_pred_test_ridge)
print("MSE :",MSE_r)

In [None]:
#calculate RMSE
RMSE_r=np.sqrt(MSE_r)
print("RMSE :",RMSE_r)

In [None]:

#calculate MAE
MAE_r= mean_absolute_error(y_test, y_pred_test_ridge)
print("MAE :",MAE_r)


In [None]:
#calculate r2 and adjusted r2
r2_r= r2_score((y_test), (y_pred_test_ridge))
print("R2 :",r2_r)
Adjusted_R2_r=(1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

**The r2_score for the test set is 0.78. This means our linear model is  performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).**




In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Ridge regression ',
       'MAE':round((MAE_r),3),
       'MSE':round((MSE_r),3),
       'RMSE':round((RMSE_r),3),
       'R2_score':round((r2_r),3),
       'Adjusted R2':round((Adjusted_R2_r ),2)}
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot((y_pred_test_ridge))
plt.plot((np.array(y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_ridge),(y_test)-(y_pred_test_ridge))

# **ML Model - 4 ELASTIC NET REGRESSION**

In [None]:
#a * L1 + b * L2
#alpha = a + b and l1_ratio = a / (a + b)
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)

In [None]:
#FIT THE MODEL
elasticnet.fit(X_train,y_train)

In [None]:
#check the score
elasticnet.score(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_en=elasticnet.predict(X_train)
y_pred_test_en=elasticnet.predict(X_test)

In [None]:
#calculate MSE
MSE_e= mean_squared_error((y_train), (y_pred_train_en))
print("MSE :",MSE_e)

In [None]:
#calculate RMSE
RMSE_e=np.sqrt(MSE_e)
print("RMSE :",RMSE_e)


In [None]:

#calculate MAE
MAE_e= mean_absolute_error(y_train, y_pred_train_en)
print("MAE :",MAE_e)


In [None]:
#calculate r2 and adjusted r2
r2_e= r2_score(y_train, y_pred_train_en)
print("R2 :",r2_e)
Adjusted_R2_e=(1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

**Looks like our r2 score value is 0.62 that means our model is  able to capture most of the data variance. Lets save it in a dataframe for later comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Elastic net regression ',
       'MAE':round((MAE_e),3),
       'MSE':round((MSE_e),3),
       'RMSE':round((RMSE_e),3),
       'R2_score':round((r2_e),3),
       'Adjusted R2':round((Adjusted_R2_e ),2)}
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#calculate MSE
MSE_e= mean_squared_error(y_test, y_pred_test_en)
print("MSE :",MSE_e)

In [None]:

#calculate RMSE
RMSE_e=np.sqrt(MSE_e)
print("RMSE :",RMSE_e)

In [None]:
#calculate MAE
MAE_e= mean_absolute_error(y_test, y_pred_test_en)
print("MAE :",MAE_e)

In [None]:
#calculate r2 and adjusted r2
r2_e= r2_score((y_test), (y_pred_test_en))
print("R2 :",r2_e)
Adjusted_R2_e=(1-(1-r2_score((y_test), (y_pred_test_en)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_en)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

**The r2_score for the test set is 0.86. This means our linear model is  performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).**




In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Elastic net regression Test',
       'MAE':round((MAE_e),3),
       'MSE':round((MSE_e),3),
       'RMSE':round((RMSE_e),3),
       'R2_score':round((r2_e),3),
       'Adjusted R2':round((Adjusted_R2_e ),2)}
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot(np.array(y_pred_test_en))
plt.plot((np.array(y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_en),(y_test)-(y_pred_test_en))

# ***Conclusion***

During the time of our analysis, we initially did EDA on all the features of our datset. We first analysed our dependent variable, 'Rented Bike Count' and also transformed it. Next we analysed categorical variable and dropped the variable who had majority of one class, we also analysed numerical variable, found out the correlation, distribution and their relationship with the dependent variable. We also removed some numerical features who had mostly 0 values and hot encoded the categorical variables.

Next we implemented 4 machine learning algorithms Linear Regression,lasso,ridge,elasticnet. The results of our evaluation are:

In [None]:
# displaying the results of evaluation metric values for all models
result=pd.concat([training_df,test_df],keys=['Training set','Test set'])
result


• No overfitting is seen.



However, this is not the ultimate end. As this data is time dependent, the values for variables like temperature, windspeed, solar radiation etc., will not always be consistent. Therefore, there will be scenarios where the model might not perform well. As Machine learning is an exponentially evolving field, we will have to be prepared for all contingencies and also keep checking our model from time to time. Therefore, having a quality knowledge and keeping pace with the ever evolving ML field would surely help one to stay a step ahead in future.