<a href="https://colab.research.google.com/github/SuvOnGithub/Bike-Sharing-Demand-Prediction---Capstone-Project/blob/main/suvenddu_Bike_Sharing_Demand_Prediction_Capstone_Project_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Seoul Bike Sharing Demand Prediction </u></b>

## <b> Problem Description </b>

### Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

## <b> Data Description </b>

### <b> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.</b>


### <b>Attribute Information: </b>

* ### Date : year-month-day
* ### Rented Bike count - Count of bikes rented at each hour
* ### Hour - Hour of he day
* ### Temperature-Temperature in Celsius
* ### Humidity - %
* ### Windspeed - m/s
* ### Visibility - 10m
* ### Dew point temperature - Celsius
* ### Solar radiation - MJ/m2
* ### Rainfall - mm
* ### Snowfall - cm
* ### Seasons - Winter, Spring, Summer, Autumn
* ### Holiday - Holiday/No holiday
* ### Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

Rental Bike Sharing is the process by which bicycles are procured on several basis- hourly, weekly,membership-wise, etc. This phenomenon has seen its stockrise to considerable levels due to a global effort towards reducing the carbon footprint, leading to climate change,unprecedented natural disasters, ozone layer depletion,and other environmental anomalies.

In my project, I chose to analyse a dataset pertaining to Rental Bike Demand from South Koreancity of Seoul, comprising of climatic variables like Temperature, Humidity, Rainfall, Snowfall, Dew Point Temperature, and others. For the available raw data,firstly, a through pre-processing was done after which a Here, hourly rental bike count is the regress.


Bike sharing systems allow the users to take one way bicycle trips over short distances.Generally these systems are operated via automated kiosks to save manpower and reduce waiting time for the users. Bike Sharing System ensures that pollution is reduced as with use of bicycles there is reduction in use of motor vehicles which leads to reduction in emission of pollutants in the air. This practice of Bike Sharing Systems is common in Western Countries while the same is not seen yet in countries like India.In India most of the bike sharing systems could not achieve their maximum potential as data analysis was not used properly. The advantages of this system is that we can have public bike stations without any human involvement. Even local Chennai Municipal Corporation has invited biddings for a new bicycle sharing system.

Generally in bicycle sharing systems it is very important that the administrators should know how many cycles will be needed in each bicycle station, knowing this count enables them the arrange proper number of cycles at the stations and decide whether a particular station needs to have extra number of bicycle stands.So in this research work we study various prediction algorithms i.e. linear regression, decision trees, gradient boosting machines. This research work focuses on which algorithm can work better for the real world problem of bicycle sharing demand prediction.

##**Business Goal:**

We are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

In [None]:
#import required packages and libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# import plotly.graph_objects as go
# import plotly.express as px
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#importing file and converting it into dataframe
file='/content/drive/MyDrive/alma better/SeoulBikeData.csv'
csv=pd.read_csv(file,encoding='ISO-8859-1')
df=pd.DataFrame(csv)

In [None]:
df.head()#getting first 5 rows of dataset

In [None]:
df.tail()#getting last 5 rows of dataset

In [None]:
df.shape#getting shape of data

In [None]:
df.describe(include='all')#getting summary


In [None]:
#check information about the data
df.info()

In [None]:
#finding null values
df.isnull().sum()

**OBSERVATION:** There is no null values present in the dataset.

In [None]:
#getting summary
df.describe()

In [None]:
#converting the date column into year, day, month, weekday column. 
df['Date']=pd.to_datetime(df['Date'])
df['Year'] = pd.DatetimeIndex(df['Date']).year
df['Day'] = pd.DatetimeIndex(df['Date']).day
df['Month']= pd.DatetimeIndex(df['Date']).month
df['weekday']=pd.DatetimeIndex(df['Date']).weekday

In [None]:
df.shape

In [None]:
df.head()

In [None]:
display(df['weekday'].unique())#getting unique values of week

Weekday is in range 0-6 so we need to make it in normal 1-7.

In [None]:
#previously weekday values was in 0-6 now we are converting it into 1-7 format
df['weekday'] = np.array(df['weekday'])+1
display(df['weekday'].unique())

##Exploratory Data Analysis (EDA)

Exploratory data analysis is an statistical way of understanding the data which is usually done in a visual way.The graphs plotted in explotary data analysis are for better understanding of data to the analyst.

Since we have to predict the number of bikes that will be rented, the best way to begin is with the variable to predict, "Rented Bike Count"

##**Demand of rented bikes at different times of years**

In [None]:
#creating data frame with year, month, day, weekday and rented bike count
Rented_bike_per_year = pd.DataFrame(df['Rented Bike Count'].groupby(by=df['Year']).sum()).reset_index().sort_values("Year", ascending=True)
Rented_bike_per_month = pd.DataFrame(df['Rented Bike Count'].groupby(by=df['Month']).sum()).reset_index().sort_values("Month", ascending=True)
Rented_bike_per_Day= pd.DataFrame(df['Rented Bike Count'].groupby(by=df['Day']).sum()).reset_index().sort_values("Day", ascending=True)
Rented_bike_per_Weekday= pd.DataFrame(df['Rented Bike Count'].groupby(by=df['weekday']).sum()).reset_index().sort_values("weekday", ascending=True)

RENTED BIKE COUNT PER YEAR

In [None]:
Rented_bike_per_year


In [None]:
# Defining the plot size
plt.figure(figsize=(8, 8))
 
# Defining the values for x-axis, y-axis
# and from which datafarme the values are to be picked
plots = sns.barplot(x=Rented_bike_per_year['Year'] , y=Rented_bike_per_year['Rented Bike Count'])
# Iterrating over the bars one-by-one
for bar in plots.patches:
    
# Using Matplotlib's annotate function and
# passing the coordinates where the annotation shall be done
# x-coordinate: bar.get_x() + bar.get_width() / 2
# y-coordinate: bar.get_height()
# free space to be left to make graph pleasing: (0, 8)
# ha and va stand for the horizontal and vertical alignment
    plots.annotate(format(bar.get_height(), '.2f'),
                (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                size=15, xytext=(0, 8),
                textcoords='offset points')
 

 
# Setting the label for x-axis
plt.xlabel("Years", size=14)
 
# Setting the label for y-axis
plt.ylabel("Rented Bike Count", size=14)
 
# Setting the title for the graph
plt.title("This is an annotated barplot")
 
# Fianlly showing the plot
plt.show()

**OBSERVATION:** Here we can see in year 2018 the rented bike count was 5986984 which is greater than 2017. It is because this business was started in 2017 and after one year business got accelerated.

RENTED BIKE COUNT PER MONTH

In [None]:
Rented_bike_per_month

In [None]:
# Defining the plot size
plt.figure(figsize=(18, 8))
 
# Defining the values for x-axis, y-axis
# and from which datafarme the values are to be picked
plots = sns.barplot(x=Rented_bike_per_month['Month'] , y=Rented_bike_per_month['Rented Bike Count'])
 
# Iterrating over the bars one-by-one
for bar in plots.patches:
    
# Using Matplotlib's annotate function and
# passing the coordinates where the annotation shall be done
# x-coordinate: bar.get_x() + bar.get_width() / 2
# y-coordinate: bar.get_height()
# free space to be left to make graph pleasing: (0, 8)
# ha and va stand for the horizontal and vertical alignment
    plots.annotate(format(bar.get_height(), '.2f'),
                (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                size=15, xytext=(0, 8),
                textcoords='offset points')
 
# Setting the label for x-axis
plt.xlabel("Months", size=14)

 
# Setting the label for y-axis
plt.ylabel("Rented Bike Count", size=14)
 
# Setting the title for the graph
plt.title("This is an annotated barplot")
 
# Fianlly showing the plot
plt.show()

**OBSERVATION:** Here we can see in 6th month or in june the rented bike count is 706728 which is highest and in 2nd month or in feb the count was lowest which is 264112.

RENTED BIKE COUNT PER DAY

In [None]:
Rented_bike_per_Day

In [None]:
# Defining the plot size
plt.figure(figsize=(22, 8))
 
# Defining the values for x-axis, y-axis
# and from which datafarme the values are to be picked
plots = sns.barplot(x=Rented_bike_per_Day['Day'] , y=Rented_bike_per_Day['Rented Bike Count'])
 
# Iterrating over the bars one-by-one
for bar in plots.patches:
    
# Using Matplotlib's annotate function and
# passing the coordinates where the annotation shall be done
# x-coordinate: bar.get_x() + bar.get_width() / 2
# y-coordinate: bar.get_height()
# free space to be left to make graph pleasing: (0, 8)
# ha and va stand for the horizontal and vertical alignment
    plots.annotate(format(bar.get_height(), '.2f'),
                (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                size=15, xytext=(0, 8),
                textcoords='offset points')
 
# Setting the label for x-axis
plt.xlabel("Days", size=14)
 
# Setting the label for y-axis
plt.ylabel("Rented Bike Count", size=14)
 
# Setting the title for the graph
plt.title("This is an annotated barplot")
 
# Fianlly showing the plot
plt.show()

**OBSERVATION:** Here we can see the rented bike count is highest on 6th day of the month which is 371295 and lowest on 2nd day of the month which is 53694.

RENTED BIKE COUNT PER WEEKDAY

In [None]:
Rented_bike_per_Weekday

In [None]:
# Defining the plot size
plt.figure(figsize=(15, 8))
 
# Defining the values for x-axis, y-axis
# and from which datafarme the values are to be picked
plots = sns.barplot(x=Rented_bike_per_Weekday['weekday'] , y=Rented_bike_per_Weekday['Rented Bike Count'])
 
# Iterrating over the bars one-by-one
for bar in plots.patches:
    
# Using Matplotlib's annotate function and
# passing the coordinates where the annotation shall be done
# x-coordinate: bar.get_x() + bar.get_width() / 2
# y-coordinate: bar.get_height()
# free space to be left to make graph pleasing: (0, 8)
# ha and va stand for the horizontal and vertical alignment
    plots.annotate(format(bar.get_height(), '.2f'),
                (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                size=15, xytext=(0, 8),
                textcoords='offset points')
 
# Setting the label for x-axis
plt.xlabel("Weekdays", size=14)
 
# Setting the label for y-axis
plt.ylabel("Rented Bike Count", size=14)
 
# Setting the title for the graph
plt.title("This is an annotated barplot")
 
# Fianlly showing the plot
plt.show()

**OBSERVATION:** Here we can see on 4th day of week the rented bike count is 928267 which is highest.

RENTED BIKE COUNT WITH RESPECT TO TEMPERATURE

In [None]:
 # Plot the graph between the temperature and rented bike counts
temp=df.groupby('Temperature(°C)')['Rented Bike Count'].sum()
temp.plot.area(color='cyan',figsize=(12, 10))

 **Observation:** 
 
*  From the graph, we can see that people prefer to take bike ride more often when the temperature is near about 25 degrees Celcius.
*   From the above graph, we can easliy conclude that the people gave more preference to bike riding in summers as compare to other seasons.

MONTHS OF BOTH YEAR 2017 AND 2018 AND RENTED BIKE COUNT

In [None]:
 #creating dataframe with months of both year 2017 and 2018 and rented bike count
df.groupby(['Year','Month']).agg({'Rented Bike Count':['sum']}).reset_index()

In [None]:
 #plotting graph
plt.figure(figsize=(15,10))
sns.barplot(x = 'Month', y = 'Rented Bike Count', data =df, hue = 'Year')
plt.title("Total number of bikes rented in different months")
plt.show()

 **OBSERVATION:** 
1.   There's is a whooping increase in number of bike rents in year 2018.
2.    In the last month the demand decreases in 2018 but increases in it seen to be increasing in the end of 2017.
3.    It is like this because, in 2017 the demand is taking off and we can see the pattern as it is still inceasing in the beginning months of 2018.
4.    There is a decline in the end of the year. This could be repercussions of winter season as well.

In [None]:
  df.head()

In [None]:
 #extracting unique seasons
df['Seasons'].unique()

In [None]:
 #adding temperature and dew point temperature column 
df['Temperature_and_DP_Temp'] = [df['Temperature(°C)'][i]+df['Dew point temperature(°C)'][i] for i in range(len(df))]
df.drop(['Temperature(°C)','Dew point temperature(°C)'],axis=1,inplace=True)

RENTED BIKE COUNT PER SEASON

In [None]:
 #seasons and Rented Bike Count
df.groupby(['Seasons'])['Rented Bike Count'].sum()

In [None]:
#plotting pie 
df.groupby('Seasons')['Rented Bike Count'].sum().plot.pie(figsize=(15,8))

In [None]:
 #plotting graph
plt.figure(figsize=(15,8))
sns.barplot(x = 'Seasons', y = 'Rented Bike Count', data = df)
plt.title("Total number of bikes rented in different Seasons")
plt.show()

 **OBSERVATION:** Here with pie and bar plot we can say in summer the rented bike count was high as compared to other seasons and lowest in winter season. This is because when temperature decreases amount of snowfall increases due to which people avoid getting out that is the reason in summer rented bike count increases.

RENTED BIKE COUNT ON HOLIDAY AND ON NO HOLIDAY

In [None]:
 #creating dataframe with holiday and rented bike count
holiday_bike_count=pd.DataFrame(df.groupby('Holiday')['Rented Bike Count'].sum()).reset_index()

In [None]:
 holiday_bike_count

In [None]:
 # Defining the plot size
plt.figure(figsize=(8, 8))
 
# Defining the values for x-axis, y-axis
# and from which datafarme the values are to be picked
plots = sns.barplot(x=holiday_bike_count['Holiday'] , y=holiday_bike_count['Rented Bike Count'])
 
# Iterrating over the bars one-by-one
for bar in plots.patches:
    
# Using Matplotlib's annotate function and
# passing the coordinates where the annotation shall be done
# x-coordinate: bar.get_x() + bar.get_width() / 2
# y-coordinate: bar.get_height()
# free space to be left to make graph pleasing: (0, 8)
# ha and va stand for the horizontal and vertical alignment
    plots.annotate(format(bar.get_height(), '.2f'),
                (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                size=15, xytext=(0, 8),
                textcoords='offset points')
 
# Setting the label for x-axis
plt.xlabel("Holiday", size=14)
 
# Setting the label for y-axis
plt.ylabel("Rented Bike Count", size=14)
 
# Setting the title for the graph
plt.title("This is an annotated barplot")
 
# Fianlly showing the plot
plt.show()

##Or we can do the same with:

In [None]:
 #importing library
from plotnine import*

In [None]:
 #plotting graph
ggplot(df)+ aes('Functioning Day',fill='Holiday')+geom_bar()

 **OBSERVATION:** 
*   Here's an ironic insight, all the holidays are falling on the functioning Days.
 
*   Here we can say on no holiday the rented bike count is much more high than on holiday.

RENTED BIKE COUNT PER HOUR

In [None]:
 #plotting graph
plt.figure(figsize=(15,8))
sns.lineplot(df['Hour'],df['Rented Bike Count'])
sns.barplot(df['Hour'],df['Rented Bike Count'])

 **OBSERVATION:**  Here with the graph we can say on 18th hour of the day there is a huge spike in the count of rented bike which is approx. 1600.

In [None]:
 import seaborn as sns
#plotting graph
fig, ax=plt.subplots(figsize=(15,8))
sns.histplot(data=df, x="Rented Bike Count", kde= True,ax=ax)
plt.show()

**OBSERVATION:**


  
 
*   The data is positively skewed

In [None]:
 import seaborn as sns
#plotting graph
fig, ax=plt.subplots(figsize=(15,8))
sns.histplot(data=df, x=np.sqrt(df["Rented Bike Count"]), kde= True,ax=ax)
plt.show()

   **Observation:**
*   After squar root transformation it is looking bit normal.

FINDING CORRELATION

In [None]:
 #plotting graph
plt.figure(figsize=(15,10))
sns.heatmap(df.corr("pearson"),
            vmin=-1, vmax=1,
            cmap='coolwarm',
            annot=True, 
            square=True);

 **OBSERVATION:** 
*   Temperature and Dew point temperature are highly correlated. We can add them to make one single column

##Pre-Processing 

There is a need of data pre-processing because the data
may be incomplete or inconsistent or noisy. There are
many ways to deal with un-processed data viz:

i)Data Cleaning: By this term we mean to fill the missing
values in data, identifying and removing outliers in data,
smoothningdata.

ii)Data Transformation: In this stage operations like
normalization and aggrigation are performed.

iii)Data Reduction: In this stage the data set is modified
such that the results produced by the model are almost the
same but un neccesary values in dataset are removed.

iv)Data Integration: In this stage data is merged from
different sources if needed , again redundancies are
removed too.

v)Label Encoding: converting the categorical variables into numerical.

##**Label Encoding**

We will create DUMMY variables for 3 categorical variables 'Holiday', 'Functioning Day' and 'Seasons'.

Before creating dummy variables, we will have to convert them into 'category' data types.

In [None]:
#replacing no with 0 and yes with 1 and holiday with 1 no holiday with 0
bike_df = df.replace({'No':0,'Yes':1,'Holiday':1,'No Holiday':0})

In [None]:
#creating dummy varaible for seasons
season_dummy = pd.get_dummies(bike_df['Seasons'])
for i in season_dummy.columns:
  bike_df[i]= season_dummy[i]
bike_df.drop('Seasons',axis='columns',inplace=True)
bike_df.head()

In [None]:
# dropping date column as it is not required
bike_df.drop('Date', axis = 1 ,inplace= True)

In [None]:
#replacing month no. with mnth name
bike_df['Month'].replace((1,2,3,4,5,6,7,8,9,10,11,12),('jan','feb','mar','apr','may','june','july','aug','sept','oct','nov','dec'),inplace= True)

In [None]:
#creating dummy varaible for month
month_dummy = pd.get_dummies(bike_df['Month'])
for i in month_dummy.columns:
  bike_df[i]= month_dummy[i]
bike_df.drop('Month',axis='columns',inplace=True)
bike_df.head()

In [None]:
bike_df.shape

In [None]:
bike_df['Holiday'].value_counts()#getting value count of holidays and no holiday

In [None]:
#creating dummy varaible for Holiday
Holiday_dummy = pd.get_dummies(bike_df['Holiday'])
for i in Holiday_dummy.columns:
  bike_df[i]= Holiday_dummy[i]
bike_df.drop('Holiday',axis='columns',inplace=True)
bike_df.head()

In [None]:
bike_df['Functioning Day'].value_counts()#getting value count of functioning day

In [None]:
#creating dummy varaible for Holiday
Functioning_Day_dummy = pd.get_dummies(bike_df['Functioning Day'])
for i in Functioning_Day_dummy.columns:
  bike_df[i]= Functioning_Day_dummy[i]
bike_df.drop('Functioning Day',axis='columns',inplace=True)
bike_df.head()

In [None]:
bike_df.shape# final shape of dataframe after preprocessing

##Model Building

In [None]:
 #spliting independent and dependent variable
x = bike_df.drop('Rented Bike Count',axis=1)
y = bike_df['Rented Bike Count']
x.head()

##Train Test split model

 Splitting the data to Train and Test: - We will now split the data into TRAIN and TEST (80:20 ratio)We will use train_test_split method from sklearn package for this.

In [None]:
 # spliting the model into test and train
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split( x,y , test_size = 0.2, random_state = 0)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

 **OBSERVATION:** 
*   We have only 7008 rows of data in our training dataset. We can compromise with our gredient descent to take little longer to reach the global minima. This is why we aren't scaling this particular dataset.

In [None]:
 
# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
 
def calc_vif(X):
 
    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
 
    return(vif)

In [None]:
 #calculating vif for all the variables
calc_vif(x)

 **OBSERVATION:** 
We can see here that the 'Temperature_and_DP_Temp   ' have a high VIF value, meaning it can be predicted by other independent variables in the dataset.

##Model Building

 A machine learning model is built by learning and generalizing from training data, then applying that acquired knowledge to new data it has never seen before to make predictions and fulfill its purpose. Lack of data will prevent you from building the model, and access to data isn't enough.

##Linear Regression Model

  linear regression is a linear approach to modelling the relationship 

---

between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression

In [None]:
 #importing libraries
from sklearn.linear_model import LinearRegression
 
import sklearn.metrics as met
from sklearn.metrics import mean_squared_error,r2_score

In [None]:
 #calling and fitting linear regression
linear_reg = LinearRegression()
linear_reg.fit(x_train,y_train)

In [None]:
 # Finding the Evaluation Metrics
print ("training score: ",linear_reg.score(x_train,y_train)) 
MSE  = mean_squared_error(y_test,linear_reg.predict(x_test))
print("MSE :" , MSE)
 
RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)
 
 
r2 = r2_score(y_test,linear_reg.predict(x_test))
print("R2 :" ,r2)
print("Adjusted R2 : ",1-(1-r2_score(y_test,linear_reg.predict(x_test) ))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)))

 **OBSERVATION:** 
*   We tried adding possible columns to make the model a bit more complex but for Linear Regression model it is still too general
 
*   We have to make our model more complex for better discretion or move to tree and ensembling algorithm for better results.
 
*   After trying combinations of features with linear regression the model underfitted. It seemed obvious because data is spread too much. It didn't seem practical to fit a line.
 
*  Our train score came out to be 0.56 and test score came out to be 0.55.

##Ridge and Lasso Model

  Ridge and Lasso regression are some of the simple techniques to reduce model complexity and prevent over-fitting which may result from simple linear regression. Ridge Regression : In ridge regression, the cost function is altered by adding a penalty equivalent to square of the magnitude of the coefficients.

In [None]:
 #importing library
from sklearn.linear_model import Ridge

In [None]:
 rr = Ridge(alpha=0.01) 
# higher the alpha value, more restriction on the coefficients; low alpha > more generalization,
# in this case linear and ridge regression resembles
rr.fit(x_train, y_train)
rr100 = Ridge(alpha=100) #  comparison with alpha value
rr100.fit(x_train, y_train)
 
Ridge_train_score = rr.score(x_train,y_train)
Ridge_test_score = rr.score(x_test, y_test)
Ridge_train_score100 = rr100.score(x_train,y_train)
Ridge_test_score100 = rr100.score(x_test, y_test)

In [None]:
print ("training score for alpha=0.01:",Ridge_train_score) 
print ("test score for alpha =0.01: ",Ridge_test_score)
print ("training score for alpha=100:",Ridge_train_score100) 
print ("test score for alpha =100: ", Ridge_test_score100)

In [None]:
# Finding the Evaluation Metrics
MSE  = mean_squared_error(y_test,rr.predict(x_test))
print("MSE for alpha =0.01 :" , MSE)
 
RMSE = np.sqrt(MSE)
print("RMSE for alpha =0.01 :" ,RMSE)
 
 
r2 = r2_score(y_test,rr.predict(x_test))
print("R2 for alpha =0.01:" ,r2)
print("Adjusted R2 for alpha =0.01 : ",1-(1-r2_score(y_test,rr.predict(x_test) ))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)))

 **OBSERVATION:**  
*  Here with ridge the train score for alpha=0.01 came to be 0.56 and the test score for alpha=0.01 came to be 0.55.
 
*  With ridge the train score for alpha=100 came to be 0.56 and the test score for alpha=100 came to be 0.55.
 
*  For both alpha 0.01 and 100 the train and test value came to be 0.56 and 0.55 respectively.

In [None]:
 #importing library
from sklearn.linear_model import Lasso

In [None]:
 #using lasso
lasso = Lasso()
lasso.fit(x_train,y_train)
train_score=lasso.score(x_train,y_train)
test_score=lasso.score(x_test,y_test)
coeff_used = np.sum(lasso.coef_!=0)
print( "training score:", train_score) 
print ("test score: ", test_score)
print ("number of features used: ", coeff_used)
#taking alpha=0.01
lasso001 = Lasso(alpha=0.01, max_iter=10e5)
lasso001.fit(x_train,y_train)
train_score001=lasso001.score(x_train,y_train)
test_score001=lasso001.score(x_test,y_test)
coeff_used001 = np.sum(lasso001.coef_!=0)
print ("training score for alpha=0.01:", train_score001) 
print ("test score for alpha =0.01: ", test_score001)
print ("number of features used: for alpha =0.01:", coeff_used001)
#taking alpha=0.0001
lasso00001 = Lasso(alpha=0.0001, max_iter=10e5)
lasso00001.fit(x_train,y_train)
train_score00001=lasso00001.score(x_train,y_train)
test_score00001=lasso00001.score(x_test,y_test)
coeff_used00001 = np.sum(lasso00001.coef_!=0)
print ("training score for alpha=0.0001:", train_score00001) 
print ("test score for alpha =0.0001: ", test_score00001)
print ("number of features used: for alpha =0.0001:", coeff_used00001)

In [None]:
 # Finding the Evaluation Metrics
MSE  = mean_squared_error(y_test,lasso001.predict(x_test))
print("MSE for alpha=0.01 :" , MSE)
 
RMSE = np.sqrt(MSE)
print("RMSE for alpha-0.01 :" ,RMSE)
 
print ("training score: ",lasso001.score(x_train,y_train)) 
 
r2 = r2_score(y_test,lasso001.predict(x_test))
print("R2 for lasso:" ,r2)
print("Adjusted R2 for alpha=0.01: ",1-(1-r2_score(y_test,lasso001.predict(x_test) ))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)))

In [None]:
 # Finding the Evaluation Metrics
print ("training score: ",lasso00001.score(x_train,y_train)) 
MSE  = mean_squared_error(y_test,lasso00001.predict(x_test))
print("MSE for alpha=0.0001 :" , MSE)
 
RMSE = np.sqrt(MSE)
print("RMSE for alpha-0.0001 :" ,RMSE)
 
 
r2 = r2_score(y_test,lasso00001.predict(x_test))
print("R2 for lasso:" ,r2)
print("Adjusted R2 for alpha=0.0001: ",1-(1-r2_score(y_test,lasso00001.predict(x_test) ))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)))

 **OBSERVATION:**  
*  Here with lasso training score came to be 0.56 and test score came to be 0.55.
 
*  Number of features used is 26.
 
*  Training score for alpha=0.01 came to be 0.56 and
test score for alpha =0.01 came to be 0.55.
 
*  Number of features used: for alpha =0.01 is 28.
 
*  Training score for alpha=0.0001 came out to be 0.56 and test score for alpha =0.0001 came to be 0.55.
 
*  Number of features used: for alpha =0.0001 is 28.

##**Decision tree regressor model**

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

In [None]:
#importing library
from sklearn.tree import DecisionTreeRegressor

In [None]:
#calling decision tree regressor
reg_decision_model=DecisionTreeRegressor()

In [None]:
# fit independent varaibles to the dependent variables
reg_decision_model.fit(x_train,y_train)

In [None]:
# Finding the Evaluation Metrics
print ("training score: ",reg_decision_model.score(x_train,y_train)) 
MSE  = mean_squared_error(y_test,reg_decision_model.predict(x_test))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)


r2 = r2_score(y_test,reg_decision_model.predict(x_test))
print("R2 :" ,r2)
print("Adjusted R2 : ",1-(1-r2_score(y_test,reg_decision_model.predict(x_test) ))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)))

##**Doing hyperparameter tuning**

Hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

In [None]:
# calculating different regression metrics

from sklearn.model_selection import GridSearchCV
from sklearn import decomposition, datasets
from sklearn import tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
#taking min leaf sample from 1 to 50
for min_sam_leaf in range(1,50):
  DT_reg = DecisionTreeRegressor(criterion='mse',min_samples_leaf=min_sam_leaf)
  DT_reg.fit(x_train,y_train)
  print(f"\nR-sqared on train dataset when min leaf {min_sam_leaf} : {met.r2_score(y_train,DT_reg.predict(x_train))}")
  print(f"R-sqared on test dataset when min leaf {min_sam_leaf}: {met.r2_score(y_test,DT_reg.predict(x_test))}")
  print(f"Mean absolute error on test dataset when min leaf {min_sam_leaf}: {met.mean_absolute_error(y_test,DT_reg.predict(x_test))}")
  print(f"Mean squared error on test dataset when min leaf {min_sam_leaf}: {met.mean_squared_error(y_test,DT_reg.predict(x_test))}")

**OBSERVATION:**

*   As expected Decision tree has overfitted the data.
*   But it is doing way better than linear regression on test data as well. 
*   At minimum Sample leaf 5 the model giving highest r-squared score and least errors on test.

In [None]:
#calling decision tree regresor
DT_reg = DecisionTreeRegressor(criterion='mse',min_samples_leaf=6)
DT_reg.fit(x_train,y_train)

In [None]:
# Finding the Evaluation Metrics
print ("training score: ",DT_reg.score(x_train,y_train)) 
MSE  = mean_squared_error(y_test,DT_reg.predict(x_test))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)


r2 = r2_score(y_test,DT_reg.predict(x_test))
print("R2 :" ,r2)
print("Adjusted R2 : ",1-(1-r2_score(y_test,DT_reg.predict(x_test) ))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)))

**OBSERVATION:** With Decision tree we reached at the model r squared value of 0.84. We only fitted with minimum number of leaf hyperparameter. With default paremeters it overfitted and reached r-squared at 1 with train dataset but 0.84 with test.

##**Feature Importance: Decision Tree**

In [None]:
#creating dataframe
features = pd.DataFrame(list(zip(DT_reg.feature_importances_,x.columns)),columns=['Score','Features'])
features=features.sort_values('Score',ascending=False)

In [None]:
#plotting graph
plt.figure(figsize=(15,7))
sns.barplot(x=features['Score'],y=features['Features'])
plt.show()

**OBSERVATION:** 
HOUR and TEMPERATURE_AND_DP_TEMPERATURE column are the main columns helping in prediction.

##**Random Forest Model**


The term “Random Forest Classifier” refers to the classification algorithm made up of several decision trees. The algorithm uses randomness to build each individual tree to promote uncorrelated forests, which then uses the forest's predictive powers to make accurate decisions.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [50,60,60],
    'min_samples_split': [50,100,150],
    'n_estimators': [200, 300, 1000]
}
# Create a based model
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [None]:
# Fit the grid search to the data
grid_search.fit(x_train,y_train)
grid_search.best_params_

Fitting 3 folds for each of 162 candidates, totalling 486 fits


In [None]:
rf = RandomForestRegressor(bootstrap= True,
 max_depth= 110,
 max_features= 3,
 min_samples_leaf= 50,
 min_samples_split= 100,
 n_estimators= 300 )
rf.fit(x_train,y_train)

In [None]:
# Finding the Evaluation Metrics
print ("training score: ",rf.score(x_train,y_train))
MSE  = mean_squared_error(y_test,rf.predict(x_test))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)


r2 = r2_score(y_test,rf.predict(x_test))
print("R2 :" ,r2)
print("Adjusted R2 : ",1-(1-r2_score(y_test,rf.