# **Project Name**    - Seoul Bike Sharing Demand Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual



# **Problem Statement**


**Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.**

#### **Define Your Business Objective?**

**Business Objective is to provide the prediction of bike count required at each hour for the stable supply of rental bikes & using suitable ML model and EDA for serving the purpose.**

***Data Description :*** 

> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.
* Attribute Information:
* Date : year-month-day
* Rented Bike count - Count of bikes rented at each hour
* Hour - Hour of he day
* Temperature-Temperature in Celsius
* Humidity - %
* Windspeed - m/s
* Visibility - 10m
* Dew point temperature - Celsius
* Solar radiation - MJ/m2
* Rainfall - mm
* Snowfall - cm
* Seasons - Winter, Spring, Summer, Autumn
* Holiday - Holiday/No holiday
* Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

# ***Let's Begin !***

### Import Libraries

In [None]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_score

### Dataset Loading

In [None]:
# Loading Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
SeoulBikeData = pd.read_csv('/content/drive/MyDrive/AlmaBetter/projectsAlmaBetter/SeoulBikeData.csv',sep=',',encoding='latin')

### Dataset First View

In [None]:
# Dataset First Look
SeoulBikeData.head(-5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
#SeoulBikeData.describe
#SeoulBikeData.info
SeoulBikeData.shape


### Dataset Information

In [None]:
# Dataset Info
SeoulBikeData.info

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
SeoulBikeData.duplicated().value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
SeoulBikeData.isnull().value_counts()

In [None]:
# Visualizing the missing values
SeoulBikeData.isnull().sum()

### What did you know about your dataset?

**As per the observations and results of above two sets of code ,we can coclude that there are no missing/NaN or duplicate values present in the dataset.**

In [None]:
# No missin Values in our dataset
plt.figure(figsize=(15, 5))
sns.heatmap(SeoulBikeData.isnull(), cbar=True, yticklabels=False)
plt.xlabel("Column_Name", size=10, weight="bold")
plt.title("Places of missing values in column",fontweight="bold",size=17)
plt.show()

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
SeoulBikeData.columns

In [None]:
# Dataset Describe
SeoulBikeData.describe

**As we can see the name of the columns are different and that might be a problem in the future for modeling hence we have to rename those columns.**

In [None]:
#Renaming the columns to our required column names
SeoulBikeData.rename({"Temperature(°C)": "Temperature",  
                      "Functioning Day":"Functioning_Day",
           "Humidity(%)": "Humidity",  
           "Wind speed (m/s)": "Wind_speed",
           "Visibility (10m)": "Visibility",
           "Dew point temperature(°C)": "Dew_point_temperature",
           "Solar Radiation (MJ/m2)": "Solar_Radiation",
           "Snowfall (cm)": "Snowfall",
           "Rainfall(mm)": "Rainfall",
           "Rented Bike Count": "Rented_Bike_Count"},  
          axis = "columns", inplace = True)

In [None]:
SeoulBikeData.head(-5)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(pd.unique(SeoulBikeData['Temperature']))
print(pd.unique(SeoulBikeData['Functioning_Day']))
print(pd.unique(SeoulBikeData['Humidity']))
print(pd.unique(SeoulBikeData['Wind_speed']))
print(pd.unique(SeoulBikeData['Visibility']))
print(pd.unique(SeoulBikeData['Dew_point_temperature']))
print(pd.unique(SeoulBikeData['Solar_Radiation']))
print(pd.unique(SeoulBikeData['Snowfall']))
print(pd.unique(SeoulBikeData['Rainfall']))
print(pd.unique(SeoulBikeData['Rented_Bike_Count']))


In [None]:
#@title Box plot for Dependent variable
fig = plt.figure(figsize =(16,4))
sns.boxplot(SeoulBikeData['Rented_Bike_Count'])

Exploring Catagorical Variables

In [None]:
#Exploration using graphs and plot
Holiday_rent = pd.DataFrame(SeoulBikeData.groupby('Holiday').agg({'Rented_Bike_Count':'mean'}))
Season_rent = pd.DataFrame(SeoulBikeData.groupby('Seasons').agg({'Rented_Bike_Count':'mean'}))

# Bike rents in Seasons and holidays
fig, ax = plt.subplots(2,2,figsize=(15,10))
ax1=plt.subplot(2, 2,1)
sns.barplot(x=Holiday_rent.index, y = Holiday_rent['Rented_Bike_Count'])
ax1=plt.subplot(2, 2,2)
sns.barplot(x=Season_rent.index, y = Season_rent['Rented_Bike_Count'])

# How many Total Seasons and Holidays
ax1=plt.subplot(2, 2,3)
SeoulBikeData['Holiday'].value_counts().plot(kind='bar')
plt.xlabel('Holiday')
plt.ylabel('Counts')
ax1=plt.subplot(2, 2,4)
SeoulBikeData['Seasons'].value_counts().plot(kind='bar')
plt.xlabel('Seasons')
plt.ylabel('Counts')

**:-From this we can conclude that the large number of bikes are being rented when there is a working day/No Holiday and more often in summer season. Even in general also, bikes are being rented more in the working day itself regardless of the seasons.**

**Exploring Numerical Variables**

In [None]:
numerical_features = ['Hour', 'Temperature', 'Humidity',
       'Wind_speed', 'Visibility', 'Solar_Radiation',
       'Rainfall', 'Snowfall']

# List of colors in the color palettes
rgb_values = sns.color_palette("Set1", 9)
# Map continents to the colors
color_map = dict(zip(numerical_features, rgb_values))

In [None]:
plt.rcParams['figure.figsize'] = (15, 5)
for col, key in zip(numerical_features, color_map):
  plt.figure()
  sns.regplot(x=SeoulBikeData[col], y = SeoulBikeData['Rented_Bike_Count'],scatter_kws={"color": color_map[key]}, line_kws={"color": "black"})

* *Following are some of the conclusions I have drawn :*

**Hour:**

There must be high demand during the office timings around 8 A.M. and 8 P.M., also for early morning and late evening we are having a relatably different trends. And, definately low demand between 8 A.M. and 8 P.M.

**Temperature:**

In general, temperature has negative correlation with the bike demands. So, as the temperature increases, the bike count also increases.

**Humidity:**

Humidity acts as a deterrent to a bike ride. The bike count decreases when the humidity increases.

**Wind Speed:**

Due to Wind speed , there is certain increase in the bike count but the change is very small.

**Visibility:**

If there is low visibility, people won't prefer to ride the bike. So,as the visibility increases , the number of bike count also increases.

**Rainfall and Snowfall:**

If there is rainfall/Snowfall, people don't prefer to travel out. And, hence the bike count decreases.

**3D plot representing Rainfall , Snowfall , Rented bike count**

In [None]:
import plotly.express as px

fig = px.scatter_3d(SeoulBikeData, x='Rainfall', y='Snowfall', z='Rented_Bike_Count',
                    size_max=18,
               opacity=0.7)

# tight layout
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))

**Change in Bike Renting with Change in hours**

In [None]:
# group by Hrs and get average Bikes rented, and precent change
avg_rent_hrs = SeoulBikeData.groupby('Hour')['Rented_Bike_Count'].mean()
pct_rent_hrs = SeoulBikeData.groupby('Hour')['Rented_Bike_Count'].sum().pct_change()

fig, (axis1,axis2) = plt.subplots(2,1,sharex=True,figsize=(15,8))

# plot average rent over time(hrs)
ax1 = avg_rent_hrs.plot(legend=True,ax=axis1,marker='o',title="Average Bikes Rented Per Hr")
ax1.set_xticks(range(len(avg_rent_hrs)))
ax1.set_xticklabels(avg_rent_hrs.index.tolist(), rotation=85)

# plot precent change for rent over time(hrs)
ax2 = pct_rent_hrs.plot(legend=True,ax=axis2,marker='o',rot=85,colormap="summer",title="Bike Rent Percent Change")
#ax1.set_xticks(range(len(avg_rent_hrs)))

To get the types of seasons in our dataset:

In [None]:
SeoulBikeData.Seasons.unique()

To get the unique holiday in our dataset:

In [None]:
SeoulBikeData.Holiday.unique()

To get the unique number of hours in our dataset:

In [None]:
SeoulBikeData.Hour.unique()

Since, we have zero count for our dependent variable when there is no functioning day. So, to keep our dataset more intact, dropping the data where there is no functioning day.

In [None]:
#SeoulBikeData[SeoulBikeData['Functioning_Day'] != 'Yes']
SeoulBikeData = SeoulBikeData[SeoulBikeData['Functioning_Day'] == 'Yes']
SeoulBikeData.drop('Functioning_Day', axis = 1, inplace =True)

In [None]:
SeoulBikeData.head()

Assigning the categorical values to the columns for building a model:

In [None]:
def Functioning_Day(row):
  if str(row) == 'Yes':
    return 1
  else :
    return 0

In [None]:
def Holiday_label(row):
  if str(row) == 'Holiday':
    return 1
  else :
    return 0

In [None]:
SeoulBikeData['Holiday']=SeoulBikeData.apply(lambda row : Holiday_label(row['Holiday']),axis=1)

SeoulBikeData['Holiday'].value_counts()

In [None]:
plt.figure(figsize=(25,10))
cor=SeoulBikeData.corr().abs()
mask = np.triu(np.ones_like(cor, dtype=bool))
sns.heatmap(cor,mask=mask, annot=True, cmap='coolwarm')

From the above heatmap, we can see that Temperature and Dew_point_temperature is highy correlated, keeping the factor of 0.91 . And, then we have hour in the graph which is having good correlation with our dependent variable.

we need to map seasons also

In [None]:
def Seasons_label(row):
  if str(row) == 'Winter':
    return 0
  elif str(row) == 'Autumn':
    return 1
  elif str(row) == 'Spring':
    return 2
  elif  str(row) == 'Summer':
    return 3

In [None]:
SeoulBikeData['Seasons']=SeoulBikeData.apply(lambda row : Seasons_label(row['Seasons']),axis=1)

SeoulBikeData['Seasons'].value_counts()

In [None]:
def Function_day(row):
  if str(row) == 'Yes':
    return 1
  else :
    return 0

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(SeoulBikeData.corr().abs(),annot=True,cmap='coolwarm')

**Identifying the outliers:**

In [None]:
sns.set(font_scale=1.0)
fig, axes = plt.subplots(nrows=4,ncols=2)
fig.set_size_inches(15, 15)
sns.boxplot(data=SeoulBikeData,y="Rented_Bike_Count",x="Humidity",orient="v",ax=axes[0][0])
sns.boxplot(data=SeoulBikeData,y="Rented_Bike_Count",x="Hour",orient="v",ax=axes[0][0])
sns.boxplot(data=SeoulBikeData,y="Rented_Bike_Count",x="Temperature",orient="v",ax=axes[1][0])
sns.boxplot(data=SeoulBikeData,y="Rented_Bike_Count",x="Wind_speed",orient="v",ax=axes[1][1])
sns.boxplot(data=SeoulBikeData,y="Rented_Bike_Count",x="Visibility",orient="v",ax=axes[2][0])
sns.boxplot(data=SeoulBikeData,y="Rented_Bike_Count",x="Seasons",orient="v",ax=axes[2][1])
sns.boxplot(data=SeoulBikeData,y="Rented_Bike_Count",x="Holiday",orient="v",ax=axes[3][0])
sns.boxplot(data=SeoulBikeData,y="Rented_Bike_Count",x="Solar_Radiation",orient="v",ax=axes[3][1])

Since , Date and Dew_point_temperature shows high correlation with our dependent variable, so dropping these two columns.

In [None]:
# Data for all the independent variables

SeoulBikeData = SeoulBikeData.drop(labels='Date',axis=1)
SeoulBikeData = SeoulBikeData.drop(labels='Dew_point_temperature',axis=1,)

In [None]:
# Data for all the independent variables

X = SeoulBikeData.drop(labels='Rented_Bike_Count',axis=1)

# Data for the dependent variable

Y = SeoulBikeData['Rented_Bike_Count']

In [None]:
#X
#Y

#Linear Regression

In [None]:
# import libraray
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge

In [None]:
# Splitting the dataset into the Training set and Test set

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

In [None]:
#Shapes of train and test data 
X_train.shape,X_test.shape,Y_train.shape,Y_test.shape

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

In [None]:
regressor.intercept_

In [None]:
regressor.coef_

In [None]:
y_pred_train=regressor.predict(X_train)
print(y_pred_train)

In [None]:
y_pred=regressor.predict(X_test)
print(y_pred)

**Evaluation Matrix**

In [None]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error,accuracy_score

In [None]:
r2_score(Y_test, y_pred)

In [None]:
print("Adjusted R2 : ",1-(1-r2_score((Y_test), (y_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
MSE  = mean_squared_error(Y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

In [None]:
MAE  = mean_absolute_error(Y_test, y_pred)
print("MAE :" , MAE)


In [None]:
plt.scatter(Y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')

In [None]:
plt.figure(figsize=(20,10))
plt.plot(y_pred)
plt.plot(np.array(Y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

# **Lasso Regression**

In [None]:
lasso = Lasso(alpha=0.001)
lasso.fit(X_train, Y_train)

In [None]:
y_pred=lasso.predict(X_test)

In [None]:
r2_score(Y_test, y_pred)

In [None]:
MSE  = mean_squared_error(Y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

In [None]:
MAE  = mean_absolute_error(Y_test, y_pred)
print("MAE :" , MAE)

# Ridge Rigression

In [None]:
# Hyperparameter tuning

from sklearn.model_selection import GridSearchCV
ridge = Ridge(alpha=30)
ridge.fit(X_train,Y_train)

In [None]:
y_pred=ridge.predict(X_test)

In [None]:
r2_score(Y_test, y_pred)

In [None]:
MSE  = mean_squared_error(Y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

In [None]:
MAE  = mean_absolute_error(Y_test, y_pred)
print("MAE :" , MAE)

# Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor()

In [None]:
tree.fit(X_train,Y_train)

In [None]:
y_pred=tree.predict(X_test)

In [None]:
r2_score(Y_test, y_pred)

In [None]:
print("Adjusted R2 : ",1-(1-r2_score((Y_test), (y_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
MSE  = mean_squared_error(Y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

In [None]:
MAE  = mean_absolute_error(Y_test, y_pred)
print("MAE :" , MAE)

In [None]:
plt.scatter(Y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')

In [None]:
plt.figure(figsize=(12,8))
plt.plot(y_pred)
plt.plot(np.array(Y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

In [None]:
tree.feature_importances_

In [None]:
features = X.columns
importances = tree.feature_importances_
indices = np.argsort(importances)

In [None]:
plt.figure(figsize=(12,9))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='red', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

# GradiantBoosting Algorithm

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
ensemble = GradientBoostingRegressor()

In [None]:
ensemble.fit(X_train,Y_train)

In [None]:
r2_score(Y_test, y_pred)

In [None]:
print("Adjusted R2 : ",1-(1-r2_score((Y_test), (y_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
MSE  = mean_squared_error(Y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

In [None]:
MAE  = mean_absolute_error(Y_test, y_pred)
print("MAE :" , MAE)

In [None]:
plt.scatter(Y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')

# RandomForest

Parameter tuning on ‘n_estimators’, ‘max_depth’ and ‘min_samples_leaf’ parameters.

In [None]:
from sklearn.ensemble import RandomForestRegressor
ensemble_regressior = RandomForestRegressor()

In [None]:
ensemble_regressior.fit(X_train,Y_train)

In [None]:
y_pred=ensemble_regressior.predict(X_test)

In [None]:
r2_score(Y_test, y_pred)

In [None]:
print("Adjusted R2 : ",1-(1-r2_score((Y_test), (y_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

In [None]:
MSE  = mean_squared_error(Y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

In [None]:
MAE  = mean_absolute_error(Y_test, y_pred)
print("MAE :" , MAE)

In [None]:
plt.scatter(Y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')

# XGBoost

In [None]:
import xgboost as xgb

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.33, random_state=42)
dreg= xgb.XGBRegressor(
                        booster= 'gbtree',
                        colsample_bylevel= 1,
                        colsample_bynode= 1,
                        colsample_bytree= 0.7,
                        eta= 0.004,
                        gamma= 0,
                        importance_type= 'gain',
                        learning_rate= 0.1,
                        max_delta_step= 0,
                        max_depth= 9,
                        min_child_weight= 10,
                        n_estimators= 100,
                        n_jobs= 1,
                        objective= 'reg:linear',
                        random_state= 0,
                        reg_alpha= 0,
                        reg_lambda= 1,
                        scale_pos_weight= 1,
                        subsample= 1,
                        verbosity= 1)
dreg.fit(X_train, Y_train)
y_pred = dreg.predict(X_test)
#Find R-squared value
r2 = r2_score(Y_test, y_pred)
# Find Adjusted R-squared value
adj_r2=1-(1-r2_score(Y_test, y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
train_score = dreg.score(X_train, Y_train)
test_score = dreg.score(X_test,Y_test)
print(f'Train score: {train_score}')
print(f'Test score: {test_score}')
r2

In [None]:
print(r2)
print(adj_r2)

In [None]:
MSE  = mean_squared_error(Y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

In [None]:
MAE  = mean_absolute_error(Y_test, y_pred)
print("MAE :" , MAE)

# Cat boost

In [None]:
!pip install catboost

In [None]:

from catboost import CatBoostRegressor
import timeit

from sklearn.datasets import make_regression

In [None]:
model = CatBoostRegressor(
    iterations=100,
    learning_rate=0.03
  )

In [None]:
model.fit(
      X_train, Y_train,
      eval_set=(X_test, Y_test),
      verbose=10);

In [None]:
def train_on_cpu():  
  model = CatBoostRegressor(
    iterations=100,
    learning_rate=0.03
  )
  
  model.fit(
      X_train, Y_train,
      eval_set=(X_test, Y_test),
      verbose=10
  );   
      
cpu_time = timeit.timeit('train_on_cpu()', 
                         setup="from __main__ import train_on_cpu", 
                         number=1)

print('Time to fit model on CPU: {} sec'.format(int(cpu_time)))

In [None]:
# Predicting the Test set results

y_pred = model.predict(X_test)

In [None]:
import math
math.sqrt(mean_squared_error(Y_test, y_pred))

In [None]:
print(r2_score(Y_test, y_pred))
print(adj_r2)

In [None]:
MSE  = mean_squared_error(Y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

In [None]:

MAE  = mean_absolute_error(Y_test, y_pred)
print("MAE :" , MAE)

In [None]:

# Validating Assumptions

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error
def print_metrics(actual, predicted):
  print('MSE is {}'.format(mean_squared_error(actual, predicted)))
  print('RMSE is {}'.format(math.sqrt(mean_squared_error(actual, predicted))))
  print('RMSE is {}'.format(r2_score(actual, predicted)))
  print('MAE is {}'.format(mean_absolute_error(actual, predicted)))
  print('MAPE is {}'.format(np.mean(np.abs((actual - predicted) / actual)) * 100))

In [None]:
# Evaluation of training Data

print_metrics(Y_train, y_train_pred)

In [None]:
# Test dataset metrics

print_metrics(Y_test, y_test_pred)

# Grid Search CV on XGboost algorithm

In [None]:
import warnings
warnings.filterwarnings('ignore')
import xgboost as xgb

xgb = xgb.XGBRegressor(random_state=0)

In [None]:
params = {"min_child_weight":[10,20], 
            'eta': [0.004,0.04,4,40], 
            'colsample_bytree':[0.7], 
            'max_depth': [7,9,11],
          
          }

In [None]:
reg_gs = GridSearchCV(xgb,param_grid=params, verbose=1,cv=3)
reg_gs.fit(X, Y)

In [None]:

reg_gs.best_estimator_.get_params()

In [None]:
reg_optimal_model =reg_gs.best_estimator_
#print(reg_optimal_model)

In [None]:
train_preds = reg_optimal_model.predict(X_train)
test_preds = reg_optimal_model.predict(X_test)

In [None]:
reg_optimal_model.score(X_test,Y_test)

For Test dataset:

In [None]:
#Find R-squared value

r2 = r2_score(Y_test, test_preds)

# Find Adjusted R-squared value

adj_r2=1-(1-r2_score(Y_test, test_preds))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))

In [None]:
print(r2)
print(adj_r2)

For Train dataset

In [None]:
#Find R-squared value
r2 = r2_score(Y_train, train_preds)
# Find Adjusted R-squared value
adj_r2=1-(1-r2_score(Y_train, train_preds))*((X_train.shape[0]-1)/(X_train.shape[0]-X_test.shape[1]-1))

In [None]:
print(r2)
print(adj_r2)

**We have nearly same r2 score for both the datasets,train and test which leads towards optimal model.**

# Principal Component Analysis

In [None]:
# import libraries for PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=7)
# X = df.drop('MEDV',axis=1)
X_pca = pca.fit_transform(X)

In [None]:
print(pca.components_)

In [None]:
print(pca.explained_variance_)

In [None]:
df_pca = pd.DataFrame(X_pca,columns=['F1','F2','F3','F4','F5','F6','F7'])
df_pca['Rented_Bike_Count'] = SeoulBikeData['Rented_Bike_Count']

In [None]:
df_pca.head()

In [None]:
df_pca.corr().abs()

In [None]:
#Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):
 
   # Calculating VIF
   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
 
   return(vif)

In [None]:

calc_vif(df_pca[[feature for feature in df_pca.describe().columns if feature not in ['Rented_Bike_Count']]])

In [None]:
# Lets look at the distribution plot of the features
pos = 1
fig = plt.figure(figsize=(16,24))
for i in df_pca.columns:
    pos = pos + 1
    sns.displot(df_pca[i])

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score,r2_score,mean_absolute_error,mean_squared_error


In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(X , Y, test_size=0.20)
print("Shape of Train data set is",X_train.shape,Y_train.shape)
print("Shape of X_test is ",X_test.shape,Y_test.shape)

In [None]:
### Cross validation

lasso = Lasso()
parameters = {'alpha': [1e-7,1e-3,1e-2,1e-1,1,5,10,20,100]}
regressor = GridSearchCV(lasso, parameters, cv=8)
regressor.fit(X_train, Y_train)

In [None]:
optimal=regressor.best_estimator_

In [None]:
y_pred=optimal.predict(X_test)
y_pred_train=optimal.predict(X_train)

In [None]:
MSE  = mean_squared_error(Y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

In [None]:
#Train datset
r2 = r2_score(Y_train, y_pred_train)
print("R2 Train:" ,r2)

#test dataset
r2 = r2_score(Y_test, y_pred)
print("R2 Test:" ,r2)

# Grid Search CV on XGboost algorithm

In [None]:
import warnings
warnings.filterwarnings('ignore')
import xgboost as xgb

xgb = xgb.XGBRegressor(random_state=0)

In [None]:
params = {"min_child_weight":[10,20], 
            'eta': [0.004,0.04,4,40], 
            'colsample_bytree':[0.7], 
            'max_depth': [7,9,11],
          
          }

In [None]:

reg_gs = GridSearchCV(xgb,param_grid=params, verbose=1,cv=3)
reg_gs.fit(X, Y)

In [None]:
reg_gs.best_estimator_.get_params()

In [None]:
reg_optimal_model =reg_gs.best_estimator_

In [None]:
train_preds = reg_optimal_model.predict(X_train)
test_preds = reg_optimal_model.predict(X_test)

In [None]:
reg_optimal_model.score(X_test,Y_test)

In [None]:
##For Test dataset:

#Find R-squared value
r2 = r2_score(Y_test, test_preds)
# Find Adjusted R-squared value
adj_r2=1-(1-r2_score(Y_test, test_preds))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))

In [None]:
print(r2)
print(adj_r2)

In [None]:
MSE  = mean_squared_error(Y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)
MAE  = mean_absolute_error(Y_test, y_pred)
print("MAE :" , MAE)

# **Conclusion**

*When we compare the root mean squared error and mean absolute error of all the models, **the XGBoost** model has less root mean squared error and mean absolute error, ending with the accuracy of **94%** . So, finally this model is best for predicting the bike rental count on daily basis. As we can see the total amount of bike rentals increases with the temperature per month. Whereas it seems that the rentals are independent of the windspeed and the humidity, because they are almost constant over the months. This also confirms on the one hand the high correlation between rentals and temperature and on the other hand that nice weather could be a good predictor.So people mainly rent bikes on the days with nice wheather and nice temperature. This could be important for planning new bike rental stations.*

# -- -- -- -- -- -- -- -- -- --END -- -- -- -- -- -- -- -- -- -- -- -- -- 