<a href="https://colab.research.google.com/github/GODxFATHER/Bike-Sharing-Demand-Prediction/blob/main/Bike_Sharing_Demand_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Seoul Bike Sharing Demand Prediction </u></b>

## <b> Problem Description </b>

### Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.


## <b> Data Description </b>

### <b> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.</b>


### <b>Attribute Information: </b>

* ### Date : year-month-day
* ### Rented Bike count - Count of bikes rented at each hour
* ### Hour - Hour of he day
* ### Temperature-Temperature in Celsius
* ### Humidity - %
* ### Windspeed - m/s
* ### Visibility - 10m
* ### Dew point temperature - Celsius
* ### Solar radiation - MJ/m2
* ### Rainfall - mm
* ### Snowfall - cm
* ### Seasons - Winter, Spring, Summer, Autumn
* ### Holiday - Holiday/No holiday
* ### Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

#Importing Librabries

In [None]:
# Importing Libraries

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline

In [None]:
# # Configuration for matplotlib graphs

# matplotlib.rcParams['font.size'] = 12
# matplotlib.rcParams['figure.figsize'] = (13, 7)
# matplotlib.rcParams['figure.facecolor'] = '#00000000'
# sns.set_style('darkgrid');

#Loading the Dataset

In [None]:
# Mounting the google drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Loading the csv file into pandas dataframe

# path_r = "/content/drive/MyDrive/EDA - Capstone Project/Airbnb NYC 2019.csv"
path_p = '/content/SeoulBikeData.csv'
path_n = "/content/drive/MyDrive/DATA_FILES/SeoulBikeData.csv"
df = pd.read_csv(path_n, encoding= 'unicode_escape')

#EDA on features

In [None]:
#rows and columns in dataset
df.shape

So we have 8769 Rows and 14 features

In [None]:
#printing the first 5 rows of dataset
df.head()

In [None]:
#dataset info
df.info()

**Here we see the following columns are objects: -**

* **Date**
* **Season**
* **Holiday**
* **Functionaing day**

**So we to convert them into datetime data type.**

**Date convert to Datetime**

**Rest we see later**

In [None]:
# convert the 'Date' column to datetime format
df['Date']= pd.to_datetime(df['Date'])

#let's check 
df.info()

In [None]:
#checking for duplicacy in the dataset
df.duplicated().sum()

* **The dataset doesn't contain any duplicate rows.**

##Let's Check for null values

In [None]:
#checking the missing data in the dataset
df.isnull().sum()

* **The dateset dont't have null values**

In [None]:
#dataset descriptive statistics

df.describe()

In [None]:
#dataset columns
df.columns

In [None]:
#renaming dataset columns
bike_df = df.rename( columns = { "Rented Bike Count":"Rented_Bike_Count",
                   "Temperature(°C)":"Temperature",
                   "Humidity(%)":"Humidity",
                   "Wind speed (m/s)":"Wind_speed",
                   "Visibility (10m)":"Visibility",
                   "Dew point temperature(°C)":"Dew_point_temperature",
                   "Solar Radiation (MJ/m2)":"Solar_Radiation",
                   "Rainfall(mm)":"Rainfall",
                   "Snowfall (cm)":"Snowfall",
                   "Functioning Day":"Functioning_Day" } )
bike_df.head(1)

In [None]:
#year columns
name_bike_df = bike_df.copy()
name_bike_df['year'] = name_bike_df['Date'].dt.year
#month column 
name_bike_df['month'] = name_bike_df['Date'].dt.month_name() 
#day columns
name_bike_df['day'] = name_bike_df['Date'].dt.day
#weekday columns
name_bike_df['weekday'] = name_bike_df['Date'].dt.day_name() 
#quarter columns
# name_bike_df['quarter'] = name_bike_df['Date'].dt.quarter   
 

##Let's explore Coloumns

##Univariate

In [None]:
name_bike_df.head(1)

In [None]:
# Numeical features

numerical_features = name_bike_df.describe().columns
numerical_features

In [None]:
# Distribution of Dependent Variable 

col = 'Rented_Bike_Count'

plt.figure(figsize=(11,4))

plt.subplot(1, 2, 1)
sns.distplot(name_bike_df[col], kde=True, fit=norm)
plt.axvline(np.median(name_bike_df[col]),color='g', linestyle='--')
plt.axvline(np.mean(name_bike_df[col]),color='r', linestyle='--')  
plt.title(f'skewness {round(name_bike_df[col].skew(),2)}')

plt.subplot(1, 2, 2)
sns.boxplot(name_bike_df[col])
plt.axvline(np.median(name_bike_df[col]),color='g', linestyle='--')
plt.axvline(np.mean(name_bike_df[col]),color='r', linestyle='--')  

plt.title('boxplot') 

In [None]:
# Distribution of Dependent Variable after transformation

col = 'Rented_Bike_Count'
plt.figure(figsize=(15,4))

plt.subplot(1, 3, 1)
sns.distplot(name_bike_df[col], kde=True)
plt.axvline(np.median(name_bike_df[col]),color='g', linestyle='--')
plt.axvline(np.mean(name_bike_df[col]),color='r', linestyle='--')  
plt.title(f'skewness {round(name_bike_df[col].skew(),2)}')

plt.subplot(1, 3, 2)
sns.distplot(np.sqrt(name_bike_df[col]), kde=True, color="y")
plt.axvline(np.median(np.sqrt(name_bike_df[col])), color='g', linestyle='--')
plt.axvline(np.mean(np.sqrt(name_bike_df[col])),color='r', linestyle='--')
plt.title(f'sqrt transformed skewness {round((np.sqrt(name_bike_df[col])).skew(),2)}')

plt.subplot(1,3,3)
stats.probplot(name_bike_df[col], plot=plt)
plt.show()

plt.show()

In [None]:

col='Rented_Bike_Count'
mean = round(norm.fit(name_bike_df[col])[0],2)
sd = round(norm.fit(name_bike_df[col])[1],2)

plt.figure(figsize=(21,5))

plt.subplot(1,3,1)

sns.distplot(name_bike_df[col] , fit=norm, kde =True)
plt.axvline(np.median(name_bike_df[col]),color='g', linestyle='--')
plt.axvline(np.mean(name_bike_df[col]),color='r', linestyle='--')  
# plt.title(f'skewness {round(name_bike_df[col].skew(),2)}')
plt.legend(['continuous probability density curve', f'Normal dist.  mean {mean} AND SD {sd} '] , loc='best')
plt.ylabel('Frequency') 
plt.title(f'{col} distribution AND skewness {round(name_bike_df[col].skew(),2)}')

plt.subplot(1,3,2)
stats.probplot(name_bike_df[col], plot=plt)
plt.show()

In [None]:
from scipy import stats
from scipy.stats import norm, skew

col='Rented_Bike_Count'

mean = round(norm.fit(name_bike_df[col])[0],2)
sd = round(norm.fit(name_bike_df[col])[1],2)

plt.figure(figsize=(21,5))

plt.subplot(1,3,1)

sns.distplot(name_bike_df[col] , fit=norm, kde =True)
plt.axvline(np.median(name_bike_df[col]),color='g', linestyle='--')
plt.axvline(np.mean(name_bike_df[col]),color='r', linestyle='--')  
# plt.title(f'skewness {round(name_bike_df[col].skew(),2)}')
plt.legend(['continuous probability density curve', f'Normal dist.  mean {mean} AND SD {sd} '] , loc='best')
plt.ylabel('Frequency') 
plt.title(f'{col} distribution AND skewness {round(name_bike_df[col].skew(),2)}')

plt.subplot(1,3,2)

mean = round(norm.fit(np.sqrt(name_bike_df[col]))[0],2)
sd = round(norm.fit(np.sqrt(name_bike_df[col]))[1],2)

sns.distplot(np.sqrt(name_bike_df[col]) , fit=norm, kde =True, color="y")
plt.axvline(np.median(np.sqrt(name_bike_df[col])),color='g', linestyle='--')
plt.axvline(np.mean(np.sqrt(name_bike_df[col])),color='r', linestyle='--')  
# plt.title(f'skewness {round(name_bike_df[col].skew(),2)}')
plt.legend(['continuous probability density curve', f'Normal dist.  mean {mean} AND SD {sd} '] , loc='best')
plt.ylabel('Frequency') 
plt.title(f'sqrt transformed {col} distribution AND skewness {round(np.sqrt(name_bike_df[col]).skew(),2)}')

plt.subplot(1,3,3)

stats.probplot(np.sqrt(name_bike_df[col]), plot=plt)

plt.show()

In [None]:
from scipy import stats
from scipy.stats import norm, skew

col='Rented_Bike_Count'

mean = round(norm.fit(name_bike_df[col])[0],2)
sd = round(norm.fit(name_bike_df[col])[1],2)

plt.figure(figsize=(21,5))

plt.subplot(1,3,1)

sns.distplot(name_bike_df[col] , fit=norm, kde =True)
plt.axvline(np.median(name_bike_df[col]),color='g', linestyle='--')
plt.axvline(np.mean(name_bike_df[col]),color='r', linestyle='--')  
# plt.title(f'skewness {round(name_bike_df[col].skew(),2)}')
plt.legend(['continuous probability density curve', f'Normal dist.  mean {mean} AND SD {sd} '] , loc='best')
plt.ylabel('Frequency') 
plt.title(f'{col} distribution AND skewness {round(name_bike_df[col].skew(),2)}')

plt.subplot(1,3,2)

mean = round(norm.fit(np.log1p(name_bike_df[col]))[0],2)
sd = round(norm.fit(np.log1p(name_bike_df[col]))[1],2)

sns.distplot(np.log1p(name_bike_df[col]) , fit=norm, kde =True, color="y")
plt.axvline(np.median(np.log1p(name_bike_df[col])),color='g', linestyle='--')
plt.axvline(np.mean(np.log1p(name_bike_df[col])),color='r', linestyle='--')  
# plt.title(f'skewness {round(name_bike_df[col].skew(),2)}')
plt.legend(['continuous probability density curve', f'Normal dist.  mean {mean} AND SD {sd} '] , loc='best')
plt.ylabel('Frequency') 
plt.title(f'log1p transformed {col} distribution AND skewness {round(np.log1p(name_bike_df[col]).skew(),2)}')

plt.subplot(1,3,3)

stats.probplot(np.log1p(name_bike_df[col]), plot=plt)

plt.show()

In [None]:
# density plot of numerical columns
# Distribution of numeric_features

for col in numerical_features[1:-2]:

  plt.figure(figsize=(19,4))

  plt.subplot(1, 3, 1)
  sns.distplot(name_bike_df[col], kde=True)
  plt.axvline(np.median(name_bike_df[col]),color='g', linestyle='--')
  plt.axvline(np.mean(name_bike_df[col]),color='r', linestyle='--')  
  plt.title(f'skewness {round(name_bike_df[col].skew(),2)}')

  plt.subplot(1, 3, 2)
  sns.boxplot(name_bike_df[col])
  plt.axvline(np.median(name_bike_df[col]),color='g', linestyle='--')
  plt.axvline(np.mean(name_bike_df[col]),color='r', linestyle='--')  
  plt.title('boxplot') 
   
  plt.subplot(1,3,3)
  stats.probplot(np.log1p(name_bike_df[col]), plot=plt)

  plt.show()

In [None]:
# transfrming Distribution of numeric_features

for col in numerical_features[1:-2]:

  plt.figure(figsize=(15,4))

  plt.subplot(1, 3, 1)
  sns.distplot(name_bike_df[col], kde=True)
  plt.axvline(np.median(name_bike_df[col]),color='g', linestyle='--')
  plt.axvline(np.mean(name_bike_df[col]),color='r', linestyle='--')  
  plt.title(f'skewness {round(name_bike_df[col].skew(),2)}')

  plt.subplot(1, 3, 2)
  sns.distplot(np.sqrt(name_bike_df[col]), kde=True, color="y")
  plt.axvline(np.median(np.sqrt(name_bike_df[col])), color='g', linestyle='--')
  plt.axvline(np.mean(np.sqrt(name_bike_df[col])),color='r', linestyle='--')
  plt.title(f'sqrt transformed skewness {round((np.sqrt(name_bike_df[col])).skew(),2)}')
  plt.show()

In [None]:
# Categorical features

categorical_features = name_bike_df.describe(include = "object").columns
categorical_features

In [None]:
# density plot of numerical columns
# Distribution of numeric_features

for col in categorical_features:

  plt.figure(figsize=(13,4))

  plt.subplot(1, 2, 1)
  sns.histplot(name_bike_df[col], color="orange")
  plt.title(f'skewness')   
  plt.xticks(rotation = 45)
  plt.show()  
  

##Bivariate

### Target column Rented_Bike_Count

In [None]:
# Check for linear relationship between dependent and numerical independent variables

sns.pairplot(data = name_bike_df, hue="Rented_Bike_Count" , y_vars="Rented_Bike_Count") 

In [None]:
sns.boxplot(y= name_bike_df.Rented_Bike_Count, x=name_bike_df.Hour) 

In [None]:
name_bike_df[numerical_features].head(1)

In [None]:
# col = 'Hour'

for col in numerical_features[1:]:
  sns.scatterplot(y= name_bike_df.Rented_Bike_Count, x=name_bike_df[col])
  plt.title(col)
  plt.show()

* **PLOT We like to see Count vs hour**
* **high Bike rent happen between 5 deg to 35 deg**
* **Renter prefer renting when Humidity is between 20% to 90%**
* **renter dont like biking when windspeed is higher then 4.8Km\hr**
* **Renter don't prefer renting in rainfall and snowfall**
* ****

In [None]:
corr_matrix = name_bike_df.corr()
corr_matrix["Rented_Bike_Count"].sort_values(ascending=False)


###1st coloumn "Date"

In [None]:
print('Dateset range from ',  name_bike_df.Date.nsmallest(1)[0], ' to ', name_bike_df.Date.nlargest(1)[8304])

In [None]:
name_bike_df.Date.nsmallest(1)[0]-name_bike_df.Date.nlargest(1)[8304] 

* **We have approx 23 months of data from jan 2017 to dec 2018**

In [None]:
# bike_df["Date"]
sns.lineplot(x = name_bike_df.Date, y = name_bike_df.Rented_Bike_Count)
plt.xticks(rotation = 45)

* **we can see rent bike count suddenly increse from jan 2018**

In [None]:
# by month
sns.lineplot(x = name_bike_df.month, y = name_bike_df.Rented_Bike_Count)
plt.xticks(rotation = 90)

* **From November to febrary renting bike decrease and June to September also**

* **from February to June is sudden rise in renting and September to Octuber also** 

In [None]:
# by day
sns.barplot(x = name_bike_df.day, y = name_bike_df.Rented_Bike_Count)

* **From 12 to 30 the renting bike is consolidation**

* **There is increase in renting in first 7 days**

In [None]:
sns.barplot(x = name_bike_df.weekday , y = name_bike_df.Rented_Bike_Count ) 

* **Sunday have lowest renting**
* **from Monday to Friday have high renting**

###3rd column hour

In [None]:
sns.pairplot(data = name_bike_df, hue="Hour", y_vars="Hour") 

* **tempreture,Windspeed,solar radition is higher in daylight then night**
* **Humidity is less between daylight then night**


In [None]:
sns.boxplot(x = name_bike_df.Hour, y = name_bike_df.Rented_Bike_Count)

* **we see that**
** **5-8 bike increase beacuse WORKER rent bike**
** **10-18 renting increaase sue to**
** **17-18 is high beaacuse worker use them and also people use to wander/cycle**

* ****

###4th column temperature 

In [None]:
sns.pairplot(data = name_bike_df, hue="Temperature", y_vars="Temperature") 

In [None]:
print('Tempreture range from ',  name_bike_df.Temperature.nsmallest(1)[1352], ' to ', name_bike_df.Temperature.nlargest(1)[5848])

In [None]:
sns.lineplot(x = name_bike_df.month , y = name_bike_df.Temperature )

###5th column temperature Humidity

In [None]:
 sns.pairplot(data = name_bike_df, hue="Humidity", y_vars="Humidity") 

###6th column temperature Wind_speed

In [None]:
sns.pairplot(data = name_bike_df, hue="Wind_speed", y_vars="Wind_speed") 

###7th column Dew_point_temperature

In [None]:
sns.pairplot(data = name_bike_df, hue="Dew_point_temperature", y_vars="Dew_point_temperature") 

###8th column Solar_Radiation

In [None]:
sns.pairplot(data = name_bike_df, hue="Solar_Radiation", y_vars="Solar_Radiation") 

###9th column Rainfall

In [None]:
 sns.pairplot(data = name_bike_df, hue="Rainfall", y_vars="Rainfall") 

###10th column Snowfall

In [None]:
 sns.pairplot(data = name_bike_df, hue="Snowfall", y_vars="Snowfall") 

###11th column Seasons

In [None]:
 sns.pairplot(data = name_bike_df, hue="Seasons", y_vars="Seasons") 

###12th column Holiday

In [None]:
 sns.pairplot(data = name_bike_df, hue="Holiday", y_vars="Holiday") 

###13th column Functioning_Day

In [None]:
 sns.pairplot(data = name_bike_df, hue="Functioning_Day", y_vars="Functioning_Day") 

In [None]:
from scipy.stats import chi2_contingency 
 
info = [[100, 200, 300], [50, 60, 70]] 
print(info)
stat, p, dof,lol= chi2_contingency(info) 
 
print(dof)
 
significance_level = 0.05
print("p value: " + str(p)) 
if p <= significance_level: 
    print('Reject NULL HYPOTHESIS') 
else: 
    print('ACCEPT NULL HYPOTHESIS') 

In [None]:
chi2_contingency(info)

In [None]:
name_bike_df.head(1)

##MODELLING 

In [None]:
df

In [None]:
from sklearn.model_selection import train_test_split
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize first 5 rows