# Time series analysis and predictions on London Bike Hiring Dataset
London is a crowdy city. An about bike hiring/ bike traffic on daily basis can benifit many industries.

## Aims: 
   ### Doing an exploratory analyis.
   ### Data storytelling: Finding differnt patterns especially seasonality : (https://public.tableau.com/app/profile/raju.roy#!/)
   ### Predict the number of bike hire on a particular day/month/year in London. For example predict how the bicycle traffic will look like on a month or a particular location.


I have also visualized the data in Tableu (https://public.tableau.com/app/profile/raju.roy#!/). 
If you want to get an overal idea, such as where most crowdy stations located. At which month or day more people hire bikes, you can interactively select features and the dashboard will show you the trends. 

# Data
The first dataset was downloaded from publicly available dataset 'London Bicycle Hire' in Bigquery.

(https://console.cloud.google.com/marketplace/product/greater-london-authority/london-bicycles?project=fit-shift-332509)
This data contains the number of hires of London's Santander Cycle Hire Scheme from 01/2015 to 06/2017. Data includes start and stop timestamps, geographical location of the station names and ride duration.
This is a big dataset and it was downloaded by chunks and added together. Also the start and end dates are already converted to date-time datatype  and added as new columns before importing here.

The second dataset was downloaded from London data store (https://data.london.gov.uk/dataset/number-bicycle-hires) which contains the daily number of bicycle hire from 2010 to present.



# Table of Contents:
### Install Libraries
### Read Data
### Data Cleaning
### Data Analysis
### Modeling, training and testing with Facebook prophet
#### Iteration 1: Prepare the data for facebook prophet, it needs two columns, y and ds. 
#### Iteration 2: Adding holiday feature to the model:
#### Iteration 3: Try the same with a larger dataset Now we look at a dataset that is larger and which contains bike hiring stats from 2010 to present. 
#### Iteration 4: Tune the model with on and off season
#### Iteration 5: Validate the model (Cross validation): Create Train and Test dataset from available data based on years, look at the change points of the trend and then perform cross validation
#### Iteration 5(crossvalidation part):Cross validation and check Performance by using Mean Absolute Error
#### Iteration 6: Multivariate time series prediction. We import temperature column and also consider it for the modeling and predictions
#### Iteration 7: Train the model with only Temperature as regressor without modification  to see if it even performs better. 
But it seems no improvement. Though the model already performs pretty well with MAPE ranging from 0.2 to 0.4

# Install some libraries that probably you might not have 
(Please don't install if you already have them)

In [None]:
#!pip install pandas-profiling

In [None]:
# for reading parquet file
#!pip install pyarrow 

In [None]:
#for pandas profiling
#!pip install ipywidgets

In [None]:
#!jupyter nbextension enable --py widgetsnbextension


In [None]:
#!pip install seaborn

In [None]:
#!pip install hvplot

In [None]:
#!pip install channels

In [None]:
#! conda install -c conda-forge mamba -y # for facebook prophet

In [None]:
#!mamba install -c conda-forge prophet -y

In [None]:
#! pip install install openpyxl

In [None]:
#! pip install plotly

In [None]:
# To bring holidays in your dataset
#!pip install holidays

## Import libraries

In [2]:
#import libraries
import pandas as pd
from datetime import datetime as dt, timedelta
#from pandas_profiling import ProfileReport

In [None]:
# for tracking the time
#from tqdm.notebook import trange, tqdm

In [None]:
import matplotlib.pyplot as plt

In [None]:
import seaborn as sns

In [None]:
import numpy as np
import hvplot.pandas

In [None]:
import holidays

In [None]:
from datetime import date

## Read the data

In [3]:
df = pd.read_parquet('/storage/huge_raw_bikeshare_df.parq').reset_index(drop = True)

In [4]:
stations = pd.read_csv('cycle_stations.csv')

## Data Cleaning

##### Initial exploration

In [None]:
df.head()


In [None]:
df.shape

let's drop some corrupted data. I like where start date is higher then end date, that's sure the data is corrupted and if we look at durations of the trip it does not match

In [None]:
df[df.start_date_as_dt >= df.end_date_as_dt]

In [None]:
df[df.start_date_as_dt >= df.end_date_as_dt].count()

In [None]:
df.drop(df[df.start_date_as_dt >= df.end_date_as_dt].index, inplace = True)

In [None]:
df.shape

In [None]:
df[df.duration <0]


In [None]:
df.start_date_as_dt.max()

In [None]:
df.start_date_as_dt.min()


In [None]:
df.shape


In [None]:
df.dtypes


In [None]:
# This code takes time...
#profile = ProfileReport(df, title="Pandas Profiling Report", minimal=True) # for large dataset minimal = True

In [None]:
df.columns

In [None]:
# This code will generate an output html file in this directory. Click on that file to get a summery of the data
#profile.to_file("output.html")


From profile report we see that start_station_id and endstatiuon_id 
have missing values but the start station name and end station name have 
no missiong values . So we will use start and end station names instead of their ids. 
end_station_logical_terminal, start_station_logical_terminal, end_station_priority_id are full of nans,
so we avoid them for present analyis. 

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df.shape

In [None]:
df.head()

The duration seems like in seconds. To have a better visualizations we convert in to minutes

In [None]:
df['duration_in_minutes'] = df.duration/60

In [None]:
df['duration_in_hours'] = df.duration/3600

In [None]:
df.head()

In [None]:
df = df[['rental_id', 'duration','duration_in_minutes','duration_in_hours',
       'end_station_name', 'start_date_as_dt' ,'end_date_as_dt',  
       'start_station_name']]

df.rename(columns={"start_date_as_dt": "start_date", "end_date_as_dt": "end_date"}, inplace = True)
df.head()

In [None]:
df['duration_in_days'] = df.duration/(3600*24)

In [None]:
df[df.duration_in_days > 30]

# Data Analysis

In [None]:
df.duration_in_minutes.value_counts()

In [None]:
df.duration

The data is big for the machines to analyze. so we do the analysis step by step. first grab the data only for a random date 27_01_2015. we see it already contains 25000 rows. Just for a single day!

In [None]:
df_27_01_2015 = df[(df.start_date.dt.year == 2015) & (df.start_date.dt.month == 1) & (df.start_date.dt.day == 27)].reset_index()



In [None]:
df_27_01_2015

In [None]:
df_27_01_2015.shape

In [None]:
df_27_01_2015.to_csv('df_27_01_2015.csv')

Let's merge this dataframe with cycle stations to get co-ordinates of start and end stations.
Therefore first merging is based on start station names and second one is based on end station names

In [None]:
df1 = df_27_01_2015.merge(stations, how = 'inner', left_on = 'start_station_name', right_on = 'name', suffixes=('_x', '_y'))

In [None]:
df1.columns

In [None]:
df1 = df1[['rental_id', 'duration_in_minutes', 'end_station_name',
        'start_station_name',
       'latitude','longitude']]

In [None]:
df1.rename(columns= {'latitude': 'start_latitude', 'longitude': 'start_longitude'}, inplace= True)

In [None]:
df1

now we do the same for end stations

In [None]:
df_with_coordinates = df1.merge(stations, how = 'inner', left_on = 'end_station_name', right_on = 'name', suffixes=('_x', '_y'))

In [None]:
df_with_coordinates.columns

In [None]:
df_with_coordinates = df_with_coordinates[['rental_id', 'duration_in_minutes', 'end_station_name',
       'start_station_name', 'start_latitude', 'start_longitude',  'latitude', 'longitude']]

In [None]:
df_with_coordinates.rename(columns= {'latitude': 'end_latitude', 'longitude': 'end_longitude'}, inplace= True)


In [None]:
df_with_coordinates

##### Suppose we want to know most popular routes on that day

In [None]:
routes = pd.DataFrame(df_with_coordinates.groupby(['start_station_name','end_station_name']).size().sort_values(ascending = False).reset_index())

In [None]:
routes.columns = ['start_station','end_station', 'number of times this route used']

In [None]:
# 10 most popular routes
routes.head(10)

##### Let's see most popular start stations at 27/01/2015

Top 10 start stations


In [None]:
station_groupby = pd.DataFrame(df_27_01_2015.groupby(['start_station_name']).count().sort_values(by = 'rental_id', ascending = False).reset_index())


In [None]:
# To get the co-ordinate of the popular start stations
start_locations = station_groupby.merge(stations, how = 'inner', left_on = 'start_station_name', right_on = 'name', suffixes=('_x', '_y'))


In [None]:
start_locations 

In [None]:
start_locations.columns 

In [None]:
start_location_coordinates = start_locations[['start_station_name','rental_id', 'latitude','longitude']]

In [None]:
start_location_coordinates.rename(columns ={"rental_id" : "Number of rides"}, inplace = True)

In [None]:
start_location_coordinates

In [None]:
start_location_coordinates.shape

In [None]:
start_location_coordinates.to_csv('start_location_coordinates.csv',index=False)


In [None]:
station_groupby = station_groupby.head(20)

In [None]:
station_groupby

Top 10 destination stations

In [None]:
end_station_groupby = pd.DataFrame(df_27_01_2015.groupby(['end_station_name']).count().sort_values(by = 'rental_id', ascending = False).reset_index())

In [None]:
end_station_groupby

In [None]:
# plot popular start stations
station_groupby = station_groupby.head(10)
import seaborn as sns
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(8, 5)) 
ax = sns.barplot(x="start_station_name", y="rental_id", data=station_groupby)
ax.set(xlabel="Start Stations", ylabel = "Number of the hires", title = 'Most popular start stations in 27/01/2015')
plt.xticks(rotation=90)
plt.show()

In [None]:
# plot popular end stations
end_station_groupby = end_station_groupby.head(10)
import seaborn as sns
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(8, 5)) 
ax = sns.barplot(x="end_station_name", y="rental_id", data=end_station_groupby)
ax.set(xlabel="Destination Stations", ylabel = "Number of the hires", title = 'Most popular destination stations in 27/01/2015')
plt.xticks(rotation=90)
plt.show()

##### We want to know at which hour most bikes are hired

In [None]:
time_group = pd.DataFrame(df_27_01_2015.groupby(pd.Grouper(key="start_date", freq="1H")).count().reset_index())

In [None]:
time_group = time_group[['start_date', 'rental_id']]

In [None]:
time_group['hours'] = range(24)

In [None]:
time_group.rename(columns ={'rental_id':'number_of_rides'}, inplace = True)

In [None]:
time_group.to_csv('hourly_bike_hiring.csv')

In [None]:
# plot popular end stations
import seaborn as sns
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(8, 5)) 
ax = sns.barplot(x="hours", y='number_of_rides', data=time_group)
ax.set(xlabel="Hours of the day", ylabel = "Number of hires", title = 'Hourly bike hires at 27/01/2015')
plt.xticks(rotation=90)
plt.show()

# Modeling, training and testing with Facebook prophet

#### Iteration 1: Prepare the data for facebook prophet, it needs two columns, y and ds. we already aggregated the required data from Bigquery. we will read it and have a look on it first

In [None]:
Bikehiring_by_date = pd.read_csv('Bikehiring_per_day.csv')

In [None]:
Bikehiring_by_date

In [None]:
Bikehiring_by_date.columns = ['y', 'ds']

In [None]:
Bikehiring_by_date.dtypes

Facebook Prophet needs the data in date time format

In [None]:
Bikehiring_by_date.ds =  pd.to_datetime(Bikehiring_by_date.ds)

In [None]:
# check again
Bikehiring_by_date.dtypes

In [None]:
#import libraries
from prophet import Prophet

In [None]:
# Initialize the model
model_v1 = Prophet(interval_width= 0.95, daily_seasonality= True)

In [None]:
#fit the model
model_v1.fit(Bikehiring_by_date)

In [None]:
#forcast

In [None]:
future = model_v1.make_future_dataframe(periods = 100, freq = 'D')
forcast = model_v1.predict(future)
forcast.head()

In [None]:
forcast.tail()

In [None]:
plot1 = model_v1.plot(forcast)

In [None]:
plot2 = model_v1.plot_components(forcast)

#### Iteration 2: Adding holiday feature to the model
##### Now we modify the model and add some new parameters

In [None]:
m = Prophet(yearly_seasonality = True)
m.add_country_holidays(country_name='UK')
m.fit(Bikehiring_by_date)

In [None]:
future = m.make_future_dataframe(periods = 100, freq = 'D')
forcast = m.predict(future)
forcast.head()

In [None]:
plot3 = m.plot(forcast)

In [None]:
plot3 = m.plot_components(forcast)

In [None]:
# names of the holidays
m.train_holiday_names

#### Iteration 3: Try the same with a larger dataset
#### Now we look at a dataset that is larger and which contains bike hiring stats from 2010 to present. We plug it into times series analysis of facebook prophet


In [None]:
bikehiring_from_2010 = pd.read_excel('bikehiring_from_2010.xlsx')

In [None]:
bikehiring_from_2010

In [None]:
bikehiring_from_2010.columns = ['ds', 'y']

In [None]:
bikehiring_from_2010.dtypes

In [None]:
m = Prophet(daily_seasonality = True)
m.add_country_holidays(country_name='UK')
m.fit(bikehiring_from_2010)

In [None]:
future = m.make_future_dataframe(periods = 100, freq = 'D')
forcast = m.predict(future)
forcast.head()

In [None]:
plot5 = m.plot(forcast)

In [None]:
plot6 = m.plot_components(forcast)

#### Iteration 4: Tune the model with on and off season

In [None]:
def is_nfl_season(ds):
    date = pd.to_datetime(ds)
    return (date.month < 7 or date.month > 3)

In [None]:

bikehiring_from_2010['on_season'] = bikehiring_from_2010['ds'].apply(is_nfl_season)
bikehiring_from_2010['off_season'] = ~bikehiring_from_2010['ds'].apply(is_nfl_season)

In [None]:
m = Prophet(weekly_seasonality=False)
m.add_seasonality(name='weekly_on_season', period=7, fourier_order=3, condition_name='on_season')
m.add_seasonality(name='weekly_off_season', period=7, fourier_order=3, condition_name='off_season')

future['on_season'] = future['ds'].apply(is_nfl_season)
future['off_season'] = ~future['ds'].apply(is_nfl_season)
forecast = m.fit(bikehiring_from_2010).predict(future)
fig = m.plot_components(forecast)

In [None]:
bikehiring_from_2010

#### Iteration 5: Validate the model (Cross validation) - Create Train and Test dataset from available data based on years, look at the change points of the trend and then perform cross validation

In [None]:
train = bikehiring_from_2010[ (bikehiring_from_2010.ds >= '2010-01-01') & (bikehiring_from_2010.ds < '2017-01-01')]

In [None]:
train.shape

In [None]:
test = bikehiring_from_2010[ (bikehiring_from_2010.ds >= '2017-01-01') & (bikehiring_from_2010.ds < '2019-01-01')]

In [None]:
train = train[['ds', 'y']]

In [None]:
test.shape

In [None]:
bikehiring_from_2010.shape

In [None]:
test = test[['ds', 'y']]

##### Train the model with yearly seasonality

In [None]:
m = Prophet(interval_width = 0.95, yearly_seasonality= True)

In [None]:
m.fit(train)

In [None]:
train.ds.min()

In [None]:
m.params

In [None]:
future = m.make_future_dataframe(periods = 730)
future.tail

In [None]:
# check if future tail and test tail is same
test.tail()

In [None]:
forcast = m.predict(future)

In [None]:
forcast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

plot side by side the original data and the forcasted data we do it manually here by using pandas. We see that the model is prediction pretty close. 
I avoided 2020 and 2021 from original dataset as it has uniusual values due to covid and we will deal with it later

In [None]:
base_dataset = bikehiring_from_2010[bikehiring_from_2010.ds < '2019-01-01' ].copy()

In [None]:

pd.concat([base_dataset.set_index('ds')['y'], forcast.set_index('ds')['yhat']], axis = 1).plot()

In [None]:
# use facebook prophet original plot function to do it better... we see that the actual value fall into the  95% confidence interval 
fig1 = m.plot(forcast)

From this above plot we see data fits with the model well. the black dots are actual data and deep blue is prediction from model. Though the match but at some days there were higher number of hires may be due to good temperature in london. the model could not detect this outlayers. so in next iteration we will deal with this and add temperature for better predictions.

In [None]:
fig2 = m.plot_components(forcast)

We see that May to October the bike hire is highest. This is beacause summer months

 Now we add the change points to see where the trend goes upward or downward

In [None]:
from prophet.plot import add_changepoints_to_plot

In [None]:
fig = m.plot(forcast)
a = add_changepoints_to_plot(fig.gca(), m, forcast)

To see if a negative or positive change happen and when we may plot change points 

In [None]:
deltas = m.params['delta'].mean(0)
deltas

In [None]:
import matplotlib.pyplot as plt

In [None]:
fig = plt.figure(facecolor = 'w')
ax = fig.add_subplot(111)
ax.bar(range(len(deltas)), deltas)
ax.grid(True, which = 'major', c = 'red', ls = '-', alpha = 0.2)
ax.set_xlabel('changepoint')
ax.set_ylabel('rate of change')
fig.tight_layout()

In [None]:
m.changepoints

 Use facebook default plotting for interactive plots

In [None]:
from prophet.plot import plot_plotly
import plotly.offline as py
fig = plot_plotly(m, forcast)
py.iplot(fig)
# this code did not work due to the problem in ipwidgets

#### Iteration 5(crossvalidation)
#### Cross validation and check Performance by using Mean Absolute Error

I am telling the model that take initial 731 days to make prediction  of next 365 days and analyze the performance

In [None]:
from prophet.diagnostics import cross_validation
cv_results = cross_validation(model = m, initial='731 days', horizon = '365 days')

In [None]:
cv_results

In [None]:
cv_results.head(10)

 see the performance using performance metrics

In [None]:
from prophet.diagnostics import performance_metrics
performance = performance_metrics(cv_results)
performance

#### Iteration 6: Multivariate time series prediction. We import temperature column and also consider it for the modeling and predictions

We want to plug the temperature of london to which might influence the bike hiring so we found a dataset of waether from  Kaggle. Cleaned it only for london. Now we will further clean it to comply with our bike hiring dataset.

In [None]:
temperature = pd.read_csv('london_temperature_profile.csv')

In [None]:
temperature

In [None]:
# we update original bike hiring dataset to match the availabe temperature 
bike_hiring_updated = bikehiring_from_2010[bikehiring_from_2010['ds'] <= '2020-09-30']
bike_hiring_updated

In [None]:
print(bike_hiring_updated.ds.min())
print(bike_hiring_updated.ds.max())

In [None]:
temperature_updated = temperature[(temperature['datetime'] >= '2010-07-30') & (temperature['datetime'] <= '2021-09-30')].reset_index(drop = True)

In [None]:
temperature_updated.shape

In [None]:
 # now we are sure both bike_hiring_updated and temperature dataframe match
temperature_updated

In [None]:
bike_hiring_updated['Temperature'] = temperature_updated.loc[:, ('London')]

In [None]:
bike_hiring_updated


In [None]:
bike_hiring_updated.dtypes

Facebook prophet can not handle missing values in temperature column. So we check it before, impute if necessary. we do it by using temperature != temperature. This logic is only true when there is a nan in the raw. As nan can not be equal to nan.

In [None]:
bike_hiring_updated.query( 'Temperature != Temperature') # as we see here there is no nan in this column

We already know the covid situation affected the bike hiring than usual in 2020. For the sake of simplicity for now we focus our analysis  to the end of 2019

In [None]:
bike_hiring_updated = bike_hiring_updated[bike_hiring_updated['ds'].dt.year < 2020 ].copy()

We want add a column to the data frame which will indicate if the date is holiday or not. For this we installed a library called holidays. It takes the input as python date time and says if the day is holiday or not. But date object we have is in pandas timestamp format. So we again convert it to python datetime and plud into holiday function

In [None]:
uk_holidays = holidays.UK()

In [None]:
bike_hiring_updated.ds[0].to_pydatetime() in uk_holidays

In [None]:
bike_hiring_updated['holiday'] =bike_hiring_updated.ds.apply(lambda row: row.to_pydatetime() in uk_holidays)

In [None]:
bike_hiring_updated.dtypes

Let's take the months in a seperated column from the time stamp

In [None]:
bike_hiring_updated['month'] = bike_hiring_updated.ds.dt.month

In [None]:
bike_hiring_updated['day-of-week'] = bike_hiring_updated['ds'].dt.day_name()

In [None]:
bike_hiring_updated

In [None]:
bike_hiring_updated['log_y'] = np.log(bike_hiring_updated.y)

In [None]:
bike_hiring_updated.hvplot.line(x='ds', y= 'log_y')

In [None]:
bike_hiring_updated.hvplot.line(x='ds', y= 'y', 
                                hover_cols = ['ds','day-of-week','on_season','off_season','Temperature', 'y', 'holiday'], 
                                title= 'YEARLY BIKE HIRING IN LONDON')

In [None]:
bike_hiring_updated.hvplot.line(x='ds', y= 'Temperature', title= 'Temperature Profile in London')

In [None]:
# Add both graphs together to see the temperature effect on bike hiring
df = bike_hiring_updated.copy()
X = df[['ds']]
Y1=df[['y']]
Y2=df[['Temperature']]

fig, ax1 = plt.subplots(figsize=(10,6))
ax2 = ax1.twinx()

ax1.plot(X, Y1, 'g', label='y') #plotting on primary Y-axis
ax1.plot(X, Y2, 'm', label='Temp') #plotting on primary Y-axis

ax2.plot(X, Y2, 'b', label='Temp') #plotting on **second** Y-axis

ax1.set_ylim(0, 80000) #Define limit/scale for primary Y-axis
ax2.set_ylim(-5, 30) #Define limit/scale for secondary Y-axis
ax1.grid(False)
ax2.grid(False)

plt.show()

Before feature engineering by temperature let's checke correlation between temperature and bike hiring

It seems 60% correlation  between temperature and bike hiring

In [None]:
bike_hiring_updated[['y','Temperature']].corr()

Let's check correlation improves if we feed temperature greater or less than 15 degrees

In [None]:
bike_hiring_updated.query('Temperature >= 10')[['y','Temperature']].corr()
# It did not improve the correlation

Define a function when temperature is more than 10 it gives 1 and less than 10 it gives 0

In [None]:
def summer_temp(temp):
    if temp >= 10:
        return 1
    else:
        return 0

In [None]:
bike_hiring_updated['summer_temp'] = bike_hiring_updated.Temperature.apply(summer_temp)

In [None]:
bike_hiring_updated

To track the behavior in summer or winter better we cut the months in 4 bins

In [None]:
def seasons(month):
    if (month >= 3) and (month <= 5): # for spring
        return 1
    elif month >= 6 and month <= 8:
        return 2
    elif month >= 9 and month <= 11:
        return 3
    else:
        return 4

In [None]:
seasons(12)

In [None]:
bike_hiring_updated['month_bin'] = bike_hiring_updated.month.apply(lambda raw: seasons(raw))

In [None]:
bike_hiring_updated

In [None]:
bike_hiring_updated1 = bike_hiring_updated[['ds','y','summer_temp', 'month_bin']]

Train test split
we can not use usual train test split like scikit. we do it just by spliting the dataset by time


In [None]:
train = bike_hiring_updated1[bike_hiring_updated1.ds.dt.year < 2018]

In [None]:
train

In [None]:
test = bike_hiring_updated1[bike_hiring_updated1.ds.dt.year >= 2018]

In [None]:
test

##### Train the model again

In [None]:
m = Prophet(interval_width= 0.95, yearly_seasonality= True)

now we tell the model that it should take summer temperature in consideration, but not standerdize

In [None]:
m.add_regressor('summer_temp', standardize= False)
m.add_regressor('month_bin', mode = 'multiplicative')

In [None]:
m.fit(train)

In [None]:
m.params

Now we create our future dataframe. It must contain ds, summer temp and month bin as we are predicting based on this

In [None]:
future = m.make_future_dataframe(periods = 730)

In [None]:
future.tail()

In [None]:
future['month_bin'] = bike_hiring_updated1['month_bin']

In [None]:
future['summer_temp'] = bike_hiring_updated1['summer_temp']

In [None]:
future

now we forcast future

In [None]:
forcast = m.predict(future)

In [None]:
comparision = forcast[['ds','yhat']].tail(730)

In [None]:
comparision['y_true'] = test.y

In [None]:
comparision.reset_index(drop = True)

In [None]:
comparision.hvplot.line(x='ds', y= ['y_true','yhat'], hover ='all', title= 'Comparision between true and predicted value')

In [None]:
# We can do comparision by using plot function in prophet
fig1 = m.plot(forcast)

In [None]:
fig2 = m.plot_components(forcast)

##### Crossvalidation again

In [None]:
from prophet.diagnostics import cross_validation, performance_metrics
cv_results = cross_validation(model = m, initial= '731 days', horizon = '365 days')
performance = performance_metrics(cv_results)
performance

We find Mape 0.4 from this model

The mean absolute percent error (MAPE) expresses accuracy as a percentage of the error. Because the MAPE is a percentage, it can be easier to understand than the other accuracy measure statistics. For example, if the MAPE is 5, on average, the forecast is off by 5%.

However, sometimes you may see a very large value of MAPE even though the model appears to fit the data well. Examine the plot to see if any data values are close to 0. Because MAPE divides the absolute error by the actual data, values close to 0 can greatly inflate the MAPE.

In [None]:
from prophet.plot import plot_cross_validation_metric

In [None]:
fig3 = plot_cross_validation_metric(cv_results, metric = 'mape')

#### Iteration 7: to sum it up---- Train the model with additional features: holiday, Summer Temperature and Monthly bin(month seasonality)


In [None]:
m = Prophet(interval_width= 0.95, yearly_seasonality= True)

In [None]:
m.add_country_holidays(country_name='UK')

In [None]:
m.add_regressor('summer_temp', standardize= False)
m.add_regressor('month_bin', mode = 'multiplicative')

In [None]:
m.fit(train)

In [None]:
future = m.make_future_dataframe(periods = 730)

In [None]:
future['month_bin'] = bike_hiring_updated1['month_bin']
future['summer_temp'] = bike_hiring_updated1['summer_temp']

In [None]:
forcast = m.predict(future)

In [None]:
fig1 = m.plot(forcast)

In [None]:
# Cross validation again
from prophet.diagnostics import cross_validation, performance_metrics
cv_results = cross_validation(model = m, initial= '731 days', horizon = '365 days')
performance = performance_metrics(cv_results)
performance


In [None]:
fig3 = plot_cross_validation_metric(cv_results, metric = 'mape')

#### Iteration 7: Train the model with only Temperature as regressor without modification  to see if it even performs better. But it seems no improvement. Though the model already performs pretty well with MAPE ranging from 0.2 to 0.4

In [None]:
bike_hiring_updated

In [None]:
bike_hiring_updated['log_y'] = np.log(bike_hiring_updated.y)

In [None]:
bike_hiring_updated

In [None]:
train = bike_hiring_updated[bike_hiring_updated.ds.dt.year <2018][['ds', 'y','Temperature']]

In [None]:
#train.columns = ['ds', 'y','Temperature']

In [None]:
test = bike_hiring_updated[bike_hiring_updated.ds.dt.year >= 2018][['ds', 'y','Temperature']]

In [None]:
#test.columns = ['ds', 'y','Temperature']

In [None]:
m = Prophet(interval_width= 0.95, yearly_seasonality= True)

In [None]:
m.add_regressor('Temperature', standardize= False)

In [None]:
m.fit(train)

In [None]:
future = m.make_future_dataframe(periods = 730)

In [None]:
future['Temperature'] = bike_hiring_updated['Temperature']

In [None]:
future

In [None]:
forcast = m.predict(future)

In [None]:
fig1 = m.plot(forcast)

In [None]:
m.plot_components(forcast)