**Question 2.1**<br> 
In this part, you will build a model to forecast the hourly carpark availability in the future
(aggregated across all carparks instead of looking at each carpark individually). Can you
explain why you may want to forecast the carpark availability in the future? Who would
find this information valuable? What can you do if you have a good forecasting model?


We can find out the potential number of cars on the street at one time as well as the general patterns of people.

This is valuable in particular for governments to plan around things such as erp pricing or road maintainence.

With a good model, any emergency work to be done can be scheduled for a period where there are lesser cars on the road, or if there needs to be a system wide upgrade for carpark systems.

**Question 2.2**<br>
Build a linear regression model to forecast the hourly carpark availability for a given
month. Use the month of July 2022 as a training dataset and the month of August 2022
as the test dataset. For this part, do not use additional datasets. The target is the hourly
carpark availability percentage and you will have to decide what features you want to
use. Generate two plots: (i) Time series plot of the actual and predicted hourly values
(ii) Scatter plot of actual vs predicted hourly values (along with a line showing how good
the fit is).

In [14]:
import requests
import json
import pandas as pd
from datetime import datetime
from time import sleep
import os
import matplotlib.pyplot as plt
from datetime import timedelta
import numpy as np

# Convert to datetime iso
def toIso(dt):
    return datetime.fromisoformat(dt)
    

def carparkApiCall(year, month, day, hour, minute, second, error_count):
    fDir = f'./data/{year}{month}{day}T{hour.zfill(2)}{minute.zfill(2)}{second.zfill(2)}.json'
    try:
        open(fDir, 'r')
    except:
        os.makedirs(os.path.dirname('./data/'), exist_ok=True)
        # If file doesn't exist, api call
        site = f'https://api.data.gov.sg/v1/transport/carpark-availability?date_time={year}-{month.zfill(2)}-{day.zfill(2)}T{hour.zfill(2)}%3A{minute.zfill(2)}%3A{second.zfill(2)}'
        # print(site)
        response_API = requests.get(site)
        data = response_API.text
        data = json.loads(data)
        try:
            timestamp = data["items"][0]["timestamp"]
            data = data["items"][0]["carpark_data"]
        except:
            print(data)
            print(year,'/', month, '/', day, 'T', hour, minute, second)
            error_count+=1
            print("error count:", error_count)
            if error_count<=5:
                return carparkApiCall(year, month, day, hour, minute, second, error_count)
            else:
                "Api call failed more than 5 times :("
        # print(timestamp)
        with open(fDir, 'w') as fp:
            json.dump(data, fp)
    df = pd.read_json(fDir)
    for heading in ("total_lots","lot_type","lots_available"):
        df[heading] = df["carpark_info"].apply(lambda x: x[0][heading])
    # Transform data
    df = df.drop(["carpark_info"], axis=1)
    df['update_datetime'] = df['update_datetime'].apply(toIso)
    df["lots_available"] = df["lots_available"].astype(int)
    df["total_lots"] = df["total_lots"].astype(int)
    return df

# Calculate average availability in percentage
def avrAvail(df):
    df["availability_percentage"] = df["lots_available"]/df["total_lots"]
    return df["availability_percentage"].sum()/len(df.index)

In [47]:
#train set - july

# start date 1 Jul 2022 0000

year = 2022
month = 7
day = 1
hour = 0
minute = 1
second = 0

dt = datetime(year, month, day, hour, minute, second)
dt_interval = timedelta(hours = 1)
total_hrs = 24*30

# day of week,time of day
train_x = [[],[]]
# avalibility percentage
train_y = []

# generate data
for hr in range(total_hrs):
    dt = dt + dt_interval
    df = carparkApiCall(str(dt.year), str(dt.month), str(dt.day), str(dt.hour), str(dt.minute), str(dt.second), 0)
    train_x[0].append(df.loc[0,'update_datetime'].weekday())
    train_x[1].append(df.loc[0,'update_datetime'].timestamp())
    train_y.append(avrAvail(df))
train_x = np.array(train_x)

In [58]:
#test set - august

# start date 1 Aug 2022 0000

year = 2022
month = 8
day = 1
hour = 0
minute = 1
second = 1

dt = datetime(year, month, day, hour, minute, second)
dt_interval = timedelta(hours = 1)
total_hrs = 24*30

# day of week,time of day
test_x = [[],[]]
# avalibility percentage
test_y = []

# generate data
for hr in range(total_hrs):
    dt = dt + dt_interval
    df = carparkApiCall(str(dt.year), str(dt.month), str(dt.day), str(dt.hour), str(dt.minute), str(dt.second), 0)
    test_y.append(avrAvail(df))
    test_x[0].append(df.loc[0,'update_datetime'].weekday())
    test_x[1].append(df.loc[0,'update_datetime'].timestamp())
    

{'items': []}
2022 / 8 / 4 T 0 1 1
error count: 1
{'items': []}
2022 / 8 / 4 T 0 1 1
error count: 2
{'items': []}
2022 / 8 / 4 T 0 1 1
error count: 3
{'items': []}
2022 / 8 / 4 T 0 1 1
error count: 4
{'items': []}
2022 / 8 / 4 T 0 1 1
error count: 5
{'items': []}
2022 / 8 / 4 T 0 1 1
error count: 6


KeyError: 'carpark_info'

In [55]:
import sklearn.linear_model as lm

regressor = lm.LinearRegression()  
regressor.fit(train_x.reshape(-1, 2), train_y) #training the algorithm
print("y =",regressor.coef_[0],"x1 +",regressor.coef_[1],"x2 +",regressor.intercept_)


y = -5.312066096435077e-07 x1 + 5.312050687585339e-07 x2 + 0.5000241065237478


**Question 2.3**<br>
Do the same as Question 2.2 above but use support vector regressor (SVR).


**Question 2.4**<br>
Do the same as Question 2.2 above but use decision tree (DT) regressor.

**Question 2.5**<br>
Make a final recommendation for the best regression model (out of the 3 methods above)
by choosing a suitable performance metric. To ensure a fair comparison, carry out hyper-parameter tuning for all 3 methods. Then, make a final recommendation selecting only
one model. Include both quantitative and qualitative arguments for your choice.
