# <div style="color:white;display:fill;border-radius:5px;background-color:#6777C7;text-align:center;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%">Introduction</div>

## <b><span style='color:#5364B4'>1.1 | </span>Objective</b>

In this month's [TPS Competiton](https://www.kaggle.com/c/tabular-playground-series-mar-2022/overview), my goal is forecast 12 hours of traffic flow in a major US metropolis. The time series in this dataset are labelled with both location coordinates and a direction of travel – a combination of spatio-temporal features within a highly dynamic traffic network. 

## <b><span style='color:#5364B4'>1.2 | </span>Data Overview</b>
The training data consists of six month's of traffic congestion levels in 20-minute intervals across a network of 65 roadways from April through September of 1991. The variables in the dataset include:
- `time`: The 20-minute period in which each measurement was taken.
- `x`: The East-West midpoint coordinate of the roadway.
- `y`: The North-South midpoint coordinate of the roadway.
- `direction`: The direction of travel of the roadway. EB indicates Eastbound travel, for example, while SW indicates a Southwest direction of travel.
- `congestion`: Congestion levels for the roadway during each hour; the target. The congestion measurements have been normalized to the range 0 to 100.

The test set contains the roadway's coordinate location and direction of travel on the day of 1991-09-30.

## <b><span style='color:#5364B4'>1.3 | </span>Descriptive Statistics</b>

In [None]:
import os, warnings
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
from datetime import datetime, timedelta
from itertools import chain
from scipy.stats import gaussian_kde
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import MinMaxScaler
from statsmodels.tsa.stattools import pacf, acf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.api import ExponentialSmoothing
from lightgbm import LGBMRegressor
warnings.filterwarnings("ignore")
init_notebook_mode(connected=True)
temp=dict(layout=go.Layout(font=dict(family="Franklin Gothic", size=12), 
                           height=500, width=700))

train=pd.read_csv('../input/tabular-playground-series-mar-2022/train.csv', 
                  parse_dates=['time'], index_col='row_id')
test=pd.read_csv('../input/tabular-playground-series-mar-2022/test.csv', 
                 parse_dates=['time'], index_col='row_id')
sub=pd.read_csv('../input/tabular-playground-series-mar-2022/sample_submission.csv')

print("There are {:,} rows and {} columns in the training set.".format(train.shape[0], train.shape[1]))
print("The time series starts on {} and ends on {}.\n".format(train.time.min(), train.time.max()))
print("There are {:,} rows and {} columns in the test set.".format(test.shape[0], test.shape[1]))
print("The time series starts on {} and ends on {}.\n".format(test.time.min(), test.time.max()))

# Create time & direction variables
for df in [train, test]:
    df['month'] = df['time'].dt.month
    df['week'] = df['time'].dt.isocalendar().week
    df['day_of_week'] = df['time'].dt.dayofweek+1
    df['day'] = df['time'].dt.day
    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['weekend'] = (df['day_of_week']>5).astype(int)
    df['xydir'] = df.x.astype(str)+df.y.astype(str)+df.direction.astype(str)

display(train.describe().T.round(3).style.format('{:,.2f}')
        .text_gradient(cmap='Greys_r')
        .bar(color='#7784CB', axis=0, vmin=0)
        .set_caption("Summary statistics of numeric columns"))
print()
cat=train.select_dtypes(include=['object']).columns.tolist()
for i in cat[:1]:
    obs=train[i].value_counts()
    avg_congest=train.groupby(i)['congestion'].mean()
    df=pd.DataFrame({"Number of Observations":obs, 
                     "Average Congestion":avg_congest})
    df.index.rename('Direction', inplace=True)
    display(df.sort_values(by="Average Congestion", ascending=False)
            .style.background_gradient(cmap='cividis',subset=['Average Congestion'], vmin=20, vmax=54)
            .format({"Number of Observations": "{:,.0f}", "Average Congestion": "{:.1f}"})
            .set_caption("Summary statistics of categorical columns"))

# <div style="color:white;display:fill;border-radius:5px;background-color:#6777C7;text-align:center;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%">Exploratory Data Analysis</div>

In [None]:
hist_data=train['congestion']
density=gaussian_kde(dataset=hist_data, bw_method='silverman')
x=np.arange(0,100) 
density.covariance_factor = lambda: .14  
density._compute_covariance()
kde_curve=density(x)

fig = go.Figure()
fig.add_trace(go.Histogram(x=hist_data, histnorm='probability density', marker_color='#F4F4F4'))
fig.add_trace(go.Scatter(x=x, y=kde_curve, marker_color='#6168CE', fill='tozeroy'))
fig.update_traces(marker=dict(line=dict(width=1, color='#BDC3C7')), 
                  hovertemplate='%{y}<extra></extra>')
fig.update_layout(template=temp, title="Distribution of Congestion", 
                  xaxis_title="Congestion", yaxis_title="Probability Densities", showlegend=False)
fig.show()

colors = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, 9)]
direction=train.groupby('direction')['congestion'].median().sort_values(ascending=False)
name=['South Bound', 'North Bound', 'East Bound', 'West Bound', 
      'Southwest', 'Northeast', 'Southeast', 'Northwest']
fig = go.Figure()
for i, name, col in zip(direction.index, name, colors):
    plot_df=train[train.direction==i]
    fig.add_trace(go.Box(y=plot_df['congestion'], name=name, boxmean=True, whiskerwidth=0.2, 
                         marker_size=2, line_width=1, marker_color=col, showlegend=False))
fig.update_layout(template=temp, title="Distribution of Traffic Congestion<br>by Direction",
                  yaxis_title='Congestion', xaxis_tickangle=30)
fig.show()

avg=train.groupby('time').congestion.mean()
fig = px.line(avg, x=avg.index, y=avg.values)
fig.update_traces(line=dict(width=1))
fig.update_layout(template=temp, title="Average Congestion Levels from April - Sept 1991", 
                  xaxis_title='', yaxis_title='Congestion', 
                  hovermode="x unified")
fig.show()

week=train.groupby('week').congestion.mean()
week_day=train.groupby('day_of_week').congestion.mean()
day=train.groupby('day').congestion.mean()
hr=train.groupby('hour').congestion.mean()

fig = make_subplots(rows=2, cols=2, 
                    subplot_titles=("Average Congestion by Week","Average Congestion by Week Day", 
                                    "Average Congestion by Day","Average Congestion by Hour"))
fig.append_trace(go.Scatter(x=week.index, y=week, mode='lines', name='Congestion'), row=1,col=1)
fig.append_trace(go.Scatter(x=week_day.index, y=week_day, mode='lines',name='Congestion'), row=1,col=2)
fig.append_trace(go.Scatter(x=day.index, y=day, mode='lines',name='Congestion'), row=2,col=1)
fig.append_trace(go.Scatter(x=hr.index, y=hr, mode='lines',name='Congestion'), row=2,col=2)
fig.update_xaxes(showline=True, zeroline=False)
fig.update_yaxes(showline=True, zeroline=False)
fig.update_layout(template=temp, hovermode="x unified", xaxis2_tickmode='linear', 
                  xaxis3=dict(tickmode='array', tickvals=[i for i in range(0,31,5)]),
                  showlegend=False, height=800)
fig.show()

## <b><span style='color:#5364B4'>2.1 | </span>EDA Summary</b>
- The distribution of our target variable, `Congestion` is fairly normally distributed.
- The four roadways, Southbound through Westbound, have the highest congestion levels overall with a median between 47 and 55, while the Southeast and Northwest regions have the lowest congestion levels, although there are more outliers in these regions.
- The average values of the time series from April - Sept 1991 fluctuate widely but have remained at about the same level over time.
- The weekly and daily congestion levels hover between 46-48, while there is a little more variation in the hourly and weekday congestion levels. On weekdays, the average congestion is around 49 and drops to about 44.7 on weekends. In addition, roadways are the least congested between the hours of 11pm and 6am and peak to an average of about 54 at 5pm.

# <div style="color:white;display:fill;border-radius:5px;background-color:#6777C7;text-align:center;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%">Time Series Components</div>

To help identify the seasonality and stationarity in the data, the graphs below show the weekly decomposition of each time series along with their Autocorrelation and Partial Autocorrelation Plots. 

In [None]:
direction=train.xydir.unique()[::3]
for i in direction: 
    
    t=train[train.xydir==i].set_index('time')
    a=pd.Series(acf(t.congestion, nlags=251)[1:])
    p=pd.Series(pacf(t.congestion, nlags=251)[1:])
    up_ci, low_ci = 2.58/np.sqrt(len(t)), -2.58/np.sqrt(len(t)) # 99% confidence 
    decomp = seasonal_decompose(t.congestion, model="additive", period=504) # weekly decomp
    
    fig = make_subplots(rows=4, cols=2,
                        specs=[[{}, {'rowspan': 2}], [{}, None],
                               [{}, {'rowspan': 2}], [{}, None]],
                        horizontal_spacing=0.07,
                        subplot_titles=('', 'Autocorrelation Plot', '','',
                                        'Partial Autocorrelation Plot',''))
    
    # ACF plot
    for j in range(len(a)):
        fig.add_shape(dict(type="line", x0=j+1, x1=j+1, y0=0, y1=a[j], 
                           line_color="#555555",opacity=0.45,line_width=1), 
                      row=1, col=2)
    fig.append_trace(go.Scatter(x=a.index+1, y=a, mode='markers', 
                                marker_color='#3F51B5', marker_size=5,
                                hovertemplate='Autocorrelation of Lag %{x} = %{y:.2f}<extra></extra>'), 
                     row=1,col=2)
    
    # PACF plot
    for k in range(len(p)):
        fig.add_shape(dict(type="line", x0=k+1, x1=k+1, y0=0, y1=p[k], 
                           line_color="#555555",opacity=0.45,line_width=1), 
                      row=3, col=2)
    fig.append_trace(go.Scatter(x=p.index+1, y=p, mode='markers', 
                                marker_color='#3F51B5', marker_size=5,
                                hovertemplate='Partial Autocorrelation of Lag %{x} = %{y:.2f}<extra></extra>'), 
                     row=3,col=2)
    fig.add_hrect(y0=low_ci, y1=up_ci, fillcolor="#8A9DC5", opacity=0.5, line_width=0) 
    
    # Decomposition plots
    fig.append_trace(go.Scatter(x=t.index, y=decomp.observed, line=dict(color='#4858BA',width=1),
                                hovertemplate='%{x}<br>Observed: %{y}<extra></extra>'), row=1, col=1) 
    fig.append_trace(go.Scatter(x=t.index, y=decomp.trend, line=dict(color='#5AA68E'),
                                hovertemplate='%{x}<br>Trend: %{y:.2f}<extra></extra>'), row=2, col=1)
    fig.append_trace(go.Scatter(x=t.index, y=decomp.seasonal, line=dict(color='#3180BD',width=1),
                                hovertemplate='%{x}<br>Seasonality = %{y:.2f}<extra></extra>'), row=3, col=1)
    fig.append_trace(go.Scatter(x=t.index, y=decomp.resid, line=dict(color='#C86F7D',width=1),
                                hovertemplate='%{x}<br>Residual = %{y:.2f}<extra></extra>'), row=4, col=1)
    fig.update_xaxes(showline=True)
    fig.update_layout(title="Time Series Decomposition of<br>Direction {}, Coordinates ({}, {})".format(i[2:],i[0],i[1]),
                      yaxis1=dict(title='Observed', showline=True), xaxis1_showline=False,
                      xaxis2=dict(range=(-2,251.5), showline=False, zeroline=False, showgrid=False), 
                      yaxis3=dict(title='Trend',showline=True),
                      yaxis4=dict(title='Seasonal',showline=True),
                      xaxis5=dict(title='Lag', range=(-2,251.5), showline=False, zeroline=False, showgrid=False),
                      yaxis6=dict(title='Residuals',showline=True),
                      template=temp, showlegend=False, height=600)
    fig.show()

The graphs above show the trend, seasonal, and residual components of each time series as well as their Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) Plots, which can be used to identify the seasonality and stationarity in the data. In the Seasonal graphs, we see a clear pattern with regularly spaced peaks and troughs occurring each week, indicating there is a weekly seasonality in the time series. The graphs of the trend, on the other hand, display irregular patterns that vary over time. We also see evidence of non-stationarity in the ACF and PACF plots. The lags in the Autocorrelation plots display sinusoidal patterns that are above the 99% confidence intervals shown in the blue bands along the $x$-axis, while the Partial Autocorrelation plots go to zero relatively quickly and then peak again at about every 72 lags. To reduce the trend and seasonality from the time series, I will take both the first weekly seasonal difference and the first non-seasonal difference of the congestion levels. This will create a new series where for every time interval, $t$, our congestion levels, $y$, become $y = (y_t-y_{t-504})-(y_{t-1}-y_{t-504-1})$, in which 504 represents the number of time periods in one week of our data in 20-minute intervals. Below are the graphs of the Autocorrelation and Partial Autocorrelation Functions of the differenced time series.

# <div style="color:white;display:fill;border-radius:5px;background-color:#6777C7;letter-spacing:0.01px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Seasonal ACF &#38; PACF Plots</p></div>

In [None]:
for i in direction: 
    
    t=train[train.xydir==i].set_index('time')
    t['seasonal_diff']=t['congestion']-t['congestion'].shift(504)
    t.dropna(inplace=True)
    t['second_diff']=t['seasonal_diff']-t['seasonal_diff'].shift()
    t.dropna(inplace=True)

    a=pd.Series(acf(t.second_diff, nlags=75)[1:])
    p=pd.Series(pacf(t.second_diff, nlags=75)[1:])
    decomp = seasonal_decompose(t.second_diff, model="additive", period=504)
    up_ci, low_ci = 2.58/np.sqrt(len(t)), -2.58/np.sqrt(len(t))
    
    fig = make_subplots(rows=4, cols=2,
                        specs=[[{}, {'rowspan': 2}], [{}, None],
                               [{}, {'rowspan': 2}], [{}, None]],
                        horizontal_spacing=0.07,
                        subplot_titles=('', 'Autocorrelation Plot', '','',
                                        'Partial Autocorrelation Plot',''))
    
    # ACF plot
    for j in range(len(a)):
        fig.add_shape(dict(type="line", x0=j+1, x1=j+1, y0=0, y1=a[j], 
                           line_color="#555555",opacity=0.5,line_width=1), 
                      row=1, col=2)
    fig.append_trace(go.Scatter(x=a.index+1, y=a, mode='markers', 
                                marker_color='#3F51B5', marker_size=5,
                                hovertemplate='Autocorrelation of Lag %{x} = %{y:.2f}<extra></extra>'), 
                     row=1,col=2)
    
    # PACF plot
    for k in range(len(p)):
        fig.add_shape(dict(type="line", x0=k+1, x1=k+1, y0=0, y1=p[k], 
                           line_color="#555555",opacity=0.5,line_width=1), 
                      row=3, col=2)
    fig.append_trace(go.Scatter(x=p.index+1, y=p, mode='markers', 
                                marker_color='#3F51B5', marker_size=5,
                                hovertemplate='Partial Autocorrelation of Lag %{x} = %{y:.2f}<extra></extra>'), 
                     row=3,col=2)
    fig.add_hrect(y0=low_ci, y1=up_ci, fillcolor="#8A9DC5", opacity=0.5, line_width=0) 
    
    # Decomposition plots
    fig.append_trace(go.Scatter(x=t.index, y=decomp.observed, line=dict(color='#4858BA',width=1),
                                hovertemplate='%{x}<br>Observed: %{y}<extra></extra>'), row=1, col=1) 
    fig.append_trace(go.Scatter(x=t.index, y=decomp.trend, line=dict(color='#5AA68E'),
                                hovertemplate='%{x}<br>Trend: %{y:.2f}<extra></extra>'), row=2, col=1)
    fig.append_trace(go.Scatter(x=t.index, y=decomp.seasonal, line=dict(color='#3180BD',width=1),
                                hovertemplate='%{x}<br>Seasonality = %{y:.2f}<extra></extra>'), row=3, col=1)
    fig.append_trace(go.Scatter(x=t.index, y=decomp.resid, line=dict(color='#C86F7D',width=1),
                                hovertemplate='%{x}<br>Residual = %{y:.2f}<extra></extra>'), row=4, col=1)
    fig.update_xaxes(showline=True)
    fig.update_layout(title="Weekly Seasonal Differenced Time Series of<br>Direction {}, Coordinates ({}, {})".format(i[2:],i[0],i[1]),
                      yaxis1=dict(title='Observed', showline=True), xaxis1_showline=False,
                      xaxis2=dict(range=(0,75.5), showline=False, zeroline=False, showgrid=False), 
                      yaxis3=dict(title='Trend',showline=True),
                      yaxis4=dict(title='Seasonal',showline=True),
                      xaxis5=dict(title='Lag', range=(0,75.5), showline=False, zeroline=False, showgrid=False),
                      yaxis6=dict(title='Residuals',showline=True),
                      template=temp, showlegend=False, height=600)
    fig.show()

In the decomposition plots of the seasonally differenced congestion levels, we see that most of the trend has been removed from the time series with a stabilized mean and data points that lie very close to the $x$-axis. The range of variance in the graphs of the seasonality has also notably decreased. Additionally, in the ACF and PACF plots of the differenced congestion levels, we see the lags in the Autocorrelation plots quickly go below the significance levels after about Lag 2, while the PACF tapers off more slowly. This pattern suggests a Moving Average process of order 2. To forecast the next 12 hours of traffic flow, I will try both a Moving Average model and experiment with adding an Autoregressive term to the model and compare the performance of each on the test set. 

# <div style="color:white;display:fill;border-radius:5px;background-color:#6777C7;letter-spacing:0.01px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Moving Average Forecast</p></div>

Based on the ACF and PACF plots above, the first forecasting method I will try is a second-order Moving Average model using the first difference of the seasonally differenced congestion levels. Below is the 12-hour forecast of the predicted traffic levels on the test set. 

In [None]:
date=train.time.max()-timedelta(days=7)
prev_week_df=train[train.time>date][['time','congestion']].set_index('time')
prev_week_df=pd.DataFrame({'congestion':prev_week_df.groupby(prev_week_df.index)['congestion'].mean(), 'value':'Actual Values'})

def plot_forecast_dist(preds, test_df, title=""): 
    
    len_dir=len(preds)
    len_pred=len(preds[0])
    pred_list=list(chain(*zip(preds[i][j] for j in range(len_pred) for i in range(len_dir))))
    pred_df=pd.DataFrame({'preds':pred_list}, index=test_df.time)
    plot_df=pd.DataFrame({'forecast':pred_df.groupby(pred_df.index)['preds'].mean(), 'value':'Forecast'})
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=prev_week_df.index, y=prev_week_df.congestion, 
                             name='Actual Values', marker_color='#636EFA'))
    fig.add_trace(go.Scatter(x=plot_df.index, y=plot_df.forecast, 
                             name='Forecast', marker_color='#EF553B'))
    fig.update_layout(template=temp, title=title, 
                      xaxis_title='', yaxis_title='Congestion', 
                      hovermode="x unified",
                      legend=dict(orientation="v", yanchor="bottom", 
                                  y=1.09, xanchor="right", x=.99, title=""))
    fig.show()
    
    # Histogram
    hist_data=pred_df['preds']
    density=gaussian_kde(dataset=hist_data, bw_method='silverman')
    x=np.arange(0,100) 
    density.covariance_factor = lambda: .25 
    density._compute_covariance()
    kde_curve=density(x)

    fig = go.Figure()
    fig.add_trace(go.Histogram(x=hist_data, histnorm='probability density', marker_color='#F4F4F4'))
    fig.add_trace(go.Scatter(x=x, y=kde_curve, marker_color='#6168CE', fill='tozeroy'))
    fig.update_traces(marker=dict(line=dict(width=1, color='#BDC3C7')), 
                      hovertemplate='%{y}<extra></extra>')
    fig.update_layout(template=temp, title="Distribution of Predictions", 
                      xaxis_title="Congestion", yaxis_title="Probability Densities", showlegend=False)
    fig.show()
    
    return pred_df

def plot_forecast(direction, train, preds): 
    
    plot_df=train_df[train_df.index>='1991-09-12 00:00:00']
    fig=go.Figure()
    fig.add_trace(go.Scatter(x=plot_df.index, y=plot_df.congestion, name='Actual Values', marker_color='#636EFA'))
    fig.add_trace(go.Scatter(x=inverse_df.index, y=inverse_df.diff_inv, name='Forecast', marker_color='#EF553B'))
    fig.update_layout(template=temp, title="12-Hour Traffic Forecast of<br>Direction {}, Coordinates ({}, {})"\
                      .format(direction[2:], direction[0], direction[1]), 
                      xaxis_title='', yaxis_title='Congestion', hovermode="x unified",
                      legend=dict(orientation="v", yanchor="bottom", y=1, xanchor="right", x=.99, title=""))
    fig.show()
    

# Moving Average model
ma_preds=[]
for i in train.xydir.unique():
    
    # Split the data
    train_df=train[train.xydir==i].set_index('time')
    test_df=test[test.xydir==i].set_index('time')
    
    # Create weekly seasonal differenced target
    train_df['seasonal_diff']=train_df['congestion']-train_df['congestion'].shift(504)
    train_df.dropna(inplace=True)
    y_train=train_df[['seasonal_diff']]
    inverse_df=train_df.copy()
    
    ma2=ARIMA(y_train, order=(0,1,2)).fit()
    forecast=ma2.predict(start=len(y_train)+1,
                         end=len(y_train)+len(test_df),
                         dynamic=True)    
    
    # Inverse differenced predictions
    forecast=pd.Series(forecast.values, name='congestion', index=test_df.index)
    inverse_df=pd.concat([inverse_df['congestion'], forecast], axis=0).to_frame()
    inverse_df['diff_inv']=inverse_df['congestion']+inverse_df['congestion'].shift(504)
    inverse_df=inverse_df[inverse_df.index>='1991-09-30 12:00:00']
    ma_preds.append(inverse_df.diff_inv.tolist()) 
    
    # Forecasts for each direction
    # plot_forecast(direction=i, train=train_df, preds=inverse_df)
    
res=plot_forecast_dist(preds=ma_preds, test_df=test, title='12-Hour Moving Average Forecast')

#### <span style='color:#4B5EB5'>Predictions</span>

In [None]:
sub_ma2=sub.copy()
sub_ma2['congestion']=np.array(res['preds']).round(0).astype(int)
sub_ma2.to_csv('submission_ma2.csv', index=False)
sub_ma2

#### <span style='color:#4B5EB5'>Test set MAE: 7.22</span>

<br>

# <div style="color:white;display:fill;border-radius:5px;background-color:#6777C7;letter-spacing:0.01px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">ARIMA</p></div>

The Moving Average model had a mean absolute error (MAE) of 7.22 on the test set. To try to improve the Moving Average forecast, I will add a second-order Autoregressive term, creating a seasonal Autoregressive Integrated Moving Average model in the order of ARIMA $(2,1,2)$ $(0,1,0)$ $_{504}$.

In [None]:
arima_preds=[]
for i in train.xydir.unique():
    
    # Split the data
    train_df=train[train.xydir==i].set_index('time')
    test_df=test[test.xydir==i].set_index('time')
    
    # Create weekly seasonal differenced target
    train_df['seasonal_diff']=train_df['congestion']-train_df['congestion'].shift(504)
    train_df.dropna(inplace=True)
    y_train=train_df[['seasonal_diff']]
    inverse_df=train_df.copy()
    
    # ARIMA ar(2) ma(2) 
    sarima=ARIMA(y_train, order=(2,1,2)).fit()
    forecast=sarima.predict(start=len(y_train)+1,
                            end=len(y_train)+len(test_df),
                            dynamic=True)    
    
    # Inverse differenced predictions
    forecast=pd.Series(forecast.values, name='congestion', index=test_df.index)
    inverse_df=pd.concat([inverse_df['congestion'], forecast], axis=0).to_frame()
    inverse_df['diff_inv']=inverse_df['congestion']+inverse_df['congestion'].shift(504)
    inverse_df=inverse_df[inverse_df.index>='1991-09-30 12:00:00']
    arima_preds.append(inverse_df.diff_inv.tolist()) 
    
    # Forecasts for each direction
    # plot_forecast(direction=i, train=train_df, preds=inverse_df)
    
res=plot_forecast_dist(preds=arima_preds, test_df=test, title='Seasonal ARIMA<br>12-Hour Traffic Forecast')

#### <span style='color:#4B5EB5'>Predictions</span>

In [None]:
sub_arima=sub.copy()
sub_arima['congestion']=np.array(res['preds']).round(0).astype(int)
sub_arima.to_csv('submission_arima.csv', index=False)
sub_arima

#### <span style='color:#4B5EB5'>Test set MAE: 6.95</span>
<br>

The addition of a second-order Autoregressive term to the model was able to improve the forecast with a lower MAE of 6.95 on the test set.

<br>


# <div style="color:white;display:fill;border-radius:5px;background-color:#6777C7;letter-spacing:0.01px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Exponential Smoothing</p></div>

The next forecasting model I will try is Exponential Smoothing. Exponential Smoothing uses a weighted moving average method in which the weights decrease exponentially with time, giving a higher weight to more recent observations. Based on the time series decomposition plots of the trend and seasonality of the congestion levels, I will apply the Holt-Winter's additive method for the trend and seasonal components in the model. 

In [None]:
periods=[495,502,505,502,495,502,502,
         501,1009,507,505,504,500,1005,
         1009,505,500,500,501,500,505,
         500,503,500,1005,500,504,499,497,
         503,505,504,1003,505,505,504,
         505,502,500,502,505,503,505,
         1009,1009,1009,500,501,506,65,
         499,496,498,504,502,505,506,505,
         498,1009,502,501,502,1009,494]

ets_preds=[]
for direction, period in zip(train.xydir.unique(), periods):

    X_train=train[train.xydir==direction].set_index('time')
    X_test=test[test.xydir==direction].set_index('time')
    y_train=X_train[X_train.index>='1991-09-01 00:00:00']['congestion']   
    
    ets = ExponentialSmoothing(y_train,
                               trend='add', seasonal='add',
                               seasonal_periods=period).fit()
    forecast = ets.forecast(steps=len(X_test))
    ets_preds.append(forecast.tolist())
    
    # Forecasts for each direction
    # plot_forecast(direction=i, train=train_df, preds=inverse_df)
    
res=plot_forecast_dist(preds=ets_preds, test_df=test, title="12-Hour Traffic Forecast<br>with Exponential Smoothing")

#### <span style='color:#4B5EB5'>Predictions</span>

In [None]:
sub_esm=sub.copy()
sub_esm['congestion']=np.array(res['preds']).round(0).astype(int)
sub_esm.to_csv("submission_esm.csv", index=False)
sub_esm

#### <span style='color:#4B5EB5'>Test set MAE: 7.401</span>
<br>

# <div style="color:white;display:fill;border-radius:5px;background-color:#6777C7;letter-spacing:0.01px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Gradient Boosting</p></div>

To see if I can improve the performance of the Time Series models, the last forecasting method I will try is Gradient Boosting.

In [None]:
scaler=MinMaxScaler(feature_range=(0,1))
gbm_preds=[]
for i in train.xydir.unique():
    
    # Split the data
    train_df=train[train.xydir==i].set_index('time')
    test_df=test[test.xydir==i].set_index('time')
    y_train=train_df.congestion

    # Scale features
    X_train=train_df.drop(['x','y','direction','congestion','xydir'], axis=1)
    X_test=test_df.drop(['x','y','direction','xydir'], axis=1)
    X_train=pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
    X_test=pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

    gbm=LGBMRegressor(boosting_type='gbdt', 
                      num_leaves=31, 
                      max_depth=-1, 
                      learning_rate=0.1, 
                      n_estimators=500,
                      objective='mae',
                      random_state=21)
    gbm.fit(X_train, y_train)
    gbm_pred=gbm.predict(X_test)
    gbm_preds.append(gbm_pred.tolist())
    
    # Forecasts for each direction
    # plot_forecast(direction=i, train=train_df, preds=inverse_df)
    
res=plot_forecast_dist(preds=gbm_preds, test_df=test, title="12-Hour Traffic Forecast<br>with Gradient Boosting")

#### <span style='color:#4B5EB5'>Predictions</span>

In [None]:
sub_gbm=sub.copy()
sub_gbm['congestion']=np.array(res['preds']).round(0).astype(int)
sub_gbm.to_csv("submission_gbm.csv", index=False)
sub_gbm

#### <span style='color:#4B5EB5'>Test set MAE: 5.20</span>
<br>

# <div style="color:white;display:fill;border-radius:5px;background-color:#6777C7;letter-spacing:0.01px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Conclusion</p></div>

To forecast traffic levels across 65 different roadways, three time series models were developed: a Moving Average, ARIMA, and Exponential Smoothing model. In the Moving Average and ARIMA models, the first difference of the weekly seasonally differenced congestion levels was taken to reduce the trend and seasonality in the data, and in the Exponential Smoothing model, Holt-Winter's additive method was used to account for the trend and seasonal components. Out of these three methods, the ARIMA model provided a more accurate forecast on the test set with a Mean Absolute Error of 6.95. The Gradient Boosting model was able to further improve on the traffic forecasts with the lowest test error overall of 5.2.

## <p style='color:#4B5EB5;text-align:center'>Thank you for reading!<br>Please let me know if you have any questions and I look forward to any suggestions 🙂</p>

### <b><span style='color:#4B5EB5'>References</span></b>
Hyndman, R.J., & Athanasopoulos, G. (2021) *Forecasting: principles and practice*, 3rd edition, OTexts: Melbourne, Australia. [OTexts.com/fpp3](https://otexts.com/fpp3/). 