# Tabular Playground Series March - EDA

**I have condensed seveal useufl plots for each road so that info can be seen about each road quickly. They can be found at the end of the overall EDA. You can view a specific road by using the table of contents below for a given (x,y); or alternatively if using the notebook viewer use the generated table of contents on the right**

I recommend reading the overall EDA first, if you're not already familar with the data.

# Key findings

- Strange congestion levels for some roadways are likely a result of filled missing data specific to that roadway using methods such as LOCF (last observation carried forward) or using a default value when the observations is missing. This normally occurs at night.
- There is no significant overall trend in congestion, however some individual roadways do have trends.
- Missing time periods are either missing for ALL roadways or none at all (more evidence to support individual roadways are using some form of imputation)
- There are a total of 28 missing periods corresponding to 81 missing 20 minutes time intervals
- Auto-correlation plots are an increadibly useful tool for viewing daily and weekly seasonality and assessing how useful past values will be at predicting the current value for each individual roadway
- The strength of daily and weekly seasonality varies significantly between roadways
- The extent to which past values can be used to predict future values varies significantly between roadways
- The extent to which Mondays are different to other weekdays varies between roadways

# Table Of Conents

* [Test Data](#test)
* [Road Network](#road)
* [Congestion Analysis](#congestion)
* [Time Analysis](#time)
    - [Missing Data](#missing)
    - [Congestion Time Series Analysis](#time2)
    - [Autocorrelation Plots](#auto)



* [Point (0,0)](#00)
    - [Northbound](#00n)
    - [Eastbound](#00e)
    - [Southbound](#00s)
* [Point (0,1)](#01)
    - [Northbound](#01n)
    - [Eastbound](#01e)
    - [Southbound](#01s)
    - [Westbound](#01w)
* [Point (0,2)](#02)
    - [Northbound](#02n)
    - [Eastbound](#02e)
    - [Southbound](#02s)
    - [Westbound](#02w)
* [Point (0,3)](#03)
    - [Northbound](#03n)
    - [Eastbound](#03e)
    - [Southbound](#03s)
    - [Westbound](#03w)
    - [NorthEast](#03ne)
    - [SouthWest](#03sw)
* [Point (1,0)](#10)
    - [Northbound](#10n)
    - [Eastbound](#10e)
    - [Southbound](#10s)
    - [Westbound](#10w)
    - [NorthEast](#10ne)
    - [SouthWest](#10sw)
* [Point (1,1)](#11)
    - [Northbound](#11n)
    - [Eastbound](#11e)
    - [Southbound](#11s)
    - [Westbound](#11w)
* [Point (1,2)](#12)
    - [Northbound](#12n)
    - [Eastbound](#12e)
    - [Southbound](#12s)
    - [Westbound](#12w)
    - [NorthEast](#12ne)
    - [SouthWest](#12sw)
* [Point (1,3)](#13)
    - [Northbound](#13n)
    - [Eastbound](#13e)
    - [Southbound](#13s)
    - [Westbound](#13w)
    - [NorthEast](#13ne)
    - [SouthWest](#13sw)
* [Point (2,0)](#20)
    - [Northbound](#20n)
    - [Eastbound](#20e)
    - [Southbound](#20s)
    - [Westbound](#20w)
* [Point (2,1)](#21)
    - [Northbound](#21n)
    - [Eastbound](#21e)
    - [Southbound](#21s)
    - [Westbound](#21w)
    - [NorthEast](#21ne)
    - [SouthWest](#21sw)
    - [NorthWest](#21nw)
    - [SouthEast](#21se)
* [Point (2,2)](#22)
    - [Northbound](#22n)
    - [Eastbound](#22e)
    - [Southbound](#22s)
    - [Westbound](#22w)
    - [NorthEast](#22ne)
    - [SouthWest](#22sw)
    - [NorthWest](#21nw)
    - [SouthEast](#21se)
* [Point (2,3)](#23)
    - [Northbound](#23n)
    - [Eastbound](#23e)
    - [Southbound](#23s)
    - [Westbound](#23w)
    - [NorthEast](#23ne)
    - [SouthWest](#23sw)

# Preliminaries

## Task

>  For the March edition of the 2022 Tabular Playground Series you're challenged to forecast twelve-hours of traffic flow in a U.S. metropolis. The time series in this dataset are labelled with both location coordinates and a direction of travel -- a combination of features that will test your skill at spatio-temporal forecasting within a highly dynamic traffic network.

- train.csv - the training set, comprising measurements of traffic congestion across 65 roadways from April through September of 1991.
 - time - the 20-minute period in which each measurement was taken
 - x - the east-west midpoint coordinate of the roadway
 - y - the north-south midpoint coordinate of the roadway
 - dirction - the direction of travel of the roadway. EB indicates "eastbound" travel, for example, while SW indicates a "southwest" direction of travel.
 - **congestion** - congestion levels for the roadway during each hour; the target. The congestion measurements have been normalized to the range 0 to 100.
-  test set; you will make hourly predictions for roadways identified by a coordinate location and a direction of travel on the day of 1991-09-30.

US metropolises have roads that are typically very straight and grid like and roads are often directly alligned with compass directions; although this is not always the case.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as md
import seaborn as sns

from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf


sns.set_style('darkgrid')
colours = sns.color_palette('tab10', as_cmap = True)

In [None]:
train_df = pd.read_csv("/kaggle/input/tabular-playground-series-mar-2022/train.csv", index_col='row_id', parse_dates=['time'])
test_df = pd.read_csv("/kaggle/input/tabular-playground-series-mar-2022/test.csv", index_col='row_id', parse_dates=['time'])

In [None]:
print("Training data shape (rows, columns):", train_df.shape)
print("Test data shape (rows, columns):", test_df.shape)

Observation: The test set is very small, representing only a single day.

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
print("Number of missing values in train set: ", train_df.isna().sum().sum())
print("Number of missing values in test set: ", test_df.isna().sum().sum())

# 1 - Test data EDA

<a id="test"></a>

First lets have a look at the test data

In [None]:
test_df.head()

In [None]:
print("First recorded time:", test_df["time"].min())
print("Last recorded time:", test_df["time"].max())
print("Observation period:", test_df["time"].max() - test_df["time"].min())

In [None]:
test_df.groupby(["x","y","direction"])["time"].count()

In [None]:
test_df.groupby(["x","y"])["direction"].unique()

In [None]:
test_df["time"].unique().astype('datetime64[m]')

In [None]:
test_df["time"].dt.day_name().unique()[0]

In [None]:
print("Last recorded train data time:", train_df["time"].max())

Observations
- The test set has a total of 65 roadways (Same as train)
- The test set has 36 times to predict for each of the 65 roadways
- There are no missing times in the test set
- The test data is on Monday
- The test data is Afternoon/Evening 12:00-23:40
- The test data is for 30th September - there are no public holidays around this time.
- The test set starts immediately after the train set ends.

# 2- Road Network

<a id="road"></a>

We aim to understand the spatial aspect of the problem:

In [None]:
train_df["coordinates"] = train_df["x"].astype(str) + train_df["y"].astype(str) 
test_df["coordinates"] = test_df["x"].astype(str) + test_df["y"].astype(str) 

In [None]:
print("Number of roadways:", len(train_df.groupby(["x","y","direction"])["congestion"]))

In [None]:
train_df.groupby(["x","y","direction"])["congestion"].count().values

All roadways have 13,059 total data entries.

In [None]:
coordinate_direction = train_df.groupby(["x","y"])["direction"].unique().reset_index()
print(train_df.groupby(["x","y"])["direction"].unique())

Creating a map of the road network:

In [None]:
dir_dict = {'EB': (1, 0), 'NB': (0, 1), 'SB': (0, -1), 'WB': (-1, 0), 'NE': (0.5**0.5, 0.5**0.5), 'SE': (-0.5**0.5, 0.5**0.5), 'NW': (0.5**0.5, -0.5**0.5), 'SW': (-0.5**0.5, -0.5**0.5)}

plt.figure(figsize=(6, 8))
ax = sns.scatterplot(data = train_df, x = "x", y = "y", color="red")
ax.set_xticks([0,1,2])
ax.set_yticks([0,1,2,3])
for row in coordinate_direction.values:
    for direction in row[2]:
        plt.plot([row[0],row[0] + dir_dict.get(direction)[0]], [row[1] , row[1] +dir_dict.get(direction)[1]], linewidth=1,linestyle='dashed', color = "blue")
        
        plt.plot([row[0],row[0] + 0.25*dir_dict.get(direction)[0]], [row[1] , row[1] + 0.25*dir_dict.get(direction)[1]], linewidth=3, color = "red")

**Observations**
- There are 12 total points
- There's a total of 65 roadways
- Points have either 3,4,6 or 8 roadways leading from them.

**Assumptions and questions:**

We could assume that the roadways join up as we would expect (as shown in blue) but this is not necessarily the case.
- The coordinates represent the midpoint of the roadway. It does not necessarily mean that they join up at a large intersections at the midpoint of each of the roadways as shown.
- The travel direction might not be perfect (e.g. east doesn't have to be directly east)
- Is NB and SB from two points (e.g. (0,0) Northbound and (0,1) Southbound) the same road but in opposite directions or two seperate roads?
- What about the NE, NW, SE, SW directions that don't have an opposite road (and lead to another point)? Are they just one-way roads - or perhaps the congestion is only recorded in one direction.

# 3 - Congestion analysis

<a id="congestion"></a>

In [None]:
f, ax = plt.subplots(figsize=(12, 7))
#ax = sns.histplot(data = train_df, x = "congestion", bins=50)
ax = sns.barplot(x = train_df["congestion"].value_counts().index, y = train_df["congestion"].value_counts().values, palette=["red" if x in [15,20,21,29,34] else "blue" for x in range(0,101,1)]);
ax.set_xlabel("Congestion");
ax.set_xticks(range(0,101,5))
ax.set_ylabel("Count");
#ax.set_xlim(0);

We have a normal distribution as we would expect. However there seems to be some outliers at congestion $\in$ [15, 20, 21, 29, 34]. These need to be further investigated.


In [None]:
def plot_congestion_distribution_dir(x,y):
    d = train_df[(train_df["x"] == x) & (train_df["y"] == y)]["direction"].reset_index()
    d['direction'] = pd.Categorical(d['direction'], ['NB', 'EB', 'SB', 'WB', 'NE', 'SW', 'NW', 'SE']) # Plot in the same order each time (not alphabetical)
    directions = d['direction'].sort_values().unique()
    if len(directions) > 4:
        f, ax = plt.subplots(figsize=(25, 10))
    else:
         f, ax = plt.subplots(figsize=(25, 5))
    f.suptitle("(" + str(x) + "," + str(y) + ")" )
    
    for i,direction in enumerate(directions):
        congestion_vals = train_df[(train_df["x"] == x) & (train_df["y"] == y) & (train_df["direction"] == direction)]
        if len(directions) > 4:
            plt.subplot(2, 4, i + 1)
        else:
            plt.subplot(1, 4, i + 1)
        
        congestion_vc = (congestion_vals["congestion"].value_counts() + pd.Series([0]*100)).fillna(0)
        ax = plt.bar(x =congestion_vc.index, height = congestion_vc.values, width=1,linewidth=0, color = "blue");
        plt.title(direction)
        plt.xlabel("Congestion");
        plt.xticks(range(0,101,10))
        plt.ylabel("Count");

In [None]:
def plot_congestion_distribution(df):
    f, ax = plt.subplots(figsize=(8, 5))
    congestion_vc = (df["congestion"].value_counts() + pd.Series([0]*100)).fillna(0)
    ax = plt.bar(x =congestion_vc.index, height = congestion_vc.values, width=1,linewidth=0, color = "blue");
    plt.xlabel("Congestion");
    plt.xticks(range(0,101,10))
    plt.ylabel("Count");

### Congestion for each road

We plot the congestion for each road

In [None]:
plot_congestion_distribution_dir(x=0,y=0)

In [None]:
plot_congestion_distribution_dir(x=0,y=1)

In [None]:
plot_congestion_distribution_dir(x=0,y=2)

In [None]:
plot_congestion_distribution_dir(x=1,y=0)

In [None]:
plot_congestion_distribution_dir(x=1,y=1)

In [None]:
plot_congestion_distribution_dir(x=1,y=2)

In [None]:
plot_congestion_distribution_dir(x=2,y=0)

In [None]:
plot_congestion_distribution_dir(x=2,y=1)

In [None]:
plot_congestion_distribution_dir(x=2,y=2)

### Congestion road comparisons

In [None]:
plt.subplots(figsize=(25, 6))
ax = sns.barplot(data = train_df, x = "coordinates", y="congestion", hue = "direction");
ax.set_xlabel("Coordinates (xy)");
ax.set_ylabel("Congestion");

In [None]:
plt.subplots(figsize=(10, 6))
ax = sns.barplot(data = train_df, x = "direction", y="congestion");
ax.set_xlabel("Direction");
ax.set_ylabel("Congestion");

We replot the grid showing the roadway directions at each point, but with the length of the lines representing the mean congestion levels. The longer the line the higher the mean congestion level:

In [None]:
plt.figure(figsize=(9, 12))
ax = sns.scatterplot(data = train_df, x = "x", y = "y", color="red")
ax.set_xticks([0,1,2])
ax.set_yticks([0,1,2,3])
for row in coordinate_direction.values:
    for direction in row[2]:
        
        temp_df = train_df[(train_df["x"] == row[0]) & (train_df["y"] == row[1]) & (train_df["direction"] == direction)]
        mean_congestion = temp_df["congestion"].mean()
        
        plt.plot([row[0],row[0] + 0.5*mean_congestion/100*dir_dict.get(direction)[0]], [row[1] , row[1] + 0.5*mean_congestion/100*dir_dict.get(direction)[1]], linewidth=3, color = "red")

# 4 - Time Analysis
<a id="time"></a>

In [None]:
print("First recorded time:", train_df["time"].min())
print("Last recorded time:", train_df["time"].max())
print("Data time range:", train_df["time"].max() - train_df["time"].min())

In [None]:
train_df[(train_df["x"] == 1) & (train_df["y"] == 0) & (train_df["direction"] == "NB")]["time"]

**Observation** 
- Data is recorded in 20 minute time intervals. 
- Data is first recorded on 1st April 1991 
- Data is last recorded on 30th September 1991
- Data is recorded over a 6 month period (182.5 days)

## Missing Data

<a id="missing"></a>

In [None]:
timedelta = pd.Series(train_df["time"].unique()).diff(periods=1).reset_index().rename(columns={0:"TimeDelta"})

t = timedelta[timedelta["TimeDelta"] != pd.Timedelta('0 days 00:20:00')].drop(columns="index").drop(0)
print("Number of missing periods", len(t))
print("Max time of missing period", max(t["TimeDelta"]))
print("Number of missing time values", (t.sum(axis=0)["TimeDelta"]- len(t)*pd.Timedelta('0 days 00:20:00'))/pd.Timedelta('0 days 00:20:00'))
t.T

These missing times happen at datetime:

In [None]:
for n,i in enumerate(t.index):
    print(n+1,"index:", i-1, "to", i, "datetime:", train_df["time"].unique()[i-1].astype('datetime64[m]'), "to:", train_df["time"].unique()[i].astype('datetime64[m]'))
    

The missing data period begins at the following times:

In [None]:
plt.subplots(figsize=(10, 6))
missing_hour_start = (pd.DatetimeIndex(train_df["time"].unique()[t.index-1]).hour.value_counts() + pd.Series([0]*24)).fillna(0)
ax = sns.barplot(x=missing_hour_start.index, y=missing_hour_start.values, color="blue")
ax.set_xlabel("Hour missing data period began")
ax.set_ylabel("Count");

The following days have missing values in them:

In [None]:
#Note: As the whole day is highlighted it makes it look like there's more missing values than there actually is.
day_list = train_df["time"].astype('datetime64[D]').drop_duplicates()
missing_days = day_list.isin(train_df["time"].unique()[t.index].astype('datetime64[D]'))

plt.subplots(figsize=(25, 5))
plt.bar(day_list, np.ones(len(missing_days)), color = [['None','r'][idx] for idx in missing_days], edgecolor = None, width = 1)
plt.title("Days with missing values");

The number of missing data periods that begin on each day of the week:

In [None]:
plt.subplots(figsize=(7, 7))
temp = pd.Series(train_df["time"].unique()[t.index - 1]).dt.day_name().value_counts()
temp = temp.append(pd.Series([0,0], index=["Monday","Sunday"]))
ax = sns.barplot(x=temp.index, y=temp.values, color="blue", order=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])
#ax.set_xlabel("Hour missing data period began")
ax.set_ylabel("Number of missing periods");

There are no missing times that happen just for a specific road:

In [None]:
(train_df["time"].value_counts() != 65).any()

**Observations** 
- There are 28 periods where data is missing (two sequential times is greater than 20 minutes).
- These missing periods range in time from 0:40 (1 sequential missing time instance) to 3:20 (12 sequential missing time instances)
- In total there are 81 20-minute time intervals missing in the data.
- The same missing times are missing in all 65 roads
- There are no missing times unique to a specific road or subset of roads.

**Insights**
- We may not want to choose validation days with missing values
- There are no missing values for Monday, so we should be able to use any Monday for validation without issues

## Congestion time-series

<a id="time2"></a>

In [None]:
def plot_month_week(df, month):
    """Plots a week in specified month of time series
    - Month should be string procceeded by 0 e.g. to view may use '05' """
    plt.subplots(figsize=(25, 6))
    temp_df = df.set_index("time")
    temp_df = temp_df.loc['1991-'+month+'-24':'1991-'+month+'-30']
    ax = sns.lineplot(data=temp_df.loc['1991-'+month+'-24':'1991-09-30'], x=temp_df.index, y = "congestion",linewidth=1 );
    ax.xaxis.set_major_formatter(md.DateFormatter('%m-%d'))

def plot_september_end(df):
    """Plots the end of time series """
    plt.subplots(figsize=(25, 6))
    temp_df = df.set_index("time")
    temp_df = temp_df.loc['1991-09-24':'1991-09-30']
    ax = sns.lineplot(data=temp_df.loc['1991-09-24':'1991-09-30'], x=temp_df.index, y = "congestion",linewidth=1 );
    ax.xaxis.set_major_formatter(md.DateFormatter('%m-%d'))
    
def plot_april_begin(df):    
    """ Plots the start of time series"""
    plt.subplots(figsize=(25, 6))
    temp_df = df.set_index("time")
    temp_df = temp_df.loc['1991-04-01':'1991-04-05']
    ax = sns.lineplot(data=temp_df.loc['1991-04-01':'1991-04-05'], x=temp_df.index, y = "congestion",linewidth=1 );
    ax.xaxis.set_major_formatter(md.DateFormatter('%m-%d'))
    
def plot_last_mon_morning(df):
    """Plots the last monday morning of time series """
    plt.subplots(figsize=(25, 6))
    temp_df = df.set_index("time")
    temp_df = temp_df.loc['1991-09-30':'1991-09-30']
    ax = sns.lineplot(data=temp_df.loc['1991-09-30':'1991-09-30'], x=temp_df.index, y = "congestion",linewidth=1 );
    ax.xaxis.set_major_formatter(md.DateFormatter('%H:%M'))
    plt.title("Monday Morning on 09-30")
    
    
def examine_time_series(df):
    #plot_april_begin(df)
    plot_month_week(df, "04")
    #plot_month_week(df, "05")
    plot_month_week(df, "06")
    #plot_month_week(df, "07")
    #plot_month_week(df, "08")
    plot_september_end(df)

We'll have a closer look at specific time series later. Here's an example plot:

In [None]:
temp_df = train_df[(train_df["x"] == 1) & (train_df["y"] == 2) & (train_df["direction"] == "NB")]
plot_september_end(temp_df)
plot_last_mon_morning(temp_df)

### Congestion over time and day of week

In [None]:
def congestion_day_of_week(df):
    plt.subplots(figsize=(8, 5))
    temp = train_df.groupby(df["time"].dt.day_of_week)["congestion"].mean().sort_index()
    ax = sns.barplot(x=temp.index, y=temp.values, color="blue")
    ax.set_xticks(ticks = temp.index, labels = ["Mon", "Tue", "Wed", "Thur", "Fri", "Sat", "Sun"]);
    ax.set_ylim([min(temp.values) - 2, max(temp.values)+1]);
    ax.set_ylabel("Congestion")
    ax.set_title("Mean congestion levels on days of week");
    return 

In [None]:
def congestion_timeofday(df):
    ax = plt.subplots(figsize=(25, 7))
    temp = train_df.groupby([df["time"].dt.hour + train_df["time"].dt.minute/60, df["time"].dt.day_name()])["congestion"].mean()
    temp.index.rename(["time","day"], inplace=True)
    temp = temp.reset_index()
    
    ax = sns.lineplot(data=temp, x="time", y="congestion", hue="day", hue_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])
    plt.xticks(list(range(0,25,1)))
    plt.ylabel("Congestion")
    plt.title("Mean congestion levels over time of day");
    plt.xlim([0,24]);
    return 

Taking a look at day of the week congestion values across all roadways:

In [None]:
congestion_day_of_week(train_df)

Taking a look at the congestion values over the time of day across all roadways:

In [None]:
congestion_timeofday(train_df)

Lets take a look at some specific roadways (we will do this in full later):

In [None]:
temp_df = train_df[(train_df["x"] == 2) & (train_df["y"] == 0) & (train_df["direction"] == "NB")]
congestion_timeofday(temp_df)
congestion_day_of_week(temp_df)

**Observations**

- Sataurday and Sunday have the least congestion and no "rush hour" is present.
- We can see the "rush hour" peaks at around 8:00 and 17:00 on week days
- Monday has slightly less congestion than the other weekdays.

**Insights:**
- As mondays have slightly lower congestion than the other weekdays, it might be a good idea to only use mondays for our validation days.

### Autocorrelation

<a id="auto"></a>

In [None]:
def auto_correlation(df,xlim=13000):
    plt.subplots(figsize=(25, 7))
    pd.plotting.autocorrelation_plot(df["congestion"]);
    plt.title("Non-stationary time series autocorrelation plot")
    plt.ylim([-0.4,0.4]);
    plt.xlim([0,xlim])
    #f,ax = plt.subplots(figsize=(25, 7))
    #plot_acf(temp_df['congestion'], lags=100, ax=ax);

In [None]:
temp_df = train_df[(train_df["x"] == 0) & (train_df["y"] == 0) & (train_df["direction"] == "NB")]
auto_correlation(temp_df)

In [None]:
temp_df = train_df[(train_df["x"] == 1) & (train_df["y"] == 0) & (train_df["direction"] == "NB")]
auto_correlation(temp_df)

In [None]:
temp_df = train_df[(train_df["x"] == 0) & (train_df["y"] == 2) & (train_df["direction"] == "NB")]
auto_correlation(temp_df)

**Explanation:**

Autocorrelation plots measures the correlation between the observed value and the lagged value of the time series, in our case each lag is 20 minutes. Autocorrelation estimates the influence of all past observed values on the currently observed value. For example, the lag value 24 hours ago will have a strong impact on the current observed value so there will be a peak at lag 72.

The larger the lag the lower the correlation is, because more recent values have a larger impact on the current value.


- [Resource1](https://towardsdatascience.com/time-series-from-scratch-autocorrelation-and-partial-autocorrelation-explained-1dd641e3076f)
- [Resource2](https://www.alpharithms.com/autocorrelation-time-series-python-432909/) 
- [Resource3](https://otexts.com/fpp2/stationarity.html)


Before calculating autocorrelation, we should make the time series stationary. The mean, variance, and covariance shouldn’t change over time. If you want to have a look at the stationary ACF and PACF you can have a look at this notebook: [View](https://www.kaggle.com/code/cabaxiom/tps-mar-22-sarima-linear-regression#Arima-Experiments). In this notebook I use stationary ACF and PACF plots to decide the order of a seasonal ARIMA model.

> A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary 

**Observations:**

- We can see both the daily and weekly seasonality in the data
- For some roadways the weekly (and daily) seasonalities are more pronounced than others.

**Insight**

- The strength of daily and weekly seasonality is immediately and easily visible with these plots.
- Plots with higher magnitude correlation values will liekly be much easier to predict than lower magnitude peaks.

**Insights from my stationary ARIMA notebook ([View](https://www.kaggle.com/code/cabaxiom/tps-mar-22-sarima-linear-regression#Arima-Experiments)):** 
- An AR(1) term is useful for some roadways,  occasional AR(2) term is also useful (AR1 referes to using the congestion value 20 and 40 minutes ago as predictors). However rarely are more than 40 minutes before useful - and if they are its usually because of predicting the imputed values.
- Seasonal AR(1) terms are not useful.
- A Moving average term (MA1) term is also useful. Both seasonally (daily) and non-seasonally.



### Seasonality

In [None]:
#Adapted from https://www.kaggle.com/ryanholbrook/seasonality
def plot_periodogram(df, detrend='linear', ax=None): 
    from scipy.signal import periodogram
    ts = df["congestion"]
    fs = (train_df["time"].max() - train_df["time"].min()) / pd.Timedelta("20T")
    freqencies, spectrum = periodogram(
        ts,
        fs=fs,
        detrend=detrend,
        window="boxcar",
        scaling='spectrum',
    )
    if ax is None:
        _, ax = plt.subplots(figsize=(8,5))
    ax.step(freqencies, spectrum, color="purple")
    ax.set_xscale("log")
    ax.set_xticks([1, 2, 6, 12, 24, 48, 168, 336, 672, 4032])
    ax.set_xticklabels(
        [
           "Bi-Annual (1)",
           "Quarterly (2)",
           "Monthly (6)",
           "Biweekly (12)",
           "Weekly (24)",
           "Semiweekly (48)",
            "daily (168)",
            "12-Hour (336)",
            "6-Hour (672)",
            "hourly"
       ],
       rotation=30,
    )
    ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))
    ax.set_ylabel("Variance")
    ax.set_title("Periodogram");
    return

In [None]:
temp_df = train_df[(train_df["x"] == 2) & (train_df["y"] == 0) & (train_df["direction"] == "NB")]
plot_periodogram(temp_df)

In [None]:
temp_df = train_df[(train_df["x"] == 0) & (train_df["y"] == 1) & (train_df["direction"] == "EB")]
plot_periodogram(temp_df)

Observations:

- Missing values are making the spikes not quite align perfect - ignoring this for now
- Primarily daily seasonality, although this varies between roadways
- Possibly some other seasonality going on other than daily weekly, although its unclear if this is result of weekends (semi-weekly), or the two-rush hours? (12/6 hourly?) or perhaps in some cases the result of imputation. 

### Trend

In [None]:
def trend(df):
    plt.subplots(figsize=(25, 7))
    temp = df.groupby(df["time"].dt.date)["congestion"].mean()
    ax = sns.regplot(x = np.array(range(0,len(temp),1)), y = temp.values, scatter_kws={"s": 2})
    ax.set_xlim([0,185])
    ax.set_ylim([np.mean(temp.values) - 1.5, np.mean(temp.values) + 1.5])
    ax.set_xlabel("Date")
    ax.set_ylabel("Congestion")
    xticks = ax.get_xticks()
    xticks_dates = [temp.index[int(x)] for x in xticks if x < len(temp.index)]
    ax.set_xticks(np.delete(xticks, -1))
    ax.set_xticklabels(xticks_dates)

In [None]:
trend(train_df)

**Observation**
- There is no significant overall trend in congestion values, however its possible there might be trends for individual roadways

In [None]:
def plot_coordinate(x,y):
    plt.figure(figsize=(3, 3))
    ax = sns.scatterplot(x = [x], y=[y])
    plt.xlim([x-1,x+1])
    plt.ylim([y-1,y+1])
    plt.xticks([x])
    plt.yticks([y])
    plt.xlabel("x")
    plt.ylabel("y")
    temp = coordinate_direction[(coordinate_direction["x"] == x) & (coordinate_direction["y"] == y)]
    
    for direction in temp.iloc[0]["direction"]:
        temp_df = train_df[(train_df["x"] == x) & (train_df["y"] == y) & (train_df["direction"] == direction)]
        mean_congestion = temp_df["congestion"].mean()
        
        plt.plot([x, x + mean_congestion/100*dir_dict.get(direction)[0]], [y , y + mean_congestion/100*dir_dict.get(direction)[1]], linewidth=3, color = "red")

In [None]:
def show_plots(x,y,direction):
    temp_df = train_df[(train_df["x"] == x) & (train_df["y"] == y) & (train_df["direction"] == direction)]
    
    plot_last_mon_morning(temp_df)
    examine_time_series(temp_df)
    
    plot_congestion_distribution(temp_df)
    congestion_day_of_week(temp_df)
    plot_periodogram(temp_df)
    congestion_timeofday(temp_df)
    trend(temp_df)
    auto_correlation(temp_df)
    
    return temp_df

We can explore each roadway individually using the plots seen, to make observations of them.

# Point (0,0)
<a id="00"></a>

In [None]:
plot_coordinate(x=0,y=0)

## **Northbound**
<a id="00n"></a>

In [None]:
temp_df = show_plots(x=0, y=0, direction="NB")

**Observations:**

- Monday has significantly less congestion than the other weekdays.
- Weekends have signficantly less congestion.
- Daily and Weekly trends are present, with the weekly trend being very strong and daily trend fairly week.
- Lagged values are not particularly well correlated to the current value.
- Congestion is highest at night 1-5 am
- There are many instances of "steady" congestion values that doesnt change - these normally occur at night. I wonder if this is an attempt of filling missing data, possibly with LOCF (Last observation carried forward).


### **Eastbound**
<a id="00e"></a>

In [None]:
temp_df = show_plots(x=0, y=0, direction="EB")

**Observations**
- Congestion can only be from a range of values, with some values being much more likely to be picked than others. Perhaps this comes from the normalisation process? 
- Monday is similar to most other weekdays.
- Weak correlations to lagged values.
- Slight positive trend can be observed.
- Steady congestion levels at night - probably caused by missing data (e.g. LOCF) 

## **Southbound**
<a id="00s"></a>

In [None]:
temp_df = show_plots(x=0, y=0, direction="SB")

In [None]:
temp_df["congestion"].value_counts()[0:3]

Observation:
- There a peak at congestion 24 as well as a few other values that look out of place, these seem to happen early in the morning. I think this is again related to how they fill missing values. Perhaps they have decided to just "guess" the congestion at night and set all values equal to this amount? Or perhaps they measured for some nights and just assumed all nights would have a congestion of 24?
- Sataurdays and Sunday have much lower congestion.
- A week negative trend is observed, but perhaps this could be caused by congestion peak at 24 occuring more frequently

# Point (0,1)
<a id="01"></a>

In [None]:
plot_coordinate(x=0,y=1)

## **Northbound**
<a id="01n"></a>

In [None]:
temp_df = show_plots(x=0, y=1, direction="NB")

**Observation:**

- Strong lag correlations
- Clear daily and weekly seasonality
- Less congestion on monday than other weekdays

## **Eastbound**
<a id="01e"></a>

In [None]:
temp_df = show_plots(x=0, y=1, direction="EB")

- Congestion spikes very early in the morning
- Slight negative correlation trend

## **Southbound**
<a id="01s"></a>

In [None]:
temp_df = show_plots(x=0, y=1, direction="SB")

## **Westbound**
<a id="01w"></a>

In [None]:
temp_df = show_plots(x=0, y=1, direction="WB")

# Point (0,2)
<a id="02"></a>

In [None]:
plot_coordinate(x=0,y=2)

## **Northbound**
<a id="02n"></a>

In [None]:
temp_df = show_plots(x=0, y=2, direction="NB")

## **Eastbound**
<a id="02e"></a>

In [None]:
temp_df = show_plots(x=0, y=2, direction="EB")

## **Southbound**
<a id="02s"></a>

In [None]:
temp_df = show_plots(x=0, y=2, direction="SB")

## **Westbound**
<a id="02w"></a>

In [None]:
temp_df = show_plots(x=0, y=2, direction="WB")

# Point (0,3)
<a id="03"></a>

In [None]:
plot_coordinate(x=0,y=3)

## **Northbound**

<a id="03n"></a>

In [None]:
temp_df = show_plots(x=0, y=3, direction="NB")

## **Eastbound**

<a id="03e"></a>

In [None]:
temp_df = show_plots(x=0, y=3, direction="EB")

## **Southbound**

<a id="04s"></a>

In [None]:
temp_df = show_plots(x=0, y=3, direction="SB")

## **Westbound**

<a id="03w"></a>

In [None]:
temp_df = show_plots(x=0, y=3, direction="WB")

## **NorthEast**

<a id="03ne"></a>

In [None]:
temp_df = show_plots(x=0, y=3, direction="NE")

## **SouthWest**

<a id="03sw"></a>

In [None]:
temp_df = show_plots(x=0, y=3, direction="SW")

# Point (1,0)

<a id="10"></a>

In [None]:
plot_coordinate(x=1,y=0)

## **Northbound**

<a id="10n"></a>

In [None]:
temp_df = show_plots(x=1, y=0, direction="NB")

## **Eastbound**

<a id="10e"></a>

In [None]:
temp_df = show_plots(x=1, y=0, direction="EB")

## **Southbound**

<a id="10s"></a>

In [None]:
temp_df = show_plots(x=1, y=0, direction="SB")

## **Westbound**
<a id="10w"></a>

In [None]:
temp_df = show_plots(x=1, y=0, direction="WB")

## **NorthEast**
<a id="10ne"></a>

In [None]:
temp_df = show_plots(x=1, y=0, direction="NE")

## **SouthWest**
<a id="10sw"></a>

In [None]:
temp_df = show_plots(x=1, y=0, direction="SW")

# Point (1,1)

<a id="11"></a>

In [None]:
plot_coordinate(x=1,y=1)

## **Northbound**
<a id="11n"></a>

In [None]:
temp_df = show_plots(x=1, y=1, direction="NB")

## **Eastbound**
<a id="11e"></a>

In [None]:
temp_df = show_plots(x=1, y=1, direction="EB")

## **Southbound**
<a id="11s"></a>

In [None]:
temp_df = show_plots(x=1, y=1, direction="SB")

## **Westbound**
<a id="11w"></a>

In [None]:
temp_df = show_plots(x=1, y=1, direction="WB")

# Point (1,2)
<a id="12"></a>

In [None]:
plot_coordinate(x=1,y=2)

## **Northbound**
<a id="12n"></a>

In [None]:
temp_df = show_plots(x=1, y=2, direction="NB")

## **Eastbound**
<a id="12e"></a>

In [None]:
temp_df = show_plots(x=1, y=2, direction="EB")

## **Southbound**
<a id="12s"></a>

In [None]:
temp_df = show_plots(x=1, y=2, direction="SB")

## **Westbound**
<a id="12w"></a>

In [None]:
temp_df = show_plots(x=1, y=2, direction="WB")

## **NorthEast**
<a id="12ne"></a>

In [None]:
temp_df = show_plots(x=1, y=2, direction="NE")

## **SouthWest**
<a id="12sw"></a>

In [None]:
temp_df = show_plots(x=1, y=2, direction="SW")

# Point (1,3)
<a id="13"></a>

In [None]:
plot_coordinate(x=1,y=3)

## **Northbound**
<a id="13n"></a>

In [None]:
temp_df = show_plots(x=1, y=3, direction="NB")

## **Eastbound**
<a id="13e"></a>

In [None]:
temp_df = show_plots(x=1, y=3, direction="EB")

## **Southbound**
<a id="13s"></a>

In [None]:
temp_df = show_plots(x=1, y=3, direction="SB")

## **Westbound**
<a id="13w"></a>

In [None]:
temp_df = show_plots(x=1, y=3, direction="WB")

## **NorthEast**
<a id="13ne"></a>

In [None]:
temp_df = show_plots(x=1, y=3, direction="NE")

## **SouthWest**
<a id="13sw"></a>

In [None]:
temp_df = show_plots(x=1, y=3, direction="SW")

# Point (2,0)
<a id="20"></a>

In [None]:
plot_coordinate(x=2,y=0)

## **Northbound**
<a id="20n"></a>

In [None]:
temp_df = show_plots(x=2, y=0, direction="NB")

## **Eastbound**
<a id="20e"></a>

In [None]:
temp_df = show_plots(x=2, y=0, direction="EB")

## **Southbound**
<a id="20s"></a>

In [None]:
temp_df = show_plots(x=2, y=0, direction="SB")

## **Westbound**
<a id="20w"></a>

In [None]:
temp_df = show_plots(x=2, y=0, direction="WB")

# Point (2,1)
<a id="21"></a>

In [None]:
plot_coordinate(x=2,y=1)

## **Northbound**
<a id="21n"></a>

In [None]:
temp_df = show_plots(x=2, y=1, direction="NB")

## **Eastbound**
<a id="21e"></a>

In [None]:
temp_df = show_plots(x=2, y=1, direction="EB")

## **Southbound**
<a id="21s"></a>

In [None]:
temp_df = show_plots(x=2, y=1, direction="SB")

## **Westbound**
<a id="21w"></a>

In [None]:
temp_df = show_plots(x=2, y=1, direction="WB")

## **NorthEast**
<a id="21ne"></a>

In [None]:
temp_df = show_plots(x=2, y=1, direction="NE")

## **SouthWest**
<a id="21sw"></a>

In [None]:
temp_df = show_plots(x=2, y=1, direction="SW")

## **SouthEast**
<a id="21se"></a>

In [None]:
temp_df = show_plots(x=2, y=1, direction="SE")

## **NorthWest**
<a id="21nw"></a>

In [None]:
temp_df = show_plots(x=2, y=1, direction="NW")

# Point (2,2)
<a id="22"></a>

In [None]:
plot_coordinate(x=2,y=2)

## **Northbound**

<a id="22n"></a>

In [None]:
temp_df = show_plots(x=2, y=2, direction="NB")

## **Eastbound**
<a id="22e"></a>

In [None]:
temp_df = show_plots(x=2, y=2, direction="EB")

## **Southbound**
<a id="22s"></a>

In [None]:
temp_df = show_plots(x=2, y=2, direction="SB")

## **Westbound**
<a id="22w"></a>

In [None]:
temp_df = show_plots(x=2, y=2, direction="WB")

## **NorthEast**
<a id="22ne"></a>

In [None]:
temp_df = show_plots(x=2, y=2, direction="NE")

## **SouthWest**
<a id="22sw"></a>

In [None]:
temp_df = show_plots(x=2, y=2, direction="SW")

## **SouthEast**
<a id="22se"></a>

In [None]:
temp_df = show_plots(x=2, y=2, direction="SE")

## **NorthWest**
<a id="22nw"></a>

In [None]:
temp_df = show_plots(x=2, y=2, direction="NW")

# Point (2,3)
<a id="23"></a>

In [None]:
plot_coordinate(x=2,y=3)

## **Northbound**
<a id="23n"></a>

In [None]:
temp_df = show_plots(x=2, y=3, direction="NB")

## **Eastbound**
<a id="23e"></a>

In [None]:
temp_df = show_plots(x=2, y=3, direction="EB")

## **Southbound**
<a id="23s"></a>

In [None]:
temp_df = show_plots(x=2, y=3, direction="SB")

## **Westbound**
<a id="23w"></a>

In [None]:
temp_df = show_plots(x=2, y=3, direction="WB")

## **NorthEast**
<a id="23ne"></a>

In [None]:
temp_df = show_plots(x=2, y=3, direction="NE")

## **SouthWest**
<a id="23sw"></a>

In [None]:
temp_df = show_plots(x=2, y=3, direction="SW")