In [None]:
import psycopg2
import configparser
import pandas as pd

CONFIG = configparser.ConfigParser(interpolation=None)
CONFIG.read('../db.cfg')
dbset = CONFIG['DBSETTINGS']
conn = psycopg2.connect(**dbset)

df = pd.read_sql("select a.requestid, a.timint, (a.timint - COALESCE(b.timint, 0)) as timediff from ntas_data as a inner join ntas_data as b on a.id = b.id+1 and a.station_char = b.station_char and a.train_message in ('Arriving', 'Delayed') and b.train_message = 'AtStation' limit 100000;", conn)
df.head()

So we have a truncated dataset. The dataset is limited to 10k entries, as the number of possible entries is ~35MM. Regardless, this will serve as a nice proof of concept, and we'll plot a histogram of it.

In [None]:
import matplotlib.pyplot as plt
df['timediff'].plot.hist(bins=100)
plt.show()

Now we will want to normalize the dataframe to the time difference. For more accuracy, we should grab data from request times, and figure out the time difference there, but we want to use this as a proof of concept first. We will subtract 1 to just normalize, because that's more or less the time difference. First, let's look at the dataset.

In [None]:
df.describe()

We're only going to look at data within the 20 minute mark.

In [None]:
df = df.loc[df['timint'] < 20]
df.describe()

In [None]:
normalized_timediff = df['timediff'].apply(lambda x: x - 1)
normalized_timediff.plot.hist(bins=100)
plt.show()

So the distribution roughly shows that a significant portion of the time the estimation does end up being longer than a minute. Let's see the proportion of times that the estimate is less than the time advertised.

In [None]:
beat_expectations = normalized_timediff[normalized_timediff <= 0]
disappointments = normalized_timediff[normalized_timediff > 0]
len(beat_expectations) / len(disappointments) * 100

So, we beat estimates roughly 5.6% of the time. That...sucks.

TODO: What should we investigate?
- specific stations: what if the ends are skewing our results a little bit
- look at delayed status: what threshold do they start notifications?
- should adjust query to make sure it's the same train + direction that we're tracking...
- 