# Compare Miovision API and CSV Data

Now that we have an old API pull, a new API pull and a CSV dump from Miovision, we can check the data integrity of the three. I chose to do this by binning the raw minute-bin data in `miovision_csv.volumes_2020` and `miovision_api.volumes` up to the hour, and joining the two aggregate tables together.

In [1]:
import psycopg2
import datetime
import pytz
import pathlib
import configparser
import numpy as np
import pandas as pd
from plotly import graph_objs as go
from ipywidgets import interact

import intersection_tmc_notebook03test as itmc

config = configparser.ConfigParser()
config.read(pathlib.Path.home().joinpath('.charlesconfig').as_posix())
postgres_settings = config['POSTGRES']

In [2]:
sql_query = """SELECT intersection_uid,
   date_trunc('hour', datetime_bin) count_date,
   classification_uid,
   SUM(volume) volume_newapi,
   SUM(volume_20201007) volume_oldapi
FROM miovision_api.volumes
WHERE datetime_bin BETWEEN '2019-01-01' AND '2020-09-30 23:59:59'
GROUP BY intersection_uid, date_trunc('hour', datetime_bin), classification_uid
ORDER BY 1, 2, 3
"""

with psycopg2.connect(database='bigdata', **postgres_settings) as conn:
    df_api = pd.read_sql(sql_query, con=conn)

sql_query = """SELECT intersection_uid,
       date_trunc('hour', datetime_bin) count_date,
       classification_uid,
       SUM(volume) volume_csv
FROM miovision_csv.volumes_2020
WHERE datetime_bin BETWEEN '2019-01-01' AND '2020-09-30 23:59:59'
GROUP BY intersection_uid, date_trunc('hour', datetime_bin), classification_uid
ORDER BY 1, 2, 3
"""

with psycopg2.connect(database='bigdata', **postgres_settings) as conn:
    df_csv = pd.read_sql(sql_query, con=conn)

In [3]:
df_api['count_date']

0         2019-01-01 00:00:00
1         2019-01-01 00:00:00
2         2019-01-01 00:00:00
3         2019-01-01 00:00:00
4         2019-01-01 01:00:00
                  ...        
1939660   2020-09-30 23:00:00
1939661   2020-09-30 23:00:00
1939662   2020-09-30 23:00:00
1939663   2020-09-30 23:00:00
1939664   2020-09-30 23:00:00
Name: count_date, Length: 1939665, dtype: datetime64[ns]

In [4]:
df = pd.merge(df_api, df_csv, how='outer', on=('intersection_uid', 'count_date', 'classification_uid'))

In [5]:
df.loc[df['volume_newapi'].isna(), :]

Unnamed: 0,intersection_uid,count_date,classification_uid,volume_newapi,volume_oldapi,volume_csv
1939665,2,2020-04-16 23:00:00,4,,,1.0
1939666,5,2019-04-29 04:00:00,6,,,1.0
1939667,5,2019-07-23 03:00:00,6,,,1.0
1939668,5,2019-12-03 18:00:00,6,,,5.0
1939669,5,2020-01-05 19:00:00,6,,,1.0
...,...,...,...,...,...,...
2502769,40,2020-09-30 07:00:00,6,,,131.0
2502770,40,2020-09-30 08:00:00,6,,,246.0
2502771,40,2020-09-30 09:00:00,6,,,244.0
2502772,40,2020-09-30 10:00:00,6,,,205.0


Turns out we're missing a ton of data. This is because `intersection_tmc.py` [checks](https://github.com/CityofToronto/bdit_data-sources/blob/master/volumes/miovision/api/intersection_tmc.py#L312-L329) that the activation date is before the start of the pull period, and the decommission date is before the current day (that should really be the last timestamp of the pull...). In total, 563,109 rows of data are missing.

There's also data missing in the CSV pull. For almost all `intersection_uid`s the number of missing rows is less than 10. The exceptions are King / Jarvis (UID 20) and Queen / Jarvis (25)

In [6]:
df.loc[df['volume_csv'].isna(), 'intersection_uid'].value_counts()

20    4281
25     272
31       4
7        3
4        3
22       2
6        2
1        2
28       2
12       2
23       1
18       1
10       1
2        1
29       1
24       1
Name: intersection_uid, dtype: int64

In [7]:
df_20missing = df.loc[df['volume_csv'].isna() &
                      (df['intersection_uid'] == 20), :].sort_values(['count_date', 'classification_uid'])

In [8]:
(df_20missing['count_date'].dt.date + pd.offsets.MonthBegin(-1)).value_counts()

2020-09-01    4130
2020-08-01     148
2020-05-01       1
2020-07-01       1
2020-04-01       1
Name: count_date, dtype: int64

In [9]:
df_25missing = df.loc[df['volume_csv'].isna() &
                      (df['intersection_uid'] == 25), :].sort_values(['count_date', 'classification_uid'])

In [10]:
(df_25missing['count_date'].dt.date + pd.offsets.MonthBegin(-1)).value_counts()

2020-08-01    137
2020-09-01    133
2019-09-01      1
2019-01-01      1
Name: count_date, dtype: int64

So some data is missing in August and September from 20 and 25.

What of the data that isn't missing? These could be actual differences in minute-by-minute counts, or (much more likely) because we have missing rows of data in one dataset or the other.

In [11]:
df_both = pd.merge(df_api, df_csv, how='inner', on=('intersection_uid', 'count_date', 'classification_uid'))

In [12]:
def get_comparison_plot(intersect_uid=26, class_uid=1):

    dfc = df_both.loc[(df_both['intersection_uid'] == intersect_uid) &
                      (df_both['classification_uid'] == class_uid), :]
    dfc = dfc.set_index('count_date').drop(columns=['intersection_uid', 'classification_uid'])

    fig = go.Figure()

    fig.add_trace(go.Scatter(
        x=dfc['volume_newapi'].index,
        y=dfc['volume_csv'] - dfc['volume_newapi'],
        mode='lines',
        name='CSV - New API'))

    fig.add_trace(go.Scatter(
        x=dfc['volume_newapi'].index,
        y=dfc['volume_csv'] - dfc['volume_oldapi'],
        mode='lines',
        name='CSV - Old API'))

    fig.add_trace(go.Scatter(
        x=dfc['volume_oldapi'].index,
        y=dfc['volume_newapi'] - dfc['volume_oldapi'],
        mode='lines',
        name='New - Old API'))
    
    fig.update_layout(
        title={
            'text': ("Volumes Differences for "
                     "intersection_uid = {0}; class = {1}").format(intersect_uid,
                                                                   class_uid),
            'font_size': 14
        },
        xaxis_title="Date",
        yaxis_title="Volume Differences",
        xaxis_rangeslider_visible=True,
        margin=dict(l=40, r=40, t=80, b=40),
    )
  
    nonzero_diff = (dfc['volume_csv'] - dfc['volume_newapi']).values
    nonzero_diff = nonzero_diff[nonzero_diff != 0]

    fig2 = go.Figure()

    fig2.add_trace(
        go.Histogram(
            histfunc="count",
            x=nonzero_diff,
            nbinsx=30,
        )
    )

    fig2.update_layout(
        title={
            'text': ("Histogram of Nonzero CSV - New API Differences (N_hours = {0})"
                     .format(nonzero_diff.shape[0])),
            'font_size': 14
        },
        xaxis_title="CSV - New",
        yaxis_title="Number of Hours",
        height=200,
        margin=dict(l=40, r=40, t=50, b=40),
    )
    
    fig.show()
    fig2.show();

In [13]:
intnames = sorted(list(df_both['intersection_uid'].unique()))
class_uids = sorted(list(df_both['classification_uid'].unique()))

interact(get_comparison_plot, intersect_uid=intnames,
         class_uid=class_uids);

interactive(children=(Dropdown(description='intersect_uid', index=19, options=(1, 2, 3, 4, 5, 6, 7, 8, 10, 12,…

In [21]:
df_api.loc[df_api['intersection_uid'] == 33]

Unnamed: 0,intersection_uid,count_date,classification_uid,volume_newapi,volume_oldapi
1937648,33,2020-09-29 00:00:00,1,506,506.0
1937649,33,2020-09-29 00:00:00,2,3,3.0
1937650,33,2020-09-29 00:00:00,3,1,1.0
1937651,33,2020-09-29 00:00:00,4,14,14.0
1937652,33,2020-09-29 00:00:00,6,38,38.0
...,...,...,...,...,...
1937882,33,2020-09-30 23:00:00,1,923,923.0
1937883,33,2020-09-30 23:00:00,2,29,29.0
1937884,33,2020-09-30 23:00:00,3,2,2.0
1937885,33,2020-09-30 23:00:00,4,14,14.0


### Adelaide / Jarvis

As discussed [here](https://github.com/CityofToronto/bdit_data-sources/issues/331#issuecomment-714705192), when looking at the difference in volume between CSV and new or old API pulls, we see 0 for the most part, punctuated by upward and downward spikes. The upward spikes are where the CSV dump has a higher volume, and the downward spikes are where the API pull has more.

Differences in volume are due to either
- Differences in the raw minute-bin data or
- Missing minutes that result in a lower volume when aggregating up to the nearest hour.

To investigate further, we look at a negative spike (CSV has less data) on 2020-08-08 10:00-12:00 and a positive spike (CSV has more data) on 2019-11-03 01:00 and one on 2019-06-11 21:00:00.

In [14]:
sql_query = """SELECT datetime_bin,
	   leg,
	   movement_uid,
	   volume_csv,
	   volume_api
FROM (
	SELECT datetime_bin,
            leg,
			movement_uid,
			volume volume_csv
	FROM miovision_csv.volumes_2020
	WHERE intersection_uid = 4 AND classification_uid = 1
		AND datetime_bin BETWEEN '2020-08-08 10:50:00' AND '2020-08-08 11:05:00'
) a
FULL OUTER JOIN (
	SELECT datetime_bin,
	       leg,
	       movement_uid,
	       volume volume_api
	FROM miovision_api.volumes
	WHERE intersection_uid = 4 AND classification_uid = 1
		AND datetime_bin BETWEEN '2020-08-08 10:50:00' AND '2020-08-08 11:05:00'
) b USING (datetime_bin, leg, movement_uid)
WHERE (volume_csv IS NULL) OR (volume_api IS NULL) OR (volume_api != volume_csv)
ORDER BY datetime_bin, leg, movement_uid"""

with psycopg2.connect(database='bigdata', **postgres_settings) as conn:
    df_20200808_nspike = pd.read_sql(sql_query, con=conn)

In [15]:
df_20200808_nspike

Unnamed: 0,datetime_bin,leg,movement_uid,volume_csv,volume_api
0,2020-08-08 10:51:00,N,1,11.0,12
1,2020-08-08 10:51:00,N,2,1.0,2
2,2020-08-08 10:51:00,S,1,5.0,10
3,2020-08-08 10:51:00,W,1,2.0,5
4,2020-08-08 10:51:00,W,3,,1
5,2020-08-08 10:52:00,W,1,,13
6,2020-08-08 10:52:00,W,2,,2
7,2020-08-08 10:52:00,W,3,,2
8,2020-08-08 10:53:00,S,1,4.0,5
9,2020-08-08 10:53:00,W,1,5.0,13


Meanwhile, the 2019-11-03 is due to duplicate timestamps from Daylight Savings Time.

In [16]:
sql_query = """SELECT datetime_bin,
             leg,
             movement_uid,
             volume volume_csv
	FROM miovision_csv.volumes_2020
	WHERE intersection_uid = 4 AND classification_uid = 1
		AND datetime_bin BETWEEN '2019-11-03 01:00:00' AND '2019-11-03 01:10:00'
	ORDER BY 1, 2, 3"""

with psycopg2.connect(database='bigdata', **postgres_settings) as conn:
    df_csv_20191103 = pd.read_sql(sql_query, con=conn)
    
df_csv_20191103

Unnamed: 0,datetime_bin,leg,movement_uid,volume_csv
0,2019-11-03 01:00:00,N,1,10
1,2019-11-03 01:00:00,N,1,17
2,2019-11-03 01:00:00,N,2,3
3,2019-11-03 01:00:00,N,2,1
4,2019-11-03 01:00:00,S,1,4
...,...,...,...,...
99,2019-11-03 01:10:00,S,1,4
100,2019-11-03 01:10:00,S,1,5
101,2019-11-03 01:10:00,W,1,6
102,2019-11-03 01:10:00,W,1,4


Finally, for 2019-06-11 21:00:

In [17]:
sql_query = """SELECT datetime_bin,
	   leg,
	   movement_uid,
	   volume_csv,
	   volume_api
FROM (
	SELECT datetime_bin,
            leg,
			movement_uid,
			volume volume_csv
	FROM miovision_csv.volumes_2020
	WHERE intersection_uid = 4 AND classification_uid = 1
		AND datetime_bin BETWEEN '2019-06-11 21:15:00' AND '2019-06-11 21:35:00'
) a
FULL OUTER JOIN (
	SELECT datetime_bin,
	       leg,
	       movement_uid,
	       volume volume_api
	FROM miovision_api.volumes
	WHERE intersection_uid = 4 AND classification_uid = 1
		AND datetime_bin BETWEEN '2019-06-11 21:15:00' AND '2019-06-11 21:35:00'
) b USING (datetime_bin, leg, movement_uid)
WHERE (volume_csv IS NULL) OR (volume_api IS NULL) OR (volume_api != volume_csv)
ORDER BY datetime_bin, leg, movement_uid"""

with psycopg2.connect(database='bigdata', **postgres_settings) as conn:
    df_20190611_pspike = pd.read_sql(sql_query, con=conn)

df_20190611_pspike

Unnamed: 0,datetime_bin,leg,movement_uid,volume_csv,volume_api
0,2019-06-11 21:17:00,N,1,2,
1,2019-06-11 21:17:00,S,1,5,
2,2019-06-11 21:17:00,W,1,16,
3,2019-06-11 21:17:00,W,2,2,
4,2019-06-11 21:17:00,W,3,3,
...,...,...,...,...,...
75,2019-06-11 21:32:00,S,1,1,
76,2019-06-11 21:32:00,W,1,15,
77,2019-06-11 21:32:00,W,2,2,
78,2019-06-11 21:32:00,W,3,2,


Here we're mainly just missing 15 minutes of data on the API side.

### Richmond / Bathurst

This is discussed [here]().

We look at 2019-11-09 08:00 - 13:00 and 2019-06-11 21:00 (again).

In [18]:
sql_query = """SELECT datetime_bin,
	   leg,
	   movement_uid,
	   volume_csv,
	   volume_api
FROM (
	SELECT datetime_bin,
            leg,
			movement_uid,
			volume volume_csv
	FROM miovision_csv.volumes_2020
	WHERE intersection_uid = 26 AND classification_uid = 1
		AND datetime_bin BETWEEN '2019-11-09 08:00:00' AND '2019-11-09 13:00:00'
) a
FULL OUTER JOIN (
	SELECT datetime_bin,
	       leg,
	       movement_uid,
	       volume volume_api
	FROM miovision_api.volumes
	WHERE intersection_uid = 26 AND classification_uid = 1
		AND datetime_bin BETWEEN '2019-11-09 08:00:00' AND '2019-11-09 13:00:00'
) b USING (datetime_bin, leg, movement_uid)
WHERE (volume_csv IS NULL) OR (volume_api IS NULL) OR (volume_api != volume_csv)
ORDER BY datetime_bin, leg, movement_uid"""

with psycopg2.connect(database='bigdata', **postgres_settings) as conn:
    df_20191109_nspike = pd.read_sql(sql_query, con=conn)

df_20191109_nspike

Unnamed: 0,datetime_bin,leg,movement_uid,volume_csv,volume_api
0,2019-11-09 08:16:00,E,2,,1
1,2019-11-09 08:16:00,N,1,,6
2,2019-11-09 08:16:00,S,1,3.0,5
3,2019-11-09 08:17:00,N,1,9.0,11
4,2019-11-09 09:12:00,E,2,,1
...,...,...,...,...,...
142,2019-11-09 12:21:00,W,3,1.0,3
143,2019-11-09 12:22:00,E,2,2.0,3
144,2019-11-09 12:22:00,E,3,2.0,3
145,2019-11-09 12:22:00,N,1,11.0,15


In [19]:
sql_query = """SELECT datetime_bin,
	   leg,
	   movement_uid,
	   volume_csv,
	   volume_api
FROM (
	SELECT datetime_bin,
            leg,
			movement_uid,
			volume volume_csv
	FROM miovision_csv.volumes_2020
	WHERE intersection_uid = 26 AND classification_uid = 1
		AND datetime_bin BETWEEN '2019-06-11 21:00:00' AND '2019-06-11 21:59:00'
) a
FULL OUTER JOIN (
	SELECT datetime_bin,
	       leg,
	       movement_uid,
	       volume volume_api
	FROM miovision_api.volumes
	WHERE intersection_uid = 26 AND classification_uid = 1
		AND datetime_bin BETWEEN '2019-06-11 21:00:00' AND '2019-06-11 21:59:00'
) b USING (datetime_bin, leg, movement_uid)
WHERE (volume_csv IS NULL) OR (volume_api IS NULL) OR (volume_api != volume_csv)
ORDER BY datetime_bin, leg, movement_uid"""

with psycopg2.connect(database='bigdata', **postgres_settings) as conn:
    df_20190611_pspike = pd.read_sql(sql_query, con=conn)

df_20190611_pspike

Unnamed: 0,datetime_bin,leg,movement_uid,volume_csv,volume_api
0,2019-06-11 21:18:00,E,2,3,
1,2019-06-11 21:18:00,E,3,10,
2,2019-06-11 21:18:00,N,1,12,
3,2019-06-11 21:18:00,S,1,2,
4,2019-06-11 21:19:00,E,2,2,
...,...,...,...,...,...
57,2019-06-11 21:31:00,N,1,4,
58,2019-06-11 21:31:00,S,1,6,
59,2019-06-11 21:32:00,E,3,7,3.0
60,2019-06-11 21:32:00,N,1,8,


These results are pretty similar to Adelaide / Jarvis - in the case where the CSV hourly volume is smaller than the API one, we see missing rows of data and some rows that don't agree. In the case where the API hourly volume is smaller, we mainly see missing rows in the API data.

## Conclusions

- We're missing a ton of data in `miovision_api.volumes` because we ran `intersection_tmc.py` for entire year blocks, and that script [only includes intersections that were activated before the start of the block and are still active](https://github.com/CityofToronto/bdit_data-sources/blob/master/volumes/miovision/api/intersection_tmc.py#L312-L329).
- There are still differences between the CSV and API raw data. In some cases data is missing (from either CSV or API) and in other cases the datetime_bin, leg and movement_uid are the same but the volumes are different. No idea why that's happening.