# Date Validation by Province

Checking that the date values in the dataset make sense according to the validation rules for the NRN.

In [1]:
%matplotlib inline
# required modules
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
from reporting import *

In [2]:
# set up the arguments for the dataset
data_dir = Path('../../nrn_data')

Data all exists within a set of GeoPackage files, broken up by province. The exact path to those files is not really consistent between the provinces, so a convenience function was created to find them all and pull out the release version information.

In [3]:
roadseg = load_all_roadseg(data_dir)
roadseg['datasetnam'].value_counts()

Ontario                      656621
Québec                       440398
Alberta                      413349
Saskatchewan                 291003
British Columbia             263584
Nova Scotia                  111570
Manitoba                     110604
New Brunswick                 67930
Newfoundland and Labrador     44484
Prince Edward Island          18140
Northwest Territories          6793
Yukon Territory                6591
Nunavut                        4242
Name: datasetnam, dtype: int64

Dates in the data are inconsistently formatted. The rules for the dates allow anything from just the year, to year-month, to a full year-month-day. The data is always formatted as YYYYMMDD though, and when data is missing it is normalized to being the first month/day of whatever attribute is missing. This means checking the length of the data in the column lets us know how to fill it in to get a normalized format that can be converted to the date type.

In [4]:
# Quebec contains a single data error in Montreal. It isn't clear what the proper value is, so it is shifted to December.
# https://geoegl.msp.gouv.qc.ca/igo2/apercu-qc/ shows this segment with a "dateappr" value of 20130115.
roadseg['credate'] = roadseg['credate'].str.replace('20141401','20141201')

# normalize the created and revised dates
roadseg['credate_norm'] = roadseg['credate'].apply(lambda v: date_normalize(v))
roadseg['revdate_norm'] = roadseg['revdate'].apply(lambda v: date_normalize(v))

# convert the normalized dates to proper DateTime dtypes
roadseg['created'] = pd.to_datetime(roadseg['credate_norm'], format="%Y%m%d")
roadseg['revised'] = pd.to_datetime(roadseg['revdate_norm'], format="%Y%m%d")

In [7]:
# It would not make sense for a revised date to be before a created date, so check that.
(roadseg['revised'] < roadseg['created']).value_counts()

False    2435309
dtype: int64

In [5]:
# Calculate the differences between the created and revised dates so that they can be reported on.
roadseg['date_diff'] = roadseg['revised'] - roadseg['created']
roadseg[['created','revised','date_diff']].head()

Unnamed: 0,created,revised,date_diff
0,1999-01-01,2016-02-01,6240 days
1,1999-01-01,2016-02-01,6240 days
2,2011-07-01,2016-02-01,1676 days
3,2011-07-01,2016-02-01,1676 days
4,2001-01-01,2016-02-01,5509 days


In [6]:
# Show the minimum and maximum number of days between the created and revised dates.
roadseg.groupby('datasetnam')['date_diff'].agg([min, max])

Unnamed: 0_level_0,min,max
datasetnam,Unnamed: 1_level_1,Unnamed: 2_level_1
Alberta,0 days,5981 days
British Columbia,0 days,13393 days
Manitoba,0 days,4322 days
New Brunswick,1 days,1 days
Newfoundland and Labrador,0 days,4202 days
Northwest Territories,0 days,4410 days
Nova Scotia,0 days,11331 days
Nunavut,0 days,11052 days
Ontario,0 days,5893 days
Prince Edward Island,0 days,5113 days
