# Merging a shapefile of Canada's health region boundaries with COVID-19 Case timeseries data

### Data input:
* [Weekly COVID-19 case data](https://raw.githubusercontent.com/ishaberry/Covid19Canada/master/timeseries_hr/cases_timeseries_hr.csv) (weekly_ts.csv)
* [Regional Health Boundaries Shapefile]((https://opendata.arcgis.com/datasets/3aa9f7b1428642998fa399c57dad8045_0.zip)) (RegionalHealthBoundaries.shp)

### Procedure:

* **Step 1:** Drop territories from the case data
* **Step 2:** Create a clean name column in each dataset to be merged on 
* **Step 3:** Match HRs in each file
    * **Case Data**
        * dashes, brackets, commas were removed
    * **Health Region Boundaries file:** 
        * **Step 3b:** Health region shapefile names were changed to match case data names. Where there was ambiguity in HR names in the COVID-19 case data, the province was determined from the `province` column (see below).
        * **Step 3c:** The COVID-19 Open Data Working Group (who authored the case time series) aggregated Saskatchewan health regions (see below). The boundaries file HRs were aggregated to match.
* **Step 4:** Merge on cleaned HR and check for missing data

        
        
        
---

#### Step 3b --- Ambiguous HR matches between files
weekly_ts | health_region_boundaries.shp 
-----------|--------------------------
`Western`, NL | `Western Regional Health Authority`, NL 
`Central`, NL | `Central Regional Health Authority`, NL 
`Eastern`, NL | `Eastern Regional Health Authority`, NL 
`Northern`, MB | `Northern Regional Health Authority`, MB 
`Northern`, BC | `Northern Health`, BC 



#### Step 3c --- Regional aggregates as decided by UofT researchers
##### Saskatchewan:
Far North = Far North Central, East, and West 

South = South West, South Central, South East aggregate

Central = Central East, Central West aggregate

---

### Data output:
Shapefile of Health Regions in Canada and their associated weekly COVID-19 data.

In [3]:
import pandas as pd
import geopandas as gpd

In [4]:
weekly = pd.read_csv('../collect/data/weekly_ts.csv')
hr = gpd.read_file('../collect/data/hr_boundaries/RegionalHealthBoundaries.shp')

Cleaning

In [5]:
# Step 1 --- drop territories from case and shape data
prov_cases = weekly.set_index('province').drop(['NWT', 'Yukon', 'Nunavut'], axis = 0).reset_index()

# Step 2 --- set up clean name cols
prov_cases['clean_name'] = prov_cases.health_region
hr['clean_name'] = hr.ENGNAME

# Step 3 --- clean names to match
prov_cases.clean_name.replace({
    r'\s\(.*\)': ''}, 
    inplace = True)

prov_cases.clean_name.replace({
    ',': '',
    '-':' '}, 
    regex = True, inplace = True)


# Step 3b --- fix ambiguities in case file
### for case data
prov_cases.loc[(prov_cases.clean_name == 'Northern') & (prov_cases.province == 'Manitoba'), 'clean_name'] = 'Northern Regional Health Authority'
prov_cases.loc[(prov_cases.clean_name == 'Northern') & (prov_cases.province == 'BC'), 'clean_name'] = 'Northern Health'

prov_cases.loc[(prov_cases.clean_name == 'Central') & (prov_cases.province == 'NL'), 'clean_name'] = 'Central Regional Health Authority'
prov_cases.loc[(prov_cases.clean_name == 'Central') & (prov_cases.province == 'Alberta'), 'clean_name'] =  'Central Zone'

prov_cases.loc[(prov_cases.clean_name == 'South') & (prov_cases.province == 'Alberta'), 'clean_name'] = 'South Zone'
prov_cases.loc[(prov_cases.clean_name == 'North') & (prov_cases.province == 'Alberta'), 'clean_name'] = 'North Zone'


prov_cases.loc[(prov_cases.clean_name == 'Eastern') & (prov_cases.province == 'NL'), 'clean_name'] = 'Eastern Regional Health Authority'
prov_cases.loc[(prov_cases.clean_name == 'Eastern') & (prov_cases.province == 'Ontario'), 'clean_name'] = 'The Eastern Ontario'


prov_cases.loc[(prov_cases.clean_name == 'Western') & (prov_cases.province == 'NL'), 'clean_name'] = 'Western Regional Health Authority'


### for health regions data
hr.clean_name.replace({
      # General
      'Région du ' : '',
      'Région de la ' : '',
      'Région des ' : '',
      'Région de ' : '',
      ' Regional Health Unit' : '',
      ' Health Unit' : '',
      'City of ' : '',
    ',': '',
    '-':' ',

       # Specific
    'Calgary Zone' : 'Calgary',
    'Edmonton Zone' : 'Edmonton',
    'Peterborough County–City': 'Peterborough',
    'Vancouver  Coastal Health': 'Vancouver Coastal',
    'Vancouver Island Health': 'Island',
    'Interior Health' : 'Interior',
    'Fraser Health':'Fraser',
    'The District of Algoma': 'Algoma',
    'Brant County': 'Brant',
    ' District':'',
    'Huron Perth Public':'Huron Perth',
    'Sudbury and':'Sudbury',
    'Niagara Regional Area':'Niagara',
    'Southern Health—Santé Sud':'Southern Health',
    'Renfrew County and':'Renfrew',
    'Hastings and Prince Edward Counties':'Hastings Prince Edward',
    'Prairie Mountain Health':'Prairie Mountain'
}, 
    regex = True, inplace = True)

hr.clean_name.replace({
    'Windsor Essex County':'Windsor Essex',
'Kingston Frontenac and Lennox and Addington':'Kingston Frontenac Lennox & Addington',
    'Mauricie et du Centre du Québec':'Mauricie',
    'Haliburton Kawartha Pine Ridge':'Haliburton Kawartha Pineridge',
    'Gaspésie—Îles de la Madeleine':'Gaspésie Îles de la Madeleine',
    'Saguenay—Lac Saint Jean':'Saguenay',
    "l'Abitibi Témiscamingue":'Abitibi Témiscamingue',
     "l'Estrie":'Estrie',
     "l'Outaouais":'Outaouais',
    'Interlake Eastern Regional Health Authority':'Interlake Eastern',
    'Labrador Grenfell Regional Health Authority':'Labrador Grenfell',
    'Winnipeg Regional Health Authority':'Winnipeg'
}, 
    inplace = True)

In [6]:
# Step 3c --- aggregate zones in shapefile as was done by UofT researches in case ts
hr.clean_name.replace({
    # SK -- Far North
    'Far North Central':'Far North',
    'Far North East':'Far North',
    'Far North West':'Far North',
    
    # SK -- South
    'South East':'South',
    'South West':'South',
    'South Central':'South',
    
    # SK -- Central
    'Central West':'Central',
    'Central East':'Central',
    
    # SK -- North
    'North East':'North',
    'North West':'North',
    'North Central':'North'
    

}, inplace = True)

In [7]:
# Step 4 --- merge
merge = pd.merge(prov_cases, hr, on = 'clean_name')

# check missing names
hr_missing = list(set(hr.clean_name.unique()) - set(merge.clean_name.unique()))
cases_missing = list(set(prov_cases.clean_name.unique()) - set(merge.clean_name.unique()))
print('Missing from HR: {}\nMissing from Cases: {}'.format(len(hr_missing), len(cases_missing)))

Missing from HR: 3
Missing from Cases: 1


In [8]:
print(sorted(hr_missing), sorted(cases_missing))

['Northwest Territories', 'Nunavut', 'Yukon'] ['Not Reported']


## Tidy & Export

<mark> Retain everything **on and after** july 22, 2020 </mark>

In [9]:
from datetime import datetime

In [10]:
# convert to shapefile
shp = gpd.GeoDataFrame(merge, geometry='geometry')

# drop unneccessary columns
shp2 = shp.drop(columns = ['TotalPop20',
       'Pop0to4_20', 'Pop5to9_20', 'Pop10to14_', 'Pop15to19_', 'Pop20to24_',
       'Pop25to29_', 'Pop30to34_', 'Pop35to39_', 'Pop40to44_', 'Pop45to49_',
       'Pop50to54_', 'Pop55to59_', 'Pop60to64_', 'Pop65to69_', 'Pop70to74_',
       'Pop75to79_', 'Pop80to84_', 'Pop85Older', 'AverageAge', 'MedianAge_',
       'Last_Updat','NewCases7D', 'PopUnder20', 'Pop20to49', 'Pop50to69', 'Pop70to84',
       'PopOver85'])

shp2.date_report = shp2.date_report.map(lambda x: datetime.strptime(x, '%Y-%m-%d'))


# extract dates after july 22, 2020
shp2_220720 = shp2.loc[shp2.date_report >= '2020-07-22']
shp2_220720.date_report = shp2_220720.date_report.astype(str)
shp2_220720.to_file('data/shapefiles/mergedHR-jan6.shp')