# Weather Warning Data


## Data Location
The data can be accessed at the following URLs:

https://dd.weather.gc.ca/alerts/cap/{YYYYMMDD}/{EEEE}/{hh}/

A 30-day history is kept in this directory.
where:
- YYYYMMDD: warning transmission day.
- EEEE: 4 letters for the responsible office code except LAND or WATR for tornado warning and severe thunderstorm warning alerts.
- hh: warning transmission hour.
- The LAND directory contains the CAP-XML files for all tornado warning and severe thunderstorm warning alerts that are issued over land zones in Canada.

## File name nomenclature
NOTE: ALL HOURS ARE IN UTC.

The directories have the following nomenclature :

alerts/cap/YYYYMMDD/

The filenames have the following nomenclature :

T_BBBBNN_C_EEEE_YYYYMMDDhh_##########.cap

where:

- T: constant string. Literal specification from WMO-386 manual as a prefix for this file naming convention.
- BBBBNN (for tornado and severe thunderstorm alerts): 4 letters and 2 numbers representing the 2 letter province or water body code, the 2 letter country code CN (from the WMO list), and a 2 digit numeric code set to 00 to satisfy the format of the existing filename structure. Ex: ABCN00.
- BBBBNN (for all other alerts): 4 letters and 2 numbers representing the traditional WMO bulletin header used for the alert bulletin on the WMO transmission circuits. Ex: WWCN11.

- C: constant string. Specified by the WMO, as a prefix for the CCCC group.
- EEEE: 4 letters for the responsible office code (CWAO, CWTO, etc.). The exception is for tornado warning and severe thunderstorm warning alerts where the 4 letters are either LAND or WATR rather than responsible office.
- YYYYMMDDhhmm: warning transmission date/time (UTC).
- '##########': the 10 numeric digit CAP message identifier found in the CAP file

In [1]:
from datetime import datetime, date
from datetime import timedelta
import subprocess
import os
import pandas as pd
import xml.etree.ElementTree as ET

In [2]:
# Define where the data store locally
# save_dir = './alerts'

# For windows
# save_dir = r'D:\3.MMAI5100 Database Fundametals\WeatherAPI\alerts'
save_dir = '.'

In [3]:
current_date = date.today()
current_date

datetime.date(2023, 8, 13)

# NOTE
After the first time, no need to run week_dates again

In [4]:
# Get the latest one week date 
# week_dates = [(current_date - timedelta(days=i)).strftime('%Y%m%d') for i in range(0, 7)]

# Define some helper functions

In [5]:
def get_remote_data(url):
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)
    cmd = ['wget', '-r', '-np', '-nH', '--cut-dirs=1', '-P', save_dir, url]
    process = subprocess.Popen(
        cmd,
        stdout = subprocess.PIPE,
        stderr = subprocess.PIPE
    )
    process.communicate()
    
    if process.returncode != 0:
        print("url does not exist!")

In [6]:
def get_element_tag(element):
    if element.tag[0] == "{":
        uri, ignore, tag = element.tag[1:].partition("}")
    else:
        uri = None
        tag = element.tag
    return tag, "{"+uri+"}"

In [7]:
def get_area_info(area):
    areaDesc = ''
    ploygon = ''
    geocodes = []
    for attr in area:
        tag, uri = get_element_tag(attr)
        if tag == 'areaDesc':
            areaDesc = attr.text
        if tag == 'polygon':
            ploygon = attr.text
        if tag == 'geocode':
            for geo in attr:
                tag, uri = get_element_tag(geo)
                if tag == 'value':
                    geocodes.append(geo.text)
                    
    geocodes = ','.join(geocodes)

    return areaDesc, ploygon, geocodes

## Note
For testing purpose, we only collect alerts from Ontario Storm Prediction Centre (CWTO) and tornado warning and severe thrunderstorm warning alerts on LAND.

In [8]:
hours = ["{:02d}".format(i) for i in range(0, 24)]
# dests = ["CWTO", "LAND"]
# Full data
dests = ["CWHX", "CWNT", "CWTO", "CWUL", "CWVR", "CWWG", "LAND"]

# Get Remote Data
- If this is your first time run the code, you can download the last 7 days data
- Daily run:
    - Skip week data downloading
    - Just download current date data
    

# Best and Fast way to get Remote Data
Using Advanced Message Queuing Protocol(AMQP) they provided.

https://eccc-msc.github.io/open-data/msc-datamart/amqp_en/


In [9]:
'''
for date in week_dates:
    for dest in dests:
        remote_path = "https://dd.weather.gc.ca/alerts/cap/{}/{}/".format(date, dest)
        get_remote_data(remote_path)
'''

'\nfor date in week_dates:\n    for dest in dests:\n        remote_path = "https://dd.weather.gc.ca/alerts/cap/{}/{}/".format(date, dest)\n        get_remote_data(remote_path)\n'

In [12]:
current_date = date.today().strftime('%Y%m%d')
for dest in dests:
    remote_path = "https://dd.weather.gc.ca/alerts/cap/{}/{}/".format(current_date, dest)
    print(remote_path)
    get_remote_data(remote_path)

https://dd.weather.gc.ca/alerts/cap/20230813/CWHX/


FileNotFoundError: [WinError 2] The system cannot find the file specified

In [None]:
# Remove unrelated index file
for root, dirs, files in os.walk(save_dir):
    for file in files:
        if not file.endswith(".cap"):
            file_path = os.path.join(root, file)
            os.remove(file_path)

In [None]:
# Create csv file headers
headers = ['identifier', 'sent', 'category', 'event', 'urgency', 'severity', 'certainty', 'effective', 'expires', 'areaDesc', 'ploygon', 'geocodes']

In [None]:
def get_xml_data(xmltree):
    alert_attr = {}
    areaDesc, ploygon, geocodes = '', '', ''
    for element in xmltree.getroot():
        tag, uri = get_element_tag(element)
        if tag == 'identifier':
            alert_attr['identifier'] = element.text.split(':')[2]
        if tag == 'sent':
            alert_attr['sent'] = element.text
            
        if tag == 'info':
            for elem in element:
                sub_tag, uri = get_element_tag(elem)
                if sub_tag == 'language' and elem.text == 'fr-CA': # skip the Franch version
                    break
                if sub_tag == 'category':
                    alert_attr['category'] = elem.text
                if sub_tag == 'event':
                    alert_attr['event'] = elem.text
                if sub_tag == 'urgency':
                    alert_attr['urgency'] = elem.text
                if sub_tag == 'severity':
                    alert_attr['severity'] = elem.text
                if sub_tag == 'certainty':
                    alert_attr['certainty'] = elem.text
                if sub_tag == 'effective':
                    alert_attr['effective'] = elem.text
                if sub_tag == 'expires':
                    alert_attr['expires'] = elem.text
                if sub_tag == 'area':
                    desc, pg, geos = get_area_info(elem)

                    areaDesc = desc if not areaDesc else areaDesc + '; ' + desc
                    ploygon = pg if not ploygon else ploygon + ' ' + pg
                    geocodes = geos if not geocodes else geocodes + ',' + geos 

                    alert_attr['areaDesc'] = areaDesc
                    alert_attr['ploygon'] = ploygon
                    alert_attr['geocodes'] = geocodes
                    
    collected_attrs = [alert_attr[header] for header in headers]
                    
    return collected_attrs

In [None]:
alert_df = pd.DataFrame(columns=headers)

In [None]:
for root, dirs, files in os.walk(save_dir):
    for file in files:
        file_path = os.path.join(root, file)
        data_tree = ET.parse(file_path)
        data_entry = get_xml_data(data_tree)
        try:
            alert_df.loc[len(alert_df)] = data_entry
        except:
            print("Fail to process file: {}".format(file))


In [None]:
file_name = "alerts_by_" + datetime.today().strftime('%Y-%m-%d') + ".csv"
alert_df.to_csv(file_name)