### Notebook to parse xml to produce cleaned text of federal legislation

Sean Rehaag

License: Creative Commons Attribution-NonCommercial 4.0 International [(CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/). NOTE: Users must also comply with upstream [licensing](https://www.justice.gc.ca/eng/terms-avis/index.html) for the data source.

Dataset & Code to be cited as: 

    Sean Rehaag, "Federal Legislation Bulk Decisions Dataset" (2024), online: Refugee Law Laboratory <https://refugeelab.ca/bulk-data/legislation-fed/>.

Notes:

(1) Data Source: [Department of Justice Github](https://github.com/justicecanada/laws-lois-xml) & [Department of Justice Website](https://laws-lois.justice.gc.ca).

(2) Unofficial Data: The data are unofficial reproductions of materials available on the Department Justice's Consolidated Acts and Regulations of Canada website. Official versions are available [here](https://laws-lois.justice.gc.ca/eng/acts/).

(3) Non-Affiliation / Endorsement: The data has been collected and reproduced without any affiliation or endorsement from the Government of Canada.

(4) Non-Commerical Use: As indicated in the license, data may be used for non-commercial use (with attribution) only. For commercial use, see the Department of Justice website's [Terms of Use](https://www.justice.gc.ca/eng/terms-avis/index.html).

(5) Accuracy: Data was collected and processed programmatically for the purposes of academic research. While we make best efforts to ensure accuracy, data gathering of this kind inevitably involves errors. As such the data should be viewed as preliminary information aimed to prompt further research and discussion, rather than as definitive information. 

In [1]:
##############################################
##############################################
# NOTE: Github API does not see files beyond #
# 1000 files in a directory (it truncates).  #
# So, locally clone the repo and point to    #
# the relevant file path (and update repo)   #
##############################################
##############################################


# set paths
dir_path = 'd:/AI-Projects/laws-lois-xml/'
en_path = dir_path + 'eng/acts/'
fr_path = dir_path + 'fra/lois/'


In [2]:
# Process English legislation

from lxml import etree as ET
import pandas as pd
import os
import time
import re

def get_root_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        root = ET.parse(f).getroot()
    return root

def fix_errors(text):
    text = text.replace('(   ','(')
    text = text.replace(' )',')')
    text = text.replace(' .','.')
    text = text.replace(' ,',',')
    text = text.replace('   ',' ')
    text = text.replace('  ',' ')
    return text

def extract_text(elem):
    text_parts = []  
    if elem.text:
        if elem.tag == 'DefinedTermEn':  # need to change if FR
            text_parts.append('\n'+'*'+elem.text.strip()+'*')    
        else:     
            text_parts.append(elem.text.strip())
    for child in elem: 
        child_text = extract_text(child)
        if child_text:
            text_parts.append(' ' + child_text + ' ')
    if elem.tail:
        tail_text = elem.tail.strip()
        if tail_text:
            text_parts.append(' ' + tail_text + ' ')
    return fix_errors(''.join(text_parts))

def extract_ordered_elements(root):
    ordered_tag = []
    ordered_text = []
    
    elements_to_extract = [
        'TitleText',
        'Label',
        'Text',
        'MarginalNote',
    ]

    special_labels = ['DIVISION', 'PART', 'SCHEDULE']
    
    for elem in root.iter():
        if elem.tag in elements_to_extract:       
            full_text = extract_text(elem)
            if full_text is None:
                continue
            if elem.tag == 'MarginalNote':
                ordered_text.append('\n\n### ' + full_text + '\n')
            elif elem.tag == 'Label':
                if ordered_tag and ordered_tag[-1] == 'Label':
                    ordered_text.append(full_text)
                elif any(special_labels in full_text for special_labels in special_labels):
                    ordered_text.append('\n\n# '+full_text)
                else:
                    ordered_text.append('\n' + full_text)
            elif elem.tag == 'TitleText':
                ordered_text.append('\n\n## ' + full_text)
            elif elem.tag == 'Text':
                ordered_text.append(full_text)
                                    
            ordered_tag.append((elem.tag))

    ordered_text = ' '.join(ordered_text)

    breakpoints = ['# SCHEDULE',
                   '## RELATED PROVISIONS']
                   
    for break_point in breakpoints:
        if break_point in ordered_text:
            ordered_text = ordered_text[:ordered_text.index(break_point)]
            break
    
    return ordered_text

def extract_date_assented(root, add_text = True):
    assent_stage = root.find(".//Stages[@stage='assented-to']")
    if assent_stage is not None:
        year = assent_stage.find(".//YYYY").text
        month = assent_stage.find(".//MM").text
        day = assent_stage.find(".//DD").text
        if year == '1000':
            assented_to_date = ""
        else:
            if add_text:
                assented_to_date = f"Assented to {year}-{month}-{day}\n"
            else:
                assented_to_date = f"{year}-{month}-{day}"
    else:
        assented_to_date = ""
    return assented_to_date

def extract_title(root):
    return root.find(".//ShortTitle").text

def extract_long_title(root):
    return root.find(".//LongTitle").text

def extract_citation(root, all_info = True):
    consolidated_number_element = root.find('.//ConsolidatedNumber')
    official_status = consolidated_number_element.get('official')
    if official_status == 'yes':
        year = '1985'
    try:
        year = root.find(".//AnnualStatuteId/YYYY").text
    except:
        if official_status == 'no':
            year = 'XXXX'
    try:
        chapter_number = root.find(".//AnnualStatuteId/AnnualStatuteNumber").text
    except:
        chapter_number = ''
    consolidated_number = root.find(".//ConsolidatedNumber").text
    if official_status == 'no':
        if all_info:
            if 'Supp' in consolidated_number or 'Supp' in chapter_number:
                citation = f"R.S.C. {year}, c. {chapter_number} ({consolidated_number})"
            else:
                citation = f"S.C. {year}, c. {chapter_number} ({consolidated_number})"
        else:
            if 'Supp' in consolidated_number or 'Supp' in chapter_number:
                citation = f"R.S.C. {year}, c. {chapter_number}"
            else:
                citation = f"S.C. {year}, c. {chapter_number}"
    else:
        citation = f"R.S.C. {year}, c. {consolidated_number}"
    # manual fixes
    if 'S.C. 1952, c. 89' in citation:
        citation = 'R.S.C. 1952, c. 89'
    if 'S.C. 1927, c. 188' in citation:
        citation = 'R.S.C. 1927, c. 188'
    return citation.strip()

def fix_doc_date(doc_date, citation):
    if doc_date != '':
        doc_date = doc_date.split('-')
        if len(doc_date[1]) == 1:
            doc_date[1] = '0'+doc_date[1]
        if len(doc_date[2]) == 1:
            doc_date[2] = '0'+doc_date[2]
        doc_date = '-'.join(doc_date)
        return doc_date           
    if 'RSC 1927' in citation:
        return '1928-02-01'
    if 'RSC 1952' in citation:
        return '1952-09-15'
    if 'RSC 1970' in citation:
        return '1971-07-15'
    if 'RSC 1985' in citation:
        if '1st Supp' in citation:
            return '1988-12-12'
        if '2nd Supp' in citation:
            return '1988-12-12'
        if '3rd Supp' in citation:
            return '1989-05-01'
        if '4th Supp' in citation:
            return '1989-11-01'
        if '5th Supp' in citation:
            return '1994-03-01'
        else:
            return '1988-12-12'
    return ''  

def extract_full_text(root):
    long_title = str(extract_long_title(root))
    assented_date = str(extract_date_assented(root))
    ordered_text = str(extract_ordered_elements(root))
    full_text = long_title + '\n\n' + assented_date + '\n' + ordered_text
    full_text = re.sub(r'^\s+$', '\n', full_text, flags=re.MULTILINE)
    full_text = re.sub(r'\n{3,}', '\n\n', full_text)
    return full_text

# iterate through all files in /acts/ folder and extract text to df
files = os.listdir(en_path)
data = []
for file in files:
    # skip Appropriations Acts (Z-01.xml) and Agreements and Conventions (Z-02.xml)
    if file == 'Z-01.xml' or file == 'Z-02.xml':
        continue
    try:
        root = get_root_from_file(en_path+file)
        citation = extract_citation(root, all_info=False).replace('.','')
        document_date = extract_date_assented(root, add_text=False)
        if document_date == '':
            if citation[:3] == 'SC ':
                citation = 'R'+citation
        document_date = fix_doc_date(document_date, citation)
        title = extract_title(root)
        full_text = extract_full_text(root)
        unofficial_text = '# '+ title + '\n\n' + citation + '\n\n' + full_text
        citation2 = file[:-4] 
        dataset = "LEGISLATION-FED"
        year = document_date[:4]
        language = 'en'
        source_url = 'https://github.com/justicecanada/laws-lois-xml/tree/main/eng/acts'
        scraped_timestamp = time.strftime('%Y-%m-%d')
        other = ''
        data.append([citation,
                     citation2, 
                     dataset, 
                     year, 
                     title, 
                     language,
                     document_date, 
                     source_url,
                     scraped_timestamp,
                     unofficial_text,
                     other])
    except Exception as e:
        print(f'Error in {file}')
        print(e)

df_acts_en = pd.DataFrame(data, columns=['citation',
                                 'citation2',
                                 'dataset',
                                 'year',
                                 'name',
                                 'language',
                                 'document_date', 
                                 'source_url',
                                 'scraped_timestamp',
                                 'unofficial_text',
                                 'other'
                                 ])

# export to json
df_acts_en.to_json('DATA/df_acts_en.json', orient='records', lines=True)

df_acts_en

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other
0,"SC 2019, c 10",A-0.6,LEGISLATION-FED,2019,Accessible Canada Act,en,2019-06-21,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Accessible Canada Act\n\nSC 2019, c 10\n\nAn...",
1,"SC 2018, c 27, s 675",A-1.3,LEGISLATION-FED,2018,Addition of Lands to Reserves and Reserve Crea...,en,2018-12-13,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Addition of Lands to Reserves and Reserve Cr...,
2,"SC 2014, c 20, s 376",A-1.5,LEGISLATION-FED,2014,Administrative Tribunals Support Service of Ca...,en,2014-06-19,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Administrative Tribunals Support Service of ...,
3,"RSC 1985, c A-1",A-1,LEGISLATION-FED,1988,Access to Information Act,en,1988-12-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Access to Information Act\n\nRSC 1985, c A-1...",
4,"RSC 1985, c 35 (4th Supp)",A-10.1,LEGISLATION-FED,1989,Air Canada Public Participation Act,en,1989-11-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Air Canada Public Participation Act\n\nRSC 1...,
...,...,...,...,...,...,...,...,...,...,...,...
932,"RSC 1985, c Y-4",Y-4,LEGISLATION-FED,1988,Yukon Quartz Mining Act,en,1988-12-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Yukon Quartz Mining Act\n\nRSC 1985, c Y-4\n...",
933,"SC 1992, c 1",Z-0.91,LEGISLATION-FED,1992,"Miscellaneous Statute Law Amendment Act, 1991",en,1992-02-28,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Miscellaneous Statute Law Amendment Act, 199...",
934,"RSC 1952, c 89",Z-0.98,LEGISLATION-FED,1952,Dominion Succession Duty Act,en,1952-01-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Dominion Succession Duty Act\n\nRSC 1952, c ...",
935,"SC 1950-51, c 2",Z-040,LEGISLATION-FED,1950,"Canadian Forces Act, 1950",en,1950-09-09,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Canadian Forces Act, 1950\n\nSC 1950-51, c 2...",


In [3]:
# print rows where df_acts_en.document_date is ''
df_acts_en[df_acts_en.document_date == '']

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other


In [4]:
# Process French legislation

from lxml import etree as ET
import pandas as pd
import os
import time

def get_root_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        root = ET.parse(f).getroot()
    return root

def fix_errors(text):
    text = text.replace('(   ','(')
    text = text.replace(' )',')')
    text = text.replace(' .','.')
    text = text.replace(' ,',',')
    text = text.replace('   ',' ')
    text = text.replace('  ',' ')
    return text

def extract_text(elem):
    text_parts = []  
    if elem.text:
        if elem.tag == 'DefinedTermFr':  
            text_parts.append('\n'+'*'+elem.text.strip()+'*')    
        else:     
            text_parts.append(elem.text.strip())
    for child in elem: 
        child_text = extract_text(child)
        if child_text:
            text_parts.append(' ' + child_text + ' ')
    if elem.tail:
        tail_text = elem.tail.strip()
        if tail_text:
            text_parts.append(' ' + tail_text + ' ')
    
    return fix_errors(''.join(text_parts))

def extract_ordered_elements(root):
    ordered_tag = []
    ordered_text = []
    
    elements_to_extract = [
        'TitleText',
        'Label',
        'Text',
        'MarginalNote',
    ]

    special_labels = ['SECTION', 'PARTIE', 'ANNEXE']
    
    for elem in root.iter():
        if elem.tag in elements_to_extract:       
            full_text = extract_text(elem)
            if full_text is None:
                continue
            if elem.tag == 'MarginalNote':
                ordered_text.append('\n\n### ' + full_text + '\n')
            elif elem.tag == 'Label':
                if ordered_tag and ordered_tag[-1] == 'Label':
                    ordered_text.append(full_text)
                elif any(special_labels in full_text for special_labels in special_labels):
                    ordered_text.append('\n\n# '+full_text)
                else:
                    ordered_text.append('\n' + full_text)
            elif elem.tag == 'TitleText':
                ordered_text.append('\n\n## ' + full_text)
            elif elem.tag == 'Text':
                ordered_text.append(full_text)
                                    
            ordered_tag.append((elem.tag))

    ordered_text = ' '.join(ordered_text)

    breakpoints = ['# ANNEX',
                   '#  ANNEX',
                   '#  SCHEDULE',
                   '# SCHEDULE',
                   '## DISPOSITIONS CONNEXES']
                   
    for break_point in breakpoints:
        if break_point in ordered_text:
            ordered_text = ordered_text[:ordered_text.index(break_point)]
            break    
    return ordered_text

def extract_date_assented(root, add_text = True):
    assent_stage = root.find(".//Stages[@stage='assented-to']")
    if assent_stage is not None:
        year = assent_stage.find(".//YYYY").text
        month = assent_stage.find(".//MM").text
        day = assent_stage.find(".//DD").text
        if year == '1000':
            assented_to_date = ""
        else:
            if add_text:
                assented_to_date = f"Sanctionnée {year}-{month}-{day}\n"
            else:
                assented_to_date = f"{year}-{month}-{day}"
    else:
        assented_to_date = ""
    return assented_to_date

def extract_title(root):
    return root.find(".//ShortTitle").text
    
def extract_long_title(root):
    return root.find(".//LongTitle").text
    
def extract_citation(root, all_info = True):
    consolidated_number_element = root.find('.//ConsolidatedNumber')
    official_status = consolidated_number_element.get('official')
    if official_status == 'yes':
        year = '1985'
    try:
        year = root.find(".//AnnualStatuteId/YYYY").text
    except:
        if official_status == 'no':
            year = 'XXXX'

    chapter_number_element = root.find(".//AnnualStatuteId/AnnualStatuteNumber")
    if chapter_number_element is not None:
        chapter_number = ''.join(chapter_number_element.itertext())
    else:
        chapter_number = ''

    consolidated_number = root.find(".//ConsolidatedNumber").text
    if official_status == 'no':
        if all_info:
            if 'suppl' in consolidated_number or 'suppl' in chapter_number:
                citation = f"L.R.C. {year}, ch. {chapter_number} ({consolidated_number})"
            else:
                citation = f"L.C. {year}, ch. {chapter_number} ({consolidated_number})"             
        else:
            if 'suppl' in consolidated_number or 'suppl' in chapter_number:
                citation = f"L.R.C. {year}, ch. {chapter_number}"
            else:
                citation = f"L.C. {year}, ch. {chapter_number}"
    else:
        citation = f"L.R.C. {year}, ch. {consolidated_number}"
    # manual fixes
    if 'L.C. 1952, ch. 89' in citation:
        citation = 'L.R.C. 1952, ch. 89'
    if 'L.C. 1927, ch. 188' in citation:
        citation = 'L.R.C. 1927, ch. 188'
    citation = citation.replace('ch.','c.')
    year = int(year[:4])
    if year < 1985:
        citation = citation.replace('L.R.C.','S.R.C.')
        citation = citation.replace('L.C.','S.C.')
    return citation.strip()

def fix_doc_date(doc_date, citation):
    if doc_date != '':
        doc_date = doc_date.split('-')
        if len(doc_date[1]) == 1:
            doc_date[1] = '0'+doc_date[1]
        if len(doc_date[2]) == 1:
            doc_date[2] = '0'+doc_date[2]
        doc_date = '-'.join(doc_date)
        return doc_date           
    if 'SRC 1927' in citation or 'LRC 1927' in citation:
        return '1928-02-01'
    if 'SRC 1952' in citation or 'LRC 1952' in citation:
        return '1952-09-15'
    if 'SRC 1970' in citation or 'LRC 1970' in citation:
        return '1971-07-15'
    if 'LRC 1985' in citation or 'SRC 1985' in citation:
        if '1er suppl' in citation:
            return '1988-12-12'
        if '2e suppl' in citation:
            return '1988-12-12'
        if '3e suppl' in citation:
            return '1989-05-01'
        if '4e suppl' in citation:
            return '1989-11-01'
        if '5e suppl' in citation:
            return '1994-03-01'
        else:
            return '1988-12-12'
    return ''  

def extract_full_text(root):
    long_title = str(extract_long_title(root))
    assented_date = str(extract_date_assented(root))
    ordered_text = str(extract_ordered_elements(root))
    full_text = long_title + '\n\n' + assented_date + '\n' + ordered_text
    full_text = re.sub(r'^\s+$', '\n', full_text, flags=re.MULTILINE)
    full_text = re.sub(r'\n{3,}', '\n\n', full_text)
    return full_text
    
# iterate through all files in /acts/ folder and extract text to df
files = os.listdir(fr_path)
data = []
for file in files:
    # skip Appropriations Acts (Z-01.xml) and Agreements and Conventions (Z-02.xml)
    if file == 'Z-01.xml' or file == 'Z-02.xml':
        continue
    try:
        root = get_root_from_file(fr_path+file)
        citation = extract_citation(root, all_info=False).replace('.','')
        document_date = extract_date_assented(root, add_text=False)
        if document_date == '':
            if citation[:3] == 'LC ':
                citation = citation.replace('LC ','LRC ')
            if citation[:3] == 'SC ':
                citation = citation.replace('SC ','SRC ')
            

        document_date = fix_doc_date(document_date, citation)
        title = extract_title(root)
        full_text = extract_full_text(root)
        unofficial_text = '# '+ title + '\n\n' + citation + '\n\n' + full_text
        citation2 = file[:-4] 
        dataset = "LEGISLATION-FED"
        year = document_date[:4]
        language = 'fr'
        source_url = 'https://github.com/justicecanada/laws-lois-xml/tree/main/fra/acts'
        scraped_timestamp = time.strftime('%Y-%m-%d')
        other = ''
        data.append([citation,
                     citation2, 
                     dataset, 
                     year, 
                     title, 
                     language,
                     document_date, 
                     source_url,
                     scraped_timestamp,
                     unofficial_text,
                     other])
    except Exception as e:
        print(f'Error in {file}')
        print(e)

df_acts_fr = pd.DataFrame(data, columns=['citation',
                                 'citation2',
                                 'dataset',
                                 'year',
                                 'name',
                                 'language',
                                 'document_date', 
                                 'source_url',
                                 'scraped_timestamp',
                                 'unofficial_text',
                                 'other'
                                 ])

# export to json
df_acts_fr.to_json('DATA/df_acts_fr.json', orient='records', lines=True)

df_acts_fr


Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other
0,"LC 2019, c 10",A-0.6,LEGISLATION-FED,2019,Loi canadienne sur l’accessibilité,fr,2019-06-21,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi canadienne sur l’accessibilité\n\nLC 201...,
1,"LC 2018, c 27, art 675",A-1.3,LEGISLATION-FED,2018,Loi sur l’ajout de terres aux réserves et la c...,fr,2018-12-13,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi sur l’ajout de terres aux réserves et la...,
2,"LC 2014, c 20, art 376",A-1.5,LEGISLATION-FED,2014,Loi sur le Service canadien d’appui aux tribun...,fr,2014-06-19,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi sur le Service canadien d’appui aux trib...,
3,"LRC 1985, c A-1",A-1,LEGISLATION-FED,1988,Loi sur l’accès à l’information,fr,1988-12-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Loi sur l’accès à l’information\n\nLRC 1985,...",
4,"LRC 1985, c 35 (4e suppl)",A-10.1,LEGISLATION-FED,1989,Loi sur la participation publique au capital d...,fr,1989-11-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi sur la participation publique au capital...,
...,...,...,...,...,...,...,...,...,...,...,...
932,"LRC 1985, c Y-4",Y-4,LEGISLATION-FED,1988,Loi sur l'extraction du quartz dans le Yukon,fr,1988-12-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi sur l'extraction du quartz dans le Yukon...,
933,"LC 1992, c 1",Z-0.91,LEGISLATION-FED,1992,Loi corrective de 1991,fr,1992-02-28,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Loi corrective de 1991\n\nLC 1992, c 1\n\nLo...",
934,"SRC 1952, c 89",Z-0.98,LEGISLATION-FED,1952,Loi fédérale sur les droits successoraux,fr,1952-01-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi fédérale sur les droits successoraux\n\n...,
935,"SC 1950-51, c 2",Z-040,LEGISLATION-FED,1950,Loi de 1950 sur les forces canadiennes,fr,1950-09-09,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi de 1950 sur les forces canadiennes\n\nSC...,


In [5]:
# show rows where df.document_date is ''
df_acts_fr[df_acts_fr.document_date == '']

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other


## VERIFICATION

In [6]:
# verification for English legislation
from bs4 import BeautifulSoup
import requests

verify_list = []

for letter in 'ABCDEFGHIJKLMNOPQRSTUVWY':  # no X or Z
    
    url = f'https://laws-lois.justice.gc.ca/eng/acts/{letter}.html'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    contentBlock = soup.find_all('div', class_='contentBlock')[0]
    li_tags = contentBlock.find_all('li')
    for li in li_tags:
        verify_dict = {}
        verify_dict['verify-title'] = li.find('a', class_='TocTitle').text.strip()
        verify_dict['verify-citation'] = li.find('span', class_='htmlLink').text.strip().replace('.','').replace('RSC,', 'RSC')
        verify_list.append(verify_dict)

print(len(verify_list))

# convert to df
verify_df = pd.DataFrame(verify_list)
verify_df



939


Unnamed: 0,verify-title,verify-citation
0,Access to Information Act,"RSC 1985, c A-1"
1,Accessible Canada Act,"SC 2019, c 10"
2,Addition of Lands to Reserves and Reserve Crea...,"SC 2018, c 27, s 675"
3,Administrative Tribunals Support Service of Ca...,"SC 2014, c 20, s 376"
4,"Advance Payments for Crops Act [Repealed, 1997...","RSC 1985, c C-49"
...,...,...
934,Yukon First Nations Self-Government Act,"SC 1994, c 35"
935,"Yukon Placer Mining Act [Repealed, 2002, c. 7,...","RSC 1985, c Y-3"
936,"Yukon Quartz Mining Act [Repealed, 2002, c. 7,...","RSC 1985, c Y-4"
937,Yukon Surface Rights Board Act,"SC 1994, c 43"


In [7]:
# list rows in df where df.citation is not in verify_df.citation
df_acts_en[~df_acts_en['citation'].isin(verify_df['verify-citation'])]

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other


In [8]:
# list rows in df where verify_df.citation is not in df.citation
verify_df[~verify_df['verify-citation'].isin(df_acts_en['citation'])]

Unnamed: 0,verify-title,verify-citation
7,Agreements and Conventions,Agreements and Conventions
26,Appropriation Acts,Appropriation Acts


In [9]:
# verification for French legislation
from bs4 import BeautifulSoup
import requests

verify_list = []

for letter in 'ABCDEFGHIJLMNOPQRSTUVWYZ':  # no K or X
    
    url = f'https://laws-lois.justice.gc.ca/fra/lois/{letter}.html'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    contentBlock = soup.find_all('div', class_='contentBlock')[0]
    li_tags = contentBlock.find_all('li')
    for li in li_tags:
        verify_dict = {}
        verify_dict['verify-title'] = li.find('a', class_='TocTitle').text.strip()
        verify_dict['verify-citation'] = li.find('span', class_='htmlLink').text.strip().replace('.','').replace(')', '').replace('(', '').replace('ch', 'c')
        verify_dict['verify-citation'] =verify_dict['verify-citation'].replace('1er suppl', '(1er suppl)').replace('2e suppl', '(2e suppl)').replace('3e suppl', '(3e suppl)').replace('4e suppl', '(4e suppl)').replace('5e suppl', '(5e suppl)')
        
        verify_list.append(verify_dict)

print(len(verify_list))

# convert to df
verify_df = pd.DataFrame(verify_list)
verify_df



939


Unnamed: 0,verify-title,verify-citation
0,Abrogation de la Loi sur les titres de biens-f...,"LC 1993, c 41"
1,"Abrogation des lois, Loi sur l’","LC 2008, c 20"
2,"Accès à l’information, Loi sur l’","LRC 1985, c A-1"
3,Accès aux documents du Comité spécial sur les ...,"SC 1984, c 36"
4,"Accessibilité, Loi canadienne sur l’","LC 2019, c 10"
...,...,...
934,"Yukon, Loi sur le","LC 2002, c 7"
935,"Yukon, Loi sur le [Abrogée, 2002, ch. 7, art. ...","LRC 1985, c Y-2"
936,"Zone de chemin de fer, Loi de la","SRC 1927, c 116"
937,Zone du chemin de fer et du Bloc de la rivière...,"SC 1930, c 37"


In [10]:
# list rows in df where df.citation is not in verify_df.citation
df_acts_fr[~df_acts_fr['citation'].isin(verify_df['verify-citation'])]

# same as above but print unique values of citation
#df_acts_fr[~df_acts_fr['citation'].isin(verify_df['verify-citation'])]['citation'].unique()

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other


In [11]:
# list rows in df where verify_df.citation is not in df.citation
verify_df[~verify_df['verify-citation'].isin(df_acts_fr['citation'])]

Unnamed: 0,verify-title,verify-citation
53,Accords et conventions,Accords et conventions
289,"Crédits, Lois de","Crédits, Lois de"


In [12]:
# Verify assented to dates in en
from lxml import etree as ET
from markdownify import markdownify as md
import os

# get list of files from acts/en

files = os.listdir(en_path)
xslt = ET.parse('LIMS2HTML.xsl')
transform = ET.XSLT(xslt)

verify_assented = []

for file in files:
    verify_assented_dict = {}
    # skip Appropriations Acts (Z-01.xml) and Agreements and Conventions (Z-02.xml)
    if file == 'Z-01.xml' or file == 'Z-02.xml':
        continue
    dom = ET.parse(en_path+file)
    xslt = ET.parse('LIMS2HTML.xsl')
    transform = ET.XSLT(xslt)
    newdom = transform(dom)
    markdn = md(ET.tostring(newdom, pretty_print=True).decode('utf-8'))
    try:
        assented_to = markdn.split('Assented to ')[1].split('\n')[0]
    except: 
        assented_to = ''
    verify_assented_dict['file'] = file
    verify_assented_dict['assented_to'] = assented_to
    verify_assented.append(verify_assented_dict)

# convert to df
verify_assented_df = pd.DataFrame(verify_assented)

# verify_assented_df.citation = file without .xml
verify_assented_df['citation2'] = verify_assented_df['file'].str[:-4]

verify_assented_df


Unnamed: 0,file,assented_to,citation2
0,A-0.6.xml,2019-6-21,A-0.6
1,A-1.3.xml,2018-12-13,A-1.3
2,A-1.5.xml,2014-6-19,A-1.5
3,A-1.xml,,A-1
4,A-10.1.xml,,A-10.1
...,...,...,...
932,Y-4.xml,,Y-4
933,Z-0.91.xml,1992-2-28,Z-0.91
934,Z-0.98.xml,1952-1-1,Z-0.98
935,Z-040.xml,1950-9-9,Z-040


In [13]:
# merge df_acts_en with verify_assented_df combining citation 
df_acts_en = df_acts_en.merge(verify_assented_df, on='citation2', how='left')
df_acts_en

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other,file,assented_to
0,"SC 2019, c 10",A-0.6,LEGISLATION-FED,2019,Accessible Canada Act,en,2019-06-21,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Accessible Canada Act\n\nSC 2019, c 10\n\nAn...",,A-0.6.xml,2019-6-21
1,"SC 2018, c 27, s 675",A-1.3,LEGISLATION-FED,2018,Addition of Lands to Reserves and Reserve Crea...,en,2018-12-13,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Addition of Lands to Reserves and Reserve Cr...,,A-1.3.xml,2018-12-13
2,"SC 2014, c 20, s 376",A-1.5,LEGISLATION-FED,2014,Administrative Tribunals Support Service of Ca...,en,2014-06-19,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Administrative Tribunals Support Service of ...,,A-1.5.xml,2014-6-19
3,"RSC 1985, c A-1",A-1,LEGISLATION-FED,1988,Access to Information Act,en,1988-12-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Access to Information Act\n\nRSC 1985, c A-1...",,A-1.xml,
4,"RSC 1985, c 35 (4th Supp)",A-10.1,LEGISLATION-FED,1989,Air Canada Public Participation Act,en,1989-11-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Air Canada Public Participation Act\n\nRSC 1...,,A-10.1.xml,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
932,"RSC 1985, c Y-4",Y-4,LEGISLATION-FED,1988,Yukon Quartz Mining Act,en,1988-12-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Yukon Quartz Mining Act\n\nRSC 1985, c Y-4\n...",,Y-4.xml,
933,"SC 1992, c 1",Z-0.91,LEGISLATION-FED,1992,"Miscellaneous Statute Law Amendment Act, 1991",en,1992-02-28,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Miscellaneous Statute Law Amendment Act, 199...",,Z-0.91.xml,1992-2-28
934,"RSC 1952, c 89",Z-0.98,LEGISLATION-FED,1952,Dominion Succession Duty Act,en,1952-01-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Dominion Succession Duty Act\n\nRSC 1952, c ...",,Z-0.98.xml,1952-1-1
935,"SC 1950-51, c 2",Z-040,LEGISLATION-FED,1950,"Canadian Forces Act, 1950",en,1950-09-09,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Canadian Forces Act, 1950\n\nSC 1950-51, c 2...",,Z-040.xml,1950-9-9


In [14]:
# revise df_acts_en.assented_to to add "Assented to " to the beginning of the date if it is not ''
df_acts_en['assented_to'] = df_acts_en['assented_to'].apply(lambda x: 'Assented to '+x if x != '' else x)
df_acts_en

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other,file,assented_to
0,"SC 2019, c 10",A-0.6,LEGISLATION-FED,2019,Accessible Canada Act,en,2019-06-21,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Accessible Canada Act\n\nSC 2019, c 10\n\nAn...",,A-0.6.xml,Assented to 2019-6-21
1,"SC 2018, c 27, s 675",A-1.3,LEGISLATION-FED,2018,Addition of Lands to Reserves and Reserve Crea...,en,2018-12-13,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Addition of Lands to Reserves and Reserve Cr...,,A-1.3.xml,Assented to 2018-12-13
2,"SC 2014, c 20, s 376",A-1.5,LEGISLATION-FED,2014,Administrative Tribunals Support Service of Ca...,en,2014-06-19,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Administrative Tribunals Support Service of ...,,A-1.5.xml,Assented to 2014-6-19
3,"RSC 1985, c A-1",A-1,LEGISLATION-FED,1988,Access to Information Act,en,1988-12-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Access to Information Act\n\nRSC 1985, c A-1...",,A-1.xml,
4,"RSC 1985, c 35 (4th Supp)",A-10.1,LEGISLATION-FED,1989,Air Canada Public Participation Act,en,1989-11-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Air Canada Public Participation Act\n\nRSC 1...,,A-10.1.xml,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
932,"RSC 1985, c Y-4",Y-4,LEGISLATION-FED,1988,Yukon Quartz Mining Act,en,1988-12-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Yukon Quartz Mining Act\n\nRSC 1985, c Y-4\n...",,Y-4.xml,
933,"SC 1992, c 1",Z-0.91,LEGISLATION-FED,1992,"Miscellaneous Statute Law Amendment Act, 1991",en,1992-02-28,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Miscellaneous Statute Law Amendment Act, 199...",,Z-0.91.xml,Assented to 1992-2-28
934,"RSC 1952, c 89",Z-0.98,LEGISLATION-FED,1952,Dominion Succession Duty Act,en,1952-01-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Dominion Succession Duty Act\n\nRSC 1952, c ...",,Z-0.98.xml,Assented to 1952-1-1
935,"SC 1950-51, c 2",Z-040,LEGISLATION-FED,1950,"Canadian Forces Act, 1950",en,1950-09-09,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Canadian Forces Act, 1950\n\nSC 1950-51, c 2...",,Z-040.xml,Assented to 1950-9-9


In [15]:
# check to see if the string df_acts_en.assented_to is in df_acts_en.unofficial_text
df_acts_en['assented_to_in_text'] = df_acts_en.apply(lambda x: x['assented_to'] in x['unofficial_text'], axis=1)

# list rows where df_acts_en.assented_to_in_text is False
df_acts_en[~df_acts_en['assented_to_in_text']]

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other,file,assented_to,assented_to_in_text


In [16]:
#show unique values of year where assented_to is ''
df_acts_en[df_acts_en.assented_to == '']['year'].unique()

array(['1988', '1989', '1971', '1952', '1994', '1928'], dtype=object)

In [17]:
# Verify assented to dates in fr
from lxml import etree as ET
from markdownify import markdownify as md
import os

# get list of files from acts/fr

files = os.listdir(fr_path)
xslt = ET.parse('LIMS2HTML.xsl')
transform = ET.XSLT(xslt)

verify_assented = []

for file in files:
    verify_assented_dict = {}
    # skip Appropriations Acts (Z-01.xml) and Agreements and Conventions (Z-02.xml)
    if file == 'Z-01.xml' or file == 'Z-02.xml':
        continue
    dom = ET.parse(fr_path+file)
    xslt = ET.parse('LIMS2HTML.xsl')
    transform = ET.XSLT(xslt)
    newdom = transform(dom)
    markdn = md(ET.tostring(newdom, pretty_print=True).decode('utf-8'))
    try:
        assented_to = markdn.split('Sanctionnée ')[1].split('\n')[0]
    except: 
        assented_to = ''
    verify_assented_dict['file'] = file
    verify_assented_dict['assented_to'] = assented_to
    verify_assented.append(verify_assented_dict)

# convert to df
verify_assented_df = pd.DataFrame(verify_assented)

# verify_assented_df.citation = file without .xml
verify_assented_df['citation2'] = verify_assented_df['file'].str[:-4]

verify_assented_df


Unnamed: 0,file,assented_to,citation2
0,A-0.6.xml,2019-6-21,A-0.6
1,A-1.3.xml,2018-12-13,A-1.3
2,A-1.5.xml,2014-6-19,A-1.5
3,A-1.xml,,A-1
4,A-10.1.xml,,A-10.1
...,...,...,...
932,Y-4.xml,,Y-4
933,Z-0.91.xml,1992-2-28,Z-0.91
934,Z-0.98.xml,1952-1-1,Z-0.98
935,Z-040.xml,1950-9-9,Z-040


In [18]:
# merge df_acts_en with verify_assented_df combining citation 
df_acts_fr = df_acts_fr.merge(verify_assented_df, on='citation2', how='left')
df_acts_fr['assented_to'] = df_acts_fr['assented_to'].apply(lambda x: 'Sanctionnée '+x if x != '' else x)
df_acts_fr

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other,file,assented_to
0,"LC 2019, c 10",A-0.6,LEGISLATION-FED,2019,Loi canadienne sur l’accessibilité,fr,2019-06-21,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi canadienne sur l’accessibilité\n\nLC 201...,,A-0.6.xml,Sanctionnée 2019-6-21
1,"LC 2018, c 27, art 675",A-1.3,LEGISLATION-FED,2018,Loi sur l’ajout de terres aux réserves et la c...,fr,2018-12-13,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi sur l’ajout de terres aux réserves et la...,,A-1.3.xml,Sanctionnée 2018-12-13
2,"LC 2014, c 20, art 376",A-1.5,LEGISLATION-FED,2014,Loi sur le Service canadien d’appui aux tribun...,fr,2014-06-19,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi sur le Service canadien d’appui aux trib...,,A-1.5.xml,Sanctionnée 2014-6-19
3,"LRC 1985, c A-1",A-1,LEGISLATION-FED,1988,Loi sur l’accès à l’information,fr,1988-12-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Loi sur l’accès à l’information\n\nLRC 1985,...",,A-1.xml,
4,"LRC 1985, c 35 (4e suppl)",A-10.1,LEGISLATION-FED,1989,Loi sur la participation publique au capital d...,fr,1989-11-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi sur la participation publique au capital...,,A-10.1.xml,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
932,"LRC 1985, c Y-4",Y-4,LEGISLATION-FED,1988,Loi sur l'extraction du quartz dans le Yukon,fr,1988-12-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi sur l'extraction du quartz dans le Yukon...,,Y-4.xml,
933,"LC 1992, c 1",Z-0.91,LEGISLATION-FED,1992,Loi corrective de 1991,fr,1992-02-28,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,"# Loi corrective de 1991\n\nLC 1992, c 1\n\nLo...",,Z-0.91.xml,Sanctionnée 1992-2-28
934,"SRC 1952, c 89",Z-0.98,LEGISLATION-FED,1952,Loi fédérale sur les droits successoraux,fr,1952-01-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi fédérale sur les droits successoraux\n\n...,,Z-0.98.xml,Sanctionnée 1952-1-1
935,"SC 1950-51, c 2",Z-040,LEGISLATION-FED,1950,Loi de 1950 sur les forces canadiennes,fr,1950-09-09,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Loi de 1950 sur les forces canadiennes\n\nSC...,,Z-040.xml,Sanctionnée 1950-9-9


In [19]:
# check to see if the string df_acts_en.assented_to is in df_acts_en.unofficial_text
df_acts_en['assented_to_in_text'] = df_acts_en.apply(lambda x: x['assented_to'] in x['unofficial_text'], axis=1)

# list rows where df_acts_en.assented_to_in_text is False
df_acts_en[~df_acts_en['assented_to_in_text']]

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other,file,assented_to,assented_to_in_text


In [20]:
#show unique values of year where assented_to is ''
df_acts_en[df_acts_en.assented_to == '']['year'].unique()

array(['1988', '1989', '1971', '1952', '1994', '1928'], dtype=object)

In [24]:
# manually view unofficial_text in en
from IPython.display import clear_output
while True:
    print('Enter row number to print unofficial_text (or exit to exit):')
    clear_output(wait=True)
    os.system('cls')
    row_sought = input()
    if row_sought == 'exit':
        break
    print(df_acts_en.iloc[int(row_sought)].unofficial_text)


# Atlantic Fisheries Restructuring Act

RSC 1985, c A-14

An Act to authorize investment in and the provision of financial assistance to the Atlantic Fisheries for the purpose of restructuring fishery enterprises

### Preamble
 WHEREAS a task force was established by the Government of Canada to study the Atlantic Fisheries with a view to recommending means of establishing and maintaining viable fishery enterprises on the Atlantic coast of Canada taking into account the economic and social development of the provinces concerned; AND WHEREAS the Government of Canada adopted the recommendations of the task force with respect to objectives of Atlantic Fisheries policy to the effect, in order of priority, first, that the Atlantic fishing industry be economically viable on an on-going basis, second, that employment in that industry be maximized subject to the constraint that those employed receive a reasonable income and third, to the extent that this objective is consistent with the first t

In [25]:
# manually view unofficial_text in fr
from IPython.display import clear_output
while True:
    print('Enter row number to print unofficial_text (or exit to exit):')
    clear_output(wait=True)
    os.system('cls')
    row_sought = input()
    if row_sought == 'exit':
        break
    print(df_acts_fr.iloc[int(row_sought)].unofficial_text)

# Loi sur la restructuration du secteur des pêches de l’Atlantique

LRC 1985, c A-14

Loi visant la restructuration d’entreprises grâce au concours financier apporté au secteur des pêches de l’Atlantique

### Préambule
 Vu la création par le gouvernement fédéral d’un groupe d’étude des pêches de l’Atlantique, chargé notamment de recommander les moyens de créer et de faire fonctionner des entreprises de pêche viables sur la côte atlantique du Canada dans le cadre du développement économique et social des provinces concernées; vu l’adoption par le gouvernement fédéral des recommandations du groupe touchant les objectifs d’une politique des pêches de l’Atlantique à savoir, par ordre de priorité, la rentabilité permanente du secteur, la maximisation des emplois dans le secteur, sous réserve d’un revenu normal pour ses travailleurs, et, dans la mesure où cet objectif s’harmonise avec les deux premiers ainsi qu’avec les engagements internationaux du Canada, l’exercice par les Canadiens des a