### Notebook to parse xml to produce cleaned text of federal regulations

Sean Rehaag

License: Creative Commons Attribution-NonCommercial 4.0 International [(CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/). NOTE: Users must also comply with upstream [licensing](https://www.justice.gc.ca/eng/terms-avis/index.html) for the data source.

Dataset & Code to be cited as: 

    Sean Rehaag, "Federal Regulations Bulk Decisions Dataset" (2024), online: Refugee Law Laboratory <https://refugeelab.ca/bulk-data/regulations-fed/>.

Notes:

(1) Data Source: [Department of Justice Github](https://github.com/justicecanada/laws-lois-xml) & [Department of Justice Website](https://laws-lois.justice.gc.ca).

(2) Unofficial Data: The data are unofficial reproductions of materials available on the Department Justice's Consolidated Acts and Regulations of Canada website. Official versions are available [here](https://laws-lois.justice.gc.ca/eng/regulations/).

(3) Non-Affiliation / Endorsement: The data has been collected and reproduced without any affiliation or endorsement from the Government of Canada.

(4) Non-Commerical Use: As indicated in the license, data may be used for non-commercial use (with attribution) only. For commercial use, see the Department of Justice website's [Terms of Use](https://www.justice.gc.ca/eng/terms-avis/index.html).

(5) Accuracy: Data was collected and processed programmatically for the purposes of academic research. While we make best efforts to ensure accuracy, data gathering of this kind inevitably involves errors. As such the data should be viewed as preliminary information aimed to prompt further research and discussion, rather than as definitive information. 

In [2]:
##############################################
##############################################
# NOTE: Github API does not see files beyond #
# 1000 files in a directory (it truncates).  #
# So, locally clone the repo and point to    #
# the relevant file path (and update repo)   #
##############################################
##############################################


# set paths
dir_path = 'd:/AI-Projects/laws-lois-xml/'
en_path = dir_path + 'eng/regulations/'
fr_path = dir_path + 'fra/reglements/'



In [34]:
# Process English Regulations

# Footnotes aren't being nicely extracted, could fix that.

from lxml import etree as ET
import pandas as pd
import os
import time
import re

def get_root_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        root = ET.parse(f).getroot()
    return root

def fix_errors(text):
    text = text.replace('(   ','(')
    text = text.replace(' )',')')
    text = text.replace(' .','.')
    text = text.replace(' ,',',')
    text = text.replace(' ;',';')
    text = text.replace('    ',' ')
    text = text.replace('   ',' ')
    text = text.replace('  ',' ')
    text = text.replace('  ',' ')
    text = text.replace('  ',' ')
    return text

def extract_text(elem):
    text_parts = []  

    if elem.text:
        if elem.tag == 'DefinedTermEn':  # need to change if FR
            text_parts.append('\n'+'*'+elem.text.strip()+'*')    
        else:     
            text_parts.append(elem.text.strip().replace('\n',''))

    for child in elem:
        child_text = extract_text(child) if child.tag != 'FootnoteRef' else None
        if child.tag == 'FootnoteRef':
            ref_text = '{' + (child.text.strip() if child.text else '') + '} '
            child_text = ref_text + (child.tail.strip().replace('\n', '') if child.tail else '')
        if child_text:
            text_parts.append(' ' + child_text + ' ')
        elif child.tail: 
            tail_text = child.tail.strip().replace('\n', '')
            if tail_text:
                text_parts.append(' ' + tail_text + ' ')
            
    if elem.tail:
        tail_text = elem.tail.strip().replace('\n','')
        if tail_text:
            text_parts.append(' ' + tail_text + ' ')
    return fix_errors(''.join(text_parts))

def get_table_text(elem):

    # Placeholder for table text
    return ''

def extract_ordered_elements(root):
    ordered_tag = []
    ordered_text = []
    
    elements_to_extract = [
        'TitleText',
        'Label',
        'Text',
        'TableGroup'
    ]

    special_labels = ['DIVISION', 'PART', 'SCHEDULE']
    
    for elem in root.iter():
        if elem.tag in elements_to_extract:       
            full_text = extract_text(elem)
            if full_text is None:
                continue
            if elem.tag == 'MarginalNote':
                ordered_text.append('\n\n### ' + full_text + '\n')
            elif elem.tag == 'Label':
                if ordered_tag and ordered_tag[-1] == 'Label':
                    ordered_text.append(full_text)
                elif any(special_labels in full_text for special_labels in special_labels):
                    ordered_text.append('\n\n# '+full_text)
                else:
                    ordered_text.append('\n' + full_text)
            elif elem.tag == 'TitleText':
                ordered_text.append('\n\n## ' + full_text)
            elif elem.tag == 'Text':
                ordered_text.append(full_text)
            elif elem.tag == 'TableGroup':
                ordered_text.append(get_table_text(elem))
                                    
            ordered_tag.append((elem.tag))

    ordered_text = ' '.join(ordered_text)

    breakpoints = ['# SCHEDULE',
                   '## RELATED PROVISIONS']
                   
    for break_point in breakpoints:
        if break_point in ordered_text:
            ordered_text = ordered_text[:ordered_text.index(break_point)]
            break

    repealed_info = root.find('.//Repealed')
    if repealed_info is not None and 'Repealed' not in ordered_text:
        ordered_text = ordered_text + '\n\n' + repealed_info.text.strip()

    return ordered_text

def extract_citation(root, all_info = True):
    consolidated_number_element = root.find('.//InstrumentNumber')
    if consolidated_number_element is not None:
        citation = consolidated_number_element.text
    else:
        citation = ''

    return citation.strip()

def extract_date_registered(root, add_text = True):
    registered = root.find(".//RegistrationDate")
    if registered is not None:
        year = registered.find(".//YYYY").text
        month = registered.find(".//MM").text
        day = registered.find(".//DD").text
        date_registered = f"{year}-{month}-{day}"
    else:
        date_registered = ""
    return date_registered

def fix_doc_date(root, document_date, citation):
    if document_date == '':
        if citation[:3] == 'CRC':  # NOTE, JUST FOR YEARS, COULD GET MORE SPECIFIC
            if '1949' in citation:
                document_date = '1949-01-01'
            elif '1955' in citation:
                document_date = '1955-01-01'
            else:
                document_date = '1978-01-01'
        else:
            registered = root.find(".//RegulationMakerOrder")
            if registered is not None:
                year = registered.find(".//YYYY").text
                month = registered.find(".//MM").text
                day = registered.find(".//DD").text
                document_date = f"{year}-{month}-{day}"
            else:
                # manual fixes
                if citation == "SOR/54-743":
                    document_date = '1954-12-28'
                elif citation == "SOR/56-290":
                    document_date = '1956-07-19'
                elif citation == "SOR/57-176":
                    document_date = '1957-04-11'
                elif citation == "SOR/61-378":
                    document_date = '1961-08-22'
                elif citation == "SOR/67-619":
                    document_date = '1967-01-01' # NOTE JUST YEAR

                else:
                    document_date = ""
    return document_date

def extract_long_title(root):
    return root.find(".//LongTitle").text

def extract_short_title(root):
    return root.find(".//ShortTitle").text

def extract_title(root, include_short = True):
    if include_short:
        try:
            title = extract_short_title(root)
        except:
            title = extract_long_title(root)
    else:
        try:
            title = extract_long_title(root)
        except:
            title = ''
    # remove \n and extra spaces
    title = re.sub(r'\s+', ' ', title)
    return title

def extract_Enabling_Authority(root):
    try:
        enabling_authority = extract_text(root.find(".//EnablingAuthority"))
        if enabling_authority is None:
            enabling_authority = ''
        if enabling_authority != '':
            enabling_authority = 'Enabling authority: ' + enabling_authority.strip()
    except:
        enabling_authority = ''

    if enabling_authority is None:
        enabling_authority = ''
    return enabling_authority

def extract_full_text(root):
    title = str(extract_title(root, include_short=False))
    registered_date = str(extract_date_registered(root))
    ordered_text = str(extract_ordered_elements(root))
    enabling_authority = str(extract_Enabling_Authority(root))
    full_text = title + '\n\n' + registered_date + '\n\n' + enabling_authority + '\n\n' + ordered_text
    full_text = re.sub(r'^\s+$', '\n', full_text, flags=re.MULTILINE)
    full_text = re.sub(r'\n{3,}', '\n\n', full_text)
    return full_text

# iterate through all files in /acts/ folder and extract text to df
files = os.listdir(en_path)
data = []
for file in files:
    # skip 
    if file == 'regs.txt' or file == 'PLACEHOLDER2.xml':
        continue
    try:
        root = get_root_from_file(en_path+file)
        citation = extract_citation(root, all_info=False).replace('.','')
        document_date = extract_date_registered(root, add_text=False)
        document_date = fix_doc_date(root, document_date, citation)
        title = extract_title(root)
        full_text = extract_full_text(root)
        unofficial_text = '# '+ title + '\n\n' + citation + '\n\n' + full_text
        citation2 = ""
        dataset = "REGULATIONS-FED"
        year = document_date[:4]
        language = 'en'
        source_url = 'https://github.com/justicecanada/laws-lois-xml/tree/main/eng/regulations'
        scraped_timestamp = time.strftime('%Y-%m-%d')
        other = ''
        data.append([citation,
                     citation2, 
                     dataset, 
                     year, 
                     title, 
                     language,
                     document_date, 
                     source_url,
                     scraped_timestamp,
                     unofficial_text,
                     other])
    except Exception as e:
        print(f'Error in {file}')
        print(e)

df_regs_en = pd.DataFrame(data, columns=['citation',
                                 'citation2',
                                 'dataset',
                                 'year',
                                 'name',
                                 'language',
                                 'document_date', 
                                 'source_url',
                                 'scraped_timestamp',
                                 'unofficial_text',
                                 'other'
                                 ])

# # export to json
# df_acts_en.to_json('DATA/df_acts_en.json', orient='records', lines=True)

df_regs_en

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other
0,"CRC, c 10",,REGULATIONS-FED,1978,Flying Accidents Compensation Regulations,en,1978-01-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-12,# Flying Accidents Compensation Regulations\n\...,
1,"CRC, c 100",,REGULATIONS-FED,1978,Ottawa International Airport Zoning Regulations,en,1978-01-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-12,# Ottawa International Airport Zoning Regulati...,
2,"CRC, c 101",,REGULATIONS-FED,1978,Penticton Airport Zoning Regulations,en,1978-01-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-12,"# Penticton Airport Zoning Regulations\n\nCRC,...",
3,"CRC, c 1013",,REGULATIONS-FED,1978,Canada Industrial Relations Remuneration Regul...,en,1978-01-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-12,# Canada Industrial Relations Remuneration Reg...,
4,"CRC, c 1015",,REGULATIONS-FED,1978,Fair Wages and Hours of Labour Regulations,en,1978-01-01,https://github.com/justicecanada/laws-lois-xml...,2024-05-12,# Fair Wages and Hours of Labour Regulations\n...,
...,...,...,...,...,...,...,...,...,...,...,...
4664,SOR/99-53,,REGULATIONS-FED,1999,Competency of Operators of Pleasure Craft Regu...,en,1999-1-15,https://github.com/justicecanada/laws-lois-xml...,2024-05-12,# Competency of Operators of Pleasure Craft Re...,
4665,SOR/99-7,,REGULATIONS-FED,1998,"Ozone-Depleting Substances Regulations, 1998",en,1998-12-16,https://github.com/justicecanada/laws-lois-xml...,2024-05-12,"# Ozone-Depleting Substances Regulations, 1998...",
4666,SOR/99-86,,REGULATIONS-FED,1999,Proclamation Designating Certain Countries as ...,en,1999-2-10,https://github.com/justicecanada/laws-lois-xml...,2024-05-12,# Proclamation Designating Certain Countries a...,
4667,SOR/99-93,,REGULATIONS-FED,1999,Tobacco (Access) Regulations,en,1999-2-11,https://github.com/justicecanada/laws-lois-xml...,2024-05-12,# Tobacco (Access) Regulations\n\nSOR/99-93\n\...,


In [35]:
# print unofficial_text for row 0
print(df_regs_en['unofficial_text'][3093])


# British Columbia Vegetable Order

SOR/2020-259

British Columbia Vegetable Order

2020-12-4

Enabling authority: AGRICULTURAL PRODUCTS MARKETING ACT

Her Excellency the Governor General in Council, on the recommendation of the Minister of Agriculture and Agri-Food, pursuant to section 2 {a} of the Agricultural Products Marketing Act {b}, makes the annexed British Columbia Vegetable Order.  
a S.C. 1991, c. 34, s. 2 
b R.S., c. A-6 
1 The following definitions apply in this Order.  
*Act* means the Natural Products Marketing (BC) Act, RSBC 1996, c. 330. (Loi)   
*Commodity Board* means the British Columbia Vegetable Marketing Commission or its successor entity. (Office)   
*Supervisory Board* means the British Columbia Farm Industry Review Board or its successor entity. (Organisme de surveillance)   
*vegetable* means any vegetable produced in British Columbia and includes strawberries intended expressly for manufacturing purposes and potatoes. (légume)  
2 The Commodity Board and the

In [36]:
#print unofficial_text where citation is SI/80-125	
print(df_regs_en[df_regs_en['citation'] == 'SOR/2002-227']['unofficial_text'].values[0])

# Immigration and Refugee Protection Regulations

SOR/2002-227

Immigration and Refugee Protection Regulations

2002-6-11

Enabling authority: IMMIGRATION AND REFUGEE PROTECTION ACT FINANCIAL ADMINISTRATION ACT

Whereas, pursuant to subsection 5(2) of the Immigration and Refugee Protection Act {a}, the Minister of Citizenship and Immigration has caused a copy of the proposed Immigration and Refugee Protection Regulations to be laid before each House of Parliament, substantially in the form set out in the annexed Regulations;  Therefore, Her Excellency the Governor General in Council, on the recommendation of the Minister of Citizenship and Immigration and the Treasury Board, pursuant to subsection 5(1) of the Immigration and Refugee Protection Act {a} and paragraphs 19(1)(a) {b} and 19.1(a) {b} and subsection 20(2) of the Financial Administration Act, and, considering that it is in the public interest to do so, subsection 23(2.1) {c} of that Act, hereby makes the annexed Immigration an

In [140]:
df_regs_en[df_regs_en['name'].str.contains('Immigration')]





Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other
591,SI/2001-120,,REGULATIONS-FED,2001,Order Designating the Minister of Citizenship ...,en,2001-12-19,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Order Designating the Minister of Citizenshi...,
653,SI/2003-214,,REGULATIONS-FED,2003,Order Transferring from the Minister of Citize...,en,2003-12-31,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Order Transferring from the Minister of Citi...,
654,SI/2003-215,,REGULATIONS-FED,2003,Order Transferring Certain Portions from the D...,en,2003-12-31,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Order Transferring Certain Portions from the...,
689,SI/2004-135,,REGULATIONS-FED,2004,Order Transferring to the Department of Citize...,en,2004-10-20,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Order Transferring to the Department of Citi...,
690,SI/2004-136,,REGULATIONS-FED,2004,Order Transferring to the Canada Border Servic...,en,2004-10-20,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Order Transferring to the Canada Border Serv...,
718,SI/2005-120,,REGULATIONS-FED,2005,Order Setting Out the Respective Responsibilit...,en,2005-12-14,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Order Setting Out the Respective Responsibil...,
815,SI/2008-136,,REGULATIONS-FED,2008,Order Designating the Minister of Citizenship ...,en,2008-11-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Order Designating the Minister of Citizenshi...,
934,SI/2013-56,,REGULATIONS-FED,2013,Order Transferring the Control and Supervision...,en,2013-5-22,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Order Transferring the Control and Supervisi...,
988,SI/2015-52,,REGULATIONS-FED,2015,Ministerial Responsibilities Under the Immigra...,en,2015-7-1,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Ministerial Responsibilities Under the Immig...,
1058,SI/2018-104,,REGULATIONS-FED,2018,Order Designating the Department of Health and...,en,2018-12-12,https://github.com/justicecanada/laws-lois-xml...,2024-05-10,# Order Designating the Department of Health a...,


In [None]:
# Process French legislation

from lxml import etree as ET
import pandas as pd
import os
import time

def get_root_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        root = ET.parse(f).getroot()
    return root

def fix_errors(text):
    text = text.replace('(   ','(')
    text = text.replace(' )',')')
    text = text.replace(' .','.')
    text = text.replace(' ,',',')
    text = text.replace('   ',' ')
    text = text.replace('  ',' ')
    return text

def extract_text(elem):
    text_parts = []  
    if elem.text:
        if elem.tag == 'DefinedTermFr':  
            text_parts.append('\n'+'*'+elem.text.strip()+'*')    
        else:     
            text_parts.append(elem.text.strip())
    for child in elem: 
        child_text = extract_text(child)
        if child_text:
            text_parts.append(' ' + child_text + ' ')
    if elem.tail:
        tail_text = elem.tail.strip()
        if tail_text:
            text_parts.append(' ' + tail_text + ' ')
    
    return fix_errors(''.join(text_parts))

def extract_ordered_elements(root):
    ordered_tag = []
    ordered_text = []
    
    elements_to_extract = [
        'TitleText',
        'Label',
        'Text',
        'MarginalNote',
    ]

    special_labels = ['SECTION', 'PARTIE', 'ANNEXE']
    
    for elem in root.iter():
        if elem.tag in elements_to_extract:       
            full_text = extract_text(elem)
            if full_text is None:
                continue
            if elem.tag == 'MarginalNote':
                ordered_text.append('\n\n### ' + full_text + '\n')
            elif elem.tag == 'Label':
                if ordered_tag and ordered_tag[-1] == 'Label':
                    ordered_text.append(full_text)
                elif any(special_labels in full_text for special_labels in special_labels):
                    ordered_text.append('\n\n# '+full_text)
                else:
                    ordered_text.append('\n' + full_text)
            elif elem.tag == 'TitleText':
                ordered_text.append('\n\n## ' + full_text)
            elif elem.tag == 'Text':
                ordered_text.append(full_text)
                                    
            ordered_tag.append((elem.tag))

    ordered_text = ' '.join(ordered_text)

    breakpoints = ['# ANNEX',
                   '#  ANNEX',
                   '#  SCHEDULE',
                   '# SCHEDULE',
                   '## DISPOSITIONS CONNEXES']
                   
    for break_point in breakpoints:
        if break_point in ordered_text:
            ordered_text = ordered_text[:ordered_text.index(break_point)]
            break    
    return ordered_text

def extract_date_assented(root, add_text = True):
    assent_stage = root.find(".//Stages[@stage='assented-to']")
    if assent_stage is not None:
        year = assent_stage.find(".//YYYY").text
        month = assent_stage.find(".//MM").text
        day = assent_stage.find(".//DD").text
        if year == '1000':
            assented_to_date = ""
        else:
            if add_text:
                assented_to_date = f"Sanctionnée {year}-{month}-{day}\n"
            else:
                assented_to_date = f"{year}-{month}-{day}"
    else:
        assented_to_date = ""
    return assented_to_date

def extract_title(root):
    return root.find(".//ShortTitle").text
    
def extract_long_title(root):
    return root.find(".//LongTitle").text
    
def extract_citation(root, all_info = True):
    consolidated_number_element = root.find('.//ConsolidatedNumber')
    official_status = consolidated_number_element.get('official')
    if official_status == 'yes':
        year = '1985'
    try:
        year = root.find(".//AnnualStatuteId/YYYY").text
    except:
        if official_status == 'no':
            year = 'XXXX'

    chapter_number_element = root.find(".//AnnualStatuteId/AnnualStatuteNumber")
    if chapter_number_element is not None:
        chapter_number = ''.join(chapter_number_element.itertext())
    else:
        chapter_number = ''

    consolidated_number = root.find(".//ConsolidatedNumber").text
    if official_status == 'no':
        if all_info:
            if 'suppl' in consolidated_number or 'suppl' in chapter_number:
                citation = f"L.R.C. {year}, ch. {chapter_number} ({consolidated_number})"
            else:
                citation = f"L.C. {year}, ch. {chapter_number} ({consolidated_number})"             
        else:
            if 'suppl' in consolidated_number or 'suppl' in chapter_number:
                citation = f"L.R.C. {year}, ch. {chapter_number}"
            else:
                citation = f"L.C. {year}, ch. {chapter_number}"
    else:
        citation = f"L.R.C. {year}, ch. {consolidated_number}"
    # manual fixes
    if 'L.C. 1952, ch. 89' in citation:
        citation = 'L.R.C. 1952, ch. 89'
    if 'L.C. 1927, ch. 188' in citation:
        citation = 'L.R.C. 1927, ch. 188'
    citation = citation.replace('ch.','c.')
    year = int(year[:4])
    if year < 1985:
        citation = citation.replace('L.R.C.','S.R.C.')
        citation = citation.replace('L.C.','S.C.')
    return citation.strip()

def fix_doc_date(doc_date, citation):
    if doc_date != '':
        doc_date = doc_date.split('-')
        if len(doc_date[1]) == 1:
            doc_date[1] = '0'+doc_date[1]
        if len(doc_date[2]) == 1:
            doc_date[2] = '0'+doc_date[2]
        doc_date = '-'.join(doc_date)
        return doc_date           
    if 'SRC 1927' in citation or 'LRC 1927' in citation:
        return '1928-02-01'
    if 'SRC 1952' in citation or 'LRC 1952' in citation:
        return '1952-09-15'
    if 'SRC 1970' in citation or 'LRC 1970' in citation:
        return '1971-07-15'
    if 'LRC 1985' in citation or 'SRC 1985' in citation:
        if '1er suppl' in citation:
            return '1988-12-12'
        if '2e suppl' in citation:
            return '1988-12-12'
        if '3e suppl' in citation:
            return '1989-05-01'
        if '4e suppl' in citation:
            return '1989-11-01'
        if '5e suppl' in citation:
            return '1994-03-01'
        else:
            return '1988-12-12'
    return ''  

def extract_full_text(root):
    long_title = str(extract_long_title(root))
    assented_date = str(extract_date_assented(root))
    ordered_text = str(extract_ordered_elements(root))
    full_text = long_title + '\n\n' + assented_date + '\n' + ordered_text
    full_text = re.sub(r'^\s+$', '\n', full_text, flags=re.MULTILINE)
    full_text = re.sub(r'\n{3,}', '\n\n', full_text)
    return full_text
    
# iterate through all files in /acts/ folder and extract text to df
files = os.listdir('acts/fr')
data = []
for file in files:
    # skip Appropriations Acts (Z-01.xml) and Agreements and Conventions (Z-02.xml)
    if file == 'Z-01.xml' or file == 'Z-02.xml':
        continue
    try:
        root = get_root_from_file('acts/fr/'+file)
        citation = extract_citation(root, all_info=False).replace('.','')
        document_date = extract_date_assented(root, add_text=False)
        if document_date == '':
            if citation[:3] == 'LC ':
                citation = citation.replace('LC ','LRC ')
            if citation[:3] == 'SC ':
                citation = citation.replace('SC ','SRC ')
            

        document_date = fix_doc_date(document_date, citation)
        title = extract_title(root)
        full_text = extract_full_text(root)
        unofficial_text = '# '+ title + '\n\n' + citation + '\n\n' + full_text
        citation2 = file[:-4] 
        dataset = "LEGISLATION-FED"
        year = document_date[:4]
        language = 'fr'
        source_url = 'https://github.com/justicecanada/laws-lois-xml/tree/main/fra/acts'
        scraped_timestamp = time.strftime('%Y-%m-%d')
        other = ''
        data.append([citation,
                     citation2, 
                     dataset, 
                     year, 
                     title, 
                     language,
                     document_date, 
                     source_url,
                     scraped_timestamp,
                     unofficial_text,
                     other])
    except Exception as e:
        print(f'Error in {file}')
        print(e)

df_acts_fr = pd.DataFrame(data, columns=['citation',
                                 'citation2',
                                 'dataset',
                                 'year',
                                 'name',
                                 'language',
                                 'document_date', 
                                 'source_url',
                                 'scraped_timestamp',
                                 'unofficial_text',
                                 'other'
                                 ])

# export to json
df_acts_fr.to_json('DATA/df_acts_fr.json', orient='records', lines=True)

df_acts_fr


In [None]:
# show rows where df.document_date is ''
df_acts_fr[df_acts_fr.document_date == '']

## VERIFICATION

In [None]:
# verification for English legislation
from bs4 import BeautifulSoup
import requests

verify_list = []

for letter in 'ABCDEFGHIJKLMNOPQRSTUVWY':  # no X or Z
    
    url = f'https://laws-lois.justice.gc.ca/eng/acts/{letter}.html'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    contentBlock = soup.find_all('div', class_='contentBlock')[0]
    li_tags = contentBlock.find_all('li')
    for li in li_tags:
        verify_dict = {}
        verify_dict['verify-title'] = li.find('a', class_='TocTitle').text.strip()
        verify_dict['verify-citation'] = li.find('span', class_='htmlLink').text.strip().replace('.','').replace('RSC,', 'RSC')
        verify_list.append(verify_dict)

print(len(verify_list))

# convert to df
verify_df = pd.DataFrame(verify_list)
verify_df



In [None]:
# list rows in df where df.citation is not in verify_df.citation
df_acts_en[~df_acts_en['citation'].isin(verify_df['verify-citation'])]

In [None]:
# list rows in df where verify_df.citation is not in df.citation
verify_df[~verify_df['verify-citation'].isin(df_acts_en['citation'])]

In [None]:
# verification for French legislation
from bs4 import BeautifulSoup
import requests

verify_list = []

for letter in 'ABCDEFGHIJLMNOPQRSTUVWYZ':  # no K or X
    
    url = f'https://laws-lois.justice.gc.ca/fra/lois/{letter}.html'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    contentBlock = soup.find_all('div', class_='contentBlock')[0]
    li_tags = contentBlock.find_all('li')
    for li in li_tags:
        verify_dict = {}
        verify_dict['verify-title'] = li.find('a', class_='TocTitle').text.strip()
        verify_dict['verify-citation'] = li.find('span', class_='htmlLink').text.strip().replace('.','').replace(')', '').replace('(', '').replace('ch', 'c')
        verify_dict['verify-citation'] =verify_dict['verify-citation'].replace('1er suppl', '(1er suppl)').replace('2e suppl', '(2e suppl)').replace('3e suppl', '(3e suppl)').replace('4e suppl', '(4e suppl)').replace('5e suppl', '(5e suppl)')
        
        verify_list.append(verify_dict)

print(len(verify_list))

# convert to df
verify_df = pd.DataFrame(verify_list)
verify_df



In [None]:
# list rows in df where df.citation is not in verify_df.citation
df_acts_fr[~df_acts_fr['citation'].isin(verify_df['verify-citation'])]

# same as above but print unique values of citation
#df_acts_fr[~df_acts_fr['citation'].isin(verify_df['verify-citation'])]['citation'].unique()

In [None]:
# list rows in df where verify_df.citation is not in df.citation
verify_df[~verify_df['verify-citation'].isin(df_acts_fr['citation'])]

In [None]:
# Verify assented to dates in en
from lxml import etree as ET
from markdownify import markdownify as md
import os

# get list of files from acts/en

files = os.listdir('acts/en')
xslt = ET.parse('LIMS2HTML.xsl')
transform = ET.XSLT(xslt)

verify_assented = []

for file in files:
    verify_assented_dict = {}
    # skip Appropriations Acts (Z-01.xml) and Agreements and Conventions (Z-02.xml)
    if file == 'Z-01.xml' or file == 'Z-02.xml':
        continue
    dom = ET.parse('acts/en/'+file)
    xslt = ET.parse('LIMS2HTML.xsl')
    transform = ET.XSLT(xslt)
    newdom = transform(dom)
    markdn = md(ET.tostring(newdom, pretty_print=True).decode('utf-8'))
    try:
        assented_to = markdn.split('Assented to ')[1].split('\n')[0]
    except: 
        assented_to = ''
    verify_assented_dict['file'] = file
    verify_assented_dict['assented_to'] = assented_to
    verify_assented.append(verify_assented_dict)

# convert to df
verify_assented_df = pd.DataFrame(verify_assented)

# verify_assented_df.citation = file without .xml
verify_assented_df['citation2'] = verify_assented_df['file'].str[:-4]

verify_assented_df


In [None]:
# merge df_acts_en with verify_assented_df combining citation 
df_acts_en = df_acts_en.merge(verify_assented_df, on='citation2', how='left')
df_acts_en

In [None]:
# revise df_acts_en.assented_to to add "Assented to " to the beginning of the date if it is not ''
df_acts_en['assented_to'] = df_acts_en['assented_to'].apply(lambda x: 'Assented to '+x if x != '' else x)
df_acts_en

In [None]:
# check to see if the string df_acts_en.assented_to is in df_acts_en.unofficial_text
df_acts_en['assented_to_in_text'] = df_acts_en.apply(lambda x: x['assented_to'] in x['unofficial_text'], axis=1)

# list rows where df_acts_en.assented_to_in_text is False
df_acts_en[~df_acts_en['assented_to_in_text']]

In [None]:
#show unique values of year where assented_to is ''
df_acts_en[df_acts_en.assented_to == '']['year'].unique()

In [None]:
# Verify assented to dates in fr
from lxml import etree as ET
from markdownify import markdownify as md
import os

# get list of files from acts/fr

files = os.listdir('acts/fr')
xslt = ET.parse('LIMS2HTML.xsl')
transform = ET.XSLT(xslt)

verify_assented = []

for file in files:
    verify_assented_dict = {}
    # skip Appropriations Acts (Z-01.xml) and Agreements and Conventions (Z-02.xml)
    if file == 'Z-01.xml' or file == 'Z-02.xml':
        continue
    dom = ET.parse('acts/fr/'+file)
    xslt = ET.parse('LIMS2HTML.xsl')
    transform = ET.XSLT(xslt)
    newdom = transform(dom)
    markdn = md(ET.tostring(newdom, pretty_print=True).decode('utf-8'))
    try:
        assented_to = markdn.split('Sanctionnée ')[1].split('\n')[0]
    except: 
        assented_to = ''
    verify_assented_dict['file'] = file
    verify_assented_dict['assented_to'] = assented_to
    verify_assented.append(verify_assented_dict)

# convert to df
verify_assented_df = pd.DataFrame(verify_assented)

# verify_assented_df.citation = file without .xml
verify_assented_df['citation2'] = verify_assented_df['file'].str[:-4]

verify_assented_df


In [None]:
# merge df_acts_en with verify_assented_df combining citation 
df_acts_fr = df_acts_fr.merge(verify_assented_df, on='citation2', how='left')
df_acts_fr['assented_to'] = df_acts_fr['assented_to'].apply(lambda x: 'Sanctionnée '+x if x != '' else x)
df_acts_fr

In [None]:
# check to see if the string df_acts_en.assented_to is in df_acts_en.unofficial_text
df_acts_en['assented_to_in_text'] = df_acts_en.apply(lambda x: x['assented_to'] in x['unofficial_text'], axis=1)

# list rows where df_acts_en.assented_to_in_text is False
df_acts_en[~df_acts_en['assented_to_in_text']]

In [None]:
#show unique values of year where assented_to is ''
df_acts_en[df_acts_en.assented_to == '']['year'].unique()

In [61]:
# manually view unofficial_text in en
from IPython.display import clear_output
while True:
    print('Enter row number to print unofficial_text (or exit to exit):')
    clear_output(wait=True)
    os.system('cls')
    row_sought = input()
    if row_sought == 'exit':
        break
    print(df_regs_en.iloc[int(row_sought)].unofficial_text)


# Military Rules of Evidence

CRC, c 1049

Regulations Respecting the Rules of Evidence at Trial by Court Martial

## Short Title 
1 These Rules may be cited as the Military Rules of Evidence.  

## Interpretation 
2 (1) In these Rules, unless the context otherwise requires,  
*accused* means the accused personally or counsel or a defending officer acting on behalf of the accused, but does not include an adviser acting on behalf of the accused; (accusé ou prévenu)   
*admissible* means admissible in evidence; (admissible)   
*burden of persuasion* means the burden of convincing the court of the existence or non-existence, or probable existence or non-existence, of any fact; (fardeau de la persuasion)   
*business* means every kind of business, occupation or calling, and includes the practice of a profession, and the operation of an institute and every kind of institution, whether carried on for profit or not; (entreprise)   
*circumstantial evidence* means evidence tending to establish

In [None]:
# manually view unofficial_text in fr
from IPython.display import clear_output
while True:
    print('Enter row number to print unofficial_text (or exit to exit):')
    clear_output(wait=True)
    os.system('cls')
    row_sought = input()
    if row_sought == 'exit':
        break
    print(df_acts_fr.iloc[int(row_sought)].unofficial_text)