<a href="https://colab.research.google.com/github/646e62/citation-generator/blob/main/canlii_skca_extractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SKCA extractor

## HTML to Markdown

Although there may be some clues to be found in the way the original CanLII document is structured, it contains an overwhelming amount of data. A lot of that data is unhelpful noise.

As it turned out, the HTML to Markdown converter ``html2text`` was able to ignore that noise and pick out multiple key features (document structure, where paragraphs actually start and stop, etc). Should the need for fine tuning arise, there may need to be some HTML pre-processing. But for the time being, the Markdown document is more than sufficient for the program's purpose.

## Metadata

After creating the Markdown document and splitting it into its natural paragraphs, it became apparent that paragraph[0] contained a fair bit of metadata that could be extracted with some text processing.

Although SKCA metadata isn't as extensive or specific as some other jurisdictions, it is informative. Specifically, the fact that SKCA outlines case features like "disposition" in easily-located plain text make these documents easy to work with.

## Document structure

Another thing that became apparent after creating the Markdown document was the fact that ``html2text`` did a good job of capturing document structure.

## Text analysis

## TODO

* Appellate
    * Good metadata
        * BCCA
        * SKCA
        * QCCA
        * NBCA
        * NSCA
        * PECA
        * YKCA
    * Viable metadata
        * NLCA
    * Inadequate metadata
        * ABCA
        * MBCA
        * ONCA
        * NWTCA
        * NUCA


## Imports and installations

### html2text

The extractor uses html2text-2020.1.16 to generate Markdown files.

In [39]:
!pip install html2text



In [40]:
import html2text
import re

from typing import Dict, List, Union
from datetime import datetime


## Patterns file

This program is designed to work with decisions published on CanLII. Every such decision contains consistent metacontent that the UNWANTED_PATTERNS constant captures via regex.

Although I print the Markdown file throughout this working document for troubleshooting purposes, the final version of the program will not print or return the Markdown file to the user.

After dealing with the universal metacontent, the program will need to account for the metadata and metadata structures added by the SKCA. Unlike commercial products like WL and QL, CanLII doesn't use a universal format for its decisions. Rather, it appears to allow individual courts to structure their documents as they see fit. Although commendable from a "marketplace of ideas" perspective, it makes it more difficult to access metadata and other such content in a consistent way. Rather than developing

Data from the SKCA is both immediately important to me and

Because SKCA structures its decision metadata in a fairly predictable and regimented way, the same regex-based approach can be used to separate the data from

In [41]:


UNWANTED_PATTERNS = [
    r'\[ !\[CanLII Logo\]\(.+?\) \]\(.+?\)',
    r'\[Home\]\(.+?\) › .+?CanLII\)',
    r'Loading paragraph markers __',
    r'\* Document',
    r'\* History  __',
    r'\* Cited documents  __',
    r'\* Treatment  __',
    r'\* CanLII Connects  __',
    r'Citations  __',
    r'Discussions  __',
    r'Unfavourable mentions  __',
    r'\nExpanded Collapsed\n'
]

# Metadata patterns for SKCA decisions on CanLII after 2015

SKCA_2015_METADATA_PATTERNS = {
    'between': r'Between:(.*?)(?=Before:)',
    'disposition': r'Disposition:\s*(.+?)\s{2,}',
    'on_appeal_from': r'On appeal from:\s*(.+?)\s{2,}',
    'appeal_heard': r'Appeal heard:\s*(.+?)\s{2,}',
    'application_heard': r'Application heard:\s*([\w\s,]+)'
}

BASE_ROLES = [
    'plaintiff', 'plaintiffs',
    'respondent', 'respondents',
    'appellant', 'appellants',
    'applicant', 'applicants'
]

# Generate combinations with brackets and slashes
ROLES = BASE_ROLES + [
    f"({role})" for role in BASE_ROLES
] + [
    f"{role1}/{role2}" for role1 in BASE_ROLES for role2 in BASE_ROLES
]

# HTML to MD functions



In [42]:
def html_to_markdown(html_content: str) -> str:
    """
    Converts HTML content to its markdown representation.

    Args:
        html_content (str): The input HTML content.

    Returns:
        str: The converted markdown content.
    """

    h = html2text.HTML2Text()
    h.ignore_links = False
    return h.handle(html_content)

def convert_file(html_filepath: str, markdown_filepath: str) -> None:
    """
    Converts an HTML file to a markdown file.

    Args:
        html_filepath (str): Path to the input HTML file.
        markdown_filepath (str): Path to the output markdown file.
    """

    with open(html_filepath, 'r', encoding='utf-8') as file:
        html_content = file.read()

    markdown_content = html_to_markdown(html_content)
    refined_markdown_content = refine_markdown(markdown_content)

    with open(markdown_filepath, 'w', encoding='utf-8') as file:
        file.write(refined_markdown_content)

def refine_markdown(md_content: str, unwanted_patterns: List[str] = UNWANTED_PATTERNS) -> str:
    """
    Refines the markdown content by making specific replacements and deletions.

    Args:
        md_content (str): The input markdown content.
        unwanted_patterns (List[str], optional): List of regex patterns to be removed from the content. Defaults to UNWANTED_PATTERNS.

    Returns:
        str: The refined markdown content.
    """

    # Replace only the first occurrence of '## ' with '# '
    md_content = md_content.replace('## ', '# ', 1)

    # Remove hard line breaks after labels
    md_content = re.sub(r'Date:\s*\n', 'Date: ', md_content)
    md_content = re.sub(r'File number:\s*\n', 'File number: ', md_content)
    md_content = re.sub(r'Citation:\s*\n', 'Citation: ', md_content)

    # Remove unwanted sections from the captured content
    for pattern in unwanted_patterns:
        md_content = re.sub(pattern, '', md_content)

    # Define the pattern to remove the footer and everything that follows
    footer_start = "Back to top"
    if footer_start in md_content:
        md_content = md_content.split(footer_start)[0]

    # Remove patterns like "---|---" and "---|---|---|---"
    md_content = re.sub(r'---(\|---)+', '\n', md_content)

    # Replace the character `|` with an empty newline
    md_content = md_content.replace('|', '\n')

    # Replace "[__PDF]" with "[PDF]"
    md_content = md_content.replace('[__  PDF]', '[PDF]')

    return md_content.strip()


# Text refining functions

In [43]:
def convert_to_standard_date(date_string: str) -> str:
    """
    Converts a date string to a standard format YYYY-MM-DD.

    Args:
        date_string (str): Input date string.

    Returns:
        str: Date in standard format.
    """
    # Define the expected date formats for the input strings
    date_formats = ["%Y-%m-%d", "%B %d, %Y"]

    for date_format in date_formats:
        try:
            # Parse the string into a datetime object
            date_obj = datetime.strptime(date_string, date_format)
            # Return the formatted string
            return date_obj.strftime("%Y-%m-%d")
        except ValueError:
            pass

    # If all parsing attempts fail, return the original string
    return date_string


def clean_markdown(filename):
    """Cleans markdown in the given file.

    Args:
        filename: The name of the file to clean.

    Returns:
        A markdown file _sans_ extraneous whitespace
    """

    with open(filename, 'r') as file:
        lines = file.readlines()

    cleaned_lines = []
    for line in lines:
        line = line.strip()
        if not line:
            continue

        line = line.replace("_", "")
        cleaned_lines.append(line)

    return '\n'.join(cleaned_lines)


# CanLII metadata extractor

Should be the same across all CanLII cases. Doesn't contain much information.

In [44]:
def capture_canlii_metadata(md_content: str) -> Dict[str, str]:
    """
    Captures specific metadata (date, file number, citation) from markdown content.

    Args:
        md_content (str): The input markdown content.

    Returns:
        Dict[str, str]: A dictionary containing captured metadata.
    """

    metadata = {}

    # Capture date
    date_match = re.search(r'Date: (\d{4}-\d{2}-\d{2})', md_content)
    if date_match:
        metadata['date'] = date_match.group(1)

    # Capture file number
    file_match = re.search(r'File number: ([A-Z0-9]+)', md_content)
    if file_match:
        metadata['file_number'] = file_match.group(1)

    # Capture citation
    citation_pattern = r'Citation: (.+?)<<(.+?)>>, retrieved\s+on\s+(\d{4}-\d{2}-\d{2})'
    citation_match = re.search(citation_pattern, md_content, re.DOTALL)
    if citation_match:
        metadata['citation'] = f"{citation_match.group(1).strip()} {citation_match.group(2)} retrieved on {citation_match.group(3)}"

    return metadata


# SKCA metadata extractor



In [45]:
def extract_metadata(pattern: str, md_content: str, known_patterns: list) -> list:
    """
    Extract metadata using the provided pattern from md_content.

    Args:
    - pattern: The regex pattern to search for.
    - md_content: The content to search within.
    - known_patterns: Known metadata patterns to consider in lookahead.

    Returns:
    - A list of names found based on the pattern.
    """
    match = re.search(pattern, md_content, re.DOTALL | re.IGNORECASE)
    if match:
        cleaned_string = re.sub(r'\n+', ' ', match.group(1)).strip()
        names_list = [name for name in cleaned_string.split('  ') if name]
        return names_list
    return []

def extract_parties(md_content: str) -> list:
    """
    Extract parties from md_content.

    Args:

    """

    role_pattern = re.compile('|'.join(map(re.escape, ROLES)), re.IGNORECASE)

    match = re.search(SKCA_2015_METADATA_PATTERNS['between'], md_content, re.DOTALL)

    if match:
        between_values = [line.strip() for line in match.group(1).strip().split('\n')
                        if line.strip() and line.strip().lower() != 'and']

        parties_list = []
        prev_party = None
        current_role = ""
        for line in between_values:
            if role_pattern.search(line):  # Using regex search here
                if current_role:
                    current_role += " " + line
                else:
                    current_role = line
            else:
                if prev_party and current_role:
                    parties_list.append({'party': prev_party, 'role': current_role})
                    current_role = ""
                prev_party = line

        # Handle the last party-role pair
        if prev_party and current_role:
            parties_list.append({'party': prev_party, 'role': current_role})

        return parties_list

def extract_counsel(md_content: str) -> list:
    """
    Extract counsel from md_content.

    Args:

    """

   # Split the string at "Counsel:"
    parts = s.split('Counsel:', 1)

    # If there's no "Counsel:" or nothing after it, return an empty list
    if len(parts) < 2:
        return []

    # Split the remaining string by lines and strip whitespace from each line
    lines = [line.strip() for line in parts[1].split('\n')]

    # Filter out empty lines
    lines = [line for line in lines if line]

    # Process each line to split on "and" if necessary and keep lines with "for"
    final_lines = []
    for line in lines:
        if " for " in line:
            names = line.split(" and ")
            if len(names) > 1:
                for_clause = names[-1].split(" for ", 1)[-1]  # e.g., "the Appellant"
                for name in names:
                    if " for " not in name:
                        final_lines.append(name.strip() + " for " + for_clause)
                    else:
                        final_lines.append(name.strip())
            else:
                final_lines.append(line)

    return final_lines

def extract_counsel_legacy(md_content: str) -> list:
    """
    Extract counsel from md_content.

    Args:

    """

    # Extracting counsel as a special case:
    counsel_pattern = (
        r'(?:'                   # Non-capturing group start
        r'[\w\s-]+'              # Matches word characters, spaces, and hyphens
        r'(?:, [KQ].C.)?'        # Optionally matches ', K.C.' or ', Q.C.'
        r' and '                 # Matches ' and '
        r')?'                    # End of non-capturing group, which is optional
        r'[\w\s-]+'              # Matches word characters, spaces, and hyphens again
        r'(?:, [KQ].C.)?'        # Optionally matches ', K.C.' or ', Q.C.' again
        r',?'                    # Optional comma
        r' for the '             # Matches ' for the '
        r'[\w\s\-A-Za-z]+'       # Matches word characters, spaces, hyphens, and English letters
    )

    counsel_string = re.findall(counsel_pattern, md_content)

    # Remove whitespace
    cleaned_counsel = re.sub(r'\n', '', counsel_string[0])
    cleaned_counsel = re.sub(r' +', ' ', cleaned_counsel)
    cleaned_counsel = cleaned_counsel.strip()

    # Placeholder for our final, cleaned-up information
    good_information = ""

    while True:
        match = re.search(r'(.*?)(for the \w+)', cleaned_counsel)
        if match:
            # Add the matched segment to our good information
            good_information += match.group(1) + match.group(2) + " "
            # Remove the processed segment from cleaned_counsel
            cleaned_counsel = cleaned_counsel.replace(match.group(0), '', 1)
        else:
            break

    # Optionally, trim any trailing spaces
    good_information = good_information.strip()

    # Find all occurrences of 'for the [word]' and split the string based on them
    roles = re.findall(r'for the \w+', good_information)
    segments = re.split(r'for the \w+', good_information)

    final_list = []
    for role, segment in zip(roles, segments):
        # Updated regex split logic
        names = [name.strip() for name in re.split(r',? and |,(?![\s]?[KQ]\.C\.)', segment)]
        for name in names:
            if name:  # Ensure we don't append empty names
                final_list.append(name + ' ' + role)

    return final_list


def extract_judges(md_content: str) -> list:
    """
    Extract judges from md_content.

    Args:

    """



def capture_skca_metadata(md_content: str) -> Dict[str, Union[str, List[str]]]:
    """
    Captures specific secondary metadata from markdown content.

    Args:
        md_content (str): The input markdown content.

    Returns:
        Dict[str, Union[str, List[str]]]: A dictionary containing captured metadata.
    """
    metadata = {}
    metadata_patterns = SKCA_2015_METADATA_PATTERNS

    # Extracting special case lists
    for key, pattern in metadata_patterns.items():
        match = re.search(
            pattern,
            md_content,
            re.DOTALL | re.IGNORECASE
        )
        try:
            if match:
                metadata[key] = ' '.join(match.group(1).split()).strip()
                print(metadata[key] + "\n")
        except IndexError:
            pass

    metadata['between'] = extract_parties(md_content)
    metadata['counsel'] = extract_counsel(md_content)

    # Extract the date components and convert to standard format
    if 'application_heard' in metadata:
        metadata['application_heard'] = ' '.join(metadata['application_heard'].split()[:3])
        metadata['application_heard'] = convert_to_standard_date(metadata['application_heard'])

    if 'appeal_heard' in metadata:
        metadata['appeal_heard'] = convert_to_standard_date(metadata['appeal_heard'])

    # Testing for multiple judge detection

    known_patterns = [
        r'Docket:',
        r'Citation:',
        r'Date:',
        r'Between:',
        r'Before:',
        r'Disposition:',
        r'Written reasons by:',
        r'In concurrence:',
        r'In dissent:',
        r'On Application From:',
        r'Application Heard:'
        r'On appeal from:',
        r'Appeal heard:',
        r'Counsel:',
    ]

    # Patterns to check
    patterns_to_check = {
        'written_reasons_by': r'Written reasons by:\s*(.+?)(?=\s{2,}(?:' + '|'.join(known_patterns) + r')|$)',
        'in_concurrence': r'In concurrence:\s*(.+?)(?=\s{2,}(?:' + '|'.join(known_patterns) + r')|$)',
        'in_dissent': r'In dissent:\s*(.+?)(?=\s{2,}(?:' + '|'.join(known_patterns) + r')|$)'
    }

    # Iterate over each pattern, extracting the metadata and storing in the 'metadata' dictionary
    for key, pattern in patterns_to_check.items():
        metadata[key] = extract_metadata(pattern, md_content, known_patterns)

    return metadata


In [46]:
def extract_paragraphs(md_content):
    # Split the content using the '--' pattern
    paragraphs = md_content.split('\n__\n')

    # Strip leading and trailing whitespace from each paragraph
    paragraphs = [p.strip() for p in paragraphs if p.strip()]

    return paragraphs


In [57]:
neutral_citation = "2023skca101" # @param {type:"string"}

In [58]:
if __name__ == "__main__":
    html_filepath = f"{neutral_citation}.html"
    markdown_filepath = f"{neutral_citation}.md"

    with open(html_filepath, 'r', encoding='utf-8') as file:
        html_content = file.read()

    markdown_content = html_to_markdown(html_content)
    refined_markdown_content = refine_markdown(markdown_content)
    paragraphs = extract_paragraphs(refined_markdown_content)
    canlii_metadata = capture_canlii_metadata(paragraphs[0])
    skca_metadata = capture_skca_metadata(paragraphs[0])

    with open(markdown_filepath, 'w', encoding='utf-8') as file:
        file.write(refined_markdown_content)

Mitchell Owston Appellant And His Majesty the King Respondent

Appeal allowed; acquittal entered

CRM 304 of 2019 (Sask QB), Regina

September 9, 2022



In [59]:
for key, value in canlii_metadata.items():
    print(key + ": ", value)

for key, value in skca_metadata.items():
    print(key + ": ", value)

date:  2023-08-31
file_number:  CACR3511
citation:  R v Owston, 2023 SKCA 101 (CanLII), https://canlii.ca/t/k005d retrieved on 2023-10-04
between:  [{'party': 'Mitchell Owston', 'role': 'Appellant'}, {'party': 'His Majesty the King', 'role': 'Respondent'}]
disposition:  Appeal allowed; acquittal entered
on_appeal_from:  CRM 304 of 2019 (Sask QB), Regina
appeal_heard:  2022-09-09
counsel:  ['\n\n\n\n\nBrian Pfefferle, K.C.', ' and Thomas Hynes for the Appellant  \n  \n\n\n\n\n\nGrace Hession David for the Respondent  \n  \n\n  \n  \n  \n  \n  \n  \n  \n\n\n  \n\nThe Court\n\n']
written_reasons_by:  ['The Court', 'On appeal from: CRM 304 of 2019 (Sask QB), Regina']
in_concurrence:  []
in_dissent:  []


In [50]:
print(paragraphs[0])

[Home](/en/) › [Saskatchewan](/en/sk/) › [Court of Appeal for
Saskatchewan](/en/sk/skca/) › 2023 SKCA 101 (CanLII)



# R v Owston, 2023 SKCA 101 (CanLII)

  
  
  
   
  

[PDF](/en/sk/skca/doc/2023/2023skca101/2023skca101.pdf)

Date: 2023-08-31

File number: CACR3511

Citation: R v Owston, 2023 SKCA 101 (CanLII), <<https://canlii.ca/t/k005d>>, retrieved
on 2023-10-04

    























  
  

  
  










Restriction on Publication




An order has been made in accordance with [s.
486.4(1)](/en/ca/laws/stat/rsc-1985-c-c-46/latest/rsc-1985-c-c-46.html#sec486.4subsec1_smooth)
of the _[Criminal
Code](/en/ca/laws/stat/rsc-1985-c-c-46/latest/rsc-1985-c-c-46.html)_ directing
that any information identifying the complainant shall not be published.




  
  




















  
  




















  
  
Court of Appeal for Saskatchewan

Docket: CACR3511




Citation: _R v Owston_ , 2023 SKCA 101  
  
Date: 2023-08-31  
  
Between:  
  
Mitchell Owston  
  
Appellant  
  
A

In [51]:
import re

def extract_counsel(counsel_string: str) -> list:
    counsel_list = []

    # 1. Self-representation: Using non-capturing group
    self_rep_pattern = r'[\w\s-]+ on (?:his|her) own behalf'
    self_rep_matches = re.findall(self_rep_pattern, counsel_string)
    counsel_list.extend(self_rep_matches)  # directly append matched strings

    # 2. Names followed by ", K.C." or ", Q.C."
    kc_qc_pattern = r'[\w\s-]+, [KQ]\.C\.'
    kc_qc_matches = re.findall(kc_qc_pattern, counsel_string)
    counsel_list.extend(kc_qc_matches)

    # 3. Names with "for the [role]"
    role_pattern = r'[\w\s-]+(?= for the [\w\s\-\'A-Za-z]+) for the [\w\s\-\'A-Za-z]+'
    role_matches = re.findall(role_pattern, counsel_string)
    counsel_list.extend(role_matches)

    return counsel_list

# Test
test_string = """
Counsel:

Michael Taylor on his own behalf
Wade McBride, K.C., and Carlene Ready for the Respondent
Lisa Watson for Saskatchewan Lawyers' Insurance Association Inc.
"""

print(extract_counsel(test_string))


['\n\nMichael Taylor on his own behalf', '\n\nMichael Taylor on his own behalf\nWade McBride, K.C.', " and Carlene Ready for the Respondent\nLisa Watson for Saskatchewan Lawyers' Insurance Association Inc"]


In [60]:
def extract_lines_after_counsel(s):
    # Split the string at "Counsel:"
    parts = s.split('Counsel:', 1)

    # If there's no "Counsel:" or nothing after it, return an empty list
    if len(parts) < 2:
        return []

    # Split the remaining string by lines and strip whitespace from each line
    lines = [line.strip() for line in parts[1].split('\n')]

    # Filter out empty lines
    lines = [line for line in lines if line]

    # Process each line to split on "and" if necessary and keep lines with "for"
    final_lines = []
    for line in lines:
        if " for " in line:
            names = line.split(" and ")
            if len(names) > 1:
                for_clause = names[-1].split(" for ", 1)[-1]  # e.g., "the Appellant"
                for name in names:
                    if " for " not in name:
                        final_lines.append(name.strip() + " for " + for_clause)
                    else:
                        final_lines.append(name.strip())
            else:
                final_lines.append(line)

    return final_lines

# Test
s = """Counsel:




Brian Pfefferle, K.C. and Thomas Hynes for the Appellant





Michael Nolin for the Respondent












Jackson J.A.

# I.                  Introduction"""

print(extract_lines_after_counsel(s))


['Brian Pfefferle, K.C. for the Appellant', 'Thomas Hynes for the Appellant', 'Michael Nolin for the Respondent']
