# NLP workflow (from Natural Language Processing Fundamentals)
* Data collection
* Data preprocessing
* Feature extraction
* Model development
* Model assessment
* Model deployment

## Data collection
Because CanLII blocks web scraping with captchas and because high-volume web scraping violates CanLII's ToS, this program will have to rely on manually downloaded HTML pages.

The following code snippet shows the HTML files that will be used to build the first test mini-corpus. The HTML files listed are copies of all reported decisions on CanLII from 2023 as of 2023-01-31

In [10]:
import os

def list_directory_tree(directory):
    print(directory)
    for path, dirs, files in os.walk(directory):
        level = path.replace(directory, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(path)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))

list_directory_tree("./canlii_crim_corpus/html/2023/")

./canlii_crim_corpus/html/2023/
/
qc/
    ca/
        2023qcca34.html
        2023qcca13.html
        2023qcca89.html
        2023qcca57.html
    cq/
        2023qccq86.html
        2023qccq15.html
    cs/
yk/
    sc/
    ca/
    tc/
        2023yktc1.html
bc/
    pc/
        2023bcpc4.html
        2023bcpc13.html
        2023bcpc12.html
        2023bcpc11.html
        2023bcpc3.html
        2023bcpc7.html
        2023bcpc6.html
        2023bcpc5.html
    sc/
        2023bcsc134.html
        2023bcsc96.html
        2023bcsc85.html
        2023bcsc106.html
        2023bcsc92.html
        2023bcsc50.html
        2023bcsc141.html
        2023bcsc72.html
    ca/
        2023bcca8.html
        2023bcca16.html
        2023bcca29.html
        2023bcca6.html
        2023bcca19.html
        2023bcca2.html
        2023bcca33.html
        2023bcca38.html
        2023bcca13.html
        2023bcca4.html
        2023bcca50.html
        2023bcca37.html
        2023bcca3.html
ab/
    pc/
        2023ab

## Data preprocessing
These functions remove extraneous HTML and save the clean text to file. Where available, the preprocessing functions split the decision into the decision's numbered paragraphs. Where the decision doesn't come with pre-formatted paragraph numbers, the functions should infer them from the document's structure. For some older decisions, it may be possible to infer pagination, though this functionality may not be necessary or useful.

### HTML to TXT
The HTML to TXT functions take a raw HTML file and convert it into an NLTK Text object.The HTML to TXT functions take a raw HTML file and convert it into an NLTK Text object.

### read_html_file
Reads an HTML file and returns it as a BeautifulSoup object. Doing so makes the file much easier to work with in subsequent functions.

### create_title
Creates a title for each decision. Where available, the function uses the neutral citation as the title. Where the neutral citation isn't available, the function uses the CanLII citation as the title.

### decision_paragraphs
Runs through the BS4 object and extracts

In [51]:
import nltk
import re
from bs4 import BeautifulSoup
from nltk.text import Text


# Reads an HTML file and returns a BeautifulSoup object
def read_html_file(filename: str)->BeautifulSoup:
    '''
    Reads an HTML file and returns a BeautifulSoup object.
    '''
    with open(filename, 'r', encoding="utf-8") as file:
        soup: BeautifulSoup = BeautifulSoup(file, 'html.parser')
    return soup

def create_title(filepath: str)-> str:
    """Create a title for the text file from the html file name"""
    path_list = filepath.split("/")
    title_list = path_list[-1].split(".")
    title = title_list[0]
    
    # title is a string composed of two groups of numbers and one group of
    # letters. This function separes each group into a list.

    # The first group of numbers is the year
    year = re.findall(r"\d+", title)[0]
    # The second group of numbers is the file number
    file_number = re.findall(r"\d+", title)[1]
    # The group of letters is the jurisdiction and court
    jurisdiction = re.findall(r"[a-z]+", title)[0]
    
    if jurisdiction == "canlii":
        jurisdiction = "CanLII"
        title = f"{year} {jurisdiction} {file_number}"
    else:
        title = f"{year} {jurisdiction.upper()} {file_number}"
 
    return title

# Extracts the decision text
def decision_paragraphs(filename: str)->list:
    '''
    Extracts the decision paragraphs. The decision text
    is contained in the <div class="paragWrapper"> tags. This function extracts
    the text from these tags and appends it to a list.
    '''
    
    decision = read_html_file(filename)
    
    # Find the first and last instances of the "paragWrapper" div
    first_div = decision.find("div", class_="paragWrapper")
    last_div = decision.find_all("div", class_="paragWrapper")[-1]

    paragraphs = []
    footnotes = []

    # Iterate over all siblings between the first and last instances of the "paragWrapper" div
    sibling = first_div
    paragraphs.append(first_div)
    while sibling != last_div:
        sibling = sibling.find_next_sibling()
        paragraphs.append(sibling)
        
    # Finds and appends footnotes where applicable
    if decision.find("SPAN", class_="MsoFootnoteReference"):
        decision_footnotes(decision)
        
    return paragraphs, footnotes


def decision_footnotes(decision: str)->list:
    '''
    Generates a list of footnotes in decisions containing them.
    '''
    footnote = decision.find("SPAN", class_="MsoFootnoteReference")
    footnotes.append(footnote)
    while footnote.find_next_sibling("SPAN", class_="MsoFootnoteReference"):
        footnote = footnote.find_next_sibling("SPAN", class_="MsoFootnoteReference")
        footnotes.append(footnote)
    
    return footnotes
    

def clean_text(paragraph: str, remove_para_nums: bool=True)->list:
    '''
    Returns tokenized text. The function can be set to include paragraph 
    numbers for instances where they may provide some semantic value, but 
    defaults to removing them as this generally isn't expected to be the case.
    '''
    words = nltk.word_tokenize(paragraph)
    if remove_para_nums:
        if paragraph[0] == "[" and paragraph[1].isdigit():
            words = words[2:]
            while words[0] != "]":
                words = words[1:]
            words = words[1:]
        return words
    else:
        return words

def compile_decision_text(paragraphs: list, footnotes: list)->list:
    decision = []
    
    for paragraph in paragraphs:
        decision.append(clean_text(paragraph))
        
    if footnotes:
        for footnote in footnotes:
            decision.append(clean_text(footnote, False))
            
    return decision

In [54]:
file = "./canlii_crim_corpus/html/2023/nl/pc/2023canlii605.html"

In [53]:
decision = decision_paragraphs(file)[0]
clean_decision = []
for paragraph in decision:
    clean_decision.append(clean_text(paragraph.text))

for item in clean_decision:
    if len(item) == 0:
        clean_decision.remove(item)

clean_decision.insert(0,create_title(file))
        
for item in clean_decision:
    print(item)

2023 CanLII 605
['Mr.', 'Dyer', 'is', 'charged', 'with', '11', 'offences', 'on', 'a', 'single', 'information', '.', 'All', 'of', 'the', 'offences', 'except', 'one', 'relate', 'to', 'a', 'single', 'complainant', 'NM', '.', 'Mr.', 'Dyer', 'has', 'asked', 'to', 'sever', 'count', '4', 'from', 'the', 'information', '.', 'Count', '4', 'is', 'a', 'single', 'count', 'of', 'sexual', 'assault', '.', 'The', 'balance', 'of', 'the', 'charges', 'are', ',', 'assault', ',', 'uttering', 'threats', ',', 'harassment', 'and', 'breaking', 'into', 'a', 'dwelling', '.']
['Counsel', 'for', 'the', 'accused', 'is', 'applying', 'pursuant', 'to', 'section', '591', '(', '3', ')', 'for', 'the', 'severance', 'order', '.', 'She', 'has', 'argued', 'that', 'the', 'basis', 'for', 'the', 'severance', 'is', ':']
['1', ')', 'The', 'charge', 'of', 'sexual', 'assault', 'should', 'be', 'severed', 'as', 'it', 'is', 'discrete', 'in', 'time', 'from', 'the', 'other', 'charges']
['2', ')', 'The', 'accused', 'wishes', 'to', 'explor

In [274]:
import nltk
from nltk.text import Text

extracted_decision_text = []
tokenized_decision_text = []

for paragraph in extracted_decision:
    extracted_decision_text.append(paragraph.text)

for paragraph in extracted_decision_text:
    paragraph = nltk.word_tokenize(paragraph)
    for word in paragraph:
        tokenized_decision_text.append(word)
        
corpus = Text(tokenized_decision_text)
print((corpus).concordance("comprehensive"))
print(type(corpus))
print(len(corpus))

Displaying 1 of 1 matches:
 . II . Background facts [ 4 ] A comprehensive review of the facts is set out i
None
<class 'nltk.text.Text'>
14079


### Corpus construction
Once the data is cleaned up and sorted out, it is added to the corpus.

## Feature extraction

## Model development

## Model assessment

## Model deployment