# NLP workflow (from Natural Language Processing Fundamentals)
* Data collection
* Data preprocessing
* Feature extraction
* Model development
* Model assessment
* Model deployment

## Data collection
Because CanLII blocks web scraping with captchas and because high-volume web scraping violates CanLII's ToS, this program will have to rely on manually downloaded HTML pages.

## Data preprocessing
These functions remove extraneous HTML and save the clean text to file. Where available, the preprocessing functions split the decision into the decision's numbered paragraphs. Where the decision doesn't come with pre-formatted paragraph numbers, the functions should infer them from the document's structure. For some older decisions, it may be possible to infer pagination, though this functionality may not be necessary or useful.

### HTML to TXT

In [200]:
from bs4 import BeautifulSoup

# Reads an HTML file and returns a BeautifulSoup object
def read_html_file(filename: str)->BeautifulSoup:
    '''
    Reads an HTML file and returns a BeautifulSoup object.
    '''
    with open(filename, 'r', encoding="utf-8") as file:
        soup: BeautifulSoup = BeautifulSoup(file, 'html.parser')
    return soup


# Extracts the decision text
def extract_decision_paragraphs(soup)->list:
    '''
    Extracts the decision text from the numbered paragraphs. The decision text
    is contained in the <div class="paragWrapper"> tags. This function extracts
    the text from these tags and appends it to a list.
    '''
    # Find the first and last instances of the "paragWrapper" div
    first_div = soup.find("div", class_="paragWrapper")
    last_div = soup.find_all("div", class_="paragWrapper")[-1]

    # Create a list to store the paragraphs
    paragraphs = []

    # Iterate over all siblings between the first and last instances of the "paragWrapper" div
    sibling = first_div
    paragraphs.append(first_div)
    while sibling != last_div:
        sibling = sibling.find_next_sibling()
        paragraphs.append(sibling)
        
    return paragraphs

In [203]:
decision = read_html_file("./canliicrim_corpus/sk/ca/2012skca119.html")
extracted_decision = extract_decision_paragraphs(decision)
print(len(extracted_decision))
for paragraph in extracted_decision:
    print(paragraph.text)

194
[1]    S.D.B. was convicted by a Court of Queen’s Bench justice of the following offence:
THAT he, the
said S.D.B., between the 15th day of May and the 16th day
of May, 2006 at Kamsack District, in the Province of Saskatchewan did commit a
sexual assault on J.G., contrary to Section 271 of the Criminal Code.
 
[2]    Following his conviction he was declared to be a dangerous offender and was sentenced as such to an
indeterminate period in a penitentiary.
[3]    Mr. S.D.B. then appealed both his conviction and sentence.  For the reasons that follow we have
decided to dismiss the appeal against conviction, but allow the appeal against
sentence.
II.              
Background facts
[4]    A comprehensive review of the facts is set out in the trial judgment (see 2008 SKQB 494).  For the
purposes of the conviction appeal, a condensed version of the relevant facts
follows.
[5]    On the day of the incident, Mr. S.D.B. and J.G. were driving around in the evening on a First Nations
reserve f

### Corpus construction
Once the data is cleaned up and sorted out, it is added to the corpus.

## Feature extraction

## Model development

## Model assessment

## Model deployment