### Notebook to parse scraped html to produce cleaned text of SCC decisions

Sean Rehaag

License: Creative Commons Attribution-NonCommercial 4.0 International [(CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/)

Dataset & Code to be cited as: 

    Sean Rehaag, "Supreme Court of Canada Bulk Decisions Dataset" (2023), online: Refugee Law Laboratory <https://refugeelab.ca/bulk-data/scc/>.

Notes:

(1) Data Source: [Supreme Court of Canada](https://www.scc-csc.ca). 

(2) Unofficial Data: The data are unofficial reproductions of materials on the Supreme Court of Canada website. Links to official versions are included in the dataset.

(3) Non-Affiliation / Endorsement: The data has been collected and reproduced without any affiliation or endorsement from the Supreme Court of Canada

(4) Non-Commerical Use: As indicated in the license, data may be used for non-commercial use (with attribution) only. For commercial use, see the Supreme Court of Canada website's [Terms of Use](https://www.scc-csc.ca/terms-avis/notice-enonce-eng.aspx).

(5) Accuracy: Data was collected and processed programmatically for the purposes of academic research. While we make best efforts to ensure accuracy, data gathering of this kind inevitably involves errors. As such the data should be viewed as preliminary information aimed to prompt further research and discussion, rather than as definitive information. 



In [1]:
# import libraries
from bs4 import BeautifulSoup
import pandas as pd
import re
import pathlib
import json
import random

# set up progress bar
from tqdm import tqdm
tqdm.pandas()

# set paths
in_path = pathlib.Path('d:/scraping/DATA/DECISIONS/SCC/BULK/HTML/')
out_path_raw = pathlib.Path('DATA/scc_raw.jsonl')
out_path_parsed = pathlib.Path('DATA/scc_cases.jsonl')
out_path_parquet = pathlib.Path('DATA/scc_cases.parquet')
out_path_yearly = pathlib.Path('DATA/YEARLY/')

# set years sought
start_year = 1887
end_year = 2022

# set language sought
language_sought = None  #set to 'en' if english only, set to 'fr' if French only, set to None if all languages


### Load Raw Data

In [2]:
# get list of files (including subdirectories) using pathlib
files = list(in_path.glob('**/*.json'))
print(len(files))

# Load data from files
results = []
for file in tqdm(files):
    with open(file) as f:
        data = json.load(f)
        results.append(data)

# convert list of dictionaries to dataframe
df = pd.DataFrame(results)
df


15582


100%|██████████| 15582/15582 [00:14<00:00, 1059.68it/s]


Unnamed: 0,case_citation,case_citation2,case_year,case_name,case_language,case_decision_date,scraped_link,scraped_timestamp,scraped_status_code,referrer_main_source,referrer_sub_source,referrer_file,referrer_timestamp,case_html
0,(1877) 1 SCR 110,(1877) 1 SCR 110,1877,Boak et al. v. The Merchant's Marine Insurance...,en,1877-01-23,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31T07:18:23.486310,200,SR-Local,SCC-batch_aug_2022,C:/Scraping/QUEUES/scc_batch_aug22_queue.jsonl,2022-08-29T07:52:21.484818,"<!DOCTYPE html>\n<html lang=""en"">\n\n<head>\n ..."
1,(1877) 1 SCR 114,(1877) 1 SCR 114,1877,Smyth v. McDougall,en,1877-02-01,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31T07:18:16.971689,200,SR-Local,SCC-batch_aug_2022,C:/Scraping/QUEUES/scc_batch_aug22_queue.jsonl,2022-08-29T07:52:21.484818,"<!DOCTYPE html>\n<html lang=""en"">\n\n<head>\n ..."
2,(1877) 1 SCR 117,(1877) 1 SCR 117,1877,The Queen v. Laliberté,en,1877-02-03,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31T07:18:09.437153,200,SR-Local,SCC-batch_aug_2022,C:/Scraping/QUEUES/scc_batch_aug22_queue.jsonl,2022-08-29T07:52:21.484818,"<!DOCTYPE html>\n<html lang=""en"">\n\n<head>\n ..."
3,(1877) 1 SCR 145,(1877) 1 SCR 145,1877,Brassard et al. v. Langevin,en,1877-02-28,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31T07:18:01.265658,200,SR-Local,SCC-batch_aug_2022,C:/Scraping/QUEUES/scc_batch_aug22_queue.jsonl,2022-08-29T07:52:21.484818,"<!DOCTYPE html>\n<html lang=""en"">\n\n<head>\n ..."
4,(1877) 1 SCR 235,(1877) 1 SCR 235,1877,Johnstone v. The Minister & Trustees of St. An...,en,1877-06-28,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31T07:16:38.791635,200,SR-Local,SCC-batch_aug_2022,C:/Scraping/QUEUES/scc_batch_aug22_queue.jsonl,2022-08-29T07:52:21.484818,"<!DOCTYPE html>\n<html lang=""en"">\n\n<head>\n ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15577,2023 CSC 4,,2023,R. c. McGregor,fr,2023-02-17T00:00:00.000Z,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2023-04-13T15:18:54.508676,200,SR-Local,SCC-batch_apr_2023,C:/Scraping/QUEUES/scc_full_queue_from_rss_apr...,2023-02-18T05:06:19.388Z,"<!DOCTYPE html>\n<html lang=""fr"">\n\n<head>\n ..."
15578,2023 CSC 5,,2023,R. c. Metzger,fr,2023-03-03T00:00:00.000Z,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2023-04-13T15:18:00.542248,200,SR-Local,SCC-batch_apr_2023,C:/Scraping/QUEUES/scc_full_queue_from_rss_apr...,2023-03-04T05:06:22.659Z,"<!DOCTYPE html>\n<html lang=""fr"">\n\n<head>\n ..."
15579,2023 CSC 6,,2023,R. c. Downes,fr,2023-03-10T00:00:00.000Z,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2023-04-13T15:15:51.338716,200,SR-Local,SCC-batch_apr_2023,C:/Scraping/QUEUES/scc_full_queue_from_rss_apr...,2023-03-11T05:06:21.549Z,"<!DOCTYPE html>\n<html lang=""fr"">\n\n<head>\n ..."
15580,2023 CSC 7,,2023,R. c. Chatillon,fr,2023-03-15T00:00:00.000Z,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2023-04-13T15:13:31.095321,200,SR-Local,SCC-batch_apr_2023,C:/Scraping/QUEUES/scc_full_queue_from_rss_apr...,2023-03-21T05:06:23.580Z,"<!DOCTYPE html>\n<html lang=""fr"">\n\n<head>\n ..."


In [3]:
# export raw df to jsonl
df.to_json(out_path_raw, orient='records', lines=True)


### Parse Data

In [4]:
# clean dataframe

#remove ?iframe=true from scraped_link
df['scraped_link'] = df['scraped_link'].str.replace('\?iframe=true', '', regex=True)

# remove T and everything after from scraped_timestamp & case_decision_date
df['scraped_timestamp'] = df['scraped_timestamp'].str.replace('T.*', '', regex=True)
df['case_decision_date'] = df['case_decision_date'].str.replace('T.*', '', regex=True)

# remove scraped_status_code, referrer_main_source, referrer_sub_source, referrer_file, referrer_timestamp
df = df.drop(columns=['scraped_status_code', 'referrer_main_source', 'referrer_sub_source', 'referrer_file', 'referrer_timestamp'])

# convert case_year to int and filter for years sought
df['case_year'] = df['case_year'].astype(int)
df = df[df.case_year >= start_year]
df = df[df.case_year <= end_year]

# filter for langauge if desired
if language_sought:
    df = df[df.case_language == language_sought]

# remove cases where no citation (typically orders or errors)
# remove df where citation includes '='
df = df[df.case_citation.str.contains('=') == False]

# if citation2 = ciation 1, make citation2 = ''
df['case_citation2'] = df['case_citation2'].where(df['case_citation2'] != df['case_citation'], '')

# change 'scraped_link' column to 'source_url;
df = df.rename(columns={'scraped_link': 'source_url'})

# remove 'case_' from all column names
df.columns = df.columns.str.replace('case_', '')


Unnamed: 0,citation,citation2,year,name,language,decision_date,source_url,scraped_timestamp,html
323,(1887) 13 SCR 441,,1887,City of Winnipeg v. Wright,en,1887-05-11,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31,"<!DOCTYPE html>\n<html lang=""en"">\n\n<head>\n ..."
324,(1887) 13 SCR 469,,1887,Ball v. Crompton Corset Co.,en,1887-03-01,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31,"<!DOCTYPE html>\n<html lang=""en"">\n\n<head>\n ..."
325,(1887) 13 SCR 577,,1887,St. Catharines Milling and Lumber Co. v. R.,en,1887-06-20,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31,"<!DOCTYPE html>\n<html lang=""en"">\n\n<head>\n ..."
326,(1887) 14 SCR 105,,1887,Canadian Pacific Ry. Co. v. Robinson,en,1887-06-20,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31,"<!DOCTYPE html>\n<html lang=""en"">\n\n<head>\n ..."
327,(1887) 14 SCR 217,,1887,Fairbanks v. Barlow,en,1887-03-14,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31,"<!DOCTYPE html>\n<html lang=""en"">\n\n<head>\n ..."
...,...,...,...,...,...,...,...,...,...
15568,2022 CSC 54,,2022,R. c. Beaver,fr,2022-12-09,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2023-04-13,"<!DOCTYPE html>\n<html lang=""fr"">\n\n<head>\n ..."
15569,2022 CSC 6,,2022,Anderson c. Alberta,fr,2022-03-18,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2022-09-01,"<!DOCTYPE html>\n<html lang=""fr"">\n\n<head>\n ..."
15570,2022 CSC 7,,2022,R. c. White,fr,2022-03-18,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2022-09-01,"<!DOCTYPE html>\n<html lang=""fr"">\n\n<head>\n ..."
15571,2022 CSC 8,,2022,R. c. Pope,fr,2022-03-21,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2022-09-01,"<!DOCTYPE html>\n<html lang=""fr"">\n\n<head>\n ..."


In [5]:
# Extract text of cases from html
def get_text(html):

    # if html is None, return None
    if html is None:
        return None
    
    # extract text from class 'entry-content' in page1 html using beautiful soup
    soup = BeautifulSoup(html, 'html.parser')

    # convert <br> to new line to preserve paragraphs
    for br in soup.find_all('br'):
        br.replace_with('\n')

    # Insert newline characters after each <p> tag to preserve paragraphs
    for p in soup.find_all('p'):
        p.insert_after('\n')

    return soup.text

df['text'] = df.html.progress_apply(get_text)



100%|██████████| 15234/15234 [08:11<00:00, 31.01it/s]


In [6]:
# Clean text of cases
def clean_text(text):

    # remove \xa0
    text = text.replace('\xa0', ' ')

    # Remove multiple whitespaces and preserve paragraphs
    text = '\n'.join([re.sub(r'\s+', ' ', line.strip()) for line in text.split('\n')])
    
    # Remove single newlines
    text = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)

    # Convert multiple newlines to single newlines
    text = re.sub(r'\n+', '\n', text)

    # Remove 'You are being directed to the most recent version...
    if 'You are being directed to the most recent version of the statute which may not be' in text:
        text = text.split('You are being directed to the most recent version of the statute which may not be')[0]

    # Remove 'Vous allez être redirigé vers la version...'
    if '\nVous allez être redirigé vers la version' in text:
        text = text.split('\nVous allez être redirigé vers la version')[0]

    # if '\nDecision Information\n' in text, remove everything before it
    if '\nDecision Information\n' in text:
        text = text.split('\nDecision Information\n')[1]

    # if '\nInformations sur la décision\n' in text, remove everything before it
    if '\nInformations sur la décision\n' in text:
        text = text.split('\nInformations sur la décision\n')[1]

    # Remove all strings '\n[Page #]\n' (with # being a number of up to 4 digits 
    text = re.sub(r'\n\[Page \d{1,3}\]\n', ' ', text)
    
    return text

df['unofficial_text'] = df.text.progress_apply(clean_text)


100%|██████████| 15234/15234 [00:59<00:00, 257.25it/s]


In [7]:
# drop unneeded columns
df = df.drop(columns=['html'])
df = df.drop(columns=['text'])

In [8]:
df

Unnamed: 0,citation,citation2,year,name,language,decision_date,source_url,scraped_timestamp,unofficial_text
323,(1887) 13 SCR 441,,1887,City of Winnipeg v. Wright,en,1887-05-11,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31,City of Winnipeg v. Wright\nCollection\nSuprem...
324,(1887) 13 SCR 469,,1887,Ball v. Crompton Corset Co.,en,1887-03-01,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31,Ball v. Crompton Corset Co.\nCollection\nSupre...
325,(1887) 13 SCR 577,,1887,St. Catharines Milling and Lumber Co. v. R.,en,1887-06-20,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31,St. Catharines Milling and Lumber Co. v. R.\nC...
326,(1887) 14 SCR 105,,1887,Canadian Pacific Ry. Co. v. Robinson,en,1887-06-20,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31,Canadian Pacific Ry. Co. v. Robinson\nCollecti...
327,(1887) 14 SCR 217,,1887,Fairbanks v. Barlow,en,1887-03-14,https://decisions.scc-csc.ca/scc-csc/scc-csc/e...,2022-08-31,Fairbanks v. Barlow\nCollection\nSupreme Court...
...,...,...,...,...,...,...,...,...,...
15568,2022 CSC 54,,2022,R. c. Beaver,fr,2022-12-09,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2023-04-13,R. c. Beaver\nCollection\nJugements de la Cour...
15569,2022 CSC 6,,2022,Anderson c. Alberta,fr,2022-03-18,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2022-09-01,Anderson c. Alberta\nCollection\nJugements de ...
15570,2022 CSC 7,,2022,R. c. White,fr,2022-03-18,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2022-09-01,R. c. White\nCollection\nJugements de la Cour ...
15571,2022 CSC 8,,2022,R. c. Pope,fr,2022-03-21,https://decisions.scc-csc.ca/scc-csc/scc-csc/f...,2022-09-01,R. c. Pope\nCollection\nJugements de la Cour s...


### Export data

In [9]:
# export cleaned df to jsonl
df.to_json(out_path_parsed, orient='records', lines=True)

In [10]:
# export cleaned df to parquet
df.to_parquet(out_path_parquet)

In [11]:
# export cleaned df to yearly json files
for year in tqdm(range(start_year, end_year+1)):
    df[df.year == year].to_json(out_path_yearly / f'{year}.json', orient='records', indent=4)

100%|██████████| 136/136 [00:01<00:00, 68.28it/s]
