### Notebook to parse scraped html to produce cleaned text of FC decisions

Sean Rehaag

License: Creative Commons Attribution-NonCommercial 4.0 International [(CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/)

Dataset & Code to be cited as: 

    Sean Rehaag, "Federal Court Decisions Dataset" (2023), online: Refugee Law Laboratory <https://refugeelab.ca/bulk-data/fc/>.

Notes:

(1) Data Source: [Federal Court](https://www.fct-cf.gc.ca). 

(2) Unofficial Data: The data are unofficial reproductions of materials on the Federal Court website. Links to official versions are included in the dataset.

(3) Non-Affiliation / Endorsement: The data has been collected and reproduced without any affiliation or endorsement from the Federal Court.

(4) Non-Commerical Use: As indicated in the license, data may be used for non-commercial use (with attribution) only. For commercial use, see the Federal Court of Appeal website's [Terms of Use](https://www.fct-cf.gc.ca/en/pages/important-notices).

(5) Accuracy: Data was collected and processed programmatically for the purposes of academic research. While we make best efforts to ensure accuracy, data gathering of this kind inevitably involves errors. As such the data should be viewed as preliminary information aimed to prompt further research and discussion, rather than as definitive information. 

(6) Limitation: Only includes cases with neutral citation, which began to be used in 2001

(7) Delay: Decisions may take many months to be translated (sometimes over a year). As a result, in the most recent years, decisions may only be available in one language.

### Requirements:

    pip install pandas
    pip install requests

(Written on Python 3.9.12)




In [1]:
# import libraries
from bs4 import BeautifulSoup
import pandas as pd
import re
import pathlib
import json
import random

# set up progress bar
from tqdm import tqdm
tqdm.pandas()

# set paths
in_path = pathlib.Path('d:/scraping/DATA/DECISIONS/FC/BULK/HTML/')
out_path_raw = pathlib.Path('DATA/fc_raw.jsonl')
out_path_parsed = pathlib.Path('DATA/fc_cases.jsonl')
out_path_parquet = pathlib.Path('DATA/fc_cases.parquet')
out_path_yearly = pathlib.Path('DATA/YEARLY/')

# set years sought
start_year = 2001 
end_year = 2023


### Load Raw Data

In [2]:
# get list of files (including subdirectories) using pathlib
files = list(in_path.glob('**/*.json'))
print(len(files))

# Load data from files
results = []
for file in tqdm(files):
    with open(file) as f:
        data = json.load(f)
        results.append(data)

# convert list of dictionaries to dataframe
df = pd.DataFrame(results)




72400


100%|██████████| 72400/72400 [04:54<00:00, 245.73it/s] 


In [3]:
# export raw df to jsonl
df.to_json(out_path_raw, orient='records', lines=True)


### Parse Data

In [4]:
# clean dataframe

#remove ?iframe=true from scraped_link
df['scraped_link'] = df['scraped_link'].str.replace('\?iframe=true', '', regex=True)

# remove T and everything after from scraped_timestamp & case_decision_date
df['scraped_timestamp'] = df['scraped_timestamp'].str.replace('T.*', '', regex=True)
df['case_decision_date'] = df['case_decision_date'].str.replace('T.*', '', regex=True)

# remove scraped_status_code, referrer_main_source, referrer_sub_source, referrer_file, referrer_timestamp
df = df.drop(columns=['scraped_status_code', 'referrer_main_source', 'referrer_sub_source', 'referrer_file', 'referrer_timestamp'])

# convert case_year to int and filter for years sought
df['case_year'] = df['case_year'].astype(int)
df = df[df.case_year >= start_year]
df = df[df.case_year <= end_year]

# remove cases where no citation (typically orders or errors)
# remove df where citation includes '=' or '-'
df = df[df.case_citation.str.contains('=') == False]
df = df[df.case_citation.str.contains('-') == False]

# if citation2 = citation 1, make citation2 = ''
df['case_citation2'] = df['case_citation2'].where(df['case_citation2'] != df['case_citation'], '')

# fill nan in citation2 with ''
df['case_citation2'] = df['case_citation2'].fillna('')

# change 'scraped_link' column to 'source_url;
df = df.rename(columns={'scraped_link': 'source_url'})

# remove 'case_' from all column names
df.columns = df.columns.str.replace('case_', '')


In [5]:
# Extract text of cases from html
def get_text(html):

    # if html is None, return None
    if html is None:
        return None
    
    # extract text from class 'entry-content' in page1 html using beautiful soup
    soup = BeautifulSoup(html, 'html.parser')

    # convert <br> to new line to preserve paragraphs
    for br in soup.find_all('br'):
        br.replace_with('\n')

    # Insert newline characters after each <p> tag to preserve paragraphs
    for p in soup.find_all('p'):
        p.insert_after('\n')

    return soup.text

df['text'] = df.html.progress_apply(get_text)



100%|██████████| 59950/59950 [30:19<00:00, 32.94it/s]  


In [6]:
# Clean text of cases
def clean_text(text):

    # remove \xa0
    text = text.replace('\xa0', ' ')

    # Remove multiple whitespaces and preserve paragraphs
    text = '\n'.join([re.sub(r'\s+', ' ', line.strip()) for line in text.split('\n')])
    
    # Remove single newlines
    text = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)

    # Convert multiple newlines to single newlines
    text = re.sub(r'\n+', '\n', text)

    # Remove 'You are being directed to the most recent version...
    if 'You are being directed to the most recent version of the statute which may not be' in text:
        text = text.split('You are being directed to the most recent version of the statute which may not be')[0]

    # Remove 'Vous allez être redirigé vers la version...'
    if '\nVous allez être redirigé vers la version' in text:
        text = text.split('\nVous allez être redirigé vers la version')[0]

    # if '\nDecision Information\n' in text, remove everything before it
    if '\nDecision Information\n' in text:
        text = text.split('\nDecision Information\n')[1]

    # if '\nInformations sur la décision\n' in text, remove everything before it
    if '\nInformations sur la décision\n' in text:
        text = text.split('\nInformations sur la décision\n')[1]

    # Remove all strings '\n[Page #]\n' (with # being a number of up to 4 digits 
    text = re.sub(r'\n\[Page \d{1,3}\]\n', ' ', text)
    
    return text

df['unofficial_text'] = df.text.progress_apply(clean_text)


100%|██████████| 59950/59950 [02:19<00:00, 431.09it/s]


In [7]:
# drop unneeded columns
df = df.drop(columns=['html'])
df = df.drop(columns=['text'])

In [8]:
# reset index
df = df.reset_index(drop=True)

In [9]:
df.head(20)

Unnamed: 0,citation,year,name,language,decision_date,source_url,scraped_timestamp,citation2,unofficial_text
0,2001 FCT 1,2001,Adecon Ship Management Inc. v. Cuba,en,2001-02-01,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,,Adecon Ship Management Inc. v. Cuba\nCourt (s)...
1,2001 FCT 10,2001,Islam v. Canada (Minister of Citizenship and I...,en,2001-02-02,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,,Islam v. Canada (Minister of Citizenship and I...
2,2001 FCT 100,2001,Duterville v. Canada,en,2001-02-20,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,,Duterville v. Canada\nCourt (s) Database\nFede...
3,2001 FCT 1000,2001,LS Entertainment Group Inc. v. KALOS VISION LT...,en,2001-09-07,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,,LS Entertainment Group Inc. v. KALOS VISION LT...
4,2001 FCT 1001,2001,Ay v. Canada (Minister of Citizenship and Immi...,en,2001-09-07,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,,Ay v. Canada (Minister of Citizenship and Immi...
5,2001 FCT 1002,2001,Mohammed v. Canada (Minister of Citizenship an...,en,2001-09-07,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,,Mohammed v. Canada (Minister of Citizenship an...
6,2001 FCT 1003,2001,Predrag v. Canada (Minister of Citizenship and...,en,2001-09-07,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,,Predrag v. Canada (Minister of Citizenship and...
7,2001 FCT 1004,2001,Socan v. 537047 B.c. Ltd.,en,2001-09-07,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,,Socan v. 537047 B.c. Ltd.\nCourt (s) Database\...
8,2001 FCT 1005,2001,Robert Mondavi Winery v. Spagnol's Wine & Beer...,en,2001-09-10,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,,Robert Mondavi Winery v. Spagnol's Wine & Beer...
9,2001 FCT 1006,2001,Alam v. Canada (Minister of Citizenship and Im...,en,2001-09-10,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,,Alam v. Canada (Minister of Citizenship and Im...


In [10]:
df.tail(20)

Unnamed: 0,citation,year,name,language,decision_date,source_url,scraped_timestamp,citation2,unofficial_text
59930,2023 CF 715,2023,Jacques c. Canada,fr,2023-05-23,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-07-04,,Jacques c. Canada\nBase de données – Cour (s)\...
59931,2023 CF 741,2023,Vidéotron Ltée c. Technologies Konek Inc.,fr,2023-05-26,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-06-27,,Vidéotron Ltée c. Technologies Konek Inc.\nBas...
59932,2023 CF 748,2023,Shaoguan Risen Trading Corporation Ltd. c. Don...,fr,2023-05-29,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-06-27,,Shaoguan Risen Trading Corporation Ltd. c. Don...
59933,2023 CF 749,2023,French c. La Légion Royale Canadienne,fr,2023-05-29,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-06-27,,French c. Légion royale canadienne\nBase de do...
59934,2023 CF 751,2023,Morin c. Canada (Procureur général),fr,2023-05-30,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-06-27,,Morin c. Canada (Procureur général)\nBase de d...
59935,2023 CF 752,2023,Lill c. Canada,fr,2023-05-30,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-07-03,,Lill c. Canada\nBase de données – Cour (s)\nDé...
59936,2023 CF 753,2023,Cherif c. Canada (Citoyenneté et Immigration),fr,2023-06-02,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-07-03,,Cherif c. Canada (Citoyenneté et Immigration)\...
59937,2023 CF 785,2023,Chaudhry c. Canada (Citoyenneté et Immigration),fr,2023-06-08,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-07-03,,Chaudhry c. Canada (Citoyenneté et Immigration...
59938,2023 CF 795,2023,Rock c. Conseil des Innus de Pessamit,fr,2023-06-07,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-06-27,,Rock c. Conseil des Innus de Pessamit\nBase de...
59939,2023 CF 82,2023,Flores c. Canada (Citoyenneté et Immigration),fr,2023-01-19,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-06-27,,Flores c. Canada (Citoyenneté et Immigration)\...


In [11]:
# count number of cases per year
df.groupby('year').count()

Unnamed: 0_level_0,citation,name,language,decision_date,source_url,scraped_timestamp,citation2,unofficial_text
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2001,2807,2807,2807,2807,2807,2807,2807,2807
2002,2639,2639,2639,2639,2639,2639,2639,2639
2003,2962,2962,2962,2962,2962,2962,2962,2962
2004,3509,3509,3509,3509,3509,3509,3509,3509
2005,3409,3409,3409,3409,3409,3409,3409,3409
2006,3071,3071,3071,3071,3071,3071,3071,3071
2007,2740,2740,2740,2740,2740,2740,2740,2740
2008,2798,2798,2798,2798,2798,2798,2798,2798
2009,2501,2501,2501,2501,2501,2501,2501,2501
2010,2580,2580,2580,2580,2580,2580,2580,2580


### Export data

In [12]:
# export cleaned df to jsonl
df.to_json(out_path_parsed, orient='records', lines=True)

In [13]:
# export cleaned df to parquet
df.to_parquet(out_path_parquet)

In [14]:
# export cleaned df to yearly json files
for year in tqdm(range(start_year, end_year+1)):
    for language in ['en', 'fr']:
        df[(df.year == year) & (df.language == language)].to_json(out_path_yearly / f'{year}_{language}.json', orient='records', indent=4)
    

100%|██████████| 23/23 [00:05<00:00,  3.89it/s]
