### Notebook to load and analyze Refugee Appeal Division cases

### ***********ONLY LOADING 3 & 4 WORK AT THE MOMENT***************

Sean Rehaag

License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). 

Dataset & Code to be cited as:

Sean Rehaag, "Refugee Appeal Division Bulk Decisions Dataset" (2023), online: Refugee Law Laboratory <https://refugeelab.ca/bulk-data/rad/>.

Notes:

(1) Data Source: Immigration and Refugee Board. In the Fall of 2022, the IRB added the Refugee Law Laboratory to their email distribution list for legal publishers of RAD decisions. The RLL therefore receives new RAD cases as they are released for publication by the IRB. Also, in the fall of 2022 the Immigration and Refugee Board provided the RLL with a full backlog of approximately 116k published decisions from all divisions (RAD, RPD, ID, IAD). 

(2) Unofficial Data: The data are unofficial reproductions. For official versions, please contact the Immigration and Refugee Board. 

(3) Non-Affiliation / Endorsement: The data has been collected and reproduced without any affiliation or endorsement from the Immigration and Refugee Board.

(4) Non-Commerical Use: As indicated in the license, data may be used for non-commercial use (with attribution) only. For commercial use, please contact the Immigration and Refugee Board. 

(5) Accuracy: Data was collected and processed programmatically for the purposes of academic research. While we make best efforts to ensure accuracy, data gathering of this kind inevitably involves errors. As such the data should be viewed as preliminary information aimed to prompt further research and discussion, rather than as definitive information.

Acknowledgements: Thanks to Rafael Dolores for coding the parsing scripts.

### Requirements:

    pip install pandas

### If using parquet

    pip install pyarrow

### if loading remotely (other than via Hugging Face)
    
    pip install requests

### If loading remotely via Hugging Face

    pip install datasets
    

(Written on Python 3.9.12)

### Load Data

Four Options: Local & Remote

In [8]:
# OPTION 1: Load Hugging Face dataset

from datasets import load_dataset
import pandas as pd

dataset = load_dataset("refugee-law-lab/canadian-legal-data", split="train", data_dir="RAD")

# convert to dataframe
df = pd.DataFrame(dataset)

df


Downloading readme: 100%|██████████| 11.2k/11.2k [00:00<00:00, 11.2MB/s]
Downloading data: 100%|██████████| 313M/313M [00:18<00:00, 17.3MB/s]
Downloading data files: 100%|██████████| 1/1 [00:18<00:00, 18.09s/it]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 1000.31it/s]
Generating train split: 27554 examples [00:01, 15744.45 examples/s]


Unnamed: 0,citation1,citation2,dataset,name,source_url,scraped_timestamp,document_date,year,unofficial_text,language,other
0,MB7-00112,,RAD,,MB7-00112a.txt,2023-11-12,2020-09-15,2020,\nRAD File / Dossier de la SAR : MB7-00112\nPr...,en,
1,MB7-00112,,RAD,,MB7-00112tf.txt,2023-11-12,2020-09-15,2020,\nDossier de la SAR / RAD File: MB7-00112\nHui...,fr,
2,MB7-03926,,RAD,,MB7-03926f.txt,2023-11-12,2020-10-26,2020,\nDossier de la SAR / RAD File : MB7-03926\nMB...,fr,
3,MB7-03926,,RAD,,MB7-03926ta.txt,2023-11-12,2020-10-26,2020,\nRAD File No. / No de dossier de la SAR : MB7...,en,
4,MB7-24221,,RAD,,MB7-24221 f.txt,2023-11-12,2021-04-01,2021,\nDossier de la SAR / RAD File : MB7-24221\nHu...,fr,
...,...,...,...,...,...,...,...,...,...,...,...
27549,TC0-10861,,RAD,,3599965.txt,2023-11-13,2021-05-25,2021,\nDossier de la SAR / RAD File: TC0-10861\nHui...,fr,
27550,TB9-20994,,RAD,,3601134.txt,2023-11-13,2020-07-30,2020,\nRAD File / Dossier de la SAR : TB9-20994\nPr...,en,
27551,TB9-20994,,RAD,,3601135.txt,2023-11-13,2020-07-30,2020,\nDossier de la SAR / RAD File: TB9-20994\nHui...,fr,
27552,TC0-09230,,RAD,,3601140.txt,2023-11-13,2021-02-22,2021,\nRAD File / Dossier de la SAR : TC0-09230\nTC...,en,


In [4]:
# OPTION 2: Load parquet data remotely from Huggingface without cloning repo
import pandas as pd
import requests
from io import BytesIO

url = 'https://huggingface.co/datasets/refugee-law-lab/canadian-legal-data/resolve/main/RAD/train.parquet'

# load data
results = requests.get(url)

# convert to dataframe
df = pd.read_parquet(BytesIO(results.content))

df

# (if code fails, add engine='pyarrow' to read_parquet() function)

Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other
0,2001 FCT 1,,FC,2001,Adecon Ship Management Inc. v. Cuba,en,2001-02-01,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,Adecon Ship Management Inc. v. Cuba\nCourt (s)...,
1,2001 FCT 10,,FC,2001,Islam v. Canada (Minister of Citizenship and I...,en,2001-02-02,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,Islam v. Canada (Minister of Citizenship and I...,
2,2001 FCT 100,,FC,2001,Duterville v. Canada,en,2001-02-20,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,Duterville v. Canada\nCourt (s) Database\nFede...,
3,2001 FCT 1000,,FC,2001,LS Entertainment Group Inc. v. KALOS VISION LT...,en,2001-09-07,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,LS Entertainment Group Inc. v. KALOS VISION LT...,
4,2001 FCT 1001,,FC,2001,Ay v. Canada (Minister of Citizenship and Immi...,en,2001-09-07,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-18,Ay v. Canada (Minister of Citizenship and Immi...,
...,...,...,...,...,...,...,...,...,...,...,...
59901,2023 CF 840,,FC,2023,Vill c. Bell Canada,fr,2023-06-14,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-07-03,Vill c. Bell Canada\nBase de données – Cour (s...,
59902,2023 CF 853,,FC,2023,Gosselin c. Canada (Procureur général),fr,2023-06-16,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-07-03,Gosselin c. Canada (Procureur général)\nBase d...,
59903,2023 CF 861,,FC,2023,Stoica c. Canada (Sécurité puplique et Protect...,fr,2023-06-20,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-07-03,Stoica c. Canada (Sécurité puplique et Protect...,
59904,2023 CF 893,,FC,2023,"Voltage Pictures, LLC c. Salna",fr,2023-06-26,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2023-07-03,"Voltage Pictures, LLC c. Salna\nBase de donnée...",


In [None]:
# OPTION 3: Load data remotely from GitHub without cloning repo
# Note: load time varies depending on internet connection (approx 1.5 GB of data for all years/languages)
# This is the slowest loading option.

import pandas as pd
import json
import requests

# Set variables
start_year = 2001  # First year of data sought (1877 +)
end_year = 2023  # Last year of data sought (2023 -)
languages_sought = ['en', 'fr']  # languages in list e.g. ['en', 'fr'] or ['en']


base_ulr = 'https://raw.githubusercontent.com/Refugee-Law-Lab/rad_bulk_data/master/DATA/YEARLY/'

# load data
results = []
for year in range(start_year, end_year+1):
    for language in languages_sought:
        url = base_ulr + f'{year}_{language}.json'
        results.extend(requests.get(url).json())

# convert to dataframe
df = pd.DataFrame(results)

df

In [1]:
# OPTION 4: Load data locally via cloned repo

# First, clone git repo
# Then run this code to load data

import pandas as pd
import json
import pathlib

# Set variables
start_year = 2010  # First year of data sought (1877 +)
end_year = 2023  # Last year of data sought (2023 -)
languages_sought = ['en', 'fr']  # languages in list e.g. ['en', 'fr'] or ['en']

# Set path to data
data_path = pathlib.Path('DATA/YEARLY/')

# load data
results = []
for year in range(start_year, end_year+1):
    for language in languages_sought:
        with open(data_path / f'{year}_{language}.json') as f:
            results.extend(json.load(f))

# convert to dataframe
df = pd.DataFrame(results)

df


Unnamed: 0,citation1,citation2,dataset,name,source_url,scraped_timestamp,document_date,year,unofficial_text,language,other
0,TB9-07773,,RAD,,3229926.txt,2023-11-13,2010-01-24,2010,\nRAD File / Dossier de la SAR : TB9-07773\nTB...,en,"{""decision-maker_name"": ""Kim Polowek""}"
1,TB9-07773,,RAD,,3229927.txt,2023-11-13,2010-01-24,2010,\nDossier de la SAR / RAD File: TB9-07773\nTB9...,fr,"{""decision-maker_name"": ""Kim Polowek""}"
2,MB9-27420,,RAD,,3546053.txt,2023-11-13,2011-01-12,2011,\nRAD File No. / No de dossier de la SAR : MB9...,en,"{""decision-maker_name"": ""Marie-Lyne Thibault""}"
3,TB3-03406,,RAD,,1355235.txt,2023-11-12,2013-07-31,2013,\n\n\n\n\n\n\nRAD File No. / N° de dossier de ...,en,"{""decision-maker_name"": ""Edward Bosveld""}"
4,VB3-01099,,RAD,,1360268.txt,2023-11-12,2013-08-30,2013,\n\n\n\tRAD File No. / N° de dossier de la SAR...,en,"{""decision-maker_name"": null}"
...,...,...,...,...,...,...,...,...,...,...,...
26354,VC1-04668,,RAD,,3583931.txt,2023-11-13,2022-04-29,2022,\nDossier de la SAR / RAD File : VC1-04668\nVC...,fr,"{""decision-maker_name"": ""Me Kristine Plouffe-M..."
26355,MC1-07673,,RAD,,3596070.txt,2023-11-13,2022-02-25,2022,\nDossier de la SAR / RAD File: MC1-07673\n\nH...,fr,"{""decision-maker_name"": ""Daphnée Ouellet""}"
26356,MC1-06135,,RAD,,3599962.txt,2023-11-13,2022-01-28,2022,\nDossier de la SAR / RAD File : MC1-06135\n\n...,fr,"{""decision-maker_name"": ""Daphnée Ouellet""}"
26357,MC2-03068,,RAD,,MC2-03068 - Final February 2023.txt,2023-11-12,2023-02-02,2023,\nRAD File / Dossier de la SAR : MC2-03068\nMC...,en,"{""decision-maker_name"": ""Me Martine Durocher""}"


### Analyze Data

In [2]:
# View dataframe
df.head()

Unnamed: 0,citation1,citation2,dataset,name,source_url,scraped_timestamp,document_date,year,unofficial_text,language,other
0,TB9-07773,,RAD,,3229926.txt,2023-11-13,2010-01-24,2010,\nRAD File / Dossier de la SAR : TB9-07773\nTB...,en,"{""decision-maker_name"": ""Kim Polowek""}"
1,TB9-07773,,RAD,,3229927.txt,2023-11-13,2010-01-24,2010,\nDossier de la SAR / RAD File: TB9-07773\nTB9...,fr,"{""decision-maker_name"": ""Kim Polowek""}"
2,MB9-27420,,RAD,,3546053.txt,2023-11-13,2011-01-12,2011,\nRAD File No. / No de dossier de la SAR : MB9...,en,"{""decision-maker_name"": ""Marie-Lyne Thibault""}"
3,TB3-03406,,RAD,,1355235.txt,2023-11-12,2013-07-31,2013,\n\n\n\n\n\n\nRAD File No. / N° de dossier de ...,en,"{""decision-maker_name"": ""Edward Bosveld""}"
4,VB3-01099,,RAD,,1360268.txt,2023-11-12,2013-08-30,2013,\n\n\n\tRAD File No. / N° de dossier de la SAR...,en,"{""decision-maker_name"": null}"


In [3]:
df.tail()

Unnamed: 0,citation1,citation2,dataset,name,source_url,scraped_timestamp,document_date,year,unofficial_text,language,other
26354,VC1-04668,,RAD,,3583931.txt,2023-11-13,2022-04-29,2022,\nDossier de la SAR / RAD File : VC1-04668\nVC...,fr,"{""decision-maker_name"": ""Me Kristine Plouffe-M..."
26355,MC1-07673,,RAD,,3596070.txt,2023-11-13,2022-02-25,2022,\nDossier de la SAR / RAD File: MC1-07673\n\nH...,fr,"{""decision-maker_name"": ""Daphnée Ouellet""}"
26356,MC1-06135,,RAD,,3599962.txt,2023-11-13,2022-01-28,2022,\nDossier de la SAR / RAD File : MC1-06135\n\n...,fr,"{""decision-maker_name"": ""Daphnée Ouellet""}"
26357,MC2-03068,,RAD,,MC2-03068 - Final February 2023.txt,2023-11-12,2023-02-02,2023,\nRAD File / Dossier de la SAR : MC2-03068\nMC...,en,"{""decision-maker_name"": ""Me Martine Durocher""}"
26358,MC2-03068,,RAD,,MC2-03068tf.txt,2023-11-12,2023-02-02,2023,\nDossier de la SAR / RAD File: MC2-03068\nMC2...,fr,"{""decision-maker_name"": ""Me Martine Durocher""}"


In [4]:
# language counts
df['language'].value_counts()

language
en    13225
fr    13134
Name: count, dtype: int64

In [5]:
# Yearly counts
year_counts = df.year.value_counts()
years_count = sorted(year_counts.index)
for year_count in years_count:
    print(f'{year_count}: {year_counts[year_count]}')


2010: 2
2011: 1
2013: 758
2014: 2569
2015: 3456
2016: 2227
2017: 1005
2018: 1984
2019: 4575
2020: 6267
2021: 3384
2022: 129
2023: 2


In [38]:
# select 20 random rows from df_unique, iterate through them and print unofficial text
import random

# select 20 random rows from df
random.seed(999)

random_rows = random.sample(range(0, len(df)), 20)

# iterate through random rows and print unofficial text
for row in random_rows:
    print('##################################')
    print(df.iloc[row]['citation1'])
    print(df.iloc[row]['source_url'])
    print(df.iloc[row]['document_date'])
    print(df.iloc[row]['year'])
    print(df.iloc[row]['language'])
    print('##################################')
    print()
    print()
    print(df.iloc[row]['unofficial_text'])
    print()
    print()
    print()
    print()
    print()
    print()
    print('____________________________________________________________________________')
    print()
    print()
    print()
    print()
    print()
    print()


##################################
TB8-26726
3581514.txt
2020-10-09
2020
en
##################################



RAD File / Dossier de la SAR : TB8-26726

Private Proceeding / Huis clos

Reasons and decision ? Motifs et décision

Person who is the subject of the appeal
XXXX XXXX XXXX
Personne en cause



Appeal considered / heard at 
Montreal, QC
Appel instruit / entendu à



Date of decision 
9 October 2020
Date de la décision 



Panel
Miriam McLeod
Tribunal



Counsel for the person who is the subject of the appeal
Marianne B Lithwick
Conseil de la personne en cause



Designated representative
N/A
Représentant(e) désigné(e)



Counsel for the Minister
N/A
Conseil du ministre





REASONS FOR DECISION
OVERVIEW 
[1] XXXX XXXX XXXX (Appellant) is a citizen of Nigeria who alleges that he fears members of the Black Axe and the Neo Black Movement (NBM) because he was an anti-cult activist at his university and he cooperated with the police to bring cult members suspected of violent crim