### Notebook to load and analyze Refugee Protection Division (Legacy) cases

Sean Rehaag

License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). 

Dataset & Code to be cited as:

Sean Rehaag, "Refugee Protection Division (Legacy) Bulk Decisions Dataset" (2023), online: Refugee Law Laboratory <https://refugeelab.ca/bulk-data/rpd/>.

Notes:

(1) Data Source: In the fall of 2022 the Immigration and Refugee Board provided the RLL with a full backlog of approximately 116k published decisions from all divisions (RAD, RPD, ID, IAD). Because the IRB no longer regularly publishes RPD decisions, the dataset is no longer being updated, which is why we refer to the 
dataset as a legacy dataset. For more recent RPD decisions (obtained via Access to Information Requests), 
see the RLLR dataset.

(2) Unofficial Data: The data are unofficial reproductions. For official versions, please contact the Immigration and Refugee Board. 

(3) Non-Affiliation / Endorsement: The data has been collected and reproduced without any affiliation or endorsement from the Immigration and Refugee Board.

(4) Non-Commerical Use: As indicated in the license, data may be used for non-commercial use (with attribution) only. For commercial use, please contact the Immigration and Refugee Board. 

(5) Accuracy: Data was collected and processed programmatically for the purposes of academic research. While we make best efforts to ensure accuracy, data gathering of this kind inevitably involves errors. As such the data should be viewed as preliminary information aimed to prompt further research and discussion, rather than as definitive information.

Acknowledgements: Thanks to Rafael Dolores who coded the initial parsing scripts for the Refugee Appeal Division Bulk Decisions Dataset, which were modified for this datset.

### Requirements:

    pip install pandas

### If using parquet

    pip install pyarrow

### if loading remotely (other than via Hugging Face)
    
    pip install requests

### If loading remotely via Hugging Face

    pip install datasets
    

(Written on Python 3.9.12)

### Load Data

Four Options: Local & Remote

In [12]:
# OPTION 1: Load Hugging Face dataset

from datasets import load_dataset
import pandas as pd

dataset = load_dataset("refugee-law-lab/canadian-legal-data", split="train", data_dir="RAD")

# convert to dataframe
df = pd.DataFrame(dataset)

df


Unnamed: 0,citation,citation2,dataset,name,source_url,scraped_timestamp,document_date,year,unofficial_text,language,other
0,MB7-00112,,RAD,,MB7-00112a.txt,2023-11-12,2020-09-15,2020,\nRAD File / Dossier de la SAR : MB7-00112\nPr...,en,
1,MB7-00112,,RAD,,MB7-00112tf.txt,2023-11-12,2020-09-15,2020,\nDossier de la SAR / RAD File: MB7-00112\nHui...,fr,
2,MB7-03926,,RAD,,MB7-03926f.txt,2023-11-12,2020-10-26,2020,\nDossier de la SAR / RAD File : MB7-03926\nMB...,fr,
3,MB7-03926,,RAD,,MB7-03926ta.txt,2023-11-12,2020-10-26,2020,\nRAD File No. / No de dossier de la SAR : MB7...,en,
4,MB7-24221,,RAD,,MB7-24221 f.txt,2023-11-12,2021-04-01,2021,\nDossier de la SAR / RAD File : MB7-24221\nHu...,fr,
...,...,...,...,...,...,...,...,...,...,...,...
27546,TC0-10861,,RAD,,3599965.txt,2023-11-13,2021-05-25,2021,\nDossier de la SAR / RAD File: TC0-10861\nHui...,fr,
27547,TB9-20994,,RAD,,3601134.txt,2023-11-13,2020-07-30,2020,\nRAD File / Dossier de la SAR : TB9-20994\nPr...,en,
27548,TB9-20994,,RAD,,3601135.txt,2023-11-13,2020-07-30,2020,\nDossier de la SAR / RAD File: TB9-20994\nHui...,fr,
27549,TC0-09230,,RAD,,3601140.txt,2023-11-13,2021-02-22,2021,\nRAD File / Dossier de la SAR : TC0-09230\nTC...,en,


In [11]:
# OPTION 2: Load parquet data remotely from Huggingface without cloning repo
import pandas as pd
import requests
from io import BytesIO

url = 'https://huggingface.co/datasets/refugee-law-lab/canadian-legal-data/resolve/main/RAD/train.parquet'

# load data
results = requests.get(url)

# convert to dataframe
df = pd.read_parquet(BytesIO(results.content))

df

# (if code fails, add engine='pyarrow' to read_parquet() function)

Unnamed: 0,citation,citation2,dataset,name,source_url,scraped_timestamp,document_date,year,unofficial_text,language,other
0,MB7-00112,,RAD,,MB7-00112a.txt,2023-11-12,2020-09-15,2020,\nRAD File / Dossier de la SAR : MB7-00112\nPr...,en,
1,MB7-00112,,RAD,,MB7-00112tf.txt,2023-11-12,2020-09-15,2020,\nDossier de la SAR / RAD File: MB7-00112\nHui...,fr,
2,MB7-03926,,RAD,,MB7-03926f.txt,2023-11-12,2020-10-26,2020,\nDossier de la SAR / RAD File : MB7-03926\nMB...,fr,
3,MB7-03926,,RAD,,MB7-03926ta.txt,2023-11-12,2020-10-26,2020,\nRAD File No. / No de dossier de la SAR : MB7...,en,
4,MB7-24221,,RAD,,MB7-24221 f.txt,2023-11-12,2021-04-01,2021,\nDossier de la SAR / RAD File : MB7-24221\nHu...,fr,
...,...,...,...,...,...,...,...,...,...,...,...
27546,TC0-10861,,RAD,,3599965.txt,2023-11-13,2021-05-25,2021,\nDossier de la SAR / RAD File: TC0-10861\nHui...,fr,
27547,TB9-20994,,RAD,,3601134.txt,2023-11-13,2020-07-30,2020,\nRAD File / Dossier de la SAR : TB9-20994\nPr...,en,
27548,TB9-20994,,RAD,,3601135.txt,2023-11-13,2020-07-30,2020,\nDossier de la SAR / RAD File: TB9-20994\nHui...,fr,
27549,TC0-09230,,RAD,,3601140.txt,2023-11-13,2021-02-22,2021,\nRAD File / Dossier de la SAR : TC0-09230\nTC...,en,


In [13]:
# OPTION 3: Load data remotely from GitHub without cloning repo
# Note: load time varies depending on internet connection (approx 600 MB of data for all years/languages)
# This is the slowest loading option.

import pandas as pd
import json
import requests

# Set variables
start_year = 2002  # First year of data sought (2002 +)
end_year = 2020  # Last year of data sought (2020 -)
languages_sought = ['en', 'fr']  # languages in list e.g. ['en', 'fr'] or ['en']


base_ulr = 'https://raw.githubusercontent.com/Refugee-Law-Lab/rpd_bulk_data/master/DATA/YEARLY/'

# load data
results = []
for year in range(start_year, end_year+1):
    for language in languages_sought:
        url = base_ulr + f'{year}_{language}.json'
        results.extend(requests.get(url).json())

# convert to dataframe
df = pd.DataFrame(results)

df

Unnamed: 0,citation,citation2,dataset,name,source_url,scraped_timestamp,document_date,year,unofficial_text,language,other
0,TB3-03406,,RAD,,1355235.txt,2023-11-12,2013-07-31,2013,\nRAD File No. / N° de dossier de la SAR :\nTB...,en,"{""decision-maker_name"": ""Edward Bosveld""}"
1,VB3-01099,,RAD,,1360268.txt,2023-11-12,2013-08-30,2013,\nRAD File No. / N° de dossier de la SAR : VB3...,en,"{""decision-maker_name"": null}"
2,TB3-03367,,RAD,,1383484.txt,2023-11-12,2013-08-19,2013,\nRAD File No. / N° de dossier de la SAR :\nTB...,en,"{""decision-maker_name"": ""Ken Atkinson""}"
3,TB3-03646,,RAD,,1383490.txt,2023-11-12,2013-09-04,2013,\nRAD File No. / N° de dossier de la SAR : TB3...,en,"{""decision-maker_name"": ""Ken Atkinson""}"
4,TB3-05167,,RAD,,1383494.txt,2023-11-12,2013-08-30,2013,\nRAD File No. / N° de dossier de la SAR :TB3-...,en,"{""decision-maker_name"": ""Ken Atkinson""}"
...,...,...,...,...,...,...,...,...,...,...,...
27546,VC1-04668,,RAD,,3583931.txt,2023-11-13,2022-04-29,2022,\nDossier de la SAR / RAD File : VC1-04668\nVC...,fr,"{""decision-maker_name"": ""Me Kristine Plouffe-M..."
27547,MC1-07673,,RAD,,3596070.txt,2023-11-13,2022-02-25,2022,\nDossier de la SAR / RAD File: MC1-07673\nHui...,fr,"{""decision-maker_name"": ""Daphnée Ouellet""}"
27548,MC1-06135,,RAD,,3599962.txt,2023-11-13,2022-01-28,2022,\nDossier de la SAR / RAD File : MC1-06135\nHu...,fr,"{""decision-maker_name"": ""Daphnée Ouellet""}"
27549,MC2-03068,,RAD,,MC2-03068 - Final February 2023.txt,2023-11-12,2023-02-02,2023,\nRAD File / Dossier de la SAR : MC2-03068\nMC...,en,"{""decision-maker_name"": ""Me Martine Durocher""}"


In [1]:
# OPTION 4: Load data locally via cloned repo

# First, clone git repo
# Then run this code to load data

import pandas as pd
import json
import pathlib

# Set variables
start_year = 2002  # First year of data sought (2002 +)
end_year = 2020  # Last year of data sought (2020 -)
languages_sought = ['en', 'fr']  # languages in list e.g. ['en', 'fr'] or ['en']

# Set path to data
data_path = pathlib.Path('DATA/YEARLY/')

# load data
results = []
for year in range(start_year, end_year+1):
    for language in languages_sought:
        with open(data_path / f'{year}_{language}.json') as f:
            results.extend(json.load(f))

# convert to dataframe
df = pd.DataFrame(results)

df


Unnamed: 0,citation,citation2,dataset,name,source_url,scraped_timestamp,document_date,year,unofficial_text,language,other
0,MA1-12669,,RPD,,486359.txt,2023-11-13,2002-08-22,2002,Immigration and Refugee Board\nRefugee Protect...,en,"{""decision-maker_name"": null}"
1,AA0-01604,,RPD,,638019.txt,2023-11-13,2002-11-08,2002,Immigration and Refugee Board\nRefugee Protect...,en,"{""decision-maker_name"": null}"
2,AA1-00073,,RPD,,638027.txt,2023-11-13,2002-11-06,2002,Immigration and Refugee Board\nRefugee Protect...,en,"{""decision-maker_name"": null}"
3,AA1-00454,,RPD,,638041.txt,2023-11-13,2002-11-01,2002,Immigration and Refugee Board\nRefugee Protect...,en,"{""decision-maker_name"": null}"
4,AA1-01289,,RPD,,638047.txt,2023-11-13,2002-10-17,2002,Immigration and Refugee Board\nRefugee Protect...,en,"{""decision-maker_name"": null}"
...,...,...,...,...,...,...,...,...,...,...,...
12462,VB9-04335,,RPD,,3568407.txt,2023-11-13,2020-01-28,2020,\nDossier de la SPR / RPD File: VB9-04335/0434...,fr,"{""decision-maker_name"": ""Megan Kammerer""}"
12463,TB0-04269,,RPD,,3580919.txt,2023-11-13,2020-02-04,2020,\nNo de dossier de la SPR / RPD File No.: TB0-...,fr,"{""decision-maker_name"": ""H. ROSS""}"
12464,TB8-03337,,RPD,,3580931.txt,2023-11-13,2020-02-24,2020,\nNo de dossier de la SPR / RPD File No.: TB8-...,fr,"{""decision-maker_name"": ""A. Casimiro""}"
12465,MB9-01194,,RPD,,3581415.txt,2023-11-13,2020-07-27,2020,Dossier de la SPR / RPD File: MB9-01194\nHuis ...,fr,"{""decision-maker_name"": ""Michel Colin""}"


### Analyze Data

In [2]:
# View dataframe
df.head()

Unnamed: 0,citation,citation2,dataset,name,source_url,scraped_timestamp,document_date,year,unofficial_text,language,other
0,MA1-12669,,RPD,,486359.txt,2023-11-13,2002-08-22,2002,Immigration and Refugee Board\nRefugee Protect...,en,"{""decision-maker_name"": null}"
1,AA0-01604,,RPD,,638019.txt,2023-11-13,2002-11-08,2002,Immigration and Refugee Board\nRefugee Protect...,en,"{""decision-maker_name"": null}"
2,AA1-00073,,RPD,,638027.txt,2023-11-13,2002-11-06,2002,Immigration and Refugee Board\nRefugee Protect...,en,"{""decision-maker_name"": null}"
3,AA1-00454,,RPD,,638041.txt,2023-11-13,2002-11-01,2002,Immigration and Refugee Board\nRefugee Protect...,en,"{""decision-maker_name"": null}"
4,AA1-01289,,RPD,,638047.txt,2023-11-13,2002-10-17,2002,Immigration and Refugee Board\nRefugee Protect...,en,"{""decision-maker_name"": null}"


In [3]:
df.tail()

Unnamed: 0,citation,citation2,dataset,name,source_url,scraped_timestamp,document_date,year,unofficial_text,language,other
12462,VB9-04335,,RPD,,3568407.txt,2023-11-13,2020-01-28,2020,\nDossier de la SPR / RPD File: VB9-04335/0434...,fr,"{""decision-maker_name"": ""Megan Kammerer""}"
12463,TB0-04269,,RPD,,3580919.txt,2023-11-13,2020-02-04,2020,\nNo de dossier de la SPR / RPD File No.: TB0-...,fr,"{""decision-maker_name"": ""H. ROSS""}"
12464,TB8-03337,,RPD,,3580931.txt,2023-11-13,2020-02-24,2020,\nNo de dossier de la SPR / RPD File No.: TB8-...,fr,"{""decision-maker_name"": ""A. Casimiro""}"
12465,MB9-01194,,RPD,,3581415.txt,2023-11-13,2020-07-27,2020,Dossier de la SPR / RPD File: MB9-01194\nHuis ...,fr,"{""decision-maker_name"": ""Michel Colin""}"
12466,TB3-09544,,RPD,,3594187.txt,2023-11-13,2020-02-04,2020,\nNo de dossier de la SPR / SPR File No.: TB3-...,fr,"{""decision-maker_name"": ""K. Khamsi""}"


In [4]:
# language counts
df['language'].value_counts()

language
fr    6239
en    6228
Name: count, dtype: int64

In [5]:
# Yearly counts
year_counts = df.year.value_counts()
years_count = sorted(year_counts.index)
for year_count in years_count:
    print(f'{year_count}: {year_counts[year_count]}')


2002: 121
2003: 238
2004: 135
2005: 551
2006: 850
2007: 872
2008: 744
2009: 938
2010: 1265
2011: 1557
2012: 1639
2013: 1262
2014: 897
2015: 455
2016: 194
2017: 255
2018: 340
2019: 133
2020: 21


In [6]:
# select 5 random rows from df_unique, iterate through them and print unofficial text
import random

random.seed(999)

random_rows = random.sample(range(0, len(df)), 5)

# iterate through random rows and print unofficial text
for row in random_rows:
    print('##################################')
    print(df.iloc[row]['citation'])
    print(df.iloc[row]['source_url'])
    print(df.iloc[row]['document_date'])
    print(df.iloc[row]['year'])
    print(df.iloc[row]['language'])
    print('##################################')
    print()
    print(df.iloc[row]['unofficial_text'])
    print()
    print()
    print('____________________________________________________________________________')
    print()
    print()
    print()


##################################
TB3-09387
2113722.txt
2015-02-02
2015
en
##################################

RPD File No. / N° de dossier de la SPR : TB3-09387
Private Proceeding / Huis clos
Reasons and Decision ? Motifs et Décision
Claimant(s)
XXXX XXXX XXXX
Demandeur(e)(s) d'asile
Date(s) of Hearing
January 12, 2015
Date(s) de l'audience
Place of Hearing
Toronto, Ontario
Lieu de l'audience
Date of Decision
February 2, 2015
Date de la décision
and reasons
et des motifs
Panel
S. Alidina
Tribunal
Counsel for the Claimant(s)
Shelley Levine
Barrister and Solicitor
Conseil(s) du (de la/des) demandeur(e)(s) d'asile
Designated Representative(s)
N/A
Représentant(e)(s) désigné(e)(s)
Counsel for the Minister
N/A
Conseil du (de la) ministre
REASONS FOR DECISION
[1] XXXX XXXX XXXX is a citizen of China. He claims for refugee protection pursuant to sections 96, 97(1)(a) and 97(1)(b) of the Immigration and Refugee Protection Act (IRPA).1
ALLEGATIONS
[2] The claimant alleges that around XXXX 2010