### Notebook to parse text files to produce cleaned text of RAD decisions

Sean Rehaag

License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). 

Dataset & Code to be cited as:

Sean Rehaag, "Refugee Appeal Division Bulk Decisions Dataset" (2023), online: Refugee Law Laboratory <https://refugeelab.ca/bulk-data/rad/>.

Notes:

(1) Data Source: Immigration and Refugee Board. In the Fall of 2022, the IRB added the Refugee Law Laboratory to their email distribution list for legal publishers of RAD decisions. The RLL therefore receives new RAD cases as they are released for publication by the IRB. Also, in the fall of 2022 the Immigration and Refugee Board provided the RLL with a full backlog of approximately 116k published decisions from all divisions (RAD, RPD, ID, IAD). 

(2) Unofficial Data: The data are unofficial reproductions. For official versions, please contact the Immigration and Refugee Board. 

(3) Non-Affiliation / Endorsement: The data has been collected and reproduced without any affiliation or endorsement from the Immigration and Refugee Board.

(4) Non-Commerical Use: As indicated in the license, data may be used for non-commercial use (with attribution) only. For commercial use, please contact the Immigration and Refugee Board. 

(5) Accuracy: Data was collected and processed programmatically for the purposes of academic research. While we make best efforts to ensure accuracy, data gathering of this kind inevitably involves errors. As such the data should be viewed as preliminary information aimed to prompt further research and discussion, rather than as definitive information.

Acknowledgements: Thanks to Rafael Dolores for coding the parsing scripts.


# Installing Libraries

In [1]:
!pip install langdetect
!pip install regex



# Importing Libraries

In [2]:
import os
import regex as re 
import pandas as pd
from datetime import datetime
from langdetect import detect
import json

## Declaring Constant
Here, we specify the directory containing our data files.

In [3]:
DATA_DIR = "DATA"

## Language Detection
This function determines the language of a given text.

In [4]:
def detect_language(text):
    try:
        return detect(text)
    except:
        return None

## Decision Maker Extraction
This function searches the given file for the decision maker using regular expressions.

In [5]:
def extract_decision_maker(content):
    patterns = [
        r"Panel\s*([\w\s\.-]+?)\s*\b(?!Panel|Tribunal)\b\s*Tribunal",
        r"Tribunal\s*([\w\s\.-]+?)\s*\b(?!Panel|Tribunal)\b\s*Panel"
    ]
    for pattern in patterns:
        match = re.search(pattern, content, re.IGNORECASE)
        if match:
            return match.group(1).strip()
    return None

## Document Date Extraction
This function searches the given file for the document date using regular expressions, taking into account both French and English texts.

In [6]:
def extract_document_date(content):
    french_month_mapping = {
        'janvier': 1, 'fevrier': 2, 'mars': 3, 'avril': 4,
        'mai': 5, 'juin': 6, 'juillet': 7, 'aout': 8,
        'septembre': 9, 'octobre': 10, 'novembre': 11, 'decembre': 12
    }

    # Capture both types of date formats, with a potential "Le" or "1er" for the French format.
    pattern = r"Date (?:of decision|de la décision)\s+(?:Le )?((?:\d{1,2}|1er) [\w]+ \d{4}|\w+ \d{1,2}(?:,)? \d{4})"
    match = re.search(pattern, content, re.IGNORECASE)

    if match:
        parts = match.group(1).split(' ')
        if parts[0].isdigit() or parts[0] == '1er':  # Day first, can be French or English
            day, month, year = parts
            day = 1 if day == '1er' else int(day)  # Handle '1er' case
            month = month.lower().replace('é', 'e').replace('û', 'u').replace('ô', 'o')
            if month in french_month_mapping:
                return datetime(int(year), french_month_mapping[month], day).date().strftime('%Y-%m-%d')
            else:
                # Consider this as English and directly pass to datetime.strptime
                return datetime.strptime(f"{month} {day} {year}", '%B %d %Y').date().strftime('%Y-%m-%d')
        else:  # Month first, English format
            month, day, year = parts[0], parts[1].replace(',', ''), parts[2]
            return datetime.strptime(f"{month} {day} {year}", '%B %d %Y').date().strftime('%Y-%m-%d')

    return None

## Formatting year with a comma
Function that attaches commas to the year: 2000 => 2,000

In [7]:
def format_year(year):
    if year:
        return '{:,}'.format(year)
    return None

## Processing Files
This block of code reads each file in the DATA directory to extract the needed information, using the previously defined functions and form a Pandas dataframe which is provided in a csv.

In [8]:
data_records = []

for file_name in os.listdir(DATA_DIR):
    with open(os.path.join(DATA_DIR, file_name), 'r', errors='replace') as file:
        content = file.read()
        rad_number = None
        for line in content.splitlines():
            if "RAD File" in line:
                rad_number_match = re.search(r"([A-Z]{2}\d+-\d+)", line)
                if rad_number_match:
                    rad_number = rad_number_match.group(1)
                    break
        lang = detect_language(content)
        decision_maker_name = extract_decision_maker(content)
        document_date = extract_document_date(content)
        year = int(document_date.split('-')[0]) if document_date else None
        formatted_year = format_year(year)

        data_records.append({
            'citation1': rad_number,
            'citation2': '',
            'dataset': 'RAD',
            'name': '',
            'source_url': file_name,
            'scraped_timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
            'document_date': document_date,
            'year': formatted_year,
            'unofficial_text': '',
            'language': lang,
            'other': json.dumps({'decision-maker_name': decision_maker_name}, ensure_ascii=False),
        })

## Placing output in CSV file

In [9]:
df = pd.DataFrame(data_records)
df['document_date'] = pd.to_datetime(df['document_date']).dt.strftime('%Y-%m-%d')
df.to_csv('output.csv', index=False, encoding='utf-8-sig')