# Train Federal Refugee JR Stay Docket Parsing Models

Code to train OpenAI models to assist with parsing online Federal Court dockets involving stays of removal

Requires OpenAI account and OpenAI API key, which can be obtained here: 

https://openai.com/api/

NOTE: For the published article OpenAI's ADA model was used. However, as of January 2024, that model has been deprecated and the process for submitting fine tunes has been revised. For further details see final markdown cell.

### Setup

In [1]:
import pandas as pd
import json
from openai import OpenAI
client = OpenAI()

In [2]:
# SETTINGS TO CHANGE

# Set paths for notice model
stay_notice_type_training_data = 'training_data/stays_notice_types_human_coded.xlsx'
stay_notice_types_output = 'training_data/stays_notice_types_human_coded.jsonl'

# Set paths for order models
stay_order_type_training_data = 'training_data/stays_order_types_human_coded.xlsx'
stay_order_types_output = 'training_data/stays_order_types_human_coded.jsonl'
stay_order_outcomes_output = 'training_data/stays_order_outcomes_human_coded.jsonl'

# Set paths for judges model
subfield_training_data = 'training_data/fc_dockets_human_coded_subfields.xlsx'
subfield_training_data = 'training_data/stays_judges_human_coded.xlsx'
judges_output = 'training_data/stays_judges_human_coded.jsonl'

### Get training docs for notice model


In [3]:
# Load Human Coded Data from Excel file
df_human_coding = pd.read_excel(stay_notice_type_training_data)
df_human_coding

Unnamed: 0,citation,text,RE_NO,DOCNO,Category,type,type_consolidated
0,IMM-6736-05,Notice of Motion contained within a Motion Rec...,10,8,notice,stay of removal,stay of removal
1,IMM-11783-12,Notice of Motion contained within a Motion Rec...,2,2,notice,stay of removal,stay of removal
2,IMM-3007-07,Avis de requête qui se trouve dans le dossier ...,6,2,notice,stay of removal,stay of removal
3,IMM-218-08,Notice of Motion contained within a Motion Rec...,3,3,notice,stay of removal,stay of removal
4,IMM-3350-17,Notice of Motion contained within a Motion Rec...,7,7,notice,stay of removal,stay of removal
...,...,...,...,...,...,...,...
303,IMM-3757-13,Notice of Motion contained within a Motion Rec...,4,4,notice,dismiss,other
304,IMM-6053-18,Notice of Motion contained within a Motion Rec...,61,30,notice,dismiss,other
305,IMM-104-20,Notice of Motion contained within a Motion Rec...,4,3,notice,stay of removal,stay of removal
306,IMM-3145-21,Notice of Motion contained within a Motion Rec...,5,3,notice,stay of removal,stay of removal


In [4]:
# produce training doc for notice types - CHAT STYLE

df_human_coding['completion'] = df_human_coding['type_consolidated'].apply(lambda x: "stay of removal" if x == "stay of removal" else "other")

system_message = """You are given a notice of motion. Return 'stay of removal' if the notice is about a stay of removal.
Otherwise return 'other'. Note that stays of release from detention are not stays of removal so should be categorized as other."""

# randomly shuffle df_human_coding with seed 42
df_human_coding = df_human_coding.sample(frac=1, random_state=42).reset_index(drop=True)

cases_to_compile = []
for x in range(len(df_human_coding)):
    case_dict={}
    case_dict['messages']=[
        {
            "role": "system",
            "content": system_message
        },
        {
            "role": "user",
            "content": df_human_coding['text'].iloc[x].replace('\n','').replace('\t','').strip()
        },
        {
            "role": "assistant",
            "content": df_human_coding['completion'].iloc[x].strip()
        }
    ]
    cases_to_compile.append(case_dict)
print(len(cases_to_compile))

with open(stay_notice_types_output, 'w', encoding='utf8') as out_file:
    for d in cases_to_compile:
            out_file.write(json.dumps(d, ensure_ascii=False))
            out_file.write("\n")

308


In [5]:
# get counts of df_human_coding['type_consolidated']
df_human_coding['completion'].value_counts()

completion
stay of removal    234
other               74
Name: count, dtype: int64

In [None]:
client.files.create(
  file=open(stay_notice_types_output, "rb"),
  purpose="fine-tune"
)

In [None]:
# Get file id for training file from pior cell ouput

client.fine_tuning.jobs.create(
  training_file='file-IIFmdIhHSAK87EiddYBEQD65', 
  model="gpt-3.5-turbo",
  suffix = "stays_notice_types"
)

# Will receive an email with the model id when fine tuning is done
# Consult fine-tuning UI to check on progress and to monitor loss

### Produce training docs for orders models


In [6]:
# Load Human Coded Data from Excel file
df_human_coding = pd.read_excel(stay_order_type_training_data)
df_human_coding

Unnamed: 0,citation,text,RE_NO,DOCNO,DOC_DT,Category,type,outcome
0,IMM-839-05,Order rendered by The Honourable Madam Justice...,17,13.0,2005-03-14T00:00:00,order,stay of removal,dismissed
1,IMM-473-16,"Order rendered by Mireille Tabib, Prothonotary...",23,,2016-02-24T00:00:00,order,extension,other
2,IMM-3376-08,Order rendered by The Honourable Mr. Justice P...,11,8.0,2008-07-31T00:00:00,order,stay of removal,dismissed
3,IMM-4803-15,Ordonnance rendu(e) par Monsieur le juge Harri...,24,16.0,2016-01-13T00:00:00,order,leave,granted
4,IMM-2507-08,Order rendered by The Honourable Mr. Justice C...,14,9.0,2008-06-05T00:00:00,order,stay of removal,granted
...,...,...,...,...,...,...,...,...
340,IMM-4767-09,Copy of Reasons for Judgment and Judgment date...,12,,2009-10-01T00:00:00,order,other,other
341,IMM-5863-15,Copy of Order dated 24-JAN-2017 rendered by Ma...,39,,2017-01-24T00:00:00,order,other,other
342,IMM-5445-10,Copy of Order dated 09-DEC-2010 rendered by Ro...,19,,2010-12-09T00:00:00,order,confidentiality,other
343,IMM-4283-21,"Order rendered by Kathleen Ring, Prothonotary ...",12,7.0,2021-08-10T00:00:00,order,extension,granted


In [7]:
# produce training doc for order types
def order_type(x):
    if x == "stay of removal":
        return "stay of removal"

    elif x.startswith("motion "):
        return x
    else:
        return "other"

df_human_coding['completion'] = df_human_coding['type'].apply(order_type)

# randomly shuffle df_human_coding with seed 42
df_human_coding = df_human_coding.sample(frac=1, random_state=42).reset_index(drop=True)

system_message = """You are given a Federal Court docket entry about an order. Return 'motion #' (with the actual number) where
the order refers to a motion number. If no motion number is indicated, and the order clearly involves a stay of removal, return 'stay of removal'.
Otherwise return 'other'. Note that stays of release from detention are not stays of removal and so should be categorized as other.""" 

cases_to_compile = []
for x in range(len(df_human_coding)):
    case_dict={}
    case_dict['messages']=[
        {
            "role": "system",
            "content": system_message
        },
        {
            "role": "user",
            "content": df_human_coding['text'].iloc[x].replace('\n','').replace('\t','').strip()
        },
        {
            "role": "assistant",
            "content": df_human_coding['completion'].iloc[x].strip()
        }
    ]
    cases_to_compile.append(case_dict)
    
print(len(cases_to_compile))

with open(stay_order_types_output, 'w', encoding='utf8') as out_file:
    for d in cases_to_compile:
            out_file.write(json.dumps(d, ensure_ascii=False))
            out_file.write("\n")

345


In [8]:
df_human_coding['completion'].value_counts()

completion
stay of removal    139
other              124
motion 3            19
motion 4            15
motion 2            12
motion 5             8
motion 6             4
motion 10            4
motion 14            2
motion 15            2
motion 11            2
motion 9             2
motion 8             2
motion 13            2
motion 20            2
motion 17            1
motion 16            1
motion 50            1
motion 12            1
motion 18            1
motion 21            1
Name: count, dtype: int64

In [None]:
client.files.create(
  file=open(stay_order_types_output, "rb"),
  purpose="fine-tune"
)

In [None]:
# Get file id for training file from pior cell ouput

client.fine_tuning.jobs.create(
  training_file='file-FI1IuWcVK3vHcxe7bBBDe5J4', 
  model="gpt-3.5-turbo",
  suffix = "stays_order_types"
)


# Will receive an email with the model id when fine tuning is done
# Consult fine-tuning UI to check on progress and to monitor loss

In [9]:
# produce training doc for order outcomes

df_human_coding['completion'] = df_human_coding['outcome']


system_message = """You are given a Federal Court docket entry about an order. Return 'granted' if the entry reports that a motion or an application
has been granted. Return 'dismissed' if the entry reports that a motion or an application has been dismissed. Return 'other' if the outcome is unclear
or if the motion or application is clearly only procedural in nature (e.g. scheduling, documents, adjournment, etc.).""" 

cases_to_compile = []
for x in range(len(df_human_coding)):
    case_dict={}
    case_dict['messages']=[
        {
            "role": "system",
            "content": system_message
        },
        {
            "role": "user",
            "content": df_human_coding['text'].iloc[x].replace('\n','').replace('\t','').strip()
        },
        {
            "role": "assistant",
            "content": df_human_coding['completion'].iloc[x].strip()
        }
    ]
    cases_to_compile.append(case_dict)
    
print(len(cases_to_compile))

with open(stay_order_outcomes_output, 'w', encoding='utf8') as out_file:
    for d in cases_to_compile:
            out_file.write(json.dumps(d, ensure_ascii=False))
            out_file.write("\n")

345


In [10]:
df_human_coding['completion'].value_counts()

completion
granted      164
dismissed    132
other         49
Name: count, dtype: int64

In [11]:
client.files.create(
  file=open(stay_order_outcomes_output, "rb"),
  purpose="fine-tune"
)

FileObject(id='file-GdYKfOMrjRCZVX7GQn2tm6PV', bytes=366546, created_at=1708607886, filename='stays_order_outcomes_human_coded.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)

In [13]:
# Get file id for training file from prior cell output

client.fine_tuning.jobs.create(
  training_file='file-GdYKfOMrjRCZVX7GQn2tm6PV', 
  model="gpt-3.5-turbo",
  suffix = "stays_outcomes"
)

# Will receive an email with the model id when fine tuning is done
# Consult fine-tuning UI to check on progress and to monitor loss

FineTuningJob(id='ftjob-nNXH9eHPePfpdPGqUQbYUZYk', created_at=1708607971, error=Error(code=None, message=None, param=None, error=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0613', object='fine_tuning.job', organization_id='org-OBNXT1YSS6IuFCMYYC5ICu3T', result_files=[], status='validating_files', trained_tokens=None, training_file='file-GdYKfOMrjRCZVX7GQn2tm6PV', validation_file=None)

### Get Training Docs for Judges model


In [14]:
# Load Human Coded Data from Excel file
df_human_coding = pd.read_excel(subfield_training_data)
df_human_coding

Unnamed: 0,text,judge
0,(Final decision) Reasons for Judgment and Judg...,Rochester
1,(Décision finale) Jugement rendu(e) par Madame...,Elizabeth Walker
2,(Décision finale) Jugement rendu(e) par Monsie...,Roy
3,Jugement et Motifs en date du 18-MAI-2021 rend...,Martineau
4,(Final decision) Reasons for Judgment and Judg...,Go
...,...,...
678,Certificate of Order rendered by The Honourabl...,Dawson
679,Certificat de l'ordonnance rendu(e) par Monsie...,Blanchard
680,(Final decision) Order rendered by The Honoura...,Blanchard
681,(Décision finale) Motifs de jugement et jugeme...,Shore


In [15]:
#get prompts and labels for training judges

df_human_coding['completion'] = df_human_coding['judge']

# randomly shuffle df_human_coding with seed 42
df_human_coding = df_human_coding.sample(frac=1, random_state=42).reset_index(drop=True)

system_message = """You are given a Federal Court docket entry. If the docket entry includes a specific judge identified by name
then return the judge's name. If the docket entry does not include a specific judge identified by name, or if anything is 
unclear then return 'other'.""" 

cases_to_compile = []
for x in range(len(df_human_coding)):
    case_dict={}
    case_dict['messages']=[
        {
            "role": "system",
            "content": system_message
        },
        {
            "role": "user",
            "content": df_human_coding['text'].iloc[x].replace('\n','').replace('\t','').strip()
        },
        {
            "role": "assistant",
            "content": df_human_coding['completion'].iloc[x].strip()
        }
    ]
    cases_to_compile.append(case_dict)
    
print(len(cases_to_compile))

with open(judges_output, 'w', encoding='utf8') as out_file:
    for d in cases_to_compile:
            out_file.write(json.dumps(d, ensure_ascii=False))
            out_file.write("\n")

683


In [16]:
df_human_coding['completion'].value_counts()

completion
other        35
Shore        28
Gagné        22
Heneghan     21
Beaudry      21
             ..
McKeown       1
Go            1
MacKay        1
Mainville     1
Hansen        1
Name: count, Length: 86, dtype: int64

In [17]:
client.files.create(
  file=open(judges_output, "rb"),
  purpose="fine-tune"
)

FileObject(id='file-HnWbi2eY8QM0DqYBP3iDBVYn', bytes=661240, created_at=1708608356, filename='stays_judges_human_coded.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)

In [18]:
# Get file id for training file from prior cell output

client.fine_tuning.jobs.create(
  training_file='file-HnWbi2eY8QM0DqYBP3iDBVYn', 
  model="gpt-3.5-turbo",
  suffix = "stays_judges"
)

# Will receive an email with the model id when fine tuning is done
# Consult fine-tuning UI to check on progress and to monitor loss

FineTuningJob(id='ftjob-B5WM5VO4wkPsd81yeEC26nKG', created_at=1708608403, error=Error(code=None, message=None, param=None, error=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0613', object='fine_tuning.job', organization_id='org-OBNXT1YSS6IuFCMYYC5ICu3T', result_files=[], status='validating_files', trained_tokens=None, training_file='file-HnWbi2eY8QM0DqYBP3iDBVYn', validation_file=None)

### NOTES

### Send training data to OpenAI to build models

### NOTE: In the Luck of the Draw article, ada was used as the base model, but that model is no longer available, and other legacy models are likely to be deprecated soon. So, instead use GPT3.5-Turbo, which is a chat style model (2 x the cost of ada):

### As of January 2024, the fine tuning process is described here:

https://platform.openai.com/docs/guides/fine-tuning/common-use-cases?lang=python

Main point is to use the JSONL files created by this notebook (see training_data directory) to create individual models using the OpenAI fine-tuning UI.

### Models Produced:

#### Notice types
- ada:ft-refugee-law-lab:stays-notice-types-2022-11-18-21-13-59
- ft:gpt-3.5-turbo-0613:refugee-law-lab:stays-notice-types:8uthOYkO 

#### Order types
- ada:ft-refugee-law-lab:stay-orders-types-2022-11-16-15-07-08
- ft:gpt-3.5-turbo-0613:refugee-law-lab:stays-order-types:8uuXKz4l 

#### Order outcomes
- ada:ft-refugee-law-lab:stay-orders-outcomes-2022-11-16-14-56-59
- ft:gpt-3.5-turbo-0613:refugee-law-lab:stays-outcomes:8v442Ukp 

#### Judges
- ada:ft-refugee-law-lab:judges-2022-11-13-22-35-55
- ft:gpt-3.5-turbo-0613:refugee-law-lab:stays-judges:8v4XnuHv 