Load dataset

In [16]:
!kaggle datasets download deepcontractor/supreme-court-judgment-prediction
!unzip supreme-court-judgment-prediction.zip

import pandas as pd

df = pd.read_csv('justice.csv')

print(df)


Dataset URL: https://www.kaggle.com/datasets/deepcontractor/supreme-court-judgment-prediction
License(s): CC0-1.0
supreme-court-judgment-prediction.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  supreme-court-judgment-prediction.zip
replace justice.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: N
      Unnamed: 0     ID                                 name  \
0              0  50606                          Roe v. Wade   
1              1  50613                  Stanley v. Illinois   
2              2  50623              Giglio v. United States   
3              3  50632                         Reed v. Reed   
4              4  50643                 Miller v. California   
...          ...    ...                                  ...   
3298        3298  63324    United States v. Palomar-Santiago   
3299        3299  63323               Terry v. United States   
3300        3300  63331              United States v. Cooley   
3301        3301

Preprocess dataset

In [17]:
# Preprocess the data
# just keep facts and first_party_winner

#drop all rows with na
df = df.dropna()
df = df[['facts', 'first_party_winner']]
df['first_party_winner'] = df['first_party_winner'].astype(int)

#remname facts to text and first_party_winner to label
df = df.rename(columns={'first_party_winner': 'label', 'facts': 'text'})

# remove the p tag from the text
df['text'] = df['text'].str.replace('<p>', '')

print(df)

                                                   text  label
1     Joan Stanley had three children with Peter Sta...      1
2     John Giglio was convicted of passing forged mo...      1
3     The Idaho Probate Code specified that "males m...      1
4     Miller, after conducting a mass mailing campai...      1
5     Ernest E. Mandel was a Belgian professional jo...      1
...                                                 ...    ...
3297  For over a century after the Alaska Purchase i...      1
3298  Refugio Palomar-Santiago, a Mexican national, ...      1
3299  Tarahrick Terry pleaded guilty to one count of...      0
3300  Joshua James Cooley was parked in his pickup t...      1
3302  The Natural Gas Act (NGA), 15 U.S.C. §§ 717–71...      1

[3098 rows x 2 columns]



Use a light weight LLM model to predict which party will win based on some actual dataset records

imports for supervised fine tuning

In [18]:
!pip install datasets # install the datasets library
!pip install peft # install the peft library
!pip install evaluate # install the evaluate library

from datasets import load_dataset,  Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding)

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig

import evaluate
import torch
import numpy as np



defining model to use or fine tune

In [19]:
model_checkpoint = "microsoft/deberta-v3-small" # using this base model for doing binary classfication because it is the smallest parameter set, can run in this machine.

#we want to fine-tune this model to do case analysis on input text, for that we want to label map for First party wins and First party losses.

#define label maps
id2label = {0: "First Party Loses", 1: "First Party Wins"}
label2id = {"First Party Loses": 0, "First Party Wins": 1}

#generate classification model for model_checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=2,
    id2label=id2label,
    label2id=label2id)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Dividing data to training and testing data from the given dataset

In [20]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Convert the dataframes to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
validation_dataset = Dataset.from_pandas(test_df)

train_dataset = train_dataset.select_columns(['label', 'text'])
validation_dataset = validation_dataset.select_columns(['label', 'text'])

# Remove the index column if it exists
if '__index_level_0__' in train_dataset.features:
    train_dataset = train_dataset.remove_columns(['__index_level_0__'])
if '__index_level_0__' in validation_dataset.features:
    validation_dataset = validation_dataset.remove_columns(['__index_level_0__'])


#print(train_dataset)
#print(validation_dataset)

dataset = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset
})
dataset



DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 2478
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 620
    })
})

preprocess dataset wrt model

In [21]:
#create a tokenizer, for the particular model we are using.
# models don't understand text, need to convert them to numerical data before feeding to models
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space = True)

#create tokenize function,
#examples is rows in dataset the training dataset has 2 columns label and text, we want to grab text from it and convert into numerical values

def tokenize_function(examples):
  #extract text
   text = examples['text']

   #tokenize and truncate, required as examples for training need to be of the same length, truncate long or pad short, or do both.
   #here truncating form left, using numpy tensor, with max length 512
   tokenizer.truncate_side = "left"
   tokenized_inputs = tokenizer(text,
                                return_tensors = "np",
                                max_length=512,
                                truncation=True)

   return tokenized_inputs

   #add pad token if not exist, tokenizer doesn't have pad tokens so adding to sequence whenever PAD is there, it's ignored by LLM
   if tokenizer.pad_token is None:
      tokenizer.add_special_tokens({'pad_token': '[PAD]'})
      model.resize_token_embeddings(len(tokenizer))

#tokenize training and validation dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

# instead of doing padding for all rows, we can dynamically PAD the rows in the datasets using collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)




Map:   0%|          | 0/2478 [00:00<?, ? examples/s]

Map:   0%|          | 0/620 [00:00<?, ? examples/s]

Evaluation metrics

In [22]:
#to import the performance of the model during training

#import accuracy evaluation metrics
accuracy = evaluate.load("accuracy")

# packaging accuracy metrics as a function, one for first party losses and first party losses class, whichever is larger will become model prediction.
# define an evaluation function to pass into trainer later
def compute_metrics(eval_pred):
  predictions, labels = eval_pred # predictions here are the logits, has 2 elements first_party_wins and firstparty_losses, evaluating which element is larger and which is larger will be the label.
  predictions = np.argmax(predictions, axis=1)
  return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

Applying untrained model to text

In [23]:
# define list of examples
text_list = test_df['text'][5:10].tolist()
actual_winner = test_df['label'][5:10].tolist()
print("Untrained model predictions:")
print("----------------------------")
for text in text_list:
    # tokenize text
    inputs = tokenizer.encode(text, return_tensors="pt")
    # compute logits
    logits = model(inputs).logits
    # convert logits to label
    predictions = torch.argmax(logits)

    print(id2label[predictions.tolist()]
          + " - Actual Result: " + id2label[actual_winner[text_list.index(text)]])

Untrained model predictions:
----------------------------
First Party Loses - Actual Result: First Party Wins
First Party Loses - Actual Result: First Party Loses
First Party Loses - Actual Result: First Party Wins
First Party Loses - Actual Result: First Party Wins
First Party Loses - Actual Result: First Party Wins


Train Model

In [24]:
peft_config = LoraConfig(task_type="SEQ_CLS", # sequence classification
                        r=4, #intrinsic rank of trainable weight matrix
                        lora_alpha=32, # learning rate
                        lora_dropout=0.01, # probability of drop out, randomly 0 internal parameters during training
                        target_modules = ["query_proj"]) #, "value_proj"] # to see which modules to target, just print the layers
                        #peft = parameter efficient fine tuning : large model trained with small number of extra arguments

Use config setting to update model

In [25]:
model = get_peft_model(model, peft_config) # get actual model and update it using the configuration of lora that we provided in previous step
model.print_trainable_parameters() # to see how much percentage of total parameters we actually need to model, as seen in result only 0.93% of the model will be trained, huge cost savings.

trainable params: 38,402 || all params: 141,934,852 || trainable%: 0.0271


In [26]:
# hyperparameters
lr = 1e-3 # size of optimization step
batch_size = 4 # number of rows in dataset processed per optimization step
num_epochs = 10 #number of times model runs through training data

In [27]:
# define training arguments
training_args = TrainingArguments(
    output_dir= model_checkpoint + "-lora-text-classification", # defining where model to be saved
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    eval_strategy="epoch", # per epoch evaluate the model parameters
    save_strategy="epoch", # per epoch save the model parameters
    load_best_model_at_end=True, # at end return best version of the model
)

In [28]:
# Create a Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6528,0.632334,{'accuracy': 0.6741935483870968}
2,0.657,0.638831,{'accuracy': 0.6741935483870968}
3,0.6596,0.713317,{'accuracy': 0.6741935483870968}
4,0.6796,0.636857,{'accuracy': 0.6741935483870968}
5,0.658,0.634413,{'accuracy': 0.6741935483870968}
6,0.653,0.633845,{'accuracy': 0.6741935483870968}
7,0.662,0.639592,{'accuracy': 0.6741935483870968}
8,0.6618,0.637323,{'accuracy': 0.6741935483870968}
9,0.656,0.632321,{'accuracy': 0.6741935483870968}
10,0.65,0.633989,{'accuracy': 0.6741935483870968}


Trainer is attempting to log a value of "{'accuracy': 0.6741935483870968}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.6741935483870968}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.6741935483870968}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.6741935483870968}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.6741935483870968}" o

TrainOutput(global_step=6200, training_loss=0.6573794334165511, metrics={'train_runtime': 949.8282, 'train_samples_per_second': 26.089, 'train_steps_per_second': 6.527, 'total_flos': 2094630762965520.0, 'train_loss': 0.6573794334165511, 'epoch': 10.0})

In [29]:
# define list of examples
text_list = test_df['text'][5:10].tolist()
actual_winner = test_df['label'][5:10].tolist()
print("Trained model predictions:")
print("----------------------------")
for text in text_list:
    # tokenize text
    inputs = tokenizer.encode(text, return_tensors="pt").to("cuda")
    # compute logits
    logits = model(inputs).logits
    # convert logits to label
    predictions = torch.argmax(logits)

    print(id2label[predictions.tolist()]
          + " - Actual Result: " + id2label[actual_winner[text_list.index(text)]])

Trained model predictions:
----------------------------
First Party Wins - Actual Result: First Party Wins
First Party Wins - Actual Result: First Party Loses
First Party Wins - Actual Result: First Party Wins
First Party Wins - Actual Result: First Party Wins
First Party Wins - Actual Result: First Party Wins
