In [1]:
import pandas as pd
import re
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

In [2]:
def filter_text(text):
    # Remove non-alphanumeric characters using regex and convert to lowercase
    filtered_text = re.sub(r'[^a-zA-Z0-9 ]', '', text).lower()
    return filtered_text


In [5]:
df = pd.read_csv("../Dataset/dataset_semeval24_traindev/train.csv")

In [6]:
txt = df['explanation'].values.tolist()[0]

In [7]:
txt

'Venue in most federal actions is governed by 28 U.S.C. §1391(b), which provides: (b) Venue in general. A civil action may be brought in— (1) a judicial district in which any defendant resides, if all defendants are residents of the State in which the district is located; (2) a judicial district in which a substantial part of the events or omissions giving rise to the claim occurred, or a substantial part of property that is the subject of the action is situated; or (3) if there is no district in which an action may otherwise be brought as provided in this section, a judicial district in which any defendant is subject to the court’s personal jurisdiction with respect to such action. Note that subsections 1 and 2 are alternatives. Venue is proper in a district where either a defendant resides (if they are all residents of the state where the action is brought) or a district in which a substantial part of the events giving rise to the claim took place. Section 1391(b)(3) is a ‘‘fallback’

In [3]:
from transformers import PegasusForConditionalGeneration, AutoTokenizer
import torch

# You can chose models from following list
# https://huggingface.co/models?sort=downloads&search=google%2Fpegasus
model_name_sum = 'google/pegasus-cnn_dailymail'
device_sum = 'mps'
tokenizer_sum = AutoTokenizer.from_pretrained(model_name_sum)
model_sum = PegasusForConditionalGeneration.from_pretrained(model_name_sum).to(device_sum)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
def generate_summary(txt, model, tokenizer, device):
    batch = tokenizer(txt, truncation=True, padding='longest', return_tensors="pt").to(device)
    translated = model.generate(**batch)
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)

    return tgt_text[0]

In [10]:
generate_summary(txt, model_sum, tokenizer_sum, device_sum)

'Venue in most federal actions is governed by 28 U.S.C. 1391(b)<n>Venue is proper in a district where either a defendant resides (if they are all residents of the state where the action is brought) or a district in which a substantial part of the events giving rise to the claim took place.'

In [11]:
# temp_sc = df["explanation"].values.tolist()
# temp_sc = [filter_text(i) for i in temp_sc]
# temp_sc = [generate_summary(i, model_sum, tokenizer_sum, device_sum) for i in temp_sc]
df["short_context"] = df["explanation"].apply(lambda x: generate_summary(x, model_sum, tokenizer_sum, device_sum))

In [12]:
df.to_csv("train2.csv", index=False)

In [13]:
df

Unnamed: 0,idx,question,answer,label,analysis,complete analysis,explanation,short_context
0,0,"1. Redistricting. Dziezek, who resides in the ...",the Western District of Kentucky.,0,So the remaining question is whether the West...,"Let’s see. Under §1391(b)(1), venue is proper ...",Venue in most federal actions is governed by 2...,Venue in most federal actions is governed by 2...
1,1,"1. Redistricting. Dziezek, who resides in the ...",the Southern District of Indiana.,0,But B is clearly not: the plaintiff’s residenc...,"Let’s see. Under §1391(b)(1), venue is proper ...",Venue in most federal actions is governed by 2...,Venue in most federal actions is governed by 2...
2,2,"1. Redistricting. Dziezek, who resides in the ...",the Southern District of Ohio.,1,"Let’s see. Under §1391(b)(1), venue is proper ...","Let’s see. Under §1391(b)(1), venue is proper ...",Venue in most federal actions is governed by 2...,Venue in most federal actions is governed by 2...
3,3,"2. Venue exercises. Chu, a Californian, went s...",proper in the Southern District of California ...,0,A is pretty clearly wrong. Although §1391(b)(2...,This question didn’t give my students much tro...,The venue provisions in §1391(b) also apply in...,The venue provisions in 1391(b) also apply in ...
4,4,"2. Venue exercises. Chu, a Californian, went s...",proper in the District of Colorado under §1391...,0,"B is another loser. First of all, Jackson does...",This question didn’t give my students much tro...,The venue provisions in §1391(b) also apply in...,The venue provisions in 1391(b) also apply in ...
...,...,...,...,...,...,...,...,...
661,661,7. Special delivery. PourPack is a Delaware co...,delivering the summons and complaint to Suares...,0,"A is wrong here, because there is a difference...","In D, Perini apparently relies on Fed. R. Civ....",Rule 4(h) provides several options for serving...,Fed. R. Civ. P. 4(h)(1)(A) allows service on a...
662,662,7. Special delivery. PourPack is a Delaware co...,delivering the papers to the Utah Secretary of...,1,"Service in B is proper, however. Here, Perini ...","In D, Perini apparently relies on Fed. R. Civ....",Rule 4(h) provides several options for serving...,Fed. R. Civ. P. 4(h)(1)(A) allows service on a...
663,663,7. Special delivery. PourPack is a Delaware co...,having a process server deliver the summons an...,0,"In C, Perini appears to rely on the subsection...","In D, Perini apparently relies on Fed. R. Civ....",Rule 4(h) provides several options for serving...,Fed. R. Civ. P. 4(h)(1)(A) allows service on a...
664,664,7. Special delivery. PourPack is a Delaware co...,delivering the summons and the complaint to Po...,0,"In D, Perini apparently relies on Fed. R. Civ....","In D, Perini apparently relies on Fed. R. Civ....",Rule 4(h) provides several options for serving...,Fed. R. Civ. P. 4(h)(1)(A) allows service on a...


In [14]:
df["Prompt"] = df['short_context']+" "+df["question"]+" "+df["answer"]+" "+df["label"].astype(str)
df["Output"] = df["Prompt"]+df["analysis"]

In [15]:
df = df.dropna().reset_index(drop=True)

In [None]:
# unique_pairs = df[['explanation', 'question']].drop_duplicates()
# unique_pairs["final_text"] = "Context: "+unique_pairs.explanation+"\n\nQuestion: "+unique_pairs.question
# questions = unique_pairs.final_text.tolist()

In [None]:
# prompt_text = "Given the context and question as follows:\n\n"+questions[0]+"\n\nExtract the relevant context to answer the Question."

In [None]:
# print(prompt_text)

In [None]:
# req_context = []

In [None]:
# for i in range(len(questions)):
#     a = input("Enter the context for the question: "+questions[i]+"\n")
#     req_context.append(a)

In [None]:
# req_context

In [16]:

# Initialize the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generate the dataset
dataset = []
for i, row in df.iterrows():
    prompt = row["Prompt"]
    output = row["Output"]
    # encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False)
    # encoded_output = tokenizer.encode(output, add_special_tokens=False)
    dataset.append((prompt, output))

# Save the dataset
with open("train.txt", "w") as f:
    for prompt, output in dataset:
        f.write(f"{''.join(map(str, prompt))}\t{''.join(map(str, output))}\n")

In [20]:


# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
# tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Prepare your dataset and tokenize the data

# Create the TextDataset
dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="train.txt",  # Path to your dataset file
    block_size=512  # Specify the block size
)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned_1",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=1000,
    save_total_limit=2,
)

# Create a Trainer and fine-tune the model
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)
trainer.train()


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/645 [00:00<?, ?it/s]

{'loss': 1.9745, 'learning_rate': 1.1240310077519382e-05, 'epoch': 2.33}
{'train_runtime': 2044.7448, 'train_samples_per_second': 1.262, 'train_steps_per_second': 0.315, 'train_loss': 1.9157021455986556, 'epoch': 3.0}


TrainOutput(global_step=645, training_loss=1.9157021455986556, metrics={'train_runtime': 2044.7448, 'train_samples_per_second': 1.262, 'train_steps_per_second': 0.315, 'train_loss': 1.9157021455986556, 'epoch': 3.0})

In [21]:
output_dir = "./gpt2-finetuned_1"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

('./gpt2-finetuned_1/tokenizer_config.json',
 './gpt2-finetuned_1/special_tokens_map.json',
 './gpt2-finetuned_1/vocab.json',
 './gpt2-finetuned_1/merges.txt',
 './gpt2-finetuned_1/added_tokens.json')

Test

In [5]:
model = GPT2LMHeadModel.from_pretrained("./gpt2-finetuned_1/")
tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-finetuned_1/")

In [23]:
df1 = pd.read_csv("../Dataset/dataset_semeval24_traindev/dev.csv")


In [24]:
df1["short_context"] = df1["explanation"].apply(lambda x: generate_summary(x, model_sum, tokenizer_sum, device_sum))

In [25]:
df1.to_csv("dev2.csv", index = False)

In [26]:
df1

Unnamed: 0,idx,question,answer,label,analysis,complete analysis,explanation,short_context
0,0,7. A venue medley. Zirkhov brings a diversity ...,The court must transfer the action to the West...,0,C suggests that the judge must transfer under ...,A good place to start is to ask whether venue ...,"So, when a plaintiff files suit in a district ...",When a plaintiff files suit in a district that...
1,1,7. A venue medley. Zirkhov brings a diversity ...,"The court will have to dismiss, since 28 U.S.C...",0,"Nibbling away at the edges, consider D, which ...",A good place to start is to ask whether venue ...,"So, when a plaintiff files suit in a district ...",When a plaintiff files suit in a district that...
2,2,7. A venue medley. Zirkhov brings a diversity ...,The court could transfer the action under 28 U...,0,A good place to start is to ask whether venue ...,A good place to start is to ask whether venue ...,"So, when a plaintiff files suit in a district ...",When a plaintiff files suit in a district that...
3,3,7. Comedy of errors. Laurel and Hardy are inju...,"reverse the decision in Fields’s case, and rem...",0,In this example the federal judge did her best...,In this example the federal judge did her best...,State law is not static. It may change while a...,State law is not static. It may change while a...
4,4,7. Comedy of errors. Laurel and Hardy are inju...,"reverse the decision in Fields’s case, and rem...",1,But what will the federal trial court do about...,In this example the federal judge did her best...,State law is not static. It may change while a...,State law is not static. It may change while a...
...,...,...,...,...,...,...,...,...
79,79,7. If at first you don’t succeed. . . . Ervin ...,The court will consider whether Ito was subjec...,0,C isn’t right either. While it is generally tr...,Although the introduction doesn’t give you the...,"So, we’ve seen that the defendant has several ...",The defendant has several options in challengi...
80,80,8. Technical fouls. Eban brings suit against L...,Darrow files the complaint and mails a copy of...,0,He has been sloppy in B as well. A waiver is n...,"Interestingly, in each of these cases the defe...","All of this is fairly technical, though import...",Rule 4 provides a method for a defendant to wa...
81,81,8. Technical fouls. Eban brings suit against L...,Darrow files the complaint and mails a copy of...,1,"In C, Darrow sent a proper request for waiver,...","Interestingly, in each of these cases the defe...","All of this is fairly technical, though import...",Rule 4 provides a method for a defendant to wa...
82,82,8. Technical fouls. Eban brings suit against L...,Darrow files the complaint and mails a copy of...,0,"And in D, Darrow gets back the post office rec...","Interestingly, in each of these cases the defe...","All of this is fairly technical, though import...",Rule 4 provides a method for a defendant to wa...


In [27]:
prompts = (df1["short_context"]+" "+df1["question"]+" "+df1["answer"]+" "+df1["label"].astype(str)).values.tolist()

In [28]:
prompts

['When a plaintiff files suit in a district that is proper under the venue statute, the court may hear the case.<n>A striking virtue of section 1406(a) is that it allows a court to save a plaintiff’s cause of action if she files in the wrong venue. 7. A venue medley. Zirkhov brings a diversity action against Pardee, a truck driver from Ohio, and Lugo Enterprises, his employer, for injuries in an accident that took place in the Western District of Kentucky. He files the action in the federal district court for the Northern District of Illinois, where Lugo’s principal place of business is located. Pardee, who lives in Ohio, moves to dismiss the action for improper venue. The court must transfer the action to the Western District of Kentucky under 28 U.S.C. §1406(a), since it is a proper venue. 0',
 'When a plaintiff files suit in a district that is proper under the venue statute, the court may hear the case.<n>A striking virtue of section 1406(a) is that it allows a court to save a plain

In [12]:

# Text generation
context = """There are some subtle problems in the amendment area. So we have a pre-closer and a closer. A party who sues a defendant on a claim within the limitations period may change theories, add new damages, or recast the factual basis of the claim through amendments, assuming that the amended allegations still are based on the same underlying facts, and that the judge allows the amendment. The amended allegations will â€˜â€˜relate backâ€™â€™ to the date the original complaint was filed. If the claim asserted in the amended complaint would have been timely on that date, Rule 15(c)(1)(B) avoids any limitations problem. But suppose the amendment adds a new defendant to the case? Rule 15 has a separate, more abstruse provision, Rule 15(c)(1)(C), that deals with this twist. As an example, assume that Leroy sues Tele-Sell, a telemarketing firm, under a statute barring telemarketing calls to consumers who have placed their names on a state do-not-call list. After he sues, the limitations period passes. Several months later, he learns that he was mistaken, that the calls actually came from Tel-Connect, a different firm. Consequently, he moves to amend to substitute Tel- Connect as the defendant. Allowing relation back for an amendment like this, adding a new party, requires a more stringent standard than others, because the new claim is against a different defendant, who was not sued before the limitations period expired. Under Rule 15(c)(1)(C) three requirements must be met before an amendment changing the party against whom a claim is asserted will â€˜â€˜relate backâ€™â€™ to the date the original complaint was filed. â€¢ First, the amended pleading must arise out of the same events as the original pleading. Rule 15(c)(1)(C). â€¢ Second, the defendant being added must have â€˜â€˜received such notice of the action that it will not be prejudiced in defending on the merits.â€™â€™ Rule 15(c)(1)(C)(i). This notice must have been received within the period of time it would have been received had the new defendant been sued originally. â€¢ Third, the plaintiff must show that the new defendant brought in by the amendment â€˜â€˜knew or should have known that the action would have been brought against it, but for a mistake concerning the proper partyâ€™s identity.â€™â€™ Rule 15(c)(1)(C)(ii). The first requirement is met in Leroyâ€™s case. He is suing Tel-Connect for the same harassing calls that were the basis of his initial suit against Tele-Sell. But Rule 15(c)(1) requires more than that. It also requires that Tel-Connect was aware, within the period for suing and serving the complaint on Tele-Sell, that the suit had been brought, and that itâ€”Tel- Connectâ€”was actually the intended target of the suit. Whatâ€™s the point of this complex provision? It is meant to ensure that Tel-Connect, the party Leroy brings in late, had actual notice within the limitations period (plus the additional 90 days that Fed. R. Civ. P. 4(m) gives for serving the complaint2) that Leroy intended to sue it. If it had such notice, the purpose of the limitations period has been satisfied: The added defendant was aware of the need to preserve evidence and prepare a defense, within the limitations period prescribed by the legislature. Perhaps the following question will help to sort out the requirements of Rule 15(c)(1)(C).
™"""
question = """7. Black and White. Williams is roughed up during an arrest by a police officer and suffers a broken wrist that doesnâ€™t heal right. Williams consults Darrow, a lawyer, who brings a federal civil rights action for his injury. Darrow obtains the police report of the arrest, which lists Officer Black as the arresting officer. Darrow names Black as the defendant. Suit is filed one month before the limitations period runs. In fact, it wasnâ€™t Black who arrested Williams, it was Officer White. The report, filled out by the booking officer, was simply mistaken. Within a few days, every officer in the precinct, including White, became aware that Black had been sued by an arrestee for excessive force, though they were not aware of the specific circumstances of the case or the identity of the plaintiff. Six months later, Black answers interrogatories sent to him by Williams, denying that he was the arresting officer. Darrow investigates, confirms from other witnesses that White was the arresting officer, and moves to amend to name White as the defendant in the action. The amendment, if allowed, will likely"""
answer = """relate back, because White was aware, before the passage of the limitations period, that Black had been sued."""
label = "0"
prompt = generate_summary(context, model_sum, tokenizer_sum, device_sum) + " " + question + " " + answer + " " + label
input_ids = tokenizer.encode(filter_text(prompt), return_tensors="pt")
output = model.generate(input_ids, max_length=512, num_return_sequences=1, no_repeat_ngram_size=2, top_k=50)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


a party who sues a defendant on a claim within the limitations period may change theories add new damages or recast the factual basis of the claimnthe amended allegations will  relate backtmtm to the date the original complaint was filed 7 black and white williams is roughed up during an arrest by a police officer and suffers a broken wrist that doesnt heal right williams consults darrow a lawyer who brings a federal civil rights action for his injury darrow obtains the police report of the arrest which lists officer black as the arresting officer darrow names black as the defendant suit is filed one month before the limitations period runs in fact it wasnt black who arrested williams it was officer white the report filled out by the booking officer was simply mistaken within a few days every officer in the precinct including white became aware that black had been sued by an arrestee for excessive force though they were not aware of the specific circumstances of the case or the identit

In [29]:
for i in range(3):
    input_ids = tokenizer.encode(prompts[i], return_tensors="pt")
    output = model.generate(input_ids, max_length=512, num_return_sequences=1, no_repeat_ngram_size=2, top_k=50)
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


When a plaintiff files suit in a district that is proper under the venue statute, the court may hear the case.<n>A striking virtue of section 1406(a) is that it allows a court to save a plaintiff’s cause of action if she files in the wrong venue. 7. A venue medley. Zirkhov brings a diversity action against Pardee, a truck driver from Ohio, and Lugo Enterprises, his employer, for injuries in an accident that took place in the Western District of Kentucky. He files the action in the federal district court for the Northern District of Illinois, where Lugo’s principal place of business is located. Pardee, who lives in Ohio, moves to dismiss the action for improper venue. The court must transfer the action to the Western District of Kentucky under 28 U.S.C. §1406(a), since it is a proper venue. 0	A venue Medley is one of the many privileges that a defendant has in her home state of residence. It allows her to bring suit there, if the plaintiff has a place to live, or if her place is within 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


When a plaintiff files suit in a district that is proper under the venue statute, the court may hear the case.<n>A striking virtue of section 1406(a) is that it allows a court to save a plaintiff’s cause of action if she files in the wrong venue. 7. A venue medley. Zirkhov brings a diversity action against Pardee, a truck driver from Ohio, and Lugo Enterprises, his employer, for injuries in an accident that took place in the Western District of Kentucky. He files the action in the federal district court for the Northern District of Illinois, where Lugo’s principal place of business is located. Pardee, who lives in Ohio, moves to dismiss the action for improper venue. The court will have to dismiss, since 28 U.S.C. §1406(a) only allows transfer to a district in which the case might have been brought. 0	A venue Medley is a venue that allows the plaintiff to bring a claim in federal court.<a>The court should not allow a defendant to transfer from one district to another, because the defen

In [None]:
prompts[3]