### Text Summarization 

Reference: https://zenodo.org/record/7152317#.Yz6mJ9JByC0

In [1]:
!wget https://zenodo.org/record/7152317/files/dataset.zip?download=1

--2023-04-01 10:29:00--  https://zenodo.org/record/7152317/files/dataset.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 97822590 (93M) [application/octet-stream]
Saving to: ‘dataset.zip?download=1’


2023-04-01 10:31:32 (635 KB/s) - ‘dataset.zip?download=1’ saved [97822590/97822590]



In [2]:
!unzip dataset.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: dataset/IN-Abs/train-data/summary/4620.txt  
  inflating: dataset/IN-Abs/train-data/summary/463.txt  
  inflating: dataset/IN-Abs/train-data/summary/4630.txt  
  inflating: dataset/IN-Abs/train-data/summary/4631.txt  
  inflating: dataset/IN-Abs/train-data/summary/4643.txt  
  inflating: dataset/IN-Abs/train-data/summary/4647.txt  
  inflating: dataset/IN-Abs/train-data/summary/4652.txt  
  inflating: dataset/IN-Abs/train-data/summary/4667.txt  
  inflating: dataset/IN-Abs/train-data/summary/4668.txt  
  inflating: dataset/IN-Abs/train-data/summary/4682.txt  
  inflating: dataset/IN-Abs/train-data/summary/4690.txt  
  inflating: dataset/IN-Abs/train-data/summary/4693.txt  
  inflating: dataset/IN-Abs/train-data/summary/4697.txt  
  inflating: dataset/IN-Abs/train-data/summary/4711.txt  
  inflating: dataset/IN-Abs/train-data/summary/4712.txt  
  inflating: dataset/IN-Abs/train-data/summary/4715.txt  
  inflat

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import AdamW, get_linear_schedule_with_warmup
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm


In [4]:
# Load the BART tokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

# Load the BART model for conditional generation
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')

In [5]:
# Set the device to use for training
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0): BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05,

In [6]:
# Define the dataset and data loader
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        article = self.data[index]['article']
        summary = self.data[index]['summary']
        inputs = tokenizer(article, truncation=True, padding='max_length', max_length=1024, return_tensors='pt')
        outputs = tokenizer(summary, truncation=True, padding='max_length', max_length=128, return_tensors='pt')
        return {'input_ids': inputs['input_ids'][0], 'attention_mask': inputs['attention_mask'][0], 
                'decoder_input_ids': outputs['input_ids'][0], 'decoder_attention_mask': outputs['attention_mask'][0]}


In [7]:
###Load sample dataset for training
###Can be automated for full process
judgement_path1 = '/content/dataset/IN-Abs/train-data/judgement/1.txt'
with open(judgement_path1, 'r') as file:
    judgement_1 = file.read()

summary_path1 = '/content/dataset/IN-Abs/train-data/summary/1.txt'
with open(summary_path1, 'r') as file:
    summary_1 = file.read()
judgement_path10 = '/content/dataset/IN-Abs/train-data/judgement/10.txt'

with open(judgement_path10, 'r') as file:
    judgement_10 = file.read()

summary_path10 = '/content/dataset/IN-Abs/train-data/summary/10.txt'
with open(summary_path10, 'r') as file:
    summary_10 = file.read()
judgement_path100 = '/content/dataset/IN-Abs/train-data/judgement/100.txt'
with open(judgement_path100, 'r') as file:
    judgement_100 = file.read()

summary_path100 = '/content/dataset/IN-Abs/train-data/summary/100.txt'
with open(summary_path100, 'r') as file:
    summary_100 = file.read()
judgement_path1000 = '/content/dataset/IN-Abs/train-data/judgement/1000.txt'
with open(judgement_path1000, 'r') as file:
    judgement_1000 = file.read()

summary_path1000 = '/content/dataset/IN-Abs/train-data/summary/1000.txt'
with open(summary_path1000, 'r') as file:
    summary_1000 = file.read()

In [8]:
train_data = [{'article': judgement_1 , 'summary': summary_1}, {'article': judgement_10 , 'summary': summary_10},{'article': judgement_100 , 'summary': summary_100}, {'article': judgement_1000 , 'summary': summary_1000}]
train_dataset = MyDataset(train_data)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)


In [9]:
# Define the optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=100, num_training_steps=1000)




In [10]:
# Define the training loop
model.train()
for epoch in range(2):
    total_loss = 0
    for batch in tqdm(train_loader, desc='Epoch ' + str(epoch)):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        decoder_input_ids = batch['decoder_input_ids'].to(device)
        decoder_attention_mask = batch['decoder_attention_mask'].to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids, 
                        decoder_attention_mask=decoder_attention_mask, labels=decoder_input_ids)
        loss = outputs.loss
        total_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
    print('Epoch ' + str(epoch) + ' Loss: ' + str(total_loss / len(train_loader)))


Epoch 0: 100%|██████████| 1/1 [00:53<00:00, 53.57s/it]


Epoch 0 Loss: 7.362895488739014


Epoch 1: 100%|██████████| 1/1 [00:40<00:00, 40.69s/it]

Epoch 1 Loss: 7.523076057434082





###Doing the testing


In [12]:
test_path1 = '/content/dataset/IN-Abs/test-data/judgement/1181.txt'
with open(test_path1, 'r') as file:
    test_1 = file.read()

test_summary_path1 = '/content/dataset/IN-Abs/test-data/summary/1181.txt'
with open(test_summary_path1, 'r') as file:
    test_summary_1 = file.read()

In [20]:

# Tokenize the input text
input_ids = tokenizer.encode(test_1, max_length=512, truncation=True, padding='max_length', return_tensors='pt')

# Generate a summary
summary_ids = model.generate(input_ids=input_ids, num_beams=4, length_penalty=2.0, max_length=1000, min_length=30, early_stopping=True)
summary = tokenizer.decode(summary_ids.squeeze(), skip_special_tokens=True)
print(summary)

Appeal No. 101 of 1959.Appeal by special leave of order dated November 8, 1957, of the Deputy Custodian General, Evacuee Property, Now Delhi Revision Petition No. 17 R/55 of 1955.Achhru Ram and K. L. Mehta for the appellants.The respondents.B.K., Khanna and, T. M. Sen, for the respondent No. 1.N.S. Bindra and A. G. Ratnarkhi for the respondents No. 2.N-N.G. Ratra and C.G, R.R.T.Ratra, and N.G.,Ratra and R, for N. G R.Ratnark and R. RatN.R, and for the the respondents Nos. 3 and 4 of the respondents.March 15. 1949The Judgment of the Court was delivered by MUDHOLKAR J.SARDia.The appellants who are admittedly displaced persons from West Pakistan were granted quasi permanent allotment of 24 standard acres and 15 3/4 units in the village of Raikot in Ludhiana District in 1949.Their father Sardar Nand Singh who was 42, was found entitled to quasi permanent land in a village of Humbran in the same district in the year 1949. The appellants, however, was not able to consolidate his land allotme

In [21]:
summary

'Appeal No. 101 of 1959.Appeal by special leave of order dated November 8, 1957, of the Deputy Custodian General, Evacuee Property, Now Delhi Revision Petition No. 17 R/55 of 1955.Achhru Ram and K. L. Mehta for the appellants.The respondents.B.K., Khanna and, T. M. Sen, for the respondent No. 1.N.S. Bindra and A. G. Ratnarkhi for the respondents No. 2.N-N.G. Ratra and C.G, R.R.T.Ratra, and N.G.,Ratra and R, for N. G R.Ratnark and R. RatN.R, and for the the respondents Nos. 3 and 4 of the respondents.March 15. 1949The Judgment of the Court was delivered by MUDHOLKAR J.SARDia.The appellants who are admittedly displaced persons from West Pakistan were granted quasi permanent allotment of 24 standard acres and 15 3/4 units in the village of Raikot in Ludhiana District in 1949.Their father Sardar Nand Singh who was 42, was found entitled to quasi permanent land in a village of Humbran in the same district in the year 1949. The appellants, however, was not able to consolidate his land allotm

In [18]:
test_summary_1

"The appellants who are displaced persons from West Pakistan, were granted quasi permanent allotment of some lands in village Raikot in 1949.\nOn October 31, 1952, the Assistant Custodian cancelled the allotment of 14 allottees in village Karodian, and also cancelled the allotment of the Appellants in Raikot but allotted lands to them in village Karodian, and allotted the lands of Raikot to other persons.\nThe 14 allottees of village Karodian as well as the appellants applied for review of the orders of cancellation of their allotment.\nThe application of the 14 allottees was dismissed.\nThey preferred a revision to the Custodian General who cancelled the appellant 's allotment (1) Cal.\n926. 329 in Karodian and restored the allotment of the 14 allottees on December 17, 1954 Thereupon,, on January 6, 1955, the appellants moved the Custodian General for calling up their review application and for revising the order of October 31, 1952, cancelling their allotment in Raikot.\nThe Custodia

In [27]:
# Save the fine-tuned model
model.save_pretrained('legal_bart')


In [28]:
tokenizer.save_pretrained('legal_bart')

('legal_bart/tokenizer_config.json',
 'legal_bart/special_tokens_map.json',
 'legal_bart/vocab.json',
 'legal_bart/merges.txt',
 'legal_bart/added_tokens.json')