#### This notebook contains the code for inferencing on [ILC](https://huggingface.co/datasets/d0r1h/ILC) testset using model [led-base-ilc](https://huggingface.co/d0r1h/led-base-ilc)

Author: [Pawan Trivedi](https://twitter.com/d0r1h) <br>
Date created: 2022/05/06 <br>
Last modified: 2022/05/06 <br>
Description: Inference on test set for summarization task

In [14]:
%pip install transformers datasets sentencepiece rouge -qq
%pip install torch

Note: you may need to restart the kernel to use updated packages.



In [15]:
import torch
import pandas as pd
from rouge import Rouge
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [16]:
rouge = Rouge()
dataset = load_dataset("d0r1h/ILC", split='test')
dataset[:5]

{'Title': ['S.86(1)(f) of the Electricity Act, is a special provision which overrides the general provisions contained in S.11 of the Arbitration and Conciliation Act 1996: Supreme Court',
  'The petitioners were released on bail, as the allegations were not corroborated by the material brought before the police: High court of Patna',
  'The allegation being only that the petitioner had tried to commit immoral human trafficking act but could not succeed, the court granted bail: High court of Patna',
  'In service jurisprudence, seniority cannot be claimed from the date when the incumbent is yet to be borne in the cadre: Delhi High Court',
  'The statute does not mandate all components of the crime to be listed in the FIR: Bombay High Court'],
 'Summary': ['Section 86(1)(f) vests a statutory jurisdiction with the State Electricity Commission to adjudicate upon disputes between licensees and generating companies and to refer any dispute for arbitration. therefore, the appointment of arbi

In [17]:
CasesText = dataset['Case']
GoldSummary = dataset['Summary']

In [18]:
len(CasesText), len(GoldSummary)

(1015, 1015)

In [21]:
device = "cuda" if torch.cuda.is_available() else "cpu"

print(device)

cpu


In [12]:
def summarize(model, tokenizer, Cases):

  SystemSummaries = []
  for i, case in enumerate(Cases):

      input_ids = tokenizer(case, return_tensors="pt").input_ids.to(device)
      global_attention_mask = torch.zeros_like(input_ids)
      global_attention_mask[:, 0] = 1
      sequences = model.generate(input_ids, global_attention_mask=global_attention_mask)
      Summary = tokenizer.batch_decode(sequences, skip_special_tokens=True)

      SystemSummaries.append(Summary)
      print(i)

  return SystemSummaries

In [13]:
checkpoint = "d0r1h/led-base-ilc"

tokenizer_led = AutoTokenizer.from_pretrained(checkpoint)
model_led = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).to(device)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [14]:
SystemSummary = summarize(model_led, tokenizer_led, CasesText)

Input ids are automatically padded from 2988 to 3072 to be a multiple of `config.attention_window`: 1024


KeyboardInterrupt: 

In [None]:
SystemSummaryFinal = []

for i in SystemSummary:
  SystemSummaryFinal.append((i[0]))

In [None]:
Summaries = pd.DataFrame(list(zip(GoldSummary, SystemSummaryFinal)), columns =['GoldSummary', 'SystemSummary'])

We have inference on test datset in batches due to the time limit on google colab

In [None]:
dir_path = "/content/drive/MyDrive/Working | Project/ILC/data/"

In [None]:
df1 = pd.read_csv(dir_path + "Summaries1.csv")
df2 = pd.read_csv(dir_path + "Summaries2.csv")
df3 = pd.read_csv(dir_path + "Summaries3.csv")
df4 = pd.read_csv(dir_path + "Summaries4.csv")

In [None]:
DF = pd.concat([df1, df2, df3, df4])
DF.reset_index(inplace=True, drop=True)

In [None]:
DF

Unnamed: 0,GoldSummary,SystemSummary
0,Section 86(1)(f) vests a statutory jurisdictio...,In the case of Gujarat Urja Vikas Nigam Limite...
1,The petitioner apprehended arrest under Sectio...,The case was taken up out of turn on the basis...
2,The petitioner was arrested under Sections 344...,The petitioner is running Dance Party from man...
3,In matters concerning administrative appointme...,The petitioners were appointed as Inspectors i...
4,The facts and information of the suspected off...,The FIR was registered on the basis of a repor...
...,...,...
1010,"In the present case, an appeal is preferred un...",The court observed that the court has not inve...
1011,Re-evaluation of answer sheets is not permissi...,“If an error is committed by the examination a...
1012,The presence of an alternate land that can be ...,The Landlord has a registered Society under th...
1013,Bail may be granted to an individual who is ac...,"The applicant is arrested on 18th November, 20..."


In [None]:
score = rouge.get_scores(DF['SystemSummary'], DF['GoldSummary'], avg=True)

In [None]:
LEDRouge = pd.DataFrame(score).set_index([['recaall','precision','f-measure']])

In [None]:
LEDRouge*100

Unnamed: 0,rouge-1,rouge-2,rouge-l
recaall,40.039816,22.3438,37.250347
precision,46.313938,25.252847,43.097858
f-measure,42.240362,23.187177,39.304978


In [None]:
DF.to_csv(dir_path + "LEDPrediction.csv", index=False, header=True)

In [None]:
LEDRouge.to_csv("/content/drive/MyDrive/Working | Project/ILC/score/LEDRouge.csv", header=True)