<a href="https://colab.research.google.com/github/HackElite-FYP/Legal-Research-Platform-Core/blob/feature%2Fsummarization/summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# install dependencies
# !python.exe -m pip install --upgrade pip

# %pip install pandas
# %pip install nltk
# %pip install numpy --only-binary :all:
# %pip install transformers sumy sentencepiece
# %pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# %pip install notebook ipywidgets --upgrade
# %pip install language_tool_python

!pip install pandas numpy sumy
!pip install transformers nltk sentencepiece
!pip install torch
!pip install language_tool_python

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Collecting chardet (from breadability>=0.1.20->sumy)
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting lxml>=2.0 (from breadability>=0.1.20->sumy)
  Downloading lxml-6.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.6 kB)
Downloading sumy-0.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# init variables
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = "1" # This might not be necessary in Colab and can sometimes cause issues.
# In Colab, it's generally better to let the environment manage CUDA.

PROCESSING_FILE_PATH = '/content/drive/MyDrive/FYP/json/cases_2024.json'
PROCESSING_CASE_INDEX = 0
SUMMARY_MODELS = ['facebook/bart-large-cnn', 'google/pegasus-xsum', 't5-base', 'allenai/led-base-16384']

In [4]:
# load to dataframes
import json
import pandas as pd

with open(PROCESSING_FILE_PATH, 'r', encoding='utf-8') as f:
    cases_data = json.load(f)

cases_df = pd.DataFrame(cases_data)
print(cases_df.head())

                                     id     type amendmentTo  \
0  a2634d61-a78b-4d5e-919d-873eda1893b2      act               
1  cb352d87-dd9d-423c-8846-6471da29d9bc      act               
2  d77f9d39-725d-43e7-bc92-006293621fc6  unknown               
3  7a3dd81a-f327-44ab-b24a-7fd23b1d0393     case               
4  96fee0ca-f700-4679-b465-6ae638ece23b      act               

                                       filename primaryLang  \
0           cpa_0132_23_final_judgement_pdf.pdf          en   
1  court_of_appeal_judgment_hcc_0184_17_pdf.pdf          en   
2                        ca_writ_170_22_pdf.pdf          en   
3              wrt_0201_21_31_01_2024_1_pdf.pdf          en   
4        ca_phc_0066_12_final_judgement_pdf.pdf          en   

                                               title  \
0  The petitioner is seeking to challenge the ord...   
1  CA/HCC 184/2017 IN THE COURT OF APPEAL OF THE ...   
2                                           Untitled   
3           

In [5]:
# sentence tokenize
import nltk
from nltk.tokenize import sent_tokenize

# nltk.download('punkt_tab', 'resources/dependencies/nltk') # This path is for local download.
nltk.download('punkt_tab') # Download to the default nltk data path in Colab

cases_df['sentences'] = cases_df['cleanedText'].apply(sent_tokenize)
print(cases_df['sentences'])

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


0      [The Attorney General, Attorney General’s Depa...
1      [ read with Article 138 of the Constitution of...
2      [IN THE COURT OF APPEAL OF THE DEMOCRATIC SOCI...
3      [WICKUM A. KALUARACHCHI, J., The Petitioner Co...
4      [., Court of Appeal No: Wanninayaka Mudiyansel...
                             ...                        
525    [ read with Article 138 of the Constitution of...
526    [ (‘TEWA’)., In the said Order ‘P9’ the 1st Re...
527    [ Court of Appeal Case No., 1., Wadduwage Ruwa...
528    [C.A., WRIT 88-2019 IN THE COURT OF APPEAL OF ...
529    [11., Prof. Mohan de Silva Former Chairman., 1...
Name: sentences, Length: 530, dtype: object


In [7]:
# ============== extractive summarization (unsupervised) ==============
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
import math

text = cases_df.loc[PROCESSING_CASE_INDEX, 'cleanedText']
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LexRankSummarizer()

# Define the ratio for dynamic sentence count
summary_ratio = 0.1 # For example, 10% of the original text's sentences

# Calculate the number of sentences based on the ratio
original_sentences_count = len(parser.document.sentences)
dynamic_sentences_count = math.ceil(original_sentences_count * summary_ratio)

summary = summarizer(parser.document, sentences_count=dynamic_sentences_count)

original_word_count = len(text.split())
summary_word_count = sum(len(str(sentence).split()) for sentence in summary)

print(f"Original word count: {original_word_count}")
print(f"Summary word count: {summary_word_count}")
print(f"Original sentence count: {original_sentences_count}")
print(f"Summary sentence count (dynamic): {dynamic_sentences_count}")


for sentence in summary:
    print(sentence)

Original word count: 1986
Summary word count: 425
Original sentence count: 51
Summary sentence count (dynamic): 6
The accused has appeared before the High Court on notice on 16-09-2020, and after serving the indictment and other relevant documents on the accused, the learned High Court Judge of Kandy has released the accused on bail.
(3) An inquiry or trial in a Magistrate's Court shall not be postponed or adjourned on the ground of the absence of a witness unless the Magistrate has first satisfied himself that the evidence of such witness is material to the inquiry or trial and that reasonable efforts have been made to secure his attendance, and has recorded the name of such witness and the nature of the evidence which he is expected to give.
Therefore, it is quite obvious that although section 263 of the Code of Criminal Procedure Act provides for the remanding of a person pending further trial, the provisions of the Bail Act shall prevail over the said provision when it comes to the

In [9]:
# ============== Extractive Summarization (Hierarchical Approach) ==============
from transformers import pipeline, AutoTokenizer, BartForConditionalGeneration, BartTokenizer
import torch
from joblib import Parallel, delayed # Keep joblib import

# Chunking function with overlap to maintain context
def chunk_text(text, tokenizer, max_tokens=900, overlap_tokens=100):
    tokens = tokenizer.encode(text)
    total_tokens = len(tokens)
    chunks = []
    start = 0
    while start < total_tokens:
        end = min(start + max_tokens, total_tokens)
        chunk = tokenizer.decode(tokens[start:end], skip_special_tokens=True)
        chunks.append(chunk)
        if end == total_tokens:
            break
        start += (max_tokens - overlap_tokens)
    return chunks

# Initialize summarizer (GPU or CPU)
import torch
device = 0 if torch.cuda.is_available() else -1
# Ensure a GPU is available and being used for better performance and memory handling
if device == -1:
    print("Warning: No GPU available. Running on CPU will be very slow and may still encounter memory issues.")


model_name = SUMMARY_MODELS[0]
# Use AutoTokenizer and the specific model class for better compatibility
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Initialize the summarization pipeline once
summarizer_pipeline = pipeline("summarization", model=model, tokenizer=tokenizer, device=device)

text = cases_df.loc[PROCESSING_CASE_INDEX, 'text']

# Function to summarize a single chunk
def summarize_chunk(chunk, summarizer, tokenizer, chunk_summary_ratio):
    chunk_len = len(tokenizer.encode(chunk))
    # Dynamically calculate max_length based on ratio and chunk length
    # Ensure max_length is within a reasonable range for the model
    adjusted_max_len = min(
        int(chunk_len * chunk_summary_ratio), # Calculate based on ratio
        512 # Reduced upper bound for chunk summary length
        )
    adjusted_max_len = max(50, adjusted_max_len) # Increased lower bound


    summary = summarizer(
        chunk,
        max_length=adjusted_max_len,
        min_length=min(30, adjusted_max_len//2), # Adjusted min_length
        do_sample=False
    )[0]['summary_text']
    return summary

# Hierarchical summarization function with dynamic max_length and parallel processing
def hierarchical_summary(text, summarizer, tokenizer,
                         max_chunk_tokens=512, overlap_tokens=100, # Further Reduced max_chunk_tokens
                         chunk_summary_ratio=0.25, final_summary_ratio=0.20,
                         n_parallel_jobs=2): # Set n_parallel_jobs to a small number

    # Chunk the original document
    chunks = chunk_text(text, tokenizer, max_tokens=max_chunk_tokens, overlap_tokens=overlap_tokens)

    print(f"Number of chunks: {len(chunks)}")

    # Summarize each chunk individually in parallel
    print(f"Summarizing chunks in parallel using {n_parallel_jobs} jobs...")
    intermediate_summaries = Parallel(n_jobs=n_parallel_jobs)(delayed(summarize_chunk)(chunk, summarizer, tokenizer, chunk_summary_ratio) for chunk in chunks)


    # Combine intermediate summaries
    combined_summary_text = " ".join(intermediate_summaries)

    # Generate final summary from intermediate summaries
    print("Generating final summary...")

    combined_summary_len = len(tokenizer.encode(combined_summary_text))
    # Dynamically calculate final max_length based on ratio and combined summary length
    # Ensure final_summary_max_len is within a reasonable range
    final_summary_max_len = min(
        int(combined_summary_len * final_summary_ratio), # Calculate based on ratio
        512 # Reduced upper bound for final summary length
        )
    final_summary_max_len = max(150, final_summary_max_len) # Increased lower bound


    final_summary = summarizer(
        combined_summary_text,
        max_length=final_summary_max_len,
        min_length=min(75, final_summary_max_len // 2), # Adjusted min_length
        do_sample=False
    )[0]['summary_text']

    return final_summary

# Example Usage
# Set n_parallel_jobs to a specific number to limit simultaneous tasks, e.g., 2 or 4
final_summary = hierarchical_summary(
    text, summarizer_pipeline, tokenizer, # Use the initialized pipeline
    max_chunk_tokens=512, # Further Reduced max_chunk_tokens
    overlap_tokens=96,
    chunk_summary_ratio=0.75,
    final_summary_ratio=0.50,
    n_parallel_jobs=2 # Example: limit to 2 parallel jobs
)

print(f"Original word count: {len(text.split())}")
print(f"Summary word count: {len(final_summary.split())}")

print("Final Summary:")
print(final_summary)



Device set to use cpu


Number of chunks: 12
Summarizing chunks in parallel using 2 jobs...




Generating final summary...
Original word count: 1986
Summary word count: 53
Final Summary:
Rajapakse Gedara Ravindu Ratnayake (Presently in prison) is the accused in High Court of Kandy Case Number HC/141/2020. The accused has appeared before the High Court on notice on 16-09-2020, and after serving the indictment and other relevant documents on the accused, the learned High Court Judge has released the accused on bail.


In [None]:
from transformers import pipeline
import torch

# Use GPU if available
device = 0 if torch.cuda.is_available() else -1

# Initialize refinement pipeline
refiner = pipeline("text2text-generation", model="google/flan-t5-base", device=device)

def refine_legal_summary(summary_text):
    prompt = (
        "Refine the following legal summary. Correct grammar, spelling, punctuation, "
        "remove repetition, and ensure clarity without changing any legal meaning:\n\n"
        f"{summary_text}"
    )

    refined_output = refiner(prompt, max_length=256, do_sample=False)
    refined_summary = refined_output[0]['generated_text']

    return refined_summary.strip()

# Example usage:
clean_summary = refine_legal_summary(final_summary)
print("Refined Final Summary:")
print(clean_summary)

Device set to use cuda:0


Refined Final Summary:
Petitioner is seeking to challenge the Order Made by the Learned High. The Attorney General, the Attorney General's department, is the respondent. The Accused has been charged with Grave Sexual Abuse o f a minor. He has been remanded on bail until the end of the trial. I am of the view that the learned high judge WAS MISDIRECTED. The onely Assumtion That CAN BE MADE is that the Remanding of The Accuses For A Period of 3 months HAD BEEN DONE AS A PUNITIVE measure. The order mode by this court previcly on 15 -12-2023, to release the Accused.
