# Lab 1: Summarizing dialogue or key legal clauses from documents

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Libraries and Modules Description

In [None]:
import pandas as pd
from transformers import BartTokenizer, BartForConditionalGeneration

1. **`pandas`**
   - A powerful data manipulation and analysis library in Python. It provides data structures like DataFrames and Series, which are ideal for handling and analyzing structured data.
   - **Common Usage**:
     - Loading, cleaning, and analyzing data in various formats (CSV, Excel, JSON, etc.).
     - Performing operations like filtering, grouping, and aggregating large datasets.

2. **`transformers` (from `Hugging Face`)**
   - A widely used library for working with pre-trained deep learning models, especially in Natural Language Processing (NLP).
   - **Common Usage**:
     - Loading, training, and fine-tuning transformer-based models like BERT, GPT, BART, etc.
     - Tokenizing text data and generating text using pre-trained models.

   - **Submodules**:
     
     - **`BartTokenizer`**:
       - A tokenizer specific to BART (Bidirectional and Auto-Regressive Transformer) models. Tokenizers are used to preprocess text by converting it into tokens that can be fed into transformer models.
     
     - **`BartForConditionalGeneration`**:
       - A pre-trained model specifically designed for conditional generation tasks such as text summarization, translation, and dialogue generation. This model can generate text based on a given input, such as summarizing a document.

#### Summary:
- **`pandas`** is used for manipulating and analyzing tabular data, ideal for handling datasets related to case law or other structured information.
- **`transformers.BartTokenizer`** is used to tokenize raw text data before feeding it into the BART model.
- **`transformers.BartForConditionalGeneration`** is a pre-trained model used for generating text in tasks like summarization, making it suitable for summarizing legal case law or other documents.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/cleaned_cases.csv')

### Initialization of BART Tokenizer and Model for Conditional Generation

In [None]:
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

1. **`BartTokenizer.from_pretrained("facebook/bart-large-cnn")`**:
   - This line initializes the BART tokenizer by loading the pre-trained tokenizer for the BART model from Hugging Face's model hub (`"facebook/bart-large-cnn"`).
   - **Purpose**:
     - The tokenizer is responsible for converting raw text input into numerical tokens (input IDs) that can be processed by the BART model. It handles tasks such as:
       - Splitting text into tokens (words, subwords, or characters).
       - Adding special tokens required by the model (e.g., start-of-sequence, end-of-sequence tokens).
       - Mapping tokens to corresponding numerical values (token IDs).
   - **Pre-trained Model**: `"facebook/bart-large-cnn"` is a version of the BART model fine-tuned specifically for summarization tasks. The tokenizer is tailored to work with this model.

2. **`BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")`**:
   - This line loads the pre-trained BART model, specifically the version fine-tuned for conditional generation tasks such as summarization.
   - **Purpose**:
     - The BART model is an encoder-decoder (transformer) model that can generate text based on input. In the case of `"facebook/bart-large-cnn"`, the model is particularly fine-tuned for summarizing long-form text like news articles or legal documents.
     - **Conditional Generation** refers to the model generating text based on a specific input condition (e.g., summarizing an article, translating text). The model generates output that is "conditioned" on the input text.
   
#### Summary:
- The **BART Tokenizer** (`BartTokenizer`) prepares raw text for input to the model by converting it into token IDs.
- The **BART Model** (`BartForConditionalGeneration`) is a pre-trained transformer model that generates text outputs, such as summarizations, based on the input provided by the tokenizer.
- Both the tokenizer and model are pre-trained on large datasets and fine-tuned specifically for tasks like summarization, making them highly effective for text generation tasks like summarizing case law or legal documents.

### Function: `summarize_text`

#### Purpose:
The `summarize_text` function uses the BART model to generate a summary of a given legal document or text. It condenses long-form text into a shorter version, maintaining the essential content and context.

#### Parameters:
- **`text`**: A string representing the legal document or text to be summarized.
- **`model`**: The pre-trained BART model for conditional generation (typically `BartForConditionalGeneration`).
- **`tokenizer`**: The BART tokenizer that converts the input text into tokens (used to prepare the input for the model).
- **`max_length`** (default: `150`): The maximum length of the generated summary (in terms of tokens). The summary will be truncated if it exceeds this length.
- **`min_length`** (default: `30`): The minimum length of the generated summary (in terms of tokens). The summary will not be shorter than this length.
- **`num_beams`** (default: `4`): The number of beams used in beam search for text generation. Higher values improve the quality of the summary but increase computation time.

In [None]:
def summarize_text(text, model, tokenizer, max_length=150, min_length=30, num_beams=4):
    """Summarizes the given legal document text using BART."""
    inputs = tokenizer([text], max_length=1024, return_tensors="pt", truncation=True)
    summary_ids = model.generate(
        inputs['input_ids'],
        max_length=max_length,
        min_length=min_length,
        num_beams=num_beams,
        length_penalty=2.0,
        early_stopping=True
    )
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

#### Steps:
1. **Tokenization**:
   - The input `text` is tokenized using the provided `tokenizer`. The text is truncated to a maximum length of 1024 tokens if necessary. The tokenizer prepares the text by converting it into a format (input IDs) suitable for input into the BART model.

2. **Generate Summary**:
   - The `model.generate()` method is used to generate a summary of the input text. This method takes the tokenized input (`input_ids`) and generates a summary based on the following parameters:
     - **`max_length`**: The maximum length of the summary (in tokens).
     - **`min_length`**: The minimum length of the summary.
     - **`num_beams`**: The number of beams used for beam search. Beam search helps in generating high-quality summaries by exploring multiple potential outputs.
     - **`length_penalty`**: A penalty applied to longer sequences, encouraging the model to generate shorter summaries.
     - **`early_stopping`**: Ensures that the model stops generating when a suitable summary is found.

3. **Decode and Return**:
   - The generated summary (in token IDs) is decoded back into human-readable text using the `tokenizer.decode()` method. Special tokens (such as start-of-sequence or end-of-sequence tokens) are removed during the decoding process.
   
#### Output:
- The function returns a string containing the generated summary of the input `text`.

#### Example:
This function is typically used to summarize legal case documents, research papers, or any other long text that needs condensing into a more concise form while retaining important information.

#### Summary:
- The `summarize_text` function efficiently condenses long text into shorter, more digestible summaries using the pre-trained BART model, making it useful for summarizing legal documents or articles.

In [None]:
df['summary'] = df['cleaned_text'].apply(lambda x: summarize_text(x, model, tokenizer))

#### Purpose:
This line of code applies the `summarize_text` function to each entry in the `cleaned_text` column of a DataFrame (`df`) to generate a summarized version of the text. The summarized text is stored in a new column, `summary`.

#### Explanation:
- **`df['cleaned_text']`**: 
  - This refers to the column in the DataFrame `df` that contains the original or preprocessed text (e.g., cleaned legal documents or case law text).
  
- **`.apply(lambda x: summarize_text(x, model, tokenizer))`**: 
  - The `apply()` function is used to apply a function along an axis of the DataFrame. In this case, it applies a lambda function to each value (row) in the `cleaned_text` column.
  - The lambda function takes each value (`x`) in the `cleaned_text` column and passes it to the `summarize_text()` function, along with the pre-loaded `model` and `tokenizer` to generate a summary of the text.
  
- **`summarize_text(x, model, tokenizer)`**:
  - This calls the `summarize_text` function, which uses the BART model to generate a summary for the provided text (`x`). The `model` and `tokenizer` are pre-trained components necessary for generating the summary.

- **`df['summary']`**:
  - The result of applying `summarize_text` to each entry in the `cleaned_text` column is assigned to a new column named `summary`. This column will contain the summarized text for each document.

#### Outcome:
After running this line of code:
- The `df` DataFrame will have a new column (`summary`) containing the summarized text for each corresponding entry in the `cleaned_text` column.
- This allows for easy comparison between the original document and its summary.

#### Summary:
This line efficiently summarizes the content of a DataFrame's `cleaned_text` column by applying the `summarize_text` function to each entry and storing the result in a new `summary` column. It's useful for transforming large volumes of text data into more concise summaries.

In [None]:
summ = 'summarized_case_law.csv'

In [None]:
df.to_csv(summ, index=False)

In [None]:
for i, row in df.iterrows():
    print(f"Original Case Law {i + 1}:\n{row['cleaned_text'][:500]}...\n")
    print(f"Summary {i + 1}:\n{row['summary']}\n")
    print("="*80)

Original Case Law 1:
Judgments and decisions from 2001 onwards
[2025] UKFTT 38 (GRC)
In the First-tier Tribunal
(General Regulatory Chamber)
Information Rights
Case numbers:  EA.2018.0239.GDPR
EA.2018.0240.GDPR
EA.2019.0022.GDPR
EA.2019.0023.GDPR
EA.2019.0033.GDPR
EA.2021.0130.GDPR
EA.2021.0144
EA.2022.0206.GDPR
EA.2022.0420.GDPR
EA.2023.0083
EA.2023.0251
EA.2023.0057
Before:  District Judge Moan
Appellant:  Christopher Hart
Respondent:  Information Commissioner
ORDER
(The Tribunal Procedure (First-tier Tribunal) (G...

Summary 1:
The Tribunal issued a stay of proceedings at the request of the Appellant due to ill-health on 2024-January-10. This followed on from previous stays for the same reason. No application was made to lift to stay and the stay lapsed. The Appellants application for a review of the Order dated 2023-June-8 is refused.

Original Case Law 2:
Judgments and decisions from 2001 onwards
[2025] EWHC 50 (Ch)
The Rolls Building
Fetter Lane
London  EC4A 1NL
MR. JUSTICE TROWE