<table align="center" width=150%>
    <tr>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                    <b> AT 2: Text Summarization Using Long T5 model on PubMed Dataset <br>
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

This notebook explores the experimentation of implementing the pre-trained long-t5-tglobal-base model to generate summaries on 1000 test records of the PubMed Dataset. This is an attempt to reproduce the results attained in (Guo et al., 2022). However, considering the limitations of the computation resources and time, the evaluation was constrained to only 1000 test records, which took about 3 hours to obtain the results using T4 GPU.

For the code implementation, resource guides from Hugging Face website was referred (Summarization, n.d.)

###1. Installing the required packages and modules

In [None]:
!pip install torch transformers datasets evaluate rouge_score

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m 

###2. Importing the required libraries

In [None]:
from transformers import AutoTokenizer, LongT5ForConditionalGeneration
from datasets import load_dataset
import evaluate
import torch
import time

###3. Loading the base model - Long T5 with Tglobal attention (small size) for Text Summarization

In [None]:
# Loading pretrained LongT5 small/base model
model_name = "google/long-t5-tglobal-base"
pretrained_model = LongT5ForConditionalGeneration.from_pretrained(model_name)

# Set the device to GPU if available, otherwise use CPU and move the model to device
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
pretrained_model.to(DEVICE)

# Enable gradient checkpointing to reduce memory consumption
pretrained_model.gradient_checkpointing_enable()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/851 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

###4. Loading the PubMed dataset (for text summarization) from Hugging Face datasets library

In [None]:
# Load PubMed dataset from Hugging Face datasets library
pubmed_dataset = load_dataset("ccdv/pubmed-summarization")

README.md:   0%|          | 0.00/3.80k [00:00<?, ?B/s]

train-00000-of-00005.parquet:   0%|          | 0.00/210M [00:00<?, ?B/s]

train-00001-of-00005.parquet:   0%|          | 0.00/208M [00:00<?, ?B/s]

train-00002-of-00005.parquet:   0%|          | 0.00/207M [00:00<?, ?B/s]

train-00003-of-00005.parquet:   0%|          | 0.00/211M [00:00<?, ?B/s]

train-00004-of-00005.parquet:   0%|          | 0.00/210M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/59.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/58.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/119924 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6633 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6658 [00:00<?, ? examples/s]

In [None]:
pubmed_dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'abstract'],
        num_rows: 119924
    })
    validation: Dataset({
        features: ['article', 'abstract'],
        num_rows: 6633
    })
    test: Dataset({
        features: ['article', 'abstract'],
        num_rows: 6658
    })
})

In [None]:
pubmed_dataset["train"][0]["article"]

"a recent systematic analysis showed that in 2011 , 314 ( 296 - 331 ) million children younger than 5 years were mildly , moderately or severely stunted and 258 ( 240 - 274 ) million were mildly , moderately or severely underweight in the developing countries . \n in iran a study among 752 high school girls in sistan and baluchestan showed prevalence of 16.2% , 8.6% and 1.5% , for underweight , overweight and obesity , respectively . \n the prevalence of malnutrition among elementary school aged children in tehran varied from 6% to 16% . \n anthropometric study of elementary school students in shiraz revealed that 16% of them suffer from malnutrition and low body weight . \n snack should have 300 - 400 kcal energy and could provide 5 - 10 g of protein / day . nowadays , school nutrition programs are running as the national programs , world - wide . national school lunch program in the united states \n there are also some reports regarding school feeding programs in developing countries

In [None]:
pubmed_dataset["train"][0]["abstract"]

"background : the present study was carried out to assess the effects of community nutrition intervention based on advocacy approach on malnutrition status among school - aged children in shiraz , iran.materials and methods : this case - control nutritional intervention has been done between 2008 and 2009 on 2897 primary and secondary school boys and girls ( 7 - 13 years old ) based on advocacy approach in shiraz , iran . \n the project provided nutritious snacks in public schools over a 2-year period along with advocacy oriented actions in order to implement and promote nutritional intervention . for evaluation of effectiveness of the intervention growth monitoring indices of pre- and post - intervention were statistically compared.results:the frequency of subjects with body mass index lower than 5% decreased significantly after intervention among girls ( p = 0.02 ) . \n however , there were no significant changes among boys or total population . \n the mean of all anthropometric indi

###5. Tokenisation of Test Data with Maximum length of input sequence - 4k (4096) and output length - 512 tokens

In [None]:
#Loading the Tokenizer for 4k input
tokenizer = AutoTokenizer.from_pretrained(model_name)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



In [None]:
# Tokenization function for the dataset
def preprocess_tokenise(examples):
    inputs = examples['article']
    model_inputs = tokenizer(inputs, max_length=4096, truncation=True, padding="max_length")  # 4k token limit

    # Labels are the summaries/abstract (target text). Tokenising the target data
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['abstract'], max_length=512, truncation=True, padding="max_length") # 512 token limit
    # Add tokenized summaries as labels
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenize the test dataset with the new max length
test_data = pubmed_dataset["test"].map(preprocess_tokenise, batched=True)

Map:   0%|          | 0/6658 [00:00<?, ? examples/s]



In [None]:
#Displaying the test data attributes after tokenization
test_data
# we can see the after tokenisation the test data comprises of input_ids, attention_mask generated from the article and labels generated from the abstract (reference summary)

Dataset({
    features: ['article', 'abstract', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 6658
})

###6. Evaluation of the pre-trained LongT5 base model on 1000 Test records

In [None]:
# Loading the ROUGE metric for evaluating summarization quality
rouge = evaluate.load("rouge")

In [None]:
#Function to evaluate the pre-trained LongT5 base model with Tglobal attention on 1000 test records
# It computes and returns the rouge scores and total time taken

def test_eval(tokenized_test_dataset):
    predictions = []
    references = []
    # Start time for evaluation
    start_time = time.time()
    # Looping over the test dataset
    for example in tokenized_test_dataset:
        # Converting input_ids list to tensor and move to device
        input_ids = torch.tensor(example['input_ids']).to(DEVICE)
        # Ensuring 'labels' (reference summary) is a list of token IDs and decode it
        if isinstance(example['labels'], list):
            reference_text = tokenizer.decode(example['labels'], skip_special_tokens=True)
        else:
            raise ValueError("Expected 'labels' to be a list of token IDs.")

        # Generating the prediction (summary) from the model
        with torch.no_grad():
            generated_ids = pretrained_model.generate(
                input_ids.unsqueeze(0),  # Add batch dimension
                max_length=512,  # Limit max length of output sequence to 512
                num_beams=2,  # Beam search for better results
                early_stopping=True
            )
        # Decoding the generated output
        pred_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
        # Appending the predicted and reference summaries to respective lists
        predictions.append(pred_text)
        references.append(reference_text)
    # End time for evaluation
    end_time = time.time()
    # Calculate the time taken for evaluation
    time_taken = end_time - start_time
    # Compute ROUGE scores for pre-trained model using the evaluate library
    rouge_result = rouge.compute(predictions=predictions, references=references)
    # Log the time taken for the evaluation
    print(f"Time taken to evaluate on 1000 test records: {time_taken:.2f} seconds")
    return rouge_result

In [None]:
# Evaluation of pre-trained model on 1000 test records
print("Evaluating pre-trained model...")
result = test_eval(test_data.select(range(1000)))

Evaluating pre-trained model...




Time taken to evaluate on 1000 test records: 9395.08 seconds


In [None]:
# Displaying ROUGE results for 1000 test records
test_rouge_results = {key: value * 100 for key, value in result.items()}
print(f"Evaluation Results on Test Records:")
print(f"ROUGE-1: {test_rouge_results['rouge1']:.2f}")
print(f"ROUGE-2: {test_rouge_results['rouge2']:.2f}")
print(f"ROUGE-L: {test_rouge_results['rougeL']:.2f}")

Evaluation Results on Test Records:
ROUGE-1: 29.16
ROUGE-2: 8.54
ROUGE-L: 16.74


The above rouge score values can be interpreted as below:
1. rouge-1 = 29.16 implies that about 29.16% of the unigrams from the reference summary (abstract) was captured in the model generated summary.
2. rouge-2 = 8.54 implies an overlap of 8.54% of two consecutive words in the abstract and that of the generated summary. This value is typically lower than the rouge-1 as it is quite difficult to have two consequtive word pairs matching compared to individual words matching.
3. rouge-L = 16.74 imply that 16.74% of the words appear in the same order (common subsequences) in both abstract and the generated text summary. This rouge value helps in comparing the sentence level structure between the reference and generated summary and in turn the coherence of the summary.

Overall, it can be understood that the model performs well in matching majorily the individual words but struggles to capture proper sequences.combination of words due to the low rouge-2 and rouge-L values. Hence there is scope for improving the sentences structure and coherence.

###7. Generating summary for sample test data

In [None]:
# Generate summaries
def generate_summary(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True)  # Adjust for 4k input length
    input_ids = inputs.input_ids.to(DEVICE)
    attention_mask = inputs.attention_mask.to(DEVICE)
    summary_ids = pretrained_model.generate(input_ids, attention_mask=attention_mask, max_length=512, num_beams=2)
    summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)
    # Clear CUDA cache after each batch to free up memory
    torch.cuda.empty_cache()
    return "".join(summary)

In [None]:
# Test the model to generate summary
test_article = pubmed_dataset['test'][0]['article']
print(f"Article: {test_article}\n\n\n")
test_reference = pubmed_dataset['test'][0]['abstract']
print(f"Abstract: {test_reference}\n\n\n")
summary = generate_summary(test_article)
print(f"Generated Summary: {summary}")

Article: anxiety affects quality of life in those living with parkinson 's disease ( pd ) more so than overall cognitive status , motor deficits , apathy , and depression [ 13 ] . 
 although anxiety and depression are often related and coexist in pd patients , recent research suggests that anxiety rather than depression is the most prominent and prevalent mood disorder in pd [ 5 , 6 ] . yet , 
 our current understanding of anxiety and its impact on cognition in pd , as well as its neural basis and best treatment practices , remains meager and lags far behind that of depression . 
 overall , neuropsychiatric symptoms in pd have been shown to be negatively associated with cognitive performance . 
 for example , higher depression scores have been correlated with lower scores on the mini - mental state exam ( mmse ) [ 8 , 9 ] as well as tests of memory and executive functions ( e.g. , attention ) [ 1014 ] . 
 likewise , apathy and anhedonia in pd patients have been associated with executiv

In [None]:
# Test the model to generate summary
test_article = pubmed_dataset['test'][2500]['article']
print(f"Article: {test_article}\n\n\n")
test_reference = pubmed_dataset['test'][2500]['abstract']
print(f"Abstract: {test_reference}\n\n\n")
summary = generate_summary(test_article)
print(f"Generated Summary: {summary}")

Article: the arrival of the precision medicine era brings new opportunities and challenges for patients undergoing precision diagnosis and treatment . 
 the morbidity and mortality rates associated with malignant tumors have increased year by year , and the burden of malignant neoplasms is increasing . 
 minimally invasive surgical treatment , minimally invasive treatment guided by imaging navigation , and specific therapy of targeted drugs are important aspects of precision oncology . with further development of medical imaging technology , information from different imaging modalities can be integrated and comprehensively analyzed by the imaging fusion system , which provides more image information of tumors from different angles and dimensions to accurately make qualitative and quantitative diagnoses and achieve the aim of precision tumor treatment . 
 multimodality image fusion technology has become the main trend in the development of future imaging technology . 
 this article rev

In [None]:
test_article = pubmed_dataset['test'][5000]['article']
print(f"Article: {test_article}\n\n\n")
test_reference = pubmed_dataset['test'][5000]['abstract']
print(f"Abstract: {test_reference}\n\n\n")
summary = generate_summary(test_article)
print(f"Generated Summary: {summary}")

Article: hypothalamic - pituitary - adrenal ( hpa ) axis dysfunction in mood disorders is one of the most robust findings in biological psychiatry . 
 however , considerable debate surrounds the nature of the core abnormality , its cause , consequences and treatment implications . 
 to review the evidence for the role of hpa axis dysfunction in the pathophysiology of mood disorders with particular reference to corticosteroid receptor pathology . 
 a selective review of the published literature in this field , focusing on human studies . 
 the nature of basal hpa axis dysregulation described in both manic and depressed bipolars appears to be similar to those described in mdd . but 
 studies using the dexamethasone/ corticotropin releasing hormone ( dex / crh ) test and dexamethasone suppression test ( dst ) have shown that hpa axis dysfunction is more prevalent in bipolar than in unipolar disorder . 
 there is robust evidence for corticotropin releasing hormone ( crh ) hyperdrive and gl

###References
Guo, M., Ainslie, J., Uthus, D. C., Ontañón, S., Ni, J., Sung, Y.-H., & Yang, Y. (2022). LongT5: Efficient Text-To-Text Transformer for Long Sequences. https://doi.org/10.18653/v1/2022.findings-naacl.55

Summarization. (n.d.). Huggingface.co. https://huggingface.co/docs/transformers/en/tasks/summarization