# Summarization accuracy comparison: Llama 3.2 1B-instruct vs Flan-t5


In [None]:
!pip install evaluate rouge_score
!pip install py7zr

In [2]:
import transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration, pipeline, AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset,Dataset
from rouge_score import rouge_scorer
import torch
from transformers.pipelines.pt_utils import KeyDataset
from huggingface_hub import login
import pandas as pd
login(token= "xxxx")
device = "cuda" if torch.cuda.is_available() else "cpu"

## Step 1: Datasets. Load 2 datasets for the experiment

### 1.1 Load the Hugging Face Samsum for experiment

In [3]:
samsum = load_dataset("samsum", trust_remote_code = True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

samsum.py:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

corpus.7z:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

In [4]:
samsum["train"][0]

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

In [5]:
num_texts = 200
avg_length = sum([len(samsum["train"][i]["dialogue"].split()) for i in range(num_texts)])/num_texts
print(f"The average token number of the first {num_texts} dialogues in samsum is roughly {avg_length}")
avg_summary = sum([len(samsum["train"][i]["summary"].split()) for i in range(num_texts)])/num_texts
print(f"The average token number of the first {num_texts} reference summaries in samsum is roughly {avg_summary}")

The average token number of the first 200 dialogues in samsum is roughly 94.815
The average token number of the first 200 reference summaries in samsum is roughly 20.195


### 1.2 Load the Hugging Face CNN/Dailymail for experiment.

In [6]:
cnn_dailymail = load_dataset('cnn_dailymail', '2.0.0')

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [7]:
cnn_dailymail["train"][0]

{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office char

In [8]:
num_texts = 200
avg_length = sum([len(cnn_dailymail["train"][i]["article"].split()) for i in range(num_texts)])/num_texts
print(f"The average token number of the first {num_texts} dialogues in CNN/dailymail is roughly {avg_length}")
avg_summary = sum([len(cnn_dailymail["train"][i]["highlights"].split()) for i in range(num_texts)])/num_texts
print(f"The average token number of the first {num_texts} reference summaries in CNN/dailymail is roughly {avg_summary}")

The average token number of the first 200 dialogues in CNN/dailymail is roughly 602.57
The average token number of the first 200 reference summaries in CNN/dailymail is roughly 41.385


## Step 2: choose a set of prompts for prompt engineering & fair model comparisons

In [9]:
Prompts = [["Summarize the following dialogue. Dialogue: ",", summary: "], \
 [ "<|system|>You are a helpful assistant.<|endoftext|> \n <|user|>How would you summarize the dialoge ? Dialogue: ", \
  "<|endoftext|> \n <|assistant|> " ]]

In [10]:
structured_prompt =  [{"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content":"Summarize the following text:\n"}]

## Step 3: Load Models: Llama 3.2 1B-instruct and Flan-t5 large.

### 3.1. Load the Llama 3.2 1B-instruct

In [67]:
Llama_id = "meta-llama/Llama-3.2-1B-Instruct"
Llama = pipeline(
    "text-generation",
    model=Llama_id,
    torch_dtype=torch.bfloat16,
    device_map= device,
    batch_size=8
)


config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Device set to use cuda


In [68]:
Llama.tokenizer.pad_token = Llama.tokenizer.eos_token
Llama.tokenizer.padding_side = "left"

#### An example of passing a structured prompt to llama (Flan-t5 seems to not have this option).

In [69]:
# Define messages for input
text = samsum["train"]["dialogue"][1]
messages = [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": f"Summarize the following text:\n{text}\nSummary:\n"}
            ]

# Generate response
outputs = Llama(messages,
                max_new_tokens=256,
                do_sample=False,
                temperature=None,
                top_p=None,
                pad_token_id=Llama.tokenizer.eos_token_id)

In [71]:
print("The text to summary:")
print(text)
print("----------------------------")
print("The summary generated by llama:")
print(outputs[0]["generated_text"][-1]["content"])

The text to summary:
Olivia: Who are you voting for in this election? 
Oliver: Liberals as always.
Olivia: Me too!!
Oliver: Great
----------------------------
The summary generated by llama:
Olivia and Oliver are discussing the upcoming election, with Olivia expressing her support for the Liberal party.


### 3.2 Load the Flan-t5 model.

In [None]:
tokenizer_t5 = T5Tokenizer.from_pretrained("google/flan-t5-large")
flant5 = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map=device)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Step 4. Generate texts

In [None]:
num_texts = 200

In [None]:
# The function below is used to generate summaries using flan t5. It generates output in batch with a default batch size to be 8.
def generate_summaries_flant5(texts, batch_size=8, max_length=512):
    summaries = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]  # Process in small batches
        # Tokenize and move tensors to the correct device
        inputs = tokenizer_t5(batch, return_tensors="pt", padding=True, truncation=True)
        inputs = {key: value.to(device) for key, value in inputs.items()}  # Move to model's device

        with torch.no_grad():  # Faster inference
            outputs = flant5.generate(**inputs, max_length=max_length)

        decoded_summaries = tokenizer_t5.batch_decode(outputs, skip_special_tokens=True)
        summaries.extend(decoded_summaries)

    return summaries

In [None]:
# Generate summaries with flant5
Summary_flant5 = []
for i in range(2):
    texts = samsum["train"]["dialogue"][:num_texts]
    texts = [Prompts[i][0] + x + Prompts[i][1] for x in texts]
    Summary_flant5.append(generate_summaries_flant5(texts))
    texts = cnn_dailymail["train"]["article"][:num_texts]
    texts = [Prompts[i][0] + x + Prompts[i][1] for x in texts]
    Summary_flant5.append(generate_summaries_flant5(texts))

In [None]:
# save the results
pd.DataFrame({"Samsum_prompt1":Summary_flant5[0],"CNN_prompt1":Summary_flant5[1],"Samsum_prompt2":Summary_flant5[2],"CNN_prompt2":Summary_flant5[3]}).to_csv("/content/drive/MyDrive/LLMs/results/flant5_results.csv",index = False)

In [None]:
# The function below is used to generate summaries using Llama 3.2 1B-Instruct.
def generate_summaries_llama(texts, structured_prompt = True, max_new_tokens=512,batch_size = 8):
    summaries = []
    """
    "structured_prompt == False" means the case where the prompt is constructed as a simple string.
    "structured_prompt == True" means the case where the prompt is constructed as a list of dictionaries, where
        role of the robot and role of the user are specified.
    """
    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        output = Llama(batch, max_new_tokens=max_new_tokens,pad_token_id=Llama.tokenizer.eos_token_id)
        if structured_prompt:
            summaries.extend([output[j][0]["generated_text"][-1]["content"] for j in range(len(output))])
        else:
            summaries.extend([output[j][0]["generated_text"][(len(text) + 1):] for j,text in zip(range(len(output)), batch)])

    return summaries

In [None]:
# Generate summaries with llama 3.2 1B
Summary_Llama = []
texts_samsum = samsum["train"]["dialogue"][:num_texts]
texts_cnn = cnn_dailymail["train"]["article"][:num_texts]

texts_unstructured_samsum = [Prompts[0][0] + x + Prompts[0][1] for x in texts_samsum]
texts_unstructured_cnn = [Prompts[0][0] + x + Prompts[0][1] for x in texts_cnn]
texts_structured_samsum = [[{"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": f"Summarize the following text:\n{text}\nSummary:"}] for text in texts_samsum]

texts_structured_cnn = [[{"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": f"Summarize the following text:\n{text}\nSummary:"}] for text in texts_cnn]

Summary_Llama.append(generate_summaries_llama(texts_unstructured_samsum,structured_prompt = False,max_new_tokens = 256))
#Summary_Llama.append(generate_summaries_llama(texts_unstructured_cnn,structured_prompt = False,max_new_tokens = 256))
Summary_Llama.append(generate_summaries_llama(texts_structured_samsum,structured_prompt = True,max_new_tokens = 256))
Summary_Llama.append(generate_summaries_llama(texts_structured_cnn,structured_prompt = True,max_new_tokens = 256))

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [None]:
# save the results
pd.DataFrame({"Samsum_prompt_unstructured":Summary_Llama[0],"Samsum_prompt_strctured":Summary_Llama[1],"CNN_prompt_structured":Summary_Llama[2]}).to_csv("/content/drive/MyDrive/LLMs/results/Llama_results.csv",index = False)

## Step 5. Compare Rouge Scores

In [37]:
# This function is used to calculate rouge scores, given a list of references and a list of candidates
def macro_rouge(references, candidates):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = []
    for ref, cand in zip(references, candidates):
        score = scorer.score(ref, cand)
        scores.append(score)

    # Average the scores
    avg_scores = {metric: sum(score[metric].fmeasure for score in scores) / len(scores)
                  for metric in ['rouge1', 'rouge2', 'rougeL']}

    return avg_scores,scores

In [11]:
# Read in previously saved results.
flant5_summary = pd.read_csv("/content/drive/MyDrive/LLMs/results/flant5_results.csv")
llama_summary = pd.read_csv("/content/drive/MyDrive/LLMs/results/Llama_results.csv")

In [36]:
llama_summary.loc[llama_summary.isnull().any(axis = 1)]

Unnamed: 0,Samsum_prompt_unstructured,Samsum_prompt_strctured,CNN_prompt_structured
127,,The conversation between Loreen and Melissa re...,"A group of football dignitaries, including FIF..."
160,,Ann Marie is trying to find a receipt for her ...,Azerbaijan's government has closed its embassy...


In [38]:
llama_summary.fillna("",inplace = True)

### 5.1 Rouge score comparison on SamSum

In [105]:
Summaries = [flant5_summary["Samsum_prompt1"].values,flant5_summary["Samsum_prompt2"].values,llama_summary["Samsum_prompt_unstructured"].values,llama_summary["Samsum_prompt_strctured"].values]
Model_prompt = ["flant5-prompt1","flant5-prompt2","llama-unstructured","llama-structured"]

In [106]:
Avg_score, Row_scores = [],[]
for summary_list in Summaries:
    avg_score,row_scores = macro_rouge(samsum["train"]["summary"][:num_texts], summary_list)
    Avg_score.append(avg_score)
    Row_scores.append(row_scores)

#### Table - Rouge scores comparison on SamSum

In [107]:
Rouge_compare_samsum = pd.DataFrame({"Model_prompt": Model_prompt,\
              "Rouge1":[score["rouge1"] for score in Avg_score],\
              "Rouge2":[score["rouge2"] for score in Avg_score],\
              "RougeL":[score["rougeL"] for score in Avg_score]})
Rouge_compare_samsum

Unnamed: 0,Model_prompt,Rouge1,Rouge2,RougeL
0,flant5-prompt1,0.564212,0.336904,0.486534
1,flant5-prompt2,0.551269,0.31561,0.465923
2,llama-unstructured,0.280247,0.094211,0.209001
3,llama-structured,0.337705,0.126921,0.255808


#### 5.1.1 Rouge scores comparison, separately on shorter and longer texts.

In [108]:
Rouge_samsum = pd.DataFrame({"text":samsum["train"]["dialogue"][:num_texts]})
Rouge_samsum["text_len"] = Rouge_samsum["text"].apply(lambda x: len(x.split()))

In [109]:
# Show an example of rouge scores of one single example.
Row_scores[0][0]

{'rouge1': Score(precision=0.875, recall=0.7777777777777778, fmeasure=0.823529411764706),
 'rouge2': Score(precision=0.42857142857142855, recall=0.375, fmeasure=0.39999999999999997),
 'rougeL': Score(precision=0.75, recall=0.6666666666666666, fmeasure=0.7058823529411765)}

In [110]:
for score in ["rouge1","rouge2","rougeL"]:
    for model_prompt,row_scores in zip(Model_prompt,Row_scores):
        Rouge_samsum[f"{model_prompt}_{score}"] = [x[score].fmeasure for x in row_scores]
  #  Rouge_samsum[f"Llama_{score}"] = [x[score].fmeasure for x in scores_llama_prompt2]
  #  Rouge_samsum[f"Flan-t5_{score}"] = [x[score].fmeasure for x in scores_flant5_prompt1]

In [111]:
Rouge_samsum.describe()

Unnamed: 0,text_len,flant5-prompt1_rouge1,flant5-prompt2_rouge1,llama-unstructured_rouge1,llama-structured_rouge1,flant5-prompt1_rouge2,flant5-prompt2_rouge2,llama-unstructured_rouge2,llama-structured_rouge2,flant5-prompt1_rougeL,flant5-prompt2_rougeL,llama-unstructured_rougeL,llama-structured_rougeL
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,94.815,0.564212,0.551269,0.280247,0.337705,0.336904,0.31561,0.094211,0.126921,0.486534,0.465923,0.209001,0.255808
std,67.247243,0.185321,0.193439,0.152295,0.13372,0.221221,0.215617,0.086223,0.103291,0.201979,0.202819,0.121687,0.118592
min,12.0,0.0,0.0,0.0,0.035088,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035088
25%,47.75,0.438333,0.431124,0.17317,0.247973,0.181006,0.155458,0.026579,0.05861,0.322647,0.315273,0.129032,0.179608
50%,79.0,0.557778,0.554374,0.285714,0.318031,0.298469,0.285714,0.08,0.10458,0.454545,0.449842,0.20229,0.234138
75%,125.0,0.696158,0.666667,0.385795,0.421345,0.471989,0.432432,0.138742,0.179383,0.647794,0.60787,0.277778,0.32
max,366.0,1.0,1.0,0.684211,0.833333,1.0,1.0,0.444444,0.73913,1.0,1.0,0.608696,0.833333


In [112]:
Rouge_samsum[Rouge_samsum.text_len > 150].describe()

Unnamed: 0,text_len,flant5-prompt1_rouge1,flant5-prompt2_rouge1,llama-unstructured_rouge1,llama-structured_rouge1,flant5-prompt1_rouge2,flant5-prompt2_rouge2,llama-unstructured_rouge2,llama-structured_rouge2,flant5-prompt1_rougeL,flant5-prompt2_rougeL,llama-unstructured_rougeL,llama-structured_rougeL
count,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0
mean,228.142857,0.441557,0.428273,0.3066,0.307061,0.188628,0.194758,0.087011,0.096934,0.353356,0.329249,0.228329,0.20529
std,56.499567,0.132762,0.144463,0.119274,0.101477,0.114347,0.121256,0.061739,0.061081,0.125215,0.10968,0.092724,0.066577
min,158.0,0.076923,0.05,0.0,0.1,0.0,0.0,0.0,0.0,0.076923,0.05,0.0,0.083333
25%,181.0,0.359069,0.351461,0.25,0.247183,0.117225,0.136057,0.041967,0.06044,0.261939,0.282468,0.164104,0.164357
50%,225.0,0.437607,0.443152,0.296703,0.300335,0.183932,0.186047,0.077311,0.094224,0.321765,0.330846,0.228681,0.204416
75%,252.75,0.522815,0.518322,0.396736,0.373905,0.256653,0.235026,0.112994,0.117049,0.440974,0.38639,0.288511,0.244558
max,366.0,0.754717,0.711111,0.5,0.485437,0.395349,0.511628,0.266667,0.235294,0.641509,0.56,0.40708,0.335484


#### Table - summarization texts token number comparison, on SamSum

In [117]:
len_dict = {}
for model_prompt,summary_list in zip(Model_prompt,Summaries):
    len_dict[model_prompt] = [sum([len(x.split()) for x in summary_list])/num_texts]
samsum_summary_length = pd.DataFrame(len_dict)
samsum_summary_length

Unnamed: 0,flant5-prompt1,flant5-prompt2,llama-unstructured,llama-structured
0,19.51,19.015,52.385,56.505


### 5.2 Rouge scores comparison on CNN/DailyMail

In [119]:
Summaries = [flant5_summary["CNN_prompt1"].values,llama_summary["CNN_prompt_structured"].values]
Model = ["flant5","llama"]

In [120]:
Avg_score, Row_scores = [],[]
for summary_list in Summaries:
    avg_score,row_scores = macro_rouge(cnn_dailymail["train"]["highlights"][:num_texts], summary_list)
    Avg_score.append(avg_score)
    Row_scores.append(row_scores)

In [121]:
Avg_score

[{'rouge1': 0.2741289314419602,
  'rouge2': 0.10931915873699548,
  'rougeL': 0.20166076726829135},
 {'rouge1': 0.261070663086643,
  'rouge2': 0.09152511283799075,
  'rougeL': 0.1666507330010361}]

#### Table - Rouge score comparison on CNN/Daily.

In [129]:
Rouge_compare_cnn = pd.DataFrame({"Model":Model,\
              "Rouge1":[score["rouge1"] for score in Avg_score],\
              "Rouge2":[score["rouge2"] for score in Avg_score],\
              "RougeL":[score["rougeL"] for score in Avg_score]})

Rouge_compare_cnn

Unnamed: 0,Model,Rouge1,Rouge2,RougeL
0,flant5,0.274129,0.109319,0.201661
1,llama,0.261071,0.091525,0.166651


#### Table - summarization texts token number comparison, on CNN/Daily

In [128]:
len_dict = {}
for model,summary_list in zip(Model,Summaries):
    len_dict[model] = [sum([len(x.split()) for x in summary_list])/num_texts]
cnn_summary_length = pd.DataFrame(len_dict)
cnn_summary_length

Unnamed: 0,flant5,llama
0,20.075,157.11


## Conclusions & Observations

1. Flan-t5 Large achieves better rouge scores than Llama 3.2 1B-instruct on summarization tasks, especially on short texts.

    *   The difference is significant on SamSum texts, while much less significant on CNN daily.

    *  When Dividing the Samsum samples into two groups of short and relatively longer texts, the difference between rouge scores of the two models is also reduced.

2. Llama 3.2 1B-instruct tends to produce long summaries, whereas flant5 tends to produce short summaries.

3. With the prompts used, the unstructured prompt (i.e., just a single stirng) works worse than structrued prompt (i.e., a structured dictionary which specifies user input and assistant output).