# Text summarization with LLM

!pip install -U transformers
!pip install -U accelerate
!pip install -U datasets
!pip install -U bertviz
!pip install -U umap-learn
!pip install -U sentencepiece
!pip install -U urllib3
!pip install py7zr

In [2]:
#Import package
from datasets import load_dataset

## Explore the Dataset :

In [3]:
# Load data "ccdv/cnn_dailymail" (pre-trained dataset of bart-large-cnn model) from HuggingFace :
dataset = load_dataset("ccdv/cnn_dailymail", version="3.0.0")
dataset

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/9.27k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [4]:
# Explore the structure of "dataset" : 
"""
Structure of dataset (almost a dict : datasets.dataset_dict.DatasetDict) :

dataset = 
DatasetDict(
{train : {"article" : ["article1", "article2"],
          "highlights" : ["highlights1", "highlights2"],
          "id" : ["id1", "id2"]} 
},
{validation : {"article" : ["article3", "article4"],
               "highlights" : ["highlights3", "highlights4"],
               "id" : ["id3", "id4"]}
},
{test : {"article" : ["article5", "article6"],
         "highlights" : ["highlights5", "highlights6"],
         "id" : ["id5", "id6"]}
}
)

"""


dataset["train"][1:5]

{'article': ['(CNN) -- Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men\'s 4x100m relay. The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover. The 26-year-old Bolt has now collected eight gold medals at world championships, equaling the record held by American trio Carl Lewis, Michael Johnson and Allyson Felix, not to mention the small matter of six Olympic titles. The relay triumph followed individual successes in the 100 and 200 meters in the Russian capital. "I\'m proud of myself and I\'ll continue to work to dominate for as long as possible," Bolt said, having previously expressed his intention to carry on until the 2016 Rio Olympics. Victory

In [5]:
type(dataset)

datasets.dataset_dict.DatasetDict

In [6]:
# We take the 1000 first characters of the article number 1.
dataset['train'][1]['article'][:1000]

'(CNN) -- Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men\'s 4x100m relay. The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover. The 26-year-old Bolt has now collected eight gold medals at world championships, equaling the record held by American trio Carl Lewis, Michael Johnson and Allyson Felix, not to mention the small matter of six Olympic titles. The relay triumph followed individual successes in the 100 and 200 meters in the Russian capital. "I\'m proud of myself and I\'ll continue to work to dominate for as long as possible," Bolt said, having previously expressed his intention to carry on until the 2016 Rio Olympics. Victory was never se

In [7]:
# Create the input index for prediction.
input_text = dataset['train'][1]['article']
input_text

'(CNN) -- Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men\'s 4x100m relay. The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover. The 26-year-old Bolt has now collected eight gold medals at world championships, equaling the record held by American trio Carl Lewis, Michael Johnson and Allyson Felix, not to mention the small matter of six Olympic titles. The relay triumph followed individual successes in the 100 and 200 meters in the Russian capital. "I\'m proud of myself and I\'ll continue to work to dominate for as long as possible," Bolt said, having previously expressed his intention to carry on until the 2016 Rio Olympics. Victory was never se

## Basic pre-trained model (no fine-tuned) :

In [8]:
# Import pre-trained model BERT
from transformers import pipeline

# Import model with pipeline library from huggingface :
pipe = pipeline('summarization', model='facebook/bart-large-cnn')

# Summarize the "input_text" :
pipe_out = pipe(input_text)
pipe_out

"""
pipe_out = [{'summary_text': "Usain Bolt wins his third gold of the world championships in Moscow. Bolt anchors Jamaica to victory in the men's 4x100m relay. 
The 26-year-old has now won eight gold medals at the championships. Jamaica's women also win gold in the 4x50m and 4x200m relays."}]

# Take the summarized text only : pipe_out[0]['summary_text']
"""

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

'\npipe_out = [{\'summary_text\': "Usain Bolt wins his third gold of the world championships in Moscow. Bolt anchors Jamaica to victory in the men\'s 4x100m relay. \nThe 26-year-old has now won eight gold medals at the championships. Jamaica\'s women also win gold in the 4x50m and 4x200m relays."}]\n\n# Take the summarized text only : pipe_out[0][\'summary_text\']\n'

In [9]:
# example : 
txt = "For much of this season, Liverpool fans had spoken about sending Jürgen Klopp into the sunset with a historic quadruple of trophies.Now, however, it is looking increasingly likely that the German manager, who is revered in the red half of the city as something of a god-like figure, will finish his final season in charge at the club winning only the League Cup.On Wednesday, Liverpool was stunned 2-0 by struggling local rival Everton in the final Merseyside derby of Klopp’s reign, all but ending the team’s hopes of winning a second Premier League title during the German’s tenure.Everton laid siege to Liverpool’s goal with a barrage of set pieces, with Jarrad Branthwaite’s opener resulting from a long free-kick into the box and Dominic Calvert-Lewin heading home from a corner in the second half to double the Toffees’ lead.It was the first derby defeat Liverpool had suffered at Goodison Park in 14 years and the home faithful made sure to rub salt into the wounds.You lost the league at Goodison Park,” was the chant from a delirious crowd.The defeat leaves Liverpool in second place, three points behind Arsenal and one ahead of Manchester City, though Pep Guardiola’s side now has two games in hand.Obviously very disappointed, Klopp told Sky Sports after the game. In a lot of things, we let it become exactly the game that Everton wanted. Two goals from set pieces … there they are really strong."
pipe(txt)

[{'summary_text': "Liverpool beaten 2-0 by Everton in final Merseyside derby of Jürgen Klopp's reign. Jarrad Branthwaite and Dominic Calvert-Lewin scored for the Toffees. Defeat leaves Liverpool in second place, three points behind Arsenal and one ahead of Manchester City."}]

## Fine-tuned "bart_large_cnn" model :

In [10]:
# from datasets import load_dataset
from transformers import pipeline

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

In [11]:
# Import the model LLM 'facebook/bart-large-cnn' :
device = 'gpu'
model_ckpt = 'facebook/bart-large-cnn'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

### Prepare data for fine-tuning :

In [12]:
# Load the "samsum" (text contain dialogue) dataset for the fine-tuning :
samsum = load_dataset('samsum')
samsum

Downloading data: 100%|██████████| 6.06M/6.06M [00:00<00:00, 16.8MB/s]
Downloading data: 100%|██████████| 347k/347k [00:00<00:00, 4.35MB/s]
Downloading data: 100%|██████████| 335k/335k [00:00<00:00, 2.79MB/s]


Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [13]:
samsum['train'][0:3]

{'id': ['13818513', '13728867', '13681000'],
 'dialogue': ["Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
  'Olivia: Who are you voting for in this election? \r\nOliver: Liberals as always.\r\nOlivia: Me too!!\r\nOliver: Great',
  "Tim: Hi, what's up?\r\nKim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating\r\nTim: What did you plan on doing?\r\nKim: Oh you know, uni stuff and unfucking my room\r\nKim: Maybe tomorrow I'll move my ass and do everything\r\nKim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies\r\nTim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores\r\nTim: It really helps\r\nKim: thanks, maybe I'll do that\r\nTim: I also like using post-its in kaban style"],
 'summary': ['Amanda baked cookies and will bring Jerry some tomorrow.',
  'Olivia and Olivier are voting for liberals in this election. ',
  'Kim may try the pomo

In [14]:
# Build Data Collator for fine-tuning :

def get_feature(batch):
    #tokenisation of all texts (or dialogue) :
    encodings = tokenizer(batch['dialogue'], text_target=batch['summary'],
                        max_length=1024, truncation=True)

    encodings = {'input_ids': encodings['input_ids'], #id
               'attention_mask': encodings['attention_mask'], #text
               'labels': encodings['labels']} # summarized txt

    return encodings

In [15]:
# batch = samsum
# batched=True means that the "get_feature" function will be apply on all values of each key of the dictionnary
# "dict.map[funct_name)" and "batched=True" is used when you want apply a function on dictionnary

# For each element of dict, we will apply this function.
samsum_pt = samsum.map(get_feature, batched=True)
samsum_pt

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 818
    })
})

In [16]:
# Transform samsum_pt to a torch format :
# samsum_pt <=> samsum pytorch
columns = ['input_ids', 'labels', 'attention_mask']
samsum_pt.set_format(type='torch', columns=columns)
samsum_pt

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 818
    })
})

In [17]:
# Create the data_collator
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
data_collator

DataCollatorForSeq2Seq(tokenizer=BartTokenizerFast(name_or_path='facebook/bart-large-cnn', vocab_size=50265, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True),
}, model=BartForConditionalGeneration(
  (model): BartModel(
   

### Fine-tune the model :

In [18]:
# Fine-tune the model with the samsum_pt dataset :
from transformers import TrainingArguments, Trainer

# Setting the hyperparameters for training :
training_args = TrainingArguments(
    output_dir = 'bart_samsum', #save the training output (warning : not the model)
    num_train_epochs=1,
    warmup_steps = 500,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay = 0.01,
    logging_steps = 10,
    evaluation_strategy = 'steps',
    eval_steps=500,
    save_steps=1e6,
    gradient_accumulation_steps=16
)

trainer = Trainer(model=model, args=training_args, tokenizer=tokenizer, data_collator=data_collator,
                  train_dataset = samsum_pt['train'], eval_dataset = samsum_pt['validation'])


trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Step,Training Loss,Validation Loss


TrainOutput(global_step=115, training_loss=1.5816870896712594, metrics={'train_runtime': 2167.7598, 'train_samples_per_second': 6.796, 'train_steps_per_second': 0.053, 'total_flos': 1.0927218624036864e+16, 'train_loss': 1.5816870896712594, 'epoch': 1.0})

In [19]:
trainer

<transformers.trainer.Trainer at 0x7de55a3294b0>

In [20]:
# save the model :
trainer.save_model('bart_samsum_model_rom')

Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


## Predict (summarized) on new data :

In [27]:
# Load the fine-tuned model "bart_samsum_model_rom" (model file) and apply the model on new text :

pipe = pipeline('summarization', model='bart_samsum_model_rom')
gen_kwargs = {'length_penalty': 0.8, 'num_beams': 8, "max_length": 50}

custom_dialogue="""
Laxmi Kant: Do you do the homework ?
Juli: No, I don't. Could you give me your homework ?
Laxmi Kant: Yes I can, but you have to pay me a lunch.
Juli: Ok ! 
"""
print(pipe(custom_dialogue, **gen_kwargs))

Your min_length=56 must be inferior than your max_length=50.


[{'summary_text': "Juli doesn't do the homework. Laxmi Kant will give her the homework if she pays him a lunch.   Â. ÂÂÂ\x9d Â£1.50 for the lunch."}]


In [24]:
# add bart_samsum_model_rom file to the bart_samsum.zip file (.zip) :
!zip bart_samsum.zip -r bart_samsum_model_rom/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  adding: bart_samsum_model_rom/ (stored 0%)
  adding: bart_samsum_model_rom/config.json (deflated 61%)
  adding: bart_samsum_model_rom/merges.txt (deflated 53%)
  adding: bart_samsum_model_rom/tokenizer_config.json (deflated 76%)
  adding: bart_samsum_model_rom/special_tokens_map.json (deflated 52%)
  adding: bart_samsum_model_rom/model.safetensors (deflated 7%)
  adding: bart_samsum_model_rom/training_args.bin (deflated 51%)
  adding: bart_samsum_model_rom/tokenizer.json (deflated 72%)
  adding: bart_samsum_model_rom/generation_config.json (deflated 47%)
  adding: bart_samsum_model_rom/vocab.json (deflated 59%)
