# <center> Summarization using IndoT5 </center>

This notebook will show you how to finetuning T5 model on summarization task in Bahasa Indonesia. In this notebook, we will use [IndoSum](https://arxiv.org/abs/1810.05334) data, which is consist of news article and its summary. This notebook assume that you already download the data and put it in your google drive folder. Thus, you must let this notebook to have authorization for accessing your google drive (Don't worry it is safe).

## Install Dependencies

In [1]:
!pip install sentencepiece==0.1.95
!pip install transformers==4.2.2
!pip install datasets==1.2.0
!pip install tqdm==4.48

Collecting sentencepiece==0.1.95
  Downloading sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 22.7 MB/s eta 0:00:01[K     |▌                               | 20 kB 27.7 MB/s eta 0:00:01[K     |▉                               | 30 kB 21.6 MB/s eta 0:00:01[K     |█                               | 40 kB 17.2 MB/s eta 0:00:01[K     |█▍                              | 51 kB 12.7 MB/s eta 0:00:01[K     |█▋                              | 61 kB 14.5 MB/s eta 0:00:01[K     |██                              | 71 kB 11.8 MB/s eta 0:00:01[K     |██▏                             | 81 kB 13.0 MB/s eta 0:00:01[K     |██▌                             | 92 kB 13.5 MB/s eta 0:00:01[K     |██▊                             | 102 kB 11.8 MB/s eta 0:00:01[K     |███                             | 112 kB 11.8 MB/s eta 0:00:01[K     |███▎                            | 122 kB 11.8 MB/s eta 0:00:01[K     |███▌            

## Mount Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Import libraries

In [4]:
import copy
import datasets
from datasets import load_dataset
import pickle
import transformers
import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoConfig, T5ForConditionalGeneration
import datetime
import os
import numpy as np
import json
import matplotlib.pyplot as plt
from tqdm import tqdm

# if there is an error related to tqdm, run this cell once more

## Data 

### Read data

In [5]:
# I already download the data and put in this folder, you should change this depending on the location of the data in your drive
work_dir = "/content/drive/MyDrive/Summarization"
data_files = {"train": f'{work_dir}/Data/train.01.jsonl', "val": f'{work_dir}/Data/dev.01.jsonl', "test": f'{work_dir}/Data/test.01.jsonl'}

dataset = load_dataset('json', data_files=data_files)

train_dataset = dataset["train"]
valid_dataset = dataset["val"]
test_dataset = dataset["test"]

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1202.0, style=ProgressStyle(description…




Using custom data configuration default


Downloading and preparing dataset json/default-e2034f4acd48f899 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-e2034f4acd48f899/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-e2034f4acd48f899/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514. Subsequent calls will reuse this data.


In [6]:
# check columns/features in the dataset
train_dataset.features

{'category': Value(dtype='string', id=None),
 'gold_labels': Sequence(feature=Sequence(feature=Value(dtype='bool', id=None), length=-1, id=None), length=-1, id=None),
 'id': Value(dtype='string', id=None),
 'paragraphs': Sequence(feature=Sequence(feature=Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), length=-1, id=None), length=-1, id=None),
 'source': Value(dtype='string', id=None),
 'source_url': Value(dtype='string', id=None),
 'summary': Sequence(feature=Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), length=-1, id=None)}

In [7]:
# Check the number for each dataset
print('Number of train dataset: ', len(train_dataset))
print('Number of validation dataset: ', len(valid_dataset))
print('Number of test dataset: ', len(test_dataset))

Number of train dataset:  14262
Number of validation dataset:  750
Number of test dataset:  3762


In [8]:
# Lets, take a look on the firt teo of train dataset
train_dataset[:2]

{'category': ['tajuk utama', 'teknologi'],
 'gold_labels': [[[False, True],
   [True, True],
   [False, False, False],
   [False, False],
   [False, False],
   [False, False],
   [False, False],
   [False],
   [False, False]],
  [[False, False, False, False],
   [False, True, True],
   [False, False, True],
   [False, False, False, False],
   [False, False],
   [False, False, False],
   [False, False],
   [False, False],
   [False, False, False],
   [False, False, False],
   [False, False, False],
   [False, False, False, False],
   [False, False],
   [False, False]]],
 'id': ['1501893029-lula-kamal-dokter-ryan-thamrin-sakit-sejak-setahun',
  '1509072914-dua-smartphone-zenfone-baru-tawarkan-solusi-bersel'],
 'paragraphs': [[[['Jakarta',
     ',',
     'CNN',
     'Indonesia',
     '-',
     '-',
     'Dokter',
     'Ryan',
     'Thamrin',
     ',',
     'yang',
     'terkenal',
     'lewat',
     'acara',
     'Dokter',
     'Oz',
     'Indonesia',
     ',',
     'meninggal',
     'dun

In [9]:
# we use IndoT5-small from the model hub
tokenizer_checkpoint = "Wikidepia/IndoT5-small" 
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)  

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=628.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=776904.0, style=ProgressStyle(descripti…




In [10]:
# max length on token for the encoder(input article)
encoder_max_len = 512
# max length on token for the decoder(summary)
decoder_max_len = 170

### Preprocess data

In [11]:
# encode function to preprocess the data
def encode(example, encoder_max_len=encoder_max_len, decoder_max_len=decoder_max_len):
    
    # use deepcopy so the referenced data is not altered
    paragraphs = copy.deepcopy(example['paragraphs'])
    summary = copy.deepcopy(example['summary'])

    # since the paragraph and the summary is splitted, we need to join it into a whole paragraph and summary
    for i in range(len(paragraphs)):
        paragraphs[i] = " ".join([word for sent_lv1 in paragraphs[i] for sent_lv2 in sent_lv1 for word in sent_lv2])
        # we need to put 'summarize: ' at the beginning of every paragraph, since that what the documentation tell to, you can change to another signature though
        paragraphs[i] = 'summarize: ' + paragraphs[i]
        summary[i] = " ".join([word for sent_lv1 in summary[i] for word in sent_lv1])
    
    encoder_inputs = tokenizer(paragraphs, truncation=True, max_length=encoder_max_len, padding='max_length')
    
    decoder_inputs = tokenizer(summary, truncation=True, max_length=decoder_max_len, padding='max_length')
    
    input_ids = encoder_inputs['input_ids']
    input_attention = encoder_inputs['attention_mask']
    target_ids = decoder_inputs['input_ids']
    target_attention = decoder_inputs['attention_mask']
    
    outputs = {'paragraphs_join': paragraphs, 'summary_join': summary, 
               'input_ids':input_ids, 'attention_mask': input_attention, 
               'labels':target_ids, 'decoder_attention_mask':target_attention}

    return outputs

In [12]:
columns_remove = list(train_dataset.features.keys())
# preprocess/map the dataset using the encode function 
train_ds = train_dataset.map(encode, batched = True, remove_columns = columns_remove)
# since we use pytorch model, we need to transform the relevant input into pytorch Tensor
train_ds.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels', 'decoder_attention_mask'], output_all_columns=True)

valid_ds = valid_dataset.map(encode, batched = True, remove_columns = columns_remove)
valid_ds.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels', 'decoder_attention_mask'], output_all_columns=True)

HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [13]:
# check features on dataset
train_ds.features

{'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'decoder_attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'paragraphs_join': Value(dtype='string', id=None),
 'summary_join': Value(dtype='string', id=None)}

In [14]:
# Wrap data using dataloader since we will train the model by inputing the data batch by batch
# its not possible to input all data at once when training, since there is memory limitation on GPU 

batch_size = 4 # from my trial and error this is the maximum size of a batch data

train_dl = DataLoader(train_ds, batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size, shuffle=False)

## Training phase

### Model preparation

In [15]:
def get_default_device():
    """Pick GPU if available, else CPU"""
    if torch.cuda.is_available():
        return torch.device('cuda')
    else:
        return torch.device('cpu')

In [16]:
device = get_default_device()

In [17]:
model_checkpoint = tokenizer_checkpoint
model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=307927045.0, style=ProgressStyle(descri…




In [18]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
model = model.to(device)

In [19]:
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

In [20]:
def fit(num_epochs, model, train_loader, valid_loader, opt):
    
    min_val_loss = 999
    for epoch in range(num_epochs):        
        model.train()
        train_loss = 0
        train_pbar = tqdm(train_loader, leave=True, total=len(train_loader))        
        for i, batch_data in enumerate(train_pbar):
            input_ids, attention_mask, labels = batch_data["input_ids"], batch_data["attention_mask"], batch_data["labels"]
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            labels = labels.to(device)
            opt.zero_grad()
            output = model(input_ids = input_ids, attention_mask = attention_mask, labels = labels)
            loss = output.loss
            train_loss += loss.item()
            loss.backward()
            opt.step()
            train_loss_avg = train_loss/(i+1)
            train_pbar.set_description("(Epoch {}) TRAIN LOSS:{:.4f} LR:{:.8f}".format((epoch+1), train_loss_avg, get_lr(opt)))

        model.eval()
        pbar = tqdm(valid_loader, leave=True, total=len(valid_loader))
        with torch.no_grad():
            val_loss = 0
            for i, data in enumerate(pbar):
                input_ids, attention_mask, labels = data["input_ids"], data["attention_mask"], data["labels"]
                input_ids = input_ids.to(device)
                attention_mask = attention_mask.to(device)
                labels = labels.to(device)
                opt.zero_grad()
                output = model(input_ids = input_ids, attention_mask = attention_mask, labels = labels)
                loss = output.loss
                val_loss += loss.item()
                val_loss_avg = val_loss/(i+1)
                pbar.set_description("(Epoch {}) VALID LOSS:{:.4f}".format((epoch+1), val_loss_avg))

            # we save model with the best val loss  
            if val_loss_avg < min_val_loss:
                min_val_loss = val_loss_avg
                model.save_pretrained("Results/best_model_summarization/")    
                 

### Train model

In [21]:
# number of training epoch
num_epochs = 3

In [22]:
fit(num_epochs, model, train_dl, valid_dl, optimizer)

(Epoch 1) TRAIN LOSS:7.6909 LR:0.00001000: 100%|██████████| 3566/3566 [29:50<00:00,  1.99it/s]
(Epoch 1) VALID LOSS:0.0181: 100%|██████████| 188/188 [23:29<00:00,  7.50s/it]
(Epoch 2) TRAIN LOSS:0.3749 LR:0.00001000: 100%|██████████| 3566/3566 [29:47<00:00,  2.00it/s]
(Epoch 2) VALID LOSS:0.0156: 100%|██████████| 188/188 [23:29<00:00,  7.50s/it]
(Epoch 3) TRAIN LOSS:0.3201 LR:0.00001000: 100%|██████████| 3566/3566 [29:45<00:00,  2.00it/s]
(Epoch 3) VALID LOSS:0.0148: 100%|██████████| 188/188 [23:29<00:00,  7.50s/it]


## Testing phase

### Using the Model

In [27]:
best_model_checkpoint = "Results/best_model_summarization/"
best_model = T5ForConditionalGeneration.from_pretrained(best_model_checkpoint)
best_model = best_model.to(device)

In [30]:
# this function will print the article, ist gold standard summary, and generated summary by the model for comparinson
def print_generated(sentence_text, summary_text, generated):
    
    b1 = "\033[1m"
    b2 = "\033[0m"
    for i in range(len(generated)):
        print(b1 + f"Full TEXT[{i}]: " + b2)
        print(sentence_text[i])
        print(b1 + f"Gold SUMMARY[{i}]: " + b2)
        print(summary_text[i])
        print(b1 + f"Generated SUMMARY[{i}]:" + b2)
        print(tokenizer.decode(generated[i], skip_special_tokens=True, clean_up_tokenization_spaces=True))
        print("\n")

In [31]:
with torch.no_grad():
    data = next(iter(valid_dl))
    sentence_text, summary_text, input_ids, attention_mask = data['paragraphs_join'], data['summary_join'], data["input_ids"], data["attention_mask"]
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)
    generated = best_model.generate(input_ids=input_ids, 
                               attention_mask=attention_mask, 
                               max_length=170, 
                               min_length=40, 
                               length_penalty=2.0, 
                               num_beams=4, 
                               early_stopping=True)

In [32]:
print_generated(sentence_text, summary_text, generated)

[1mFull TEXT[0]: [0m
summarize: Ketua MPR Zulkifli Hasan menyesalkan kisruh yang terjadi antara pelaku sarana transportasi online dan tradisional . Zulkifli menyarankan adanya pertemuan bersama antara pemerintah , pelaku transportasi online dan transportasi tradisional demi meredam kisruh yang masih belum terselesaikan . Zulkifli menilai aturan yang dikeluarkan pemerintah seharusnya tidak hanya membahas tarif tapi juga mekanisme yang dapat menguntungkan semua pihak , baik pelaku transportasi online maupun tradisional . " Tidak hanya tarif tapi apa saja harus diatur . Dipanggil keduanya untuk berbicara masing-masing , musyawarah , duduk bareng kemudian dibuat aturan yang saling menguntungkan . Kan bisa saling melengkapi , negara lain bisa masa kita enggak bisa , " ucap Zulkifli di Gedung DPR , Senayan , Jakarta Pusat , Senin ( 27 / 3 ) . Baca juga : Setya Novanto : Jangan Sampai Kisruh Taksi dan Ojek Online Jadi Besar Ketua Umum PAN menambahkan bahwa hal ini harus diatur karena menyan

## Evaluation using test data

In [33]:
# Lets valuate the model on test dataset, to save time, lets evaluate 20 of the test dataset. If you want to get more accurate evaluation score, its better to test it on more sample.
test_ds = test_dataset.map(encode, batched = True, remove_columns = columns_remove)
test_ds.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels', 'decoder_attention_mask'], output_all_columns=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=False)

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




In [34]:
def pack_sentence_summary_generated(sentence_text, summary_text, generated):
    
    output = []
    for i in range(len(generated)):
        element = {}
        element['sentence'] = sentence_text[i]
        element['summary'] = summary_text[i]
        element['generated_summary'] = tokenizer.decode(generated[i], skip_special_tokens=True, clean_up_tokenization_spaces=True)
        output.append(element)

    return output

In [35]:
def predict_summary(model, data_loader):
    
    output = []
    with torch.no_grad():
        process_pbar = tqdm(data_loader, leave=True, total=len(data_loader))
        for i, batch_data in enumerate(process_pbar):

            # since it will takes a long time to evaluate all data, lets just evaluate the first 5 batch of data, if you want to evaluate all data just comment the 2 lines below
            if i>4:
              break

            sentence_text, summary_text, input_ids, attention_mask = batch_data['paragraphs_join'], batch_data["summary_join"], batch_data["input_ids"], batch_data["attention_mask"]
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            generated = model.generate(input_ids=input_ids, 
                                       attention_mask=attention_mask, 
                                       max_length=170, 
                                       min_length=40, 
                                       length_penalty=2.0, 
                                       num_beams=4, 
                                       early_stopping=True)
            
            # for this function will will include the generated by model summary and the gold standard summary, for easy of use later
            sub_output = pack_sentence_summary_generated(sentence_text, summary_text, generated)
            output += sub_output
            process_pbar.set_description("Progress")

            
    return output

In [36]:
test_result = predict_summary(best_model, test_dl)

Progress:   1%|          | 5/941 [00:16<51:46,  3.32s/it]


In [37]:
len(test_result)

20

In [38]:
# save the model generated summary and the gold standart('true') summary in a txt file for evaluation
gold_summary_filepath = 'gold_sum.txt'

with open(gold_summary_filepath, 'w') as f1:
    for item in test_result:
        f1.write(item['summary'].replace('\r\n','').replace('\n','') + " \n")

gen_summary_filepath = 'pred_sum.txt'

with open(gen_summary_filepath, 'w') as f2:
    for item in test_result:
        f2.write(item['generated_summary'] + " \n")

#### Evaluation using rouge score

In [39]:
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [40]:
from rouge import FilesRouge

In [41]:
files_rouge = FilesRouge()

scores = files_rouge.get_scores('pred_sum.txt', 'gold_sum.txt', avg=True)

In [42]:
print(scores)

{'rouge-1': {'r': 0.6706837745890322, 'p': 0.65288830289498, 'f': 0.6584100939507476}, 'rouge-2': {'r': 0.5252359433010357, 'p': 0.5290477992634497, 'f': 0.5239697400077534}, 'rouge-l': {'r': 0.6656713102015679, 'p': 0.6483439845374174, 'f': 0.6536568852898951}}


for benchmark purpose, check [this paper](https://arxiv.org/abs/2011.00677) .



## Download Model

In [None]:
!zip -r ./Results.zip ./Results

In [None]:
from google.colab import files

files.download("./Results.zip")

**_author: Hadi Muhshi_** <br />
**_email: hadi.muhshi@gmail.com_**