# T5 fine-tuning in a Text Summarization task

Nowadays, the AI community has two ways to approach automatic text summarization, Extractive Summarization and Abstractive Summarization:
- _Extractive Summarization_: the extractive approach selects the most important phrases and lines from the documents. It then combines all the important lines to create the summary. So, in this case, every line and word of the summary actually belongs to the original document which is summarized.
- _Abstractive Summarization_: The abstractive approach uses new phrases and terms that are different from the original document, keeping the meaning the same, just like how humans do in summarization. So, it is much harder than the extractive approach.

The **abstractive text summarization** is one of the most challenging tasks in natural language processing, involving understanding of long passages, information compression, and language generation. The dominant paradigm for training machine learning models to do this is sequence-to-sequence (seq2seq) learning, where a neural network learns to map input sequences to output sequences. While these seq2seq models were initially developed using recurrent neural networks, Transformer encoder-decoder models have recently become favored as they are more effective at modeling the dependencies present in the long sequences encountered in summarization.

Transformer models combined with self-supervised pre-training (e.g., BERT, GPT-2, RoBERTa, XLNet, ALBERT, T5, ELECTRA) have shown to be a powerful framework for producing general language learning.
Text-To-Text Transfer Transformer (T5) model, pre-trained on Colossal Clean Crawled Corpus (C4), a cleaned version of Common Crawl that is two orders of magnitude larger than Wikipedia, achieves state-of-the-art results on many NLP benchmarks while being flexible enough to be fine-tuned to a variety of important downstream tasks. 

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import os
import regex as re
import json
from tqdm import tqdm

In [2]:
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

In [3]:
dataset_path = os.path.join(os.path.abspath(""), 'podcasts-no-audio-13GB')

In [4]:
metadata_path_train = os.path.join(dataset_path, 'metadata.tsv')
metadata_train = pd.read_csv(metadata_path_train, sep='\t')
print("Columns: ", metadata_train.columns)
print("Shape: ", metadata_train.shape)

Columns:  Index(['show_uri', 'show_name', 'show_description', 'publisher', 'language',
       'rss_link', 'episode_uri', 'episode_name', 'episode_description',
       'duration', 'show_filename_prefix', 'episode_filename_prefix'],
      dtype='object')
Shape:  (105360, 12)


In [5]:
def get_path(episode):
    # extract the 2 reference number/letter to access the episode transcript
    show_filename = episode['show_filename_prefix']
    episode_filename = episode['episode_filename_prefix'] + ".json"
    dir_1, dir_2 = re.match(r'show_(\d)(\w).*', show_filename).groups()

    # check if the transcript file in all the derived subfolders exist
    transcipt_path = os.path.join(dataset_path, "spotify-podcasts-2020",
                                "podcasts-transcripts", dir_1, dir_2,
                                show_filename, episode_filename)

    return transcipt_path

In [None]:
# check if the transcript files exist
for i in range(len(metadata_train)):
    assert os.path.exists(get_path(metadata_train.iloc[i]))

print("All files exist")

In [6]:
def get_transcription(episode):
    with open(get_path(episode), 'r') as f:
        episode_json = json.load(f)
        # seems that the last result in each trastcript is a repetition of the first one, so we ignore it
        transcripts = [
            result["alternatives"][0]['transcript'] if 'transcript' in result["alternatives"][0] else ""
            for result in episode_json["results"][:-1]
        ]
        return " ".join(transcripts)

### Build gold dataset

In [7]:
metadata_path_gold = os.path.join(dataset_path, '150gold.tsv')
metadata_gold = pd.read_csv(metadata_path_gold, sep='\t')
metadata_gold = pd.merge(metadata_gold, metadata_train, left_on='episode id', right_on='episode_uri')

print("Columns: ", metadata_gold.columns)
print("Shape: ", metadata_gold.shape)

Columns:  Index(['show name', 'episode name', 'episode id', 'creator description',
       'EGFB', 'lexrank summary', 'EGFB.1', 'textrank summary', 'EGFB.2',
       'lsa summary', 'EGFB.3', 'quasi-supervised summary', 'EGFB.4',
       'supervised summary', 'EGFB.5', 'show_uri', 'show_name',
       'show_description', 'publisher', 'language', 'rss_link', 'episode_uri',
       'episode_name', 'episode_description', 'duration',
       'show_filename_prefix', 'episode_filename_prefix'],
      dtype='object')
Shape:  (150, 27)


In [8]:
quality = {
    'B': 1,
    'F': 2,
    'G': 3,
    'E': 4
}

# convert egfb columns to a quality score
egfb_columns = ['EGFB', 'EGFB.1', 'EGFB.2', 'EGFB.3', 'EGFB.4', 'EGFB.5']
egfb_to_quality = metadata_gold[egfb_columns].applymap(lambda x: quality[x])

# remove rows with no quality > 1
egfb_to_quality = egfb_to_quality[[any(row > 1) for row in egfb_to_quality.values]] 

# select the best transcript for each episode
best_egfb = egfb_to_quality.apply(lambda x: x.idxmax(), axis=1)
best_summary = [metadata_gold.iloc[i, np.argwhere(metadata_gold.columns == egfb)[0][0] - 1] for i, egfb in best_egfb.iteritems()]

metadata_gold = metadata_gold.loc[best_egfb.index]
metadata_gold['best_summary'] = best_summary

In [9]:
# add transcripts
metadata_gold['transcript'] = metadata_gold.apply(get_transcription, axis=1)

In [10]:
train_data = metadata_gold[['episode id', 'transcript', 'best_summary']]
train_data

Unnamed: 0,episode id,transcript,best_summary
0,spotify:episode:4KRC1TZ28FavN3J5zLHEtQ,What's up fellas? So I got a patron supported...,All right guys now as y'all guys might know so...
1,spotify:episode:4tdDQcsBOUVWnA9XrpgTzS,If you are bored you are boring. One of my ki...,It was the first and last time I ever said tha...
2,spotify:episode:626YAxomH0HZ6nCW9NLlGY,Visit Larisa English club.com English everyday...,Prepositions of movement review two is the sec...
3,spotify:episode:6AUFl7KQWN6pzGFEIEKFQu,So so and salutations Summers and welcome to t...,My passion for The Sims 4 Grew From consuming ...
5,spotify:episode:6IDbemwG5t6XMlctbqcna7,Hi everyone. This is Justin from a liquidy pla...,"This week on Nothing But A Bob Thang, Nathan a..."
...,...,...,...
145,spotify:episode:2zr8iztbD8xSbuWO60tfHg,"Well, everybody's clear here with a word from ...","It was a significant weekend in the NWSL, with..."
146,spotify:episode:2SfUG4VtJFkyiuNgHALlsC,All right. Now - just one second now I'm confu...,During the first ever Frank and Eric Movie Par...
147,spotify:episode:2c2WPjRpoCSxtnAw0WsoqG,"What is up? Everyone? Alright, so I've been pr...",LFT Radio - Lifelong Fitness and Training In t...
148,spotify:episode:6oZYPfBhCdpTSamM9Uj0v9,"Oh, so you have to do a lot eight game. It was...","On today's show, we sit down with LSU freshman..."


### Preprocessing
Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs using the same vocabulary used when the model was pretrained.

In [13]:
max_input_length = int(np.quantile(train_data['transcript'].apply(len), 0.01))
max_target_length = int(np.quantile(train_data['best_summary'].apply(len), 0.1))
print("Max input length: ", max_input_length)
print("Max target length: ", max_target_length)

Max input length:  638
Max target length:  188


AutoTokenizer class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint

In [14]:
from datasets import Dataset
train_dataset = Dataset.from_pandas(train_data)

In [15]:
from transformers import AutoTokenizer

model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [16]:
def preprocess_function(dataset, text_column, summary_column, max_input_length, max_target_length, padding, prefix="summarize: "):
    inputs = dataset[text_column]
    targets = dataset[summary_column]
    inputs = [prefix + inp for inp in inputs]
    model_inputs = tokenizer(inputs, max_length=max_input_length, padding=padding, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [17]:
padding = "max_length"
train_dataset = train_dataset.map(
                lambda x: preprocess_function(x, "transcript", "best_summary", max_input_length, max_target_length, padding, prefix="summarize: "),
                batched=True,
                remove_columns=train_dataset.column_names,
                desc="Running tokenizer on train dataset"
            )

Running tokenizer on train dataset: 100%|██████████| 1/1 [00:00<00:00,  1.29ba/s]


In [18]:
from functools import partial

def sample_generator(dataset, model, tokenizer, shuffle, pad_to_multiple_of=None):
    if shuffle:
        sample_ordering = np.random.permutation(len(dataset))
    else:
        sample_ordering = np.arange(len(dataset))
    for sample_idx in sample_ordering:
        example = dataset[int(sample_idx)]
        # Handle dicts with proper padding and conversion to tensor.
        example = tokenizer.pad(example, return_tensors="np", pad_to_multiple_of=pad_to_multiple_of)
        example = {key: tf.convert_to_tensor(arr, dtype_hint=tf.int32) for key, arr in example.items()}
        if model is not None and hasattr(model, "prepare_decoder_input_ids_from_labels"):
            decoder_input_ids = model.prepare_decoder_input_ids_from_labels(
                labels=tf.expand_dims(example["labels"], 0)
            )
            example["decoder_input_ids"] = tf.squeeze(decoder_input_ids, 0)
        yield example, example["labels"]  # TF needs some kind of labels, even if we don't use them
    return

# region Helper functions
def dataset_to_tf(dataset, model, tokenizer, total_batch_size, num_epochs, shuffle):
    if dataset is None:
        return None
    train_generator = partial(sample_generator, dataset, model, tokenizer, shuffle=shuffle)
    train_signature = {
        feature: tf.TensorSpec(shape=(None,), dtype=tf.int32)
        for feature in dataset.features
        if feature != "special_tokens_mask"
    }
    if (
        model is not None
        and "decoder_input_ids" not in train_signature
        and hasattr(model, "prepare_decoder_input_ids_from_labels")
    ):
        train_signature["decoder_input_ids"] = train_signature["labels"]
    # This may need to be changed depending on your particular model or tokenizer!
    padding_values = {
        key: tf.convert_to_tensor(tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0, dtype=tf.int32)
        for key in train_signature.keys()
    }
    padding_values["labels"] = tf.convert_to_tensor(-100, dtype=tf.int32)
    train_signature["labels"] = train_signature["input_ids"]
    train_signature = (train_signature, train_signature["labels"])
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF
    tf_dataset = (
        tf.data.Dataset.from_generator(train_generator, output_signature=train_signature)
        .with_options(options)
        .padded_batch(
            batch_size=total_batch_size,
            drop_remainder=True,
            padding_values=(padding_values, np.array(-100, dtype=np.int32)),
        )
        .repeat(int(num_epochs))
    )
    return tf_dataset

### Training

In [19]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model.resize_token_embeddings(len(tokenizer))

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


<transformers.modeling_tf_utils.TFSharedEmbeddings at 0x211064559a0>

In [20]:
total_train_batch_size = 2
num_train_epochs = 3
learning_rate = 5e-5
tf_train_dataset = dataset_to_tf(
            train_dataset,
            model,
            tokenizer,
            total_batch_size=total_train_batch_size,
            num_epochs=num_train_epochs,
            shuffle=True,
        )

In [21]:
from transformers import create_optimizer
# region Optimizer, loss and LR scheduling
# Scheduler and math around the number of training steps.
num_update_steps_per_epoch = len(train_dataset) // total_train_batch_size
num_train_steps = num_train_epochs * num_update_steps_per_epoch
optimizer, lr_schedule = create_optimizer(
    init_lr=learning_rate, num_train_steps=num_train_steps, num_warmup_steps=0
)

def masked_sparse_categorical_crossentropy(y_true, y_pred):
    # We clip the negative labels to 0 to avoid NaNs appearing in the output and
    # fouling up everything that comes afterwards. The loss values corresponding to clipped values
    # will be masked later anyway, but even masked NaNs seem to cause overflows for some reason.
    # 1e6 is chosen as a reasonable upper bound for the number of token indices - in the unlikely
    # event that you have more than 1 million tokens in your vocabulary, consider increasing this value.
    # More pragmatically, consider redesigning your tokenizer.
    losses = tf.keras.losses.sparse_categorical_crossentropy(
        tf.clip_by_value(y_true, 0, int(1e6)), y_pred, from_logits=True
    )
    # Compute the per-sample loss only over the unmasked tokens
    losses = tf.ragged.boolean_mask(losses, y_true != -100)
    losses = tf.reduce_mean(losses, axis=-1)
    return losses

In [22]:
from datasets import load_metric
# region Metric
metric = load_metric("rouge")
# endregion

# region Training
model.compile(loss={"logits": masked_sparse_categorical_crossentropy}, optimizer=optimizer)

In [24]:
model.fit(
                tf_train_dataset,
                epochs=int(num_train_epochs),
                steps_per_epoch=num_update_steps_per_epoch,
            )

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x21108e13490>

### Evaluation

In [46]:
transcript_exaple = train_data.iloc[46].transcript
transcript_exaple

"Hi, my name is Kate's cooker. And this is your everyday positivity here on Vale. So goal setting I need to talk to you about goal setting there are a few tricks around goal-setting. Sometimes it's about making sure that there's one clear goal. And at this time of the year you can end up with having loads of goals. There's another trick around goal setting and that is this if you had a goal in 2019 that you look at back at your kind of December January time, and he's written down and go. Oh, I didn't do that goal. I'm here to tell you. That it's okay. So I had a goal. I wanted to play the piano by the time I was 40 and for one reason or another that goal has been put by the wayside mainly because I wanted to be able to send you stuff but Facebook kept on telling me I couldn't because it had like a real song in it. So I've kept that goal and I'm going to stick to that goal and I'm going to make sure that I work to that goal in 2020.  It's a shame. I didn't make it to the my 40th birthda

In [47]:
# best summarization
train_data.iloc[46].best_summary

'Kate Cocker is the host of the everyday positivity podcast. Today she talks about goal setting and how to set goals for 2020. She will also be doing a Facebook live about setting goals for the next decade. If you enjoy the podcast, please leave a review at Apple podcasts or wherever it is that you get your podcasts.'

In [48]:
example = np.reshape(tokenizer(transcript_exaple).input_ids, (1,-1))

Token indices sequence length is longer than the specified maximum sequence length for this model (695 > 512). Running this sequence through the model will result in indexing errors


In [49]:
output =model.generate(example)

In [50]:
print(tokenizer.decode(output[0], skip_special_tokens=True))

that goal. It's about making sure that you look at it and go right?
