**Abstractive Text Summarization**

Model: The updated code utilizes the T5 model for abstractive text summarization. T5 (Text-To-Text Transfer Transformer) is a Transformer-based model developed by Google Research. It has been pre-trained on a large corpus of text and fine-tuned on various downstream tasks, including text summarization.

Library: The code utilizes the Hugging Face transformers library, which is a powerful library for working with pre-trained Transformer models. It provides an easy-to-use interface for loading pre-trained models, tokenization, and text generation.

Approach: The approach used in the code is a pipeline-based approach provided by the transformers library. The pipeline class simplifies the process of using pre-trained models for various natural language processing tasks, including text summarization. Specifically, the pipeline class provides a convenient API for abstractive text summarization using the T5 model.

The code initializes the T5 model and tokenizer using the T5ForConditionalGeneration and T5Tokenizer classes, respectively, from the transformers library. It then uses the pipeline class with the "summarization" task to perform text summarization. The pipeline internally handles the tokenization, model loading, and text generation process, making it straightforward to generate summaries.

By leveraging the T5 model, the transformers library, and the pipeline-based approach, the code achieves abstractive text summarization in a concise and efficient manner.

In [3]:
# REF
# https://github.com/rojagtap/abstractive_summarizer/tree/master

In [4]:
# Modifications Copyright (C) 2020 Rohan Jagtap

In [5]:
import pandas as pd
import numpy as np
import tensorflow as tf
import time
import re
import pickle

In [6]:
# ## TO FIX BadZipFile: File is not a zip . source: https://stackoverflow.com/questions/3083235/unzipping-file-results-in-badzipfile-file-is-not-a-zip-file

# def fixBadZipfile(zipFile):
#     with open(zipFile, 'r+b') as f:
#         data = f.read()
#         pos = data.find(b'\x50\x4b\x05\x06')  # End of central directory signature
#         if pos > 0:
#             print("Truncating file at location", pos + 22)
#             f.seek(pos + 22)  # size of 'ZIP end of central directory record'
#             f.truncate()
#         else:
#             raise Exception("File is truncated.")


### Loading Data

In [7]:
news = pd.read_excel("news.xlsx")

In [8]:
news.drop(['Source ', 'Time ', 'Publish Date'], axis=1, inplace=True)

In [9]:
news.head()

Unnamed: 0,Headline,Short
0,4 ex-bank officials booked for cheating bank o...,The CBI on Saturday booked four former officia...
1,Supreme Court to go paperless in 6 months: CJI,Chief Justice JS Khehar has said the Supreme C...
2,"At least 3 killed, 30 injured in blast in Sylh...","At least three people were killed, including a..."
3,Why has Reliance been barred from trading in f...,Mukesh Ambani-led Reliance Industries (RIL) wa...
4,Was stopped from entering my own studio at Tim...,TV news anchor Arnab Goswami has said he was t...


In [10]:
news.shape

(55104, 2)

In [11]:
document = news['Short']
summary = news['Headline']

In [12]:
document[30], summary[30]

('According to the Guinness World Records, the most generations alive in a single family have been seven.  The difference between the oldest and the youngest person in the family was about 109 years, when Augusta Bunge&#39;s great-great-great-great grandson was born on January 21, 1989. The family belonged to the United States of America.',
 'The most generations alive in a single family have been 7')

### Preprocessing

In [13]:
# for decoder sequence
summary = summary.apply(lambda x: '<go> ' + x + ' <stop>')
summary.head()

0    <go> 4 ex-bank officials booked for cheating b...
1    <go> Supreme Court to go paperless in 6 months...
2    <go> At least 3 killed, 30 injured in blast in...
3    <go> Why has Reliance been barred from trading...
4    <go> Was stopped from entering my own studio a...
Name: Headline, dtype: object

#### Tokenizing the texts into integer tokens

In [14]:
# since < and > from default tokens cannot be removed
filters = '!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n'
oov_token = '<unk>'

In [15]:
document_tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token=oov_token)
summary_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters=filters, oov_token=oov_token)

In [16]:
document_tokenizer.fit_on_texts(document)
summary_tokenizer.fit_on_texts(summary)

In [17]:
inputs = document_tokenizer.texts_to_sequences(document)
targets = summary_tokenizer.texts_to_sequences(summary)

In [18]:
summary_tokenizer.texts_to_sequences(["This is a test"])

[[184, 22, 12, 71]]

In [19]:
summary_tokenizer.sequences_to_texts([[184, 22, 12, 71]])

['this is a test']

In [20]:
encoder_vocab_size = len(document_tokenizer.word_index) + 1
decoder_vocab_size = len(summary_tokenizer.word_index) + 1

# vocab_size
encoder_vocab_size, decoder_vocab_size

(76362, 29661)

#### Obtaining insights on lengths for defining maxlen

In [21]:
document_lengths = pd.Series([len(x) for x in document])
summary_lengths = pd.Series([len(x) for x in summary])

In [22]:
document_lengths.describe()

count    55104.000000
mean       368.003049
std         26.235510
min        280.000000
25%        350.000000
50%        369.000000
75%        387.000000
max        469.000000
dtype: float64

In [23]:
summary_lengths.describe()

count    55104.000000
mean        63.620282
std          7.267463
min         20.000000
25%         59.000000
50%         63.000000
75%         69.000000
max         96.000000
dtype: float64

In [24]:
# maxlen
# taking values > and round figured to 75th percentile
# at the same time not leaving high variance
encoder_maxlen = 400
decoder_maxlen = 75

#### Padding/Truncating sequences for identical sequence lengths

In [25]:
inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs, maxlen=encoder_maxlen, padding='post', truncating='post')
targets = tf.keras.preprocessing.sequence.pad_sequences(targets, maxlen=decoder_maxlen, padding='post', truncating='post')

### Creating dataset pipeline

In [26]:
inputs = tf.cast(inputs, dtype=tf.int32)
targets = tf.cast(targets, dtype=tf.int32)

In [27]:
BUFFER_SIZE = 20000
BATCH_SIZE = 64

In [28]:
dataset = tf.data.Dataset.from_tensor_slices((inputs, targets)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

### Positional Encoding for adding notion of position among words as unlike RNN this is non-directional

In [29]:
def get_angles(position, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    return position * angle_rates

In [30]:
def positional_encoding(position, d_model):
    angle_rads = get_angles(
        np.arange(position)[:, np.newaxis],
        np.arange(d_model)[np.newaxis, :],
        d_model
    )

    # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)


### Masking

- Padding mask for masking "pad" sequences
- Lookahead mask for masking future words from contributing in prediction of current words in self attention

In [31]:
def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    return seq[:, tf.newaxis, tf.newaxis, :]

In [32]:
def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask

### Building the Model

#### Scaled Dot Product

In [33]:
def scaled_dot_product_attention(q, k, v, mask):
    matmul_qk = tf.matmul(q, k, transpose_b=True)

    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

    output = tf.matmul(attention_weights, v)
    return output, attention_weights

#### Multi-Headed Attention

In [34]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)

        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])

        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
        output = self.dense(concat_attention)

        return output, attention_weights

### Feed Forward Network

In [35]:
def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation='relu'),
        tf.keras.layers.Dense(d_model)
    ])

#### Fundamental Unit of Transformer encoder

In [36]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)

        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)

        return out2


#### Fundamental Unit of Transformer decoder

In [37]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)


    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)

        attn2, attn_weights_block2 = self.mha2(enc_output, enc_output, out1, padding_mask)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)

        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)

        return out3, attn_weights_block1, attn_weights_block2


#### Encoder consisting of multiple EncoderLayer(s)

In [38]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding, rate=0.1):
        super(Encoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.d_model)

        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        seq_len = tf.shape(x)[1]

        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x


#### Decoder consisting of multiple DecoderLayer(s)

In [39]:
class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size, maximum_position_encoding, rate=0.1):
        super(Decoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        seq_len = tf.shape(x)[1]
        attention_weights = {}

        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)

            attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
            attention_weights['decoder_layer{}_block2'.format(i+1)] = block2

        return x, attention_weights


#### Finally, the Transformer

In [40]:
class Transformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, target_vocab_size, pe_input, pe_target, rate=0.1):
        super(Transformer, self).__init__()

        self.encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, pe_input, rate)

        self.decoder = Decoder(num_layers, d_model, num_heads, dff, target_vocab_size, pe_target, rate)

        self.final_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inp, tar, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):
        enc_output = self.encoder(inp, training, enc_padding_mask)

        dec_output, attention_weights = self.decoder(tar, enc_output, training, look_ahead_mask, dec_padding_mask)

        final_output = self.final_layer(dec_output)

        return final_output, attention_weights


### Training

In [41]:
# hyper-params
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
EPOCHS = 20

#### Adam optimizer with custom learning rate scheduling

In [42]:
# class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
#     def __init__(self, d_model, warmup_steps=4000):
#         super(CustomSchedule, self).__init__()

#         self.d_model = d_model
#         self.d_model = tf.cast(self.d_model, tf.float32)

#         self.warmup_steps = warmup_steps

#     def __call__(self, step):
#         arg1 = tf.math.rsqrt(step)
#         arg2 = step * (self.warmup_steps ** -1.5)

#         return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)


#### Defining losses and other metrics

In [43]:
####   DEBUG ####

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()
        self.d_model = tf.cast(d_model, tf.float32)  # Convert d_model to float32

        self.warmup_steps = warmup_steps

    def __call__(self, step):
        step = tf.cast(step, tf.float32)  # Convert step to float32
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

learning_rate = CustomSchedule(d_model)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

In [44]:
# ### REPLACED BY ABOVE CODE ###

# learning_rate = CustomSchedule(d_model)

# optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

In [45]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

In [46]:
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_sum(loss_)/tf.reduce_sum(mask)


In [47]:
train_loss = tf.keras.metrics.Mean(name='train_loss')

#### Transformer

In [48]:
transformer = Transformer(
    num_layers,
    d_model,
    num_heads,
    dff,
    encoder_vocab_size,
    decoder_vocab_size,
    pe_input=encoder_vocab_size,
    pe_target=decoder_vocab_size,
)

#### Masks

In [49]:
def create_masks(inp, tar):
    enc_padding_mask = create_padding_mask(inp)
    dec_padding_mask = create_padding_mask(inp)

    look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
    dec_target_padding_mask = create_padding_mask(tar)
    combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

    return enc_padding_mask, combined_mask, dec_padding_mask


#### Checkpoints

In [50]:
checkpoint_path = "checkpoints"

ckpt = tf.train.Checkpoint(transformer=transformer, optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print ('Latest checkpoint restored!!')

#### Training steps

In [51]:
@tf.function
def train_step(inp, tar):
    tar_inp = tar[:, :-1]
    tar_real = tar[:, 1:]

    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)

    with tf.GradientTape() as tape:
        predictions, _ = transformer(
            inp, tar_inp,
            True,
            enc_padding_mask,
            combined_mask,
            dec_padding_mask
        )
        loss = loss_function(tar_real, predictions)

    gradients = tape.gradient(loss, transformer.trainable_variables)
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

    train_loss(loss)

In [52]:
####   DEBUG ####

def train_step(inp, tar):
    tar_inp = tar[:, :-1]
    tar_real = tar[:, 1:]

    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)

    with tf.GradientTape() as tape:
        predictions, _ = transformer(inp, tar_inp, True, enc_padding_mask, combined_mask, dec_padding_mask)
        loss = loss_function(tar_real, predictions)

    gradients = tape.gradient(loss, transformer.trainable_variables)
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

    train_loss(loss)


In [53]:
# ###REPLACED BY ABOVE CODE

# for epoch in range(EPOCHS):
#     start = time.time()

#     train_loss.reset_states()

#     for (batch, (inp, tar)) in enumerate(dataset):
#         train_step(inp, tar)

#         # 55k samples
#         # we display 3 batch results -- 0th, middle and last one (approx)
#         # 55k / 64 ~ 858; 858 / 2 = 429
#         if batch % 429 == 0:
#             print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch, train_loss.result()))

#     if (epoch + 1) % 5 == 0:
#         ckpt_save_path = ckpt_manager.save()
#         print ('Saving checkpoint for epoch {} at {}'.format(epoch+1, ckpt_save_path))

#     print ('Epoch {} Loss {:.4f}'.format(epoch + 1, train_loss.result()))

#     print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))


### Inference

#### Predicting one word at a time at the decoder and appending it to the output; then taking the complete sequence as an input to the decoder and repeating until maxlen or stop keyword appears

In [54]:
def evaluate(input_document):
    input_document = document_tokenizer.texts_to_sequences([input_document])
    input_document = tf.keras.preprocessing.sequence.pad_sequences(input_document, maxlen=encoder_maxlen, padding='post', truncating='post')

    encoder_input = tf.expand_dims(input_document[0], 0)

    decoder_input = [summary_tokenizer.word_index["<go>"]]
    output = tf.expand_dims(decoder_input, 0)

    for i in range(decoder_maxlen):
        enc_padding_mask, combined_mask, dec_padding_mask = create_masks(encoder_input, output)

        predictions, attention_weights = transformer(
            encoder_input,
            output,
            False,
            enc_padding_mask,
            combined_mask,
            dec_padding_mask
        )

        predictions = predictions[: ,-1:, :]
        predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

        if predicted_id == summary_tokenizer.word_index["<stop>"]:
            return tf.squeeze(output, axis=0), attention_weights

        output = tf.concat([output, predicted_id], axis=-1)

    return tf.squeeze(output, axis=0), attention_weights


In [55]:
# def summarize(input_document):
#     # not considering attention weights for now, can be used to plot attention heatmaps in the future
#     summarized = evaluate(input_document=input_document)[0].numpy()
#     summarized = np.expand_dims(summarized[1:], 0)  # not printing <go> token
#     return summary_tokenizer.sequences_to_texts(summarized)[0]  # since there is just one translated document

In [56]:
####   DEBUG ####

class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding, rate=0.1):
        super(Encoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.d_model)

        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        seq_len = tf.shape(x)[1]

        x = tf.cast(x, tf.int32)  # Convert x to int32

        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x


In [57]:
pip install sentencepiece transformers ####   DEBUG ####

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m79.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m90.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting

In [58]:
####   DEBUG ####
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('t5-base')

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [64]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the T5 tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

def summarize(text):
    # Tokenize the input text
    tokenized_text = tokenizer.encode(text, truncation=True, max_length=400, padding='longest')

    # Convert the tokenized text to a tensor
    input_text = torch.tensor([tokenized_text])

    # Generate the summary
    summary_ids = model.generate(input_text, num_beams=4, max_length=200, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Check the length of the summary and generate a new summary if it's too short
    while len(summary.split()) < 120:
        summary_ids = model.generate(input_text, num_beams=4, max_length=200, early_stopping=True)
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

# Example usage
text = """
In recent years, people are seeking for a solution to improve text
summarization for Thai language. Although several solutions such
as PageRank, Graph Rank, Latent Semantic Analysis (LSA)
models, etc., have been proposed, research results in Thai text
summarization were restricted due to limited corpus in Thai
language with complex grammar. This paper applied a text
summarization system for Thai travel news based on keyword
scored in Thai language by extracting the most relevant sentences
from the original document. We compared LSA and Non-negative
Matrix Factorization (NMF) to find the algorithm that is suitable
with Thai travel news. The suitable compression rates for Generic
Sentence Relevance score (GRS) and K-means clustering were also
evaluated. From these experiments, we concluded that keyword
scored calculation by LSA with sentence selection by GRS is the
best algorithm for summarizing Thai Travel News, compared with
human with the best compression rate of 20%.

Daily newspaper has abundant of data that users do not have
enough time for reading them. It is difficult to identify the relevant
information to satisfy the information needed by users. Automatic
summarization can reduce the problem of information overloading
and it has been proposed previously in English and other languages.
However, there were only a few research results in Thai text
summarization due to the lack of corpus in Thai language and the
complicated grammar.
Text Summarization [1] is a technique for summarizing the content
of the documents. It consists of three steps: 1) create an
intermediate representation of the input text, 2) calculate score for
the sentences based on the concepts, and 3) choose important sentences
to be included in the summary. Text summarization can
be divided into 2 approaches. The first approach is the extractive
summarization, which relies on a method for extracting words and
searching for keywords from the original document. The second
approach is the abstractive summarization, which analyzes words
by linguistic principles with transcription or interpretation from the
original document. This approach implies more effective and
accurate summary than the extractive methods. However, with the
lack of Thai corpus, we chose to apply an extractive summarization
method for Thai text summarization.
This research focused on the sentence extraction function based on
keyword score calculation then selecting important sentences based
on the Generic Sentence Relevance score (GRS), calculated from
Latent Semantic Analysis (LSA) and Non-negative Matrix
Factorization (NMF). We also tried using K-means clustering for
document summarization. In this experiment, we compared 5
models for 5 rounds with Thai travel news using the compression
rates of 20%, 30% and 40% and reported the rate and method that
produced the best result from the experiment.

In recent years, several models in Thai Text summarization have
been introduced. Suwanno, N. et al. [2] proposed a Thai text
summarization that extracted a paragraph from a document based
on Thai compound nouns, term frequency method, and headline
score for generating a summary. Chongsuntornsri, A., et al. [3]
proposed a new approach for Text summarization in Thai based on
content- and graph-based with the use of Topic Sensitive PageRank
algorithm for summarizing and ranking of text segments.
Jaruskulchai C., et al. [4] proposed a method to summarize
documents by extracting important sentences from combining the
specific properties (Local Property) and the overall properties
(Global Property) of the sentences. The overall properties were
based on the relationship between sentences in the document. From
their experiments, the summarization of the industrial news got
60% precision, 44% recall, and 50.9% F-measure, the general news
got the 51.8% precision, 38.5% recall, and 43.1% F-measure while
the fashion magazines got 53.0% precision, 33.0% recall, and
40.4% F-measure.
Mani, I., et al. [5] proposed techniques of text summarization by
using word frequency in the document and calculated the weight of
word to create a keyword group. They then calculated the cosine
similarity of sentences. The researcher used A* search algorithm to
find the shortest sequence of sentences from keyword group by
topic calculation, sentence segmentation and word grouping. The
sequence of sentences that were in the main group were selected as
important sentences. Their summarization of the agricultural news
got 68.57% precision, 51.95% recall and 56.72% F-measure.
Lee, J., et al. [6] proposed a document summarization method using
Non-negative Matrix Factorization (NMF). They compared
between Latent Semantic Analysis (LSA) and NMF to find the
weight of each word and calculated the summation of weights. The
important sentences were ranked and selected into the summary
based on their summed weight. Based on LSA, they found many
weights with zero and negative values. However, when applied
NMF, they found only the positive values and the scope of the
semantic features’ meaning was narrow. Therefore, they proposed
that NMF provided a greater possibility for extracting important
sentences.

The first step for working with Thai Text is word tokenization. Even
though Thai writing system has no delimiters to indicate word
boundaries together with many rules for word segmentation, several
Thai word tokenization programs have been proposed. Table 1
shows F1 score of the recent programs trained and tested by one of
our laboratory members with the data from BEST2010 corpus [7].
Cutkum [8] got the highest F1 score, hence, we used Cutkum for this
step.

Latent Semantic Analysis (LSA) [14] is the algorithm, which
reduces the dimensionality of term document. The algorithm
creates a matrix by using word frequency, applies the singular value
decomposition (SVD) [15], and then finds closely related terms and
documents. The original matrix A can be separated into three
matrices, where U is the m x r (words x extracted concept) matrix,
V is the n x r (sentences x extracted concepts) matrix, and Σ is the
r x r diagonal matrix, which can be reconstructed to find the original
matrix A.

For sentence selection by K-means clustering, we grouped similar
sentences into the same cluster using the following steps:
1. Randomly select K sentences as the representative of K
groups. K in this paper is the number of sentences that
will be selected into the summary.
2. Calculate centroid of each group by using the value of
sentence vector from V matrix for LSA and 𝐻𝐻𝑇𝑇 matrix
for NMF.
3. Use cosine similarity to calculate sentence similarity
between a sentence and the centroid of each group. Then
assign that sentence to the group with the highest
similarity.
4. Repeat steps 2-3 until all sentences are assigned to a
group, no sentences change the group, or the similarity
between sentences and their centroid is close.
5. Select a sentence with the maximum similarity score with
the centroid of the group and add it into the summary.

The standard data sets in Thai language are unavailable for
evaluating text summarization system. Therefore, we collected 400
Thai travel news from Thairath and Manager online newspapers to
be used as datasets for our experiments. We split 400 travel news
into 5 sets of 80 news each. We then evaluated the performance of
text summarization methods which were LSA and NMF by
comparing their results with the summaries manually curated by
two experts from the Faculty of Liberal Arts, Ubon Ratchathani
University.

The open-source python libraries such as numpy [19] and sklearn
[20] were used in our system. We converted the Thai travel news
obtained from Thairath and Manager online newspapers to plain
text. Then, the sentences of each news were segmented by human
with the following format: Si = ‘xxx’, where Si represents the order
of the sentence in the original document and ‘xxx’ represents the
content of that sentence. After removing stop words and duplicate
words, we built a document term matrix or matrix A then applied
SVD and NMF to the matrix. Then, we used python modules
numpy.linalg.svd to calculate SVD and sklearn.decomposition to
calculate NMF. For sentence selection, we used Gong, Y. et al. and
Murray, G. et al. approaches for calculating weight of the sentence
scores then selected sentences with the highest scores into the
summary. For keyword score calculation of NMF, we calculated
the keyword score from Eq. (5) and then selected the sentence with
the highest score from each concept. The python module
sklearn.cluster was used for K-means clustering. The selected
sentences from all approaches were in the same order as the original
document. In this paper, we performed the 20%, 30% and 40%
document compression. This meant 80%, 70% and 60% of the
sentences will be selected into the summary.

Table 3 demonstrates an example of a matrix 𝐴𝐴, constructed from
word count by sentence of a Thai travel news. It was composed of
98 words and 9 sentences. This matrix 𝐴𝐴 was then applied with the
LSA and NMF. The sentence vectors were calculated from the term
weight and the semantic feature vectors from Eq. (1) for LSA and
Eq. (2) for NMF

We evaluated the results of the summarization by using standard
accuracy, precision, recall, and F1 score [21]. These measurements
quantify the differences between the summary from human and the
experimental methods. The precision shows the correctness of the
extracted sentences and the recall reflects the number of good
sentences missed by the method.

In this experimental set, we would like to explore how the different
sentence selection methods: the Generic Sentence Relevance score
and K-means clustering, affected the text summarization result.
For K-means clustering, both SVD and NMF had similar
summarization efficiency. The F1 score of SVD with K-means
clustering was 0.83, 0.72, and 0.62 for the compression rate of 20%,
30%, and 40%. For the NMF with K-means clustering, the F1 score
for the three compression rates was 0.83, 0.74 and 0.64.
For the Generic Sentence Relevance score, the best F1 score for the
compression rate of 20%, 30%, and 40% was 0.86, 0.78 and 0.68
respectively and the best F1 scores for all compression rates were
from the approach of Murray, G. et al.

Figure 2 shows the Thai text summarization efficiency of 5 models:
(1) NMF with GRS, (2) NMF with K-means, (3) SVD with sentence
score by Gong, Y. et al., (4) SVD with K-means, and (5) SVD with
sentence score by Murray, G. et al. applied to 400 Thai travel news,
divided into 5 sets of 80 news each, with the varied compression
rates of 20%, 30% and 40%.
From this experiment, the best model based on keyword score for
Thai travel news summarization was SVD with sentence selection
by Murray, G. et al. This model with the compression rate of 20%
got the highest score because Murray G. et al. method determined
the number of sentences to be extracted from each concept based on
the importance of that concept. The method of Gong, Y. et al., on
the other hand was proposed to select only one sentence with the
highest score from each concept so that the summary would include
sentences from all concepts. The Generic Sentence Relevance score
for NMF also collected one sentence for each concept, the same as
Gong, Y. et al. but with the highest score calculated by Eq. (5). As
multiple important sentences could be selected from a more
important concept, Murray, G. et al. outperformed both Gong, Y. et
al. and the GRS method.

In this paper, we applied several text summarization methods to
Thai Travel News based on keyword scored in Thai language by
extracting the most relevant sentences from the original document.
We compared LSA and NMF together with different sentence
selection methods, to find the algorithm suitable with this paper's
data source. We concluded that keyword scored calculation by LSA
with sentence selection by Generic Sentence Relevance score by
Murray, G. et al. was the best algorithm while the best compression
rate of all models was 20%, for summarizing Thai Travel News
compared with humans.
In future work, we plan to perform the experiments with different
types of documents and improve word segmentation of compound
nouns that was not handled by Cutkum.

We would like to thank the department of computer engineering,
faculty of engineering, Chulalongkorn University for providing
computing facilities.
"""

summary = summarize(text)
print(summary)

NameError: ignored

In [None]:
####   DEBUG ####
def summarize(text):
    # Tokenize the input text
    tokenized_text = tokenizer.encode(text, truncation=True, max_length=400, padding='longest')

    # Convert the tokenized text to a tensor
    input_text = tf.convert_to_tensor([tokenized_text])

    # Generate the summary
    summary_ids = model.generate(input_text, num_beams=4, max_length=100, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary


In [None]:
summary