In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
gowrishankarp_newspaper_text_summarization_cnn_dailymail_path = kagglehub.dataset_download('gowrishankarp/newspaper-text-summarization-cnn-dailymail')
ketanghungralekar_model123_path = kagglehub.dataset_download('ketanghungralekar/model123')
deathsinger205_pkl1234556_path = kagglehub.dataset_download('deathsinger205/pkl1234556')
vmit24_pklw123344_path = kagglehub.dataset_download('vmit24/pklw123344')

print('Data source import complete.')


<h1 style='text-align: center;text-color:blue;'><strong>Sequnce to Sequence modelling</strong></h1>
<h2 style='text-align: center;text-color:blue;'>With Teacher Forcing and Attention Mechanism</h2>

This notebook explores the implementation of a Sequence-to-Sequence (seq2seq) model with attention and teacher forcing for the task of text summarization.

Background:
+ **Text Summarization**: The process of condensing a longer piece of text (e.g., an article, document) into a shorter version while preserving the most important information.
+ **Seq2seq Models**: A class of neural networks designed to handle sequence-to-sequence tasks, such as machine translation, text summarization, and question answering. They consist of an encoder that processes the input sequence and a decoder that generates the output sequence.  
+ **Attention Mechanism**: A key component in modern seq2seq models that allows the decoder to focus on different parts of the input sequence when generating each output token. This improves the model's ability to capture long-range dependencies and produce more accurate translations or summaries.  (My notebook : https://www.kaggle.com/code/divyanshvishwkarma/seq2seq-with-attention-mechanism)
+ **Teacher Forcing**: A training technique where the ground truth output tokens are fed as input to the decoder during training. This helps stabilize training and improve the quality of the generated output, especially in the early stages of training. (My notebook : https://www.kaggle.com/code/divyanshvishwkarma/teacher-forcing-in-seq2seq-tensorflow-and-keras)



<h2 style='text-align:center;'>Seq2Seq models</h2>
Seq2Seq models are a type of neural network architecture designed to handle tasks involving sequential data, such as machine translation and text summarization. They consist of two main components: an encoder, which processes the input sequence and creates a context vector, and a decoder, which generates the output sequence based on the context vector. Seq2Seq models have revolutionized many NLP tasks by effectively transforming one sequence into another.
<p style='text-align:center;'><img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*Ismhi-muID5ooWf3ZIQFFg.png'></p>

<hr><br>

<h2 style='text-align:center;'>Importing Libraries</h2>

In [None]:
import keras
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow.keras import layers as L
from tensorflow.keras import models as M
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2025-04-11 06:31:14.544031: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744353074.745003      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744353074.804322      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


<p align="center"><strong><span style="font-size: 24px;">Filtering Training Data Based on Length Constraints</span></strong></p>

To prevent **GPU resource exhaustion errors on Kaggle**, the training data is filtered before model training:

- Articles and summaries that exceed predefined limits are removed:
  - `TEXT_SIZE` for the input article length
  - `SUMM_SIZE` for the target summary length
- This ensures efficient usage of limited GPU memory during training.

Filtering helps keep the dataset manageable while still retaining high-quality training examples.


In [None]:
TEXT_SIZE = 1600
SUMM_SIZE = 500

<hr><br>

<h3 style='text-align:center'>About the data</h3>

<p style='text-align:center'>Link : <a>https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail</a> <br><br>The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering. </p>

In [None]:
train = pd.read_csv('/kaggle/input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/train.csv')

In [None]:
train = train[train['article'].apply(lambda x: len(x)<TEXT_SIZE)]
train = train[train['highlights'].apply(lambda x: len(x)<SUMM_SIZE)]
len(train)

18775

In [None]:
train = train.reset_index().drop(['index','id'], axis=1)

In [None]:
train.head(10)

Unnamed: 0,article,highlights
0,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,"Kabul, Afghanistan (CNN) -- China's top securi...",China's top security official visited Afghanis...
2,"(CNN) -- Virgin, a leading branded venture cap...",The Virgin Group was founded by Richard Branso...
3,By . Chris Pleasance . Police are hunting for ...,Two men filmed taking iPad from canoe rental o...
4,Baghdad (CNN) -- Radical Iraqi cleric Muqtada ...,Muqtada al-Sadr has been in Iran since 2007 .\...
5,"PUBLISHED: . 07:04 EST, 9 January 2014 . | . U...","Zhu Sanni, 23, had been left alone at home for..."
6,"Kabul, Afghanistan (CNN) -- Thousands of bottl...",Official: Bottles are almost exclusively from ...
7,(CNN) -- Tour de France race director Christia...,The 2013 Tour de France will start from the Fr...
8,(CNN) -- Hundreds filed by a casket on Sunday ...,Wes Leonard collapsed after scoring a winning ...
9,Earlier this season I picked Thierry Henry as ...,Sportsmail columnist Martin Keown was honoured...


<br>
<h3 style='text-align:center'>Example text from the dataset</h3>

In [None]:
train['article'][0]

"By . Associated Press . PUBLISHED: . 14:11 EST, 25 October 2013 . | . UPDATED: . 15:36 EST, 25 October 2013 . The bishop of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. Bishop John Folda (pictured) of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A . State Immunization Program Manager Molly Howell says the risk is low, but officials feel it's important to alert people to the possible exposure. The diocese announced on Monday that Bishop John Folda is taking time off after being diagnosed with hepatitis A. The diocese says he contracted the infection through contaminated food while attending a conference for newly ordained 

In [None]:
train['highlights'][0]

'Bishop John Folda, of North Dakota, is taking time off after being diagnosed .\nHe contracted the infection through contaminated food in Italy .\nChurch members in Fargo, Grand Forks and Jamestown could have been exposed .'

<hr><br>

<h2 style='text-align:center;'><strong>Preprocessing data</strong></h2>

<p style='text-align:center;'>
The initial step involves converting the input and target text into sequences of tokens, which can be individual words or sub-word units. This is typically achieved through tokenization techniques. To ensure uniform input shapes for the model, the sequences are then padded with special tokens (e.g., &lt;PAD&gt;) to achieve equal lengths. Finally, to provide clear boundaries for the model, special start (&lt;START&gt;) and end (&lt;END&gt;) tokens are added to the beginning and end of the target sequences, respectively. This preprocessed data is then ready to be fed into the seq2seq model for training and inference.
</p>


<p style='text-align:center;'><img src='https://miro.medium.com/v2/resize:fit:1400/1*IFpVaaEdNfEkIPA3BaIhuw.png'></p>

In [None]:
X, y = np.array(train.iloc[:, 0:1]), np.array(train.iloc[:,1:2])

In [None]:
X, y = X.reshape(X.shape[0]), y.reshape(y.shape[0])

<br>

### Adding "start" and "end" token to the label datapoints

In [None]:
START = '<start>'
END = '<end>'
PAD = '<PAD>'

In [None]:
y = [f"{START} {text} {END}" for text in y]

<br>
<h4 style='text-align:center;'>Taking out a few data points for infernce</h4>

In [None]:
size = -10
X_test, y_test = X[size:], y[size:]
X, y = X[:size], y[:size]

<br>
<h4 style='text-align:center;'>Preparing Tokenizer and finding vocabulary size</h4>

In [None]:
e_tk, d_tk = Tokenizer(), Tokenizer()
e_tk.fit_on_texts(X)
d_tk.fit_on_texts(y)

In [None]:
start_id = d_tk.word_index.get(START.strip('<>'))
end_id = d_tk.word_index.get(END.strip('<>'))
pad_id = 0

In [None]:
in_vocab_size, out_vocab_size = len(e_tk.word_index) + 1, len(d_tk.word_index) + 1
in_vocab_size, out_vocab_size

(91127, 41940)

<br>
<h4 style='text-align:center;'>Converting text to sequences, padding them and finalizing the three series (enc_inputs, dec_inputs, targets) <br> analogous to (X, dec_target_input, y)</h4>

In [None]:
enc_inputs = e_tk.texts_to_sequences(X)
targets = d_tk.texts_to_sequences(y)

<p style='text-align:center;'><img src='https://miro.medium.com/v2/resize:fit:1400/1*yM9GYw49kIFf3bKp2Eg2AQ.png' width=50%></p>

In [None]:
find_len = lambda x : max([len(seq) for seq in x])+1
input_seq_len, output_seq_len = find_len(enc_inputs), find_len(targets)
input_seq_len, output_seq_len

(329, 95)

In [None]:
enc_inputs =np.array(pad_sequences(enc_inputs, padding='post', truncating='post', maxlen = input_seq_len))

In [None]:
targets = pad_sequences(targets, padding='post', truncating='post', maxlen = output_seq_len)

In [None]:
dec_inputs = np.array(targets[:, :-1])
targets =  np.array(targets[:, 1:])

<br>
<h4 style='text-align:center;'>Dimensions of parameter</h4>

In [None]:
in_vocab_size, out_vocab_size, input_seq_len, output_seq_len

(91127, 41940, 329, 95)

<hr><br>
<h2 style='text-align:center;'><strong>
    Attention Mechanism
    </strong> (Bahdanau Attention)</h2>

Bahdanau attention, also known as additive attention, is a mechanism designed to improve the performance of sequence-to-sequence models. It works by enabling the model to focus on specific parts of the input sequence when generating each part of the output sequence.

<p>The decoder hidden state $s_{t}$ (query) at the $t^{th}$ timestep is passed to all encoder hidden states (keys : $h_{1}$, $h_{2}$,..., $h_{T}$) to calculate scores.
The attention mechanism ensures that the decoder focuses on the most relevant parts of the input (as represented by the keys) when generating the next output token.</p>

<p style='text-align:center;'><img src='https://machinelearningmastery.com/wp-content/uploads/2021/09/bahdanau_1.png' width=50%></p>

In [None]:
class BahdanauAttention(L.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = L.Dense(units)
        self.W2 = L.Dense(units)
        self.V = L.Dense(1)

    def call(self, query, values):
        # query - shape == (batch_size, hidden_size) -> decoder hidden state at the current timestep
        # values - shape == (batch_size, max_len/timesteps, hidden_size) -> encoder outputs (all timesteps)
        # here, hidden_size = units, max_len = timesteps
        query = tf.expand_dims(query, axis = 1)                # (batch_size, 1, hidden_size)
        score = self.V(tf.nn.tanh(self.W1(query) + self.W2(values)))  # (batch_size, timesteps, 1)
        attention_weight = tf.nn.softmax(score, axis = 1)      # (batch_size, timesteps, 1)
        context = attention_weight*values                      # (batch_size, timesteps, hidden_size)
        context_vector = tf.reduce_sum(context, axis = 1)      # (batch_size, hidden_size)
        return context_vector, attention_weight

<hr><br>
<h2 style='text-align:center;'><strong>
    Model Defination
    </strong></h2>

+ Teacher Forcing is implemented in the  **train_step**  method where we use the actual target sequence as input to the decoder during training.
+ The model uses separate Encoder and Decoder classes for better organization.
+ A generate method is included for inference, which uses the model's own predictions rather than teacher forcing.
+ The architecture uses LSTM cells, but you can easily modify it to use GRU or other RNN cells.

## Encoder

In [None]:
class Encoder(L.Layer):
    def __init__(self, in_vocab, embedding_dim, hidden_units):
        super(Encoder, self).__init__()
        self.embed = L.Embedding(in_vocab, embedding_dim)       # (batch_size, seq_length) -> (batch_size, seq_length, embedding_dim)
        self.lstm = L.LSTM(hidden_units, return_sequences=True,return_state = True)   # (batch_size, seq_length, embedding_dim) -> (batch_size, hidden_units)

    def call(self, inputs):
        # input : (batch_size, seq_length)
        x = self.embed(inputs)                               # (batch_size, seq_length, embeddign_dim)
        enc_out, hidden_state, cell_state = self.lstm(x)     # O/P (batch_size, seq_len, hidden_size)
        return enc_out, hidden_state, cell_state

## Decoder

In [None]:
class Decoder(L.Layer):
    def __init__(self, out_vocab, embedding_dim, hidden_units):
        super(Decoder, self).__init__()
        self.embed = L.Embedding(out_vocab, embedding_dim)     # (batch_size, seq_length) -> (batch_size, seq_length, embedding_dim)
        self.lstm = L.LSTM(hidden_units, return_sequences = True, return_state = True)  # (batch_size, seq_length, embedding_dim) -> (batch_size, hidden_units)
        self.dense = L.Dense(out_vocab, activation='softmax')  # (batch_size, seq_length, hidden_units) -> (batch_size, seq_length, out_vocab)
        self.attention = BahdanauAttention(64)

    def call(self, inputs, hidden_state, cell_state, enc_output):
        # input : (batch_size, 1)
        x = self.embed(inputs)                                 # (batch_size, 1, embedding_dim)
        states = [hidden_state, cell_state]
        context, attention_weights = self.attention(query = hidden_state, values = enc_output)
        dec_out, hidden_state, cell_state = self.lstm(x, initial_state=states)  # O/P : (batch_size, 1, hidden_units)
        dec_out = tf.squeeze(dec_out, axis=1)                  # (batch_size, hidden_units)
        # context = tf.expand_dims(context, axis=1)              # (batch_size, 1, embedding_dim)
        inputs = tf.concat([context, dec_out], axis=-1)        # (batch_size, 1, embedding_dim + enc_units)
        out = self.dense(inputs)                               # (batch_size, 1, out_vocab)
        return out, hidden_state, cell_state

<hr><br>

## <h2 style='text-align:center;'><strong>Teacher Forcing</strong></h2>
<p style='text-align:center;'>Instead of passing the output of the current timestep as the input to the next
During training, teacher forcing provides the model with the ground truth (actual) output from the training data instead of feeding the model's own previous output as input. Teacher Forcing makes convergence faster during training especially in the starting epochs.</p>
<a href="https://ibb.co/LzR9pnD"><p style='text-align:center;'><img src="https://i.ibb.co/VW9MB20/Your-paragraph-text.png" alt="Your-paragraph-text" border="0"></p></a>


## Note:
<p> This model defination contains extra elements like the serializable decorator and the two config methods, they are optional and can be skipped without any issues. However they are required if you need to save the model in .keras format.</p>

In [None]:
@keras.saving.register_keras_serializable(package="Custom", name="Seq2Seq")
class Seq2Seq(M.Model):

    def __init__(self, in_vocab, out_vocab, embedding_dim, hidden_units, end_token):
        super(Seq2Seq, self).__init__()

        self.in_vocab = in_vocab
        self.out_vocab = out_vocab
        self.embedding_dim = embedding_dim
        self.hidden_units = hidden_units

        self.encoder = Encoder(in_vocab, embedding_dim, hidden_units)
        self.decoder = Decoder(out_vocab, embedding_dim, hidden_units)
        self.end_token = end_token

    @tf.function
    def train_step(self, inputs):
        (enc_inputs, dec_inputs), targets = inputs         # (batch_size, seq_length)

        with tf.GradientTape() as tape:
            enc_out, hidden_state, cell_state = self.encoder(enc_inputs)           # (batch_size, hidden_units)
            seq_len = dec_inputs.shape[1]
            dec_out = tf.TensorArray(tf.float32, seq_len)  # (batch_size, seq_len, target_vocab_size)
            mask = tf.TensorArray(tf.bool, size=seq_len)
            for timestep in tf.range(seq_len):
                timestep_input = dec_inputs[:, timestep:timestep+1]       # (batch_size, 1)
                timestep_output, hidden_state, cell_state = self.decoder(timestep_input, hidden_state, cell_state, enc_out)   # timestep_output -> # (batch_size, 1, hidden_units)
                dec_out = dec_out.write(timestep, timestep_output)
                is_end = tf.equal(targets[:, timestep], self.end_token)  # Creating mask based on whether end token is encountered
                mask = mask.write(timestep, tf.logical_not(is_end))
            dec_out = tf.transpose(dec_out.stack(), [1, 0, 2])
            sequence_mask = tf.transpose(mask.stack(), [1, 0])
            loss = self.compiled_loss(targets, dec_out, sample_weight=tf.cast(sequence_mask, tf.float32))
        variables = self.trainable_variables
        gradients = tape.gradient(loss, variables)
        self.optimizer.apply_gradients(zip(gradients, variables))
        self.compiled_metrics.update_state(targets, dec_out) # Update metrics
        return {m.name : m.result() for m in self.metrics}

    @tf.function
    def call(self, inputs, training=False):
        enc_inputs, dec_inputs = inputs
        enc_out, hidden_state, cell_state = self.encoder(enc_inputs)   # (batch_size, hidden_units)
        seq_len = tf.shape(dec_inputs)[1]
        dec_out = tf.TensorArray(tf.float32, seq_len)  # (batch_size, seq_len, target_vocab_size)
        for timestep in tf.range(seq_len):
            timestep_input = dec_inputs[:, timestep:timestep+1]       # (batch_size, 1)
            timestep_output, hidden_state, cell_state = self.decoder(timestep_input, hidden_state, cell_state, enc_out)   # timestep_output -> # (batch_size, 1, hidden_units)
            dec_out = dec_out.write(timestep, timestep_output)
        return tf.transpose(dec_out.stack(), [1, 0, 2])


    def generate(self, enc_inputs, max_len, start, end):
        enc_out, hidden_state, cell_state = self.encoder(enc_inputs)
        dec_in = tf.expand_dims([start], 0)              # To get from int -> (1,1) tensor
        result = []
        for _ in range(max_len):
            prediction_logits, hidden_state, cell_state = self.decoder(dec_in, hidden_state, cell_state, enc_out) # (1, 1, hidden_units)
            prediction = tf.argmax(prediction_logits, axis=-1)        # return token ID (int)
            if prediction == end:
                break
            result.append(prediction.numpy())
            dec_in = tf.expand_dims(prediction, 0)
        return result


    # def get_config(self):
    #     config = super(Seq2Seq, self).get_config()
    #     config.update({
    #           'in_vocab': self.in_vocab,
    #           'out_vocab': self.out_vocab,
    #           'embedding_dim': self.embedding_dim,
    #           'hidden_units': self.hidden_units
    #       })
    #     return config

    # @classmethod
    # def from_config(cls, config):
    #     return cls(
    #         in_vocab=config['in_vocab'],
    #         out_vocab=config['out_vocab'],
    #         embedding_dim=config['embedding_dim'],
    #         hidden_units=config['hidden_units']
    #     )
    def get_config(self):
        config = super(Seq2Seq, self).get_config()
        config.update({
            'in_vocab': self.in_vocab,
            'out_vocab': self.out_vocab,
            'embedding_dim': self.embedding_dim,
            'hidden_units': self.hidden_units,
            'end_token': self.end_token  # 🛠️ include this!
        })
        return config

    @classmethod
    def from_config(cls, config):
        end_token = config.get('end_token', 0)  # 🛠️ set a default or handle gracefully
        return cls(
            in_vocab=config['in_vocab'],
            out_vocab=config['out_vocab'],
            embedding_dim=config['embedding_dim'],
            hidden_units=config['hidden_units'],
            end_token=end_token
        )

<hr><br>
<h2 style='text-align:center;'><strong>Model Instance and Training</strong></h2>
<p style='text-align:center;'>The model was trained over 40 epochs of training, hyperparameters are 512 embedding dimension and 512 LSTM units in both encoder and decoder.</p>
<p style='text-align:center;'>This Notebook may not contain the model training output as it was saved and we again tried to train but due to resource exhaustion errors for bigger text size you can see training cells don't have output </p>

In [None]:
model = Seq2Seq(in_vocab=in_vocab_size, out_vocab=out_vocab_size, embedding_dim=1024, hidden_units=512, end_token=end_id)

In [None]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [None]:
model.fit((enc_inputs, dec_inputs), targets, batch_size=32, epochs=1, validation_split=0.2)

```
for metric in self.metrics:
    metric.update_state(y, y_pred)
```

  return self._compiled_metrics_update_state(


[1m470/470[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m493s[0m 1s/step - accuracy: 0.5928 - loss: 2.3844e-05 - val_accuracy: 0.6292 - val_loss: 2.9843


<keras.src.callbacks.history.History at 0x78dc2135e5f0>

In [None]:
model.fit((enc_inputs, dec_inputs), targets, batch_size=32, epochs=20, validation_split=0.2)

Epoch 1/20
[1m470/470[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m481s[0m 1s/step - accuracy: 0.6321 - loss: 2.3844e-05 - val_accuracy: 0.6478 - val_loss: 2.7503
Epoch 2/20
[1m470/470[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m481s[0m 1s/step - accuracy: 0.6496 - loss: 2.3844e-05 - val_accuracy: 0.6552 - val_loss: 2.6526
Epoch 3/20
[1m284/470[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m2:58[0m 961ms/step - accuracy: 0.6636 - loss: 2.3844e-05

In [None]:
model.fit((enc_inputs, dec_inputs), targets, batch_size=32, epochs=10, validation_split=0.2)

In [None]:
model.fit((enc_inputs, dec_inputs), targets, batch_size=32, epochs=10, validation_split=0.2)

<p align="center"><strong><span style="font-size: 24px;">Saving all required files after loading the model</span></strong></p>

After training the model with attention and teacher forcing, the following components are saved to ensure reproducibility and smooth inference:

- The trained **Keras model** is saved, which includes both the architecture and the learned weights.
- **Encoder and decoder tokenizers** are saved using `pickle`, preserving the vocabulary and tokenization logic.
- A **metadata dictionary** is saved containing:
  - The word-to-index mapping (`word_dict`)
  - Special tokens like `start_id` and `end_id`
  - Sequence lengths for both input and output (`input_seq_len` and `output_seq_len`)

These files together enable the model to be reloaded and used for prediction without the need for retraining or redefining preprocessing steps.


In [None]:
model.save('Attention_Model_(teacher_forcing).keras')

In [None]:
import pickle

with open("e_tk.pkl", "wb") as f:
    pickle.dump(e_tk, f)

with open("d_tk.pkl", "wb") as f:
    pickle.dump(d_tk, f)

metadata = {
    "word_dict": word_dict,
    "start_id": start_id,
    "end_id": end_id,
    "input_seq_len": input_seq_len,
    "output_seq_len": output_seq_len
}

with open("metadata.pkl", "wb") as f:
    pickle.dump(metadata, f)

<hr><br>
<h3 style='text-align:center;'><strong>Model Inference</strong></h3>
<p style='text-align:center;'>
With the required saved model and associated files (such as encoder/decoder tokenizers and metadata), the cells under this section can be executed to perform inference or evaluate the model on test data <strong>without retraining the model</strong>. This setup enables quick generation of summaries and computation of evaluation metrics like ROUGE directly.
</p>


In [None]:
# word_dict = {v : k for k,v in d_tk.word_index.items()}

<p align="center"><strong><span style="font-size: 24px;">Loading the Trained Model for Inference</span></strong></p>

The trained attention-based Seq2Seq model is loaded using Keras for performing inference on new data.

- The model was saved after training and is now reloaded using `load_model()`.
- Since a custom model class (`Seq2Seq`) was used during training, it must be specified in the `custom_objects` argument to ensure proper reconstruction.
- This step restores the trained model's architecture and learned weights, making it ready for inference without retraining.


In [None]:
from keras.models import load_model
model = load_model("/kaggle/input/model123/Attention_Model_(teacher_forcing).keras", custom_objects={"Seq2Seq": Seq2Seq})

I0000 00:00:1744353172.051433      31 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15513 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0


<p align="center"><strong><span style="font-size: 24px;">Loading Tokenizers and Metadata for Inference</span></strong></p>

To perform inference using the trained model, it's essential to load the same preprocessing components used during training:

- **Encoder and decoder tokenizers** are loaded using `pickle`. These tokenizers preserve the vocabulary mappings required for encoding input sequences and decoding outputs.
- A **metadata file** is also loaded, which contains:
  - `word_dict`: Mapping of words to indices
  - `start_id` and `end_id`: Special tokens used to denote the beginning and end of sequences
  - `input_seq_len` and `output_seq_len`: Sequence lengths used during training

Loading these components ensures that the input data is tokenized and padded exactly as it was during training, which is crucial for accurate inference.


In [None]:
import pickle

# Load encoder and decoder tokenizers
with open("/kaggle/input/pkl1234556/e_tk.pkl", "rb") as f:
    e_tk = pickle.load(f)

with open("/kaggle/input/pkl1234556/d_tk.pkl", "rb") as f:
    d_tk = pickle.load(f)

# Load metadata (e.g., start/end token IDs, word_dict, etc.)
with open("/kaggle/input/pkl1234556/metadata.pkl", "rb") as f:
    metadata = pickle.load(f)

word_dict = metadata["word_dict"]
start_id = metadata["start_id"]
end_id = metadata["end_id"]
input_seq_len = metadata["input_seq_len"]
output_seq_len = metadata["output_seq_len"]


<p align="center"><strong><span style="font-size: 24px;">Generating Summaries Using the Trained Model</span></strong></p>

This function performs **text summarization** using the trained attention-based Seq2Seq model. It takes raw input text and returns a generated summary by following these steps:

- **Tokenization & Padding**:  
  The input text is first tokenized using the encoder tokenizer and padded to match the input sequence length used during training.

- **Prediction**:  
  The padded input is passed to the model's `generate` method along with the maximum target length, `start_id`, and `end_id` to produce a sequence of output token IDs.

- **Decoding**:  
  The generated token IDs are converted back to words using the `word_dict`. Decoding stops upon encountering the `end_id`.

- **Return Summary**:  
  The decoded words are joined to form the final summarized text, which is returned by the function.

This setup ensures that the model can generate summaries for new inputs using the same preprocessing pipeline as in training.


In [None]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

def summarize_text(text, model, source_tokenizer, word_dict,
                   start_id, end_id, source_max, target_max):

    # Tokenize and pad input
    seq = source_tokenizer.texts_to_sequences([text])
    seq = pad_sequences(seq, maxlen=source_max, padding='post')

    # Generate predictions
    model_output = model.generate(seq, target_max, start_id, end_id)

    # Decode output tokens to words
    output_text = []
    for token_id in model_output:
        if isinstance(token_id, (list, tuple, np.ndarray)):
            token_id = int(token_id[0])
        else:
            token_id = int(token_id)

        if token_id == end_id:
            break

        word = word_dict.get(token_id, '')
        if word:
            output_text.append(word)

    # print("\nINPUT TEXT:")
    # print(text)
    # print("\nGENERATED SUMMARY:")
    # print(' '.join(output_text))

    return ' '.join(output_text)


Inference of a random summary to show

<p align="center"><strong><span style="font-size: 24px;">Testing the summarize_text Function</span></strong></p>

A sample input is used to test the `summarize_text` function and verify that it returns a valid summary.

- The function is called with a test input and necessary components.
- The generated summary is printed alongside the expected summary for a quick comparison.

This simple check ensures the inference pipeline is working as intended.


In [None]:
example_text = X[0]  # or any custom string
summary = summarize_text(
    text=example_text,
    model=model,
    source_tokenizer=e_tk,
    word_dict=word_dict,
    start_id=start_id,
    end_id=end_id,
    source_max=input_seq_len,
    target_max=output_seq_len
)

print("\nEXPECTED SUMMARY:")
print(y[0][7:-5])  # trimming <start> and <end> from actual summary if applicable


I0000 00:00:1744310813.219351      95 cuda_dnn.cc:529] Loaded cuDNN version 90300



INPUT TEXT:
By . Associated Press . PUBLISHED: . 14:11 EST, 25 October 2013 . | . UPDATED: . 15:36 EST, 25 October 2013 . The bishop of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. Bishop John Folda (pictured) of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A . State Immunization Program Manager Molly Howell says the risk is low, but officials feel it's important to alert people to the possible exposure. The diocese announced on Monday that Bishop John Folda is taking time off after being diagnosed with hepatitis A. The diocese says he contracted the infection through contaminated food while attending a conference for new

<p align="center"><strong><span style="font-size: 24px;">Evaluating the Model Using ROUGE Score</span></strong></p>

To evaluate the quality of generated summaries, the **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** metric is used.

- The generated summaries are compared against the reference (expected) summaries.
- ROUGE scores measure the **overlap of n-grams, word sequences, and word pairs** between the two.
- Commonly used variants include **ROUGE-1**, **ROUGE-2**, and **ROUGE-L** for unigram, bigram, and longest common subsequence matches.

This evaluation provides a quantitative measure of how well the model is performing on the summarization task.


<p align="center"><strong><span style="font-size: 24px;">Loading and Preprocessing Test Data</span></strong></p>

The test dataset is loaded to evaluate the performance of the trained model.

- The test data is **preprocessed using the same steps** as the training data to maintain consistency.
- This includes tokenization, padding, and any necessary formatting.
- Ensuring identical preprocessing helps produce reliable and comparable evaluation results.


In [None]:
test = pd.read_csv('/kaggle/input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/test.csv')

<p align="center"><strong><span style="font-size: 24px;">Filtering Test Data Based on Length Constraints</span></strong></p>

Due to **Kaggle's GPU resource constraints**, unfiltered long sequences can lead to **resource exhaustion errors** during inference.

- To prevent this, the test data is filtered similarly to the training data:
  - Articles longer than `TEXT_SIZE` and summaries longer than `SUMM_SIZE` are excluded.
- This keeps the input within safe bounds for GPU processing and ensures smooth evaluation.

Applying these constraints helps avoid crashes while maintaining evaluation consistency.


In [None]:
test = test[test['article'].apply(lambda x: len(x)<TEXT_SIZE)]
test = test[test['highlights'].apply(lambda x: len(x)<SUMM_SIZE)]
len(test)

879

In [None]:
test = test.reset_index().drop(['index','id'], axis=1)

In [None]:
X, y = np.array(test.iloc[:, 0:1]), np.array(test.iloc[:,1:2])
X, y = X.reshape(X.shape[0]), y.reshape(y.shape[0])

In [None]:
START = '<start>'
END = '<end>'
PAD = '<PAD>'
y = [f"{START} {text} {END}" for text in y]
X_test, y_test = X, y

<p align="center"><strong><span style="font-size: 24px;">Evaluating Model Using ROUGE Scores</span></strong></p>

To evaluate the summarization quality of the trained model, **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** metrics are used:

- ROUGE-1, ROUGE-2, and ROUGE-L scores are computed to assess the overlap between generated summaries and reference summaries.
- The evaluation is performed in two phases:
  1. **Initial Evaluation:** On a small subset (first 100 samples) for quick inspection and debugging.
  2. **Full Evaluation:** On the entire test set to get a complete performance picture.
- Each generated summary is compared against the ground truth using a **stemmer-based ROUGE scorer**, and average F1 scores are calculated.

These metrics provide insight into the model's ability to retain important content from the input text.


In [None]:
from rouge_score import rouge_scorer

# Use only first 100 samples
X_subset = X[:100]
y_subset = y[:100]

# Use last 10 from the subset
X_test = X_subset
y_test = y_subset

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
avg_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

# Generate summaries and calculate scores
for i in range(len(X_test)):
    print(f"🔄 Processing article {i + 1}/{len(X_test)}...")

    summary = summarize_text(
        text=X_test[i],
        model=model,
        source_tokenizer=e_tk,
        word_dict=word_dict,
        start_id=start_id,
        end_id=end_id,
        source_max=input_seq_len,
        target_max=output_seq_len
    )

    scores = scorer.score(y_test[i], summary)
    for key in scores:
        avg_scores[key].append(scores[key].fmeasure)

# Compute and print average F1 scores
print("\n📊 AVERAGE ROUGE F1 SCORES OVER FIRST 100 SAMPLES:")
for key in avg_scores:
    mean_f1 = sum(avg_scores[key]) / len(avg_scores[key])
    print(f"{key.upper()}: {mean_f1:.4f}")


🔄 Processing article 1/100...


I0000 00:00:1744314660.054336      93 cuda_dnn.cc:529] Loaded cuDNN version 90300


🔄 Processing article 2/100...
🔄 Processing article 3/100...
🔄 Processing article 4/100...
🔄 Processing article 5/100...
🔄 Processing article 6/100...
🔄 Processing article 7/100...
🔄 Processing article 8/100...
🔄 Processing article 9/100...
🔄 Processing article 10/100...
🔄 Processing article 11/100...
🔄 Processing article 12/100...
🔄 Processing article 13/100...
🔄 Processing article 14/100...
🔄 Processing article 15/100...
🔄 Processing article 16/100...
🔄 Processing article 17/100...
🔄 Processing article 18/100...
🔄 Processing article 19/100...
🔄 Processing article 20/100...
🔄 Processing article 21/100...
🔄 Processing article 22/100...
🔄 Processing article 23/100...
🔄 Processing article 24/100...
🔄 Processing article 25/100...
🔄 Processing article 26/100...
🔄 Processing article 27/100...
🔄 Processing article 28/100...
🔄 Processing article 29/100...
🔄 Processing article 30/100...
🔄 Processing article 31/100...
🔄 Processing article 32/100...
🔄 Processing article 33/100...
🔄 Processing art

In [None]:
from rouge_score import rouge_scorer

# Use only first 100 samples
X_subset = X
y_subset = y

# Use last 10 from the subset
X_test = X_subset
y_test = y_subset

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
avg_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

# Generate summaries and calculate scores
for i in range(len(X_test)):
    print(f"🔄 Processing article {i + 1}/{len(X_test)}...")

    summary = summarize_text(
        text=X_test[i],
        model=model,
        source_tokenizer=e_tk,
        word_dict=word_dict,
        start_id=start_id,
        end_id=end_id,
        source_max=input_seq_len,
        target_max=output_seq_len
    )

    scores = scorer.score(y_test[i], summary)
    for key in scores:
        avg_scores[key].append(scores[key].fmeasure)

# Compute and print average F1 scores
print("\n📊 AVERAGE ROUGE F1 SCORES OVER TEST SAMPLES:")
for key in avg_scores:
    mean_f1 = sum(avg_scores[key]) / len(avg_scores[key])
    print(f"{key.upper()}: {mean_f1:.4f}")


🔄 Processing article 1/879...


I0000 00:00:1744353358.586204      89 cuda_dnn.cc:529] Loaded cuDNN version 90300


🔄 Processing article 2/879...
🔄 Processing article 3/879...
🔄 Processing article 4/879...
🔄 Processing article 5/879...
🔄 Processing article 6/879...
🔄 Processing article 7/879...
🔄 Processing article 8/879...
🔄 Processing article 9/879...
🔄 Processing article 10/879...
🔄 Processing article 11/879...
🔄 Processing article 12/879...
🔄 Processing article 13/879...
🔄 Processing article 14/879...
🔄 Processing article 15/879...
🔄 Processing article 16/879...
🔄 Processing article 17/879...
🔄 Processing article 18/879...
🔄 Processing article 19/879...
🔄 Processing article 20/879...
🔄 Processing article 21/879...
🔄 Processing article 22/879...
🔄 Processing article 23/879...
🔄 Processing article 24/879...
🔄 Processing article 25/879...
🔄 Processing article 26/879...
🔄 Processing article 27/879...
🔄 Processing article 28/879...
🔄 Processing article 29/879...
🔄 Processing article 30/879...
🔄 Processing article 31/879...
🔄 Processing article 32/879...
🔄 Processing article 33/879...
🔄 Processing art

<hr><br>
<h2 style='text-align:center;'>Observations</h2>

+ As part of our course project, we implemented an attention-based Seq2Seq model for text summarization and evaluated it using ROUGE metrics.
+ We initially experimented with a baseline model **without attention**, but it produced poor results both qualitatively and quantitatively. Due to its limitations, we decided not to pursue it further.
+ In contrast, the attention-based model generated more coherent and contextually relevant summaries, despite occasional word repetitions or mid-sentence mix-ups.
+ Although the model was more complex and took longer to train, the performance gains justified the added overhead.
+ The ROUGE evaluation metrics clearly reflected the model’s effectiveness:

  📊 **AVERAGE ROUGE F1 SCORES OVER FIRST 100 SAMPLES**:  
  ‣ **ROUGE-1**: 0.1952  
  ‣ **ROUGE-2**: 0.0322  
  ‣ **ROUGE-L**: 0.1251  

  📊 **AVERAGE ROUGE F1 SCORES OVER TEST SAMPLES** (after filtering):  
  ‣ **ROUGE-1**: 0.1954  
  ‣ **ROUGE-2**: 0.0325  
  ‣ **ROUGE-L**: 0.1239  

+ These results confirm the value of incorporating attention in sequence modeling and pave the way for exploring more advanced models like Transformers in future work.


<hr><br>
<h3 style='text-align:center;'><strong>References</strong></h3>
<p style='text-align:center;'>
<br>
    
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. https://arxiv.org/abs/1409.3215

Attention Mechanism:

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. 1 https://arxiv.org/pdf/1409.0473  

Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. https://aclanthology.org/D15-1166/


<ul>
    <li>
        <b>Original Paper for Teacher Forcing:</b>
        <ul>
            <li>Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems 27 (2014).<br></li>
        </ul>
    </li>
    <li>
        <b>Review Papers:</b>
        <ul>
            <li>Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arXiv preprint arXiv:1609.08144 (2016).</li>
            <li>Young, Tom, et al. "Recent trends in neural machine translation." arXiv preprint arXiv:1703.01619 (2017).<br></li>
        </ul>
    </li>
    <li>
        <b>Tutorials and Blog Posts:</b>
        <ul>
            <li><b>Understanding Teacher Forcing in Seq2Seq Models</b> by Lilian Weng: <a href="https://lilianweng.github.io/">https://lilianweng.github.io/</a></li>
            <li><b>Seq2Seq Tutorial with Neural Networks</b> by PyTorch: <a href="https://pytorch.org/tutorials/">https://pytorch.org/tutorials/</a><br></li>
        </ul>
    </li>
    <li>
        <b>Research Papers Exploring Alternatives to Teacher Forcing:</b>
        <ul>
            <li><b>Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks</b> by Samy Bengio et al.: <a href="https://arxiv.org/abs/1506.03099">https://arxiv.org/abs/1506.03099</a></li>
            <li><b>Improved Training of Sequence to Sequence Models</b> by Minh-Thang Luong et al.: <a href="https://research.google/pubs/sequence-to-sequence-learning-with-neural-networks/">https://research.google/pubs/sequence-to-sequence-learning-with-neural-networks/</a><br></li>
        </ul>
    </li>
</ul>
    
</p>