## Extractive Text Summarization
This notebook peforms an end to end solution to perform extractive text summarization on CNN dailymail dataset.

## Model Building and Testing

### Importing Libraries
We first install required packages in the colab environment to train the model.
We then import all the packages and modules for later use.

In [None]:
!pip install transformers datasets -q
import time
import numpy as np
import pandas as pd
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
from datasets import load_dataset, load_metric

### Downloading Dataset

*   We first load the cnn_dailmail dataset
*   Contents of the news_dataset:
  *  Article - The text part of the article
  *  Highlights - The summary of the article
  *  Id - unique id for the article (hash value)



In [None]:
news_datasets = load_dataset("cnn_dailymail",'3.0.0')

In [None]:
# Since the data is huge (training - 287113 articles) we take only small percentage
# Define the percentages for train, validation, and test splits
train_percent = 1
validation_percent = 1
test_percent = 1

# Calculate the number of examples for each split
train_len = len(news_datasets["train"])
val_len = len(news_datasets["validation"])
test_len = len(news_datasets["test"])
train_split = train_len * train_percent // 100
validation_split = val_len * validation_percent // 100
test_split = test_len * test_percent // 100

# Create new datasets with the desired splits
train_dataset = news_datasets["train"].shuffle(seed=42).select([i for i in range(train_split)])
validation_dataset = news_datasets["validation"].shuffle(seed=42).select([i for i in range(validation_split)])
test_dataset = news_datasets["test"].shuffle(seed=42).select([i for i in range(test_split)])

In [None]:
train_dataset[0]['article']

"By . Anthony Bond . PUBLISHED: . 07:03 EST, 2 March 2013 . | . UPDATED: . 08:07 EST, 2 March 2013 . Three members of the same family who died in a static caravan from carbon monoxide poisoning would have been unconscious 'within minutes', investigators said today. The bodies of married couple John and Audrey Cook were discovered alongside their daughter, Maureen, at the mobile home they shared on Tremarle Home Park in Camborne, west Cornwall. The inquests have now opened into the deaths last Saturday, with investigators saying the three died along with the family's pet dog, of carbon monoxide poisoning from a cooker. Tragic: The inquests have opened into the deaths of three members of the same family who were found in their static caravan last weekend. John and Audrey Cook are pictured . Awful: The family died following carbon monoxide poisoning at this caravan at the Tremarle Home Park in Camborne, Cornwall . It is also believed there was no working carbon monoxide detector in the st

In [None]:
len(train_dataset)

2871

### Tokenizing
* We first define the pre-defined model we want use, here it is bert.
* We initialize a tokenizer using the pre-trained model specified. The BertTokenizer.from_pretrained method loads the tokenizer corresponding to bert-based-uncased model.
* The tokenizer is used to preprocess and tokenize text data, making it suitable for input to our model. It converts text to tokens, padding sequences, and converts tokens back to text.
* So in a way tokenizer is also a part of pre-processing before the model receives it. The articles are passed throught this and then to the model.
* We initialize a function to tokenize data that takes the reduced data from train, validation and test data and applied this tokenizer on each article, highlights in the data. It generated input_ids, attention mask and labels.
* It compares if the sentence is in the summary and if there it assigns a 1 else 0.


In [None]:
# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the data
def tokenize_data(data, max_length):
    input_ids = []
    attention_masks = []
    labels = []

    for each in range(len(data)):
        text = data[each]['article']
        label = data[each]['highlights']

        sentences = text.split(".")

        # Tokenize each sentence
        encoded_sentences = tokenizer(sentences, padding="max_length", truncation=True, return_tensors="tf", max_length=max_length)
        input_ids.extend(encoded_sentences["input_ids"])
        attention_masks.extend(encoded_sentences["attention_mask"])

        # Determine which sentences are important and create binary labels (0 for not important, 1 for important)
        sentence_importance = [1 if sentence in label else 0 for sentence in text.split(".")]

        labels.extend(sentence_importance)

    return input_ids, attention_masks, labels

In [None]:
# Defining hyperparameters
max_length = 256
batch_size = 4

#### Preprocessing the data by passing through tokenizer

In [None]:
# Tokenizing the datasets
train_input_ids, train_attention_masks, train_labels = tokenize_data(train_dataset, max_length)
validation_input_ids, validation_attention_masks, validation_labels = tokenize_data(validation_dataset, max_length)
test_input_ids, test_attention_masks, test_labels = tokenize_data(test_dataset, max_length)



### Converting to Tf dataset
* We convert the tokenized datasets into tf dataset format that is compatible with the model.
* TensorFlow datasets allow for efficient and batched data processing, making it easier to train deep learning models on large datasets.
* We can now use these datasets with the model to train and evaluate text summarization.

In [None]:
train_dataset_tf = tf.data.Dataset.from_tensor_slices((train_input_ids, train_attention_masks, train_labels))
validation_dataset_tf = tf.data.Dataset.from_tensor_slices((validation_input_ids, validation_attention_masks, validation_labels))
test_dataset_tf = tf.data.Dataset.from_tensor_slices((test_input_ids, test_attention_masks, test_labels))

#### We can shuffle the data and divide the input into batches for training and inference.

In [None]:
train_dataset_tf = train_dataset_tf.batch(batch_size).shuffle(buffer_size=100)
validation_dataset_tf = validation_dataset_tf.batch(batch_size)
test_dataset_tf = test_dataset_tf.batch(batch_size)

### Fine Tuning the Model
* We initialize a BERT model for sequence classification.
* TFBertForSequenceClassification.from_pretrained is designed for sequence classification tasks, where the model takes a sequence of tokens as input and predicts a category or label for that sequence.
* Since we classified the sentences as 0 or 1 we give num_labels as 2.

In [None]:
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training the model
* We initiliaze the optimizer for the model and loss function
* We define the hyperparameters for training

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [None]:
accumulation_steps = 4  # Accumulate gradients every 4 steps
step = 0  # Initialize the step counter
num_epochs = 1

#### Training the model over train_dataset
* We go over each article that is tokenized in terms of batches.
* with tf.GradientTape() as tape - This is used for calculating gradients and it allows tensorFlow to keep track of operations for gradient computation.
* We get the output from the model containing the raw scores for the task.
* The logits are then compared against the ground truth (labels). We accumulate the gradient over the steps so that we can perform in batches.
* The gradients are then applied to update the weights with the optimizer.

In [None]:
for epoch in range(num_epochs):
    for batch in train_dataset_tf:
        input_ids, attention_mask, labels = batch

        with tf.GradientTape() as tape:
            outputs = model(input_ids, attention_mask=attention_mask, training=True)
            logits = outputs.logits
            loss = loss_fn(labels, logits)

            loss = loss / accumulation_steps  # Scale the loss

        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

        step += 1  # Increment the step counter

        if step % accumulation_steps == 0:
            print(f"Epoch [{epoch + 1}/{num_epochs}] - Step [{step}/{len(train_dataset_tf)}] - Loss: {loss.numpy()}")



Epoch [1/1] - Step [4/30732] - Loss: 0.15715251863002777
Epoch [1/1] - Step [8/30732] - Loss: 0.10688062012195587
Epoch [1/1] - Step [12/30732] - Loss: 0.1459098756313324
Epoch [1/1] - Step [16/30732] - Loss: 0.06755093485116959
Epoch [1/1] - Step [20/30732] - Loss: 0.14754396677017212
Epoch [1/1] - Step [24/30732] - Loss: 0.12900492548942566
Epoch [1/1] - Step [28/30732] - Loss: 0.019558699801564217
Epoch [1/1] - Step [32/30732] - Loss: 0.017782138660550117
Epoch [1/1] - Step [36/30732] - Loss: 0.01188712939620018
Epoch [1/1] - Step [40/30732] - Loss: 0.08939484506845474
Epoch [1/1] - Step [44/30732] - Loss: 0.05446353554725647
Epoch [1/1] - Step [48/30732] - Loss: 0.0070655024610459805
Epoch [1/1] - Step [52/30732] - Loss: 0.0792413130402565
Epoch [1/1] - Step [56/30732] - Loss: 0.006066455505788326
Epoch [1/1] - Step [60/30732] - Loss: 0.005495986435562372
Epoch [1/1] - Step [64/30732] - Loss: 0.004540382884442806
Epoch [1/1] - Step [68/30732] - Loss: 0.003574148751795292
Epoch [1/1

KeyboardInterrupt: ignored

### Saving the model
We save the model using model.save in tf format which is suitable for tf serving.

In [None]:
model.save('T5_ext_summ')

In [None]:
model.save("saved_model/1", save_format="tf")

### Testing the model
* We take the baseline article and tokenize it. We preprocess it like before and send it the model.
* The output is taken and labels that have highest similarity are extracted.
* The sentence with high similarity is printed as summary.

In [None]:
article = "The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed. Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water. Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct. Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town. First Minister Nicola Sturgeon visited the area to inspect the damage. The waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare. Jeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit."
# Tokenizing the article
article_tokens = tokenizer(article, padding="max_length", truncation=True, return_tensors="tf", max_length=max_length)
input_ids = tf.convert_to_tensor(article_tokens["input_ids"])
attention_mask = tf.convert_to_tensor(article_tokens["attention_mask"])

model_output = model(input_ids, attention_mask=attention_mask)
logits = model_output.logits

generated_summary = "".join(article.split(".")[tf.argmax(logits, axis=1).numpy()[0]])

print("Generated Summary:")
print(generated_summary)

Generated Summary:
The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed


### Model metrics


In [None]:
!pip install bert_score -q
from bert_score import score

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
gen_summary=[generated_summary]
text=[article]

In [None]:
P,R,F1=score(gen_summary,text,lang='en')

Downloading (…)lve/main/config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
for i in range(len(gen_summary)):
    print(f"Pair{i+1}")
    print(f"Precision(P): {P[i].item()}")
    print(f"Recall(R): {R[i].item()}")
    print(f"F1 Score: {F1[i].item()}")
    print()
    print()

Pair1
Precision(P): 0.9601625800132751
Recall(R): 0.8428446054458618
F1 Score: 0.8976868391036987




### Downloading the saved_model

In [None]:
!zip -r ext_summ.zip /content/saved_model
from google.colab import files
files.download("ext_summ.zip")

  adding: content/saved_model/ (stored 0%)
  adding: content/saved_model/1/ (stored 0%)
  adding: content/saved_model/1/assets/ (stored 0%)
  adding: content/saved_model/1/keras_metadata.pb (deflated 96%)
  adding: content/saved_model/1/fingerprint.pb (stored 0%)
  adding: content/saved_model/1/saved_model.pb (deflated 92%)
  adding: content/saved_model/1/variables/ (stored 0%)
  adding: content/saved_model/1/variables/variables.data-00000-of-00001 (deflated 7%)
  adding: content/saved_model/1/variables/variables.index (deflated 77%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Serving the Summarizer

### Saving the model and exploring the saved_model

In [None]:
export_path='/content/saved_model/1'
!saved_model_cli show --dir {export_path} --all

2023-11-07 17:56:53.838306: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-07 17:56:53.838368: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-07 17:56:53.838401: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
    

### Installing TF serving

In [None]:
!echo "deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | tee /etc/apt/sources.list.d/tensorflow-serving.list && \
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | apt-key add -
!apt update

deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2943  100  2943    0     0   3429      0 --:--:-- --:--:-- --:--:--  3426
OK
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease [18.1 kB]
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:7 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [46.6 kB]
Hit:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRel

In [None]:
!wget 'http://storage.googleapis.com/tensorflow-serving-apt/pool/tensorflow-model-server-2.8.0/t/tensorflow-model-server/tensorflow-model-server_2.8.0_all.deb'
!dpkg -i tensorflow-model-server_2.8.0_all.deb
!pip3 install tensorflow-serving-api==2.8.0

--2023-11-07 18:04:26--  http://storage.googleapis.com/tensorflow-serving-apt/pool/tensorflow-model-server-2.8.0/t/tensorflow-model-server/tensorflow-model-server_2.8.0_all.deb
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.31.207, 142.251.18.207, 74.125.128.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.31.207|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 340152790 (324M) [application/x-debian-package]
Saving to: ‘tensorflow-model-server_2.8.0_all.deb’


2023-11-07 18:04:38 (28.3 MB/s) - ‘tensorflow-model-server_2.8.0_all.deb’ saved [340152790/340152790]

Selecting previously unselected package tensorflow-model-server.
(Reading database ... 120874 files and directories currently installed.)
Preparing to unpack tensorflow-model-server_2.8.0_all.deb ...
Unpacking tensorflow-model-server (2.8.0) ...
Setting up tensorflow-model-server (2.8.0) ...
Collecting tensorflow-serving-api==2.8.0
  Downloading tensorfl

In [None]:
import os
os.environ["MODEL_DIR"] = MODEL_DIR

In [None]:
MODEL_DIR

'/content/saved_model'

### Starting TF Model server

In [None]:
%%bash --bg
nohup tensorflow_model_server \
  --rest_api_port=8501 \
  --model_name=ext_model \
  --model_base_path="${MODEL_DIR}" >server.log 2>&1

In [None]:
!tail server.log

2023-11-07 18:09:19.858959: E external/org_tensorflow/tensorflow/core/grappler/optimizers/meta_optimizer.cc:828] tfg_optimizer{} failed: NOT_FOUND: Op type not registered 'DisableCopyOnRead' in binary running on e2871df5ca98. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
	when importing GraphDef to MLIR module in GrapplerHook
2023-11-07 18:09:20.763275: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 93763584 exceeds 10% of free system memory.


In [None]:
!ps aux | grep tensorflow_model_server

root       24431  4.6  5.1 2023996 688984 ?      Sl   18:09   0:02 tensorflow_model_server --rest_ap
root       24675  0.0  0.0   6616  2392 ?        S    18:10   0:00 grep tensorflow_model_server


In [None]:
import requests
import json
import numpy as np

In [None]:
metadata_url = "http://localhost:8501/v1/models/ext_model/metadata"

# Send a request to get the model metadata
response = requests.get(metadata_url)

# Parse the JSON response
metadata = response.json()

# Print the signature information
print(metadata)

{'model_spec': {'name': 'ext_model', 'signature_name': '', 'version': '1'}, 'metadata': {'signature_def': {'signature_def': {'serving_default': {'inputs': {'token_type_ids': {'dtype': 'DT_INT32', 'tensor_shape': {'dim': [{'size': '-1', 'name': ''}, {'size': '-1', 'name': ''}], 'unknown_rank': False}, 'name': 'serving_default_token_type_ids:0'}, 'attention_mask': {'dtype': 'DT_INT32', 'tensor_shape': {'dim': [{'size': '-1', 'name': ''}, {'size': '-1', 'name': ''}], 'unknown_rank': False}, 'name': 'serving_default_attention_mask:0'}, 'input_ids': {'dtype': 'DT_INT32', 'tensor_shape': {'dim': [{'size': '-1', 'name': ''}, {'size': '-1', 'name': ''}], 'unknown_rank': False}, 'name': 'serving_default_input_ids:0'}}, 'outputs': {'logits': {'dtype': 'DT_FLOAT', 'tensor_shape': {'dim': [{'size': '-1', 'name': ''}, {'size': '2', 'name': ''}], 'unknown_rank': False}, 'name': 'StatefulPartitionedCall:0'}}, 'method_name': 'tensorflow/serving/predict'}, '__saved_model_init_op': {'inputs': {}, 'outpu

### Generating summary using TF serving
* We take the baseline article and pre-process it using the bert tokenizer.
* The data is then convered into a format suitable to sent to the serving.
* We define the rest_api url at which the model is served and its format.
* The data is changed to json format and rest api request is sent.
* After the model generates summary, it is read and decoded into human readable format.

In [None]:
article_serving = "The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed. Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water. Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct. Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town. First Minister Nicola Sturgeon visited the area to inspect the damage. The waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare. Jeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit."

In [None]:
tokenized_serv = tokenizer(article_serving, padding="max_length", truncation=True, return_tensors="tf", max_length=256)

In [None]:
input_data = {
    'input_ids': tokenized_serv['input_ids'].numpy().tolist(),
    'attention_mask': tokenized_serv['attention_mask'].numpy().tolist()
}

#Define the input data for TensorFlow Serving
data = {
    'signature_name': 'serving_default',
    'instances': [input_data]
}

In [None]:
server_url = "http://localhost:8501/v1/models/ext_model:predict"
headers = {"content-type": "application/json"}
response = requests.post(server_url, data=json.dumps(data), headers = headers)

In [None]:
result = json.loads(response.text)
#print(result)
pred_tokens = result[0]['output_0']
generated_text = tokenizer.decode(pred_tokens, skip_special_tokens=True)
print("Generated Summary: ")
print(generated_text)

Generated Summary:
The full cost of damages in Newton Stewart, one of the areas worst affected, is still being assessed
