# Text Summarization on the Opinosis Dataset

During this experiment Weights & Biasses platform was used as complete MLOps platform. All of the results are logged in the [project](https://wandb.ai/aleksandar1932/[NLP]%20lab-03%20%7C%20text-summarization?workspace=user-aleksandar1932). Additionally, to tune the hyperparameters [WANDB Sweeps](https://docs.wandb.ai/guides/sweeps/quickstart) are going to be used.

**Note**: WANDB cells will fail to execute since credentials are not provided.

In [104]:
import os

import pandas as pd
import numpy as np
import wandb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from scripts.utils import load_data
from scripts.utils import nlp_pipeline
from scripts.utils import create_vocabulary
from scripts.loader import load_embeddings

WANDB_PROJECT_NAME = os.getenv("WANDB_PROJECT_NAME") or "[NLP] lab-03 | text-summarization"

# Utils

In [9]:
def append_start_end(data):
    data['text_tokens'] = data['text_tokens'].apply(lambda x: np.concatenate((['<START>'], x, ['</END>'])))
    data['summary_tokens'] = data['summary_tokens'].apply(lambda x: np.concatenate((['<START>'], x, ['</END>'])))

In [105]:
def create_train_data(texts, summaries):
    input_texts, input_summaries, next_words = [], [], []

    for sentence, rephrase in zip(texts, summaries):
        for i in range(1, len(rephrase)):
            input_texts.append(sentence)
            input_summaries.append(rephrase[:i])
            next_words.append(rephrase[i])

    return input_texts, input_summaries, next_words

# Data Preprocessing

In [None]:
run = wandb.init(project=WANDB_PROJECT_NAME, job_type="load_data")

In [106]:
df = load_data()
df.head()

Unnamed: 0,id,text,summary
0,accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\r\n but for the...",This unit is generally quite accurate. \r\nSe...
1,bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and ve...",The rooms were not large but were clean and ve...
2,battery-life_amazon_kindle,After I plugged it in to my USB hub on my com...,Battery life is exceptional.\r\nThe Kindle can...
3,battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\...,The battery life is too short.\r\nThe time bet...
4,battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh ...",The battery life is longer then 5 hours.\r\nBu...


In [None]:
df.to_csv('data/data.csv', index=False)

## Upload raw data as artifact to WANDB.

See the artifacts on the following [link](https://wandb.ai/aleksandar1932/[NLP]%20lab-03%20%7C%20text-summarization/artifacts/dataset/opinosis-training/31b123948b13683e0d37) under the `opinosis-raw` dataset.


In [None]:
raw_data = wandb.Artifact(
    "opinosis-raw", type="dataset",
    description="Raw OPINOSIS dataset",
    metadata={"source": "https://archive.ics.uci.edu/ml/datasets/Opinosis+Opinion+%26frasl%3B+Review",
                "sizes": len(df)}
)

complete_data = wandb.Table(data=df, columns=df.columns)
raw_data.add(complete_data, "Complete dataset")
run.log_artifact(raw_data)
run.finish()

## Tokenization


In [107]:
df['text_tokens'] = df['text'].apply(lambda x: nlp_pipeline(x))
df['summary_tokens'] = df['summary'].apply(lambda x: nlp_pipeline(x))
df.head()

Unnamed: 0,id,text,summary,text_tokens,summary_tokens
0,accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\r\n but for the...",This unit is generally quite accurate. \r\nSe...,"[accurate, part, find, garmin, software, provi...","[unit, generally, quite, accurate, set-up, usa..."
1,bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and ve...",The rooms were not large but were clean and ve...,"[room, overly, big, clean, comfortable, beds, ...","[rooms, large, clean, comfortable, bathroom, s..."
2,battery-life_amazon_kindle,After I plugged it in to my USB hub on my com...,Battery life is exceptional.\r\nThe Kindle can...,"[plugged, usb, hub, computer, charge, battery,...","[battery, life, exceptional, kindle, run, days..."
3,battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\...,The battery life is too short.\r\nThe time bet...,"[short, battery, life, moved, 8gb, love, ipod,...","[battery, life, short, time, chargers, enough]"
4,battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh ...",The battery life is longer then 5 hours.\r\nBu...,"[6ghz, 533fsb, cpu, glossy, display, 3, cell, ...","[battery, life, longer, 5, hours, due, battery..."


## START/END Tokens

In [108]:
append_start_end(df)
df.head()

Unnamed: 0,id,text,summary,text_tokens,summary_tokens
0,accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\r\n but for the...",This unit is generally quite accurate. \r\nSe...,"[<START>, accurate, part, find, garmin, softwa...","[<START>, unit, generally, quite, accurate, se..."
1,bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and ve...",The rooms were not large but were clean and ve...,"[<START>, room, overly, big, clean, comfortabl...","[<START>, rooms, large, clean, comfortable, ba..."
2,battery-life_amazon_kindle,After I plugged it in to my USB hub on my com...,Battery life is exceptional.\r\nThe Kindle can...,"[<START>, plugged, usb, hub, computer, charge,...","[<START>, battery, life, exceptional, kindle, ..."
3,battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\...,The battery life is too short.\r\nThe time bet...,"[<START>, short, battery, life, moved, 8gb, lo...","[<START>, battery, life, short, time, chargers..."
4,battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh ...",The battery life is longer then 5 hours.\r\nBu...,"[<START>, 6ghz, 533fsb, cpu, glossy, display, ...","[<START>, battery, life, longer, 5, hours, due..."


## Create Vocabulary and Embeddings

In [109]:
texts = df['text_tokens'].values
summaries = df['summary_tokens'].values

In [110]:
vocabulary, word_to_id, id_to_word = create_vocabulary(np.concatenate((texts, summaries)))

In [111]:
df['text_indices'] = df['text_tokens'].apply(lambda tokens: np.array([word_to_id[word] for word in tokens]))
df['summary_indices'] = df['summary_tokens'].apply(lambda tokens: np.array([word_to_id[word] for word in tokens]))

text_indices = df['text_indices'].values
summary_indices = df['summary_indices'].values

df.head()

Unnamed: 0,id,text,summary,text_tokens,summary_tokens,text_indices,summary_indices
0,accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\r\n but for the...",This unit is generally quite accurate. \r\nSe...,"[<START>, accurate, part, find, garmin, softwa...","[<START>, unit, generally, quite, accurate, se...","[3978, 5083, 2397, 6278, 3664, 3245, 4998, 508...","[3978, 7185, 3235, 5628, 5083, 1176, 584, 3856..."
1,bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and ve...",The rooms were not large but were clean and ve...,"[<START>, room, overly, big, clean, comfortabl...","[<START>, rooms, large, clean, comfortable, ba...","[3978, 3132, 7113, 5175, 4959, 3414, 1426, 408...","[3978, 1655, 4696, 4959, 3414, 1936, 2189, 545..."
2,battery-life_amazon_kindle,After I plugged it in to my USB hub on my com...,Battery life is exceptional.\r\nThe Kindle can...,"[<START>, plugged, usb, hub, computer, charge,...","[<START>, battery, life, exceptional, kindle, ...","[3978, 2018, 625, 4116, 1910, 3202, 3853, 7160...","[3978, 3853, 3501, 436, 5753, 519, 7120, 963, ..."
3,battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\...,The battery life is too short.\r\nThe time bet...,"[<START>, short, battery, life, moved, 8gb, lo...","[<START>, battery, life, short, time, chargers...","[3978, 2291, 3853, 3501, 1479, 4222, 1066, 668...","[3978, 3853, 3501, 2291, 71, 288, 129, 6508]"
4,battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh ...",The battery life is longer then 5 hours.\r\nBu...,"[<START>, 6ghz, 533fsb, cpu, glossy, display, ...","[<START>, battery, life, longer, 5, hours, due...","[3978, 217, 2386, 6598, 1169, 5571, 5475, 5634...","[3978, 3853, 3501, 6674, 6265, 6198, 263, 3853..."


# Upload pre-processed data as artifact to WANDB

See the artifacts on the following [link](https://wandb.ai/aleksandar1932/[NLP]%20lab-03%20%7C%20text-summarization/artifacts/dataset/opinosis-training/31b123948b13683e0d37) under the `opinosis-preprocessed` dataset.

In [None]:
run = wandb.init(project=WANDB_PROJECT_NAME, job_type="load_data")
pre_processed_data = wandb.Artifact(
    "opinosis-preprocessed", type="dataset",
    description="Preprocessed OPINOSIS dataset",
    metadata={"sizes": len(df), "pipeline": ["tokenization", "indexing", "start/end tokens"]}
)

pre_processed_dataframe = wandb.Table(data=df, columns=df.columns, allow_mixed_types=True)
pre_processed_data.add(pre_processed_dataframe, "Preprocessed dataset")
run.log_artifact(pre_processed_data)
run.finish()

In [112]:
embeddings = load_embeddings(vocabulary,embedding_size=50, embedding_type='glove', dump_path='./data')

# Create Train-Test Data

In [113]:
train_texts, test_texts, train_summaries, test_summaries = train_test_split(text_indices, summary_indices, test_size=0.1)
input_texts, input_summaries, next_words = create_train_data(train_texts, train_summaries)

In [114]:
max_texts_length = max([len(text) for text in input_texts])
max_summaries_length = max([len(summary) for summary in input_summaries])

print(f"Max text length: {max_texts_length}")
print(f"Max summary length: {max_summaries_length}")

Max text length: 6124
Max summary length: 24


In [115]:
padded_texts = pad_sequences(input_texts, maxlen=max_texts_length)
padded_summaries = pad_sequences(input_summaries, maxlen=max_summaries_length)

In [116]:
label_binarizer = LabelBinarizer()
label_binarizer.fit(list(word_to_id.values()))
next_words = label_binarizer.transform(next_words)

## Export training data for trainer

In [23]:
training_data = pd.DataFrame.from_dict({"texts": padded_texts.tolist(), "summaries": padded_summaries.tolist(), "next_words": next_words.tolist()})

Unnamed: 0,texts,summaries,next_words
0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [None]:
training_data.to_csv('data/opinosis-training.csv', index=False)

### Upload training data as artifact to WANDB

See the artifacts on the following [link](https://wandb.ai/aleksandar1932/[NLP]%20lab-03%20%7C%20text-summarization/artifacts/dataset/opinosis-training/31b123948b13683e0d37) under the `opinosis-training` dataset.

In [None]:
run = wandb.init(project=WANDB_PROJECT_NAME, job_type="load_data")
training_artifact = wandb.Artifact(
    "opinosis-training", type="dataset",
    description="Training data generated after the preprocessing of OPINOSIS dataset",
    metadata={"sizes": len(training_data), "pipeline": ["tokenization", "indexing", "start/end tokens", "train/test split", "padding"]}
)

tdf = wandb.Table(data=training_data, columns=training_data.columns, allow_mixed_types=True)
training_artifact.add(tdf, "Training Data")
run.log_artifact(training_artifact)
run.finish()

# Create Model

This section serves as demonstration of how to create a model. It is not necessary to create a model to run the experiment. Model training will be executed on my of my GPU servers, so there will be model imports from files.

**Disclaimer:** Skip this section if you are running this notebook on low-end device.

In [None]:
from wandb.keras import WandbCallback

from scripts.model import create_model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import categorical_crossentropy

In [None]:
model = create_model(max_texts_length, max_summaries_length, len(vocabulary), 50, embeddings)
run = wandb.init(reinit=True, name=model.name)
model.compile(optimizer=Adam(lr=0.01), loss=categorical_crossentropy, metrics=['accuracy'])
model.summary()

In [None]:
model.fit([np.array(padded_texts), np.array(padded_summaries)],
              np.array(next_words),
              batch_size=64, epochs=15, verbose=1, callbacks=[WandbCallback()])
run.finish()

# Evaluate Model on Test Data

For this example, the above model was pre-trained on CUDA enabled hardware, and it's going to be imported from a `/models` directory.

In [117]:
from tensorflow.keras.models import load_model

from scripts.model import decode
from scripts.model import convert

model = load_model('models/opinosis_model-a29091fb-b163-44d4-b8d5-33b3d98d8689.h5')

## Prepare Test Data

In [118]:
padded_texts_test = pad_sequences(test_texts, maxlen=padded_texts.shape[1])
padded_summaries_test = pad_sequences(test_summaries, maxlen=padded_summaries.shape[1])

## Generate Summaries and Evaluate Model

In [None]:
from scripts.model import calculate_rouge

In [None]:
model, input_sent, word_to_id, id_to_word, padding_size, verbose=False

In [132]:
padded_summaries_test_pred = []
for test_sentence in padded_texts_test:
    padded_summaries_test_pred.append(decode(model,test_sentence, word_to_id, id_to_word, padded_summaries.shape[1]))

In [133]:
gt_summaries = convert(padded_summaries,id_to_word)
pred_summaries = convert(padded_summaries_test_pred, id_to_word)

In [134]:
bleu = calculate_bleu(gt_summaries, pred_summaries)
print(f"BLEU score: {bleu}")

BLEU score: 8.792966245470362e-232


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [135]:
rouge_scores = calculate_rouge(gt_summaries, pred_summaries)
print(f"ROUGE scores: {rouge_scores}")

ROUGE scores: [{'rouge1': Score(precision=0.041666666666666664, recall=0.04, fmeasure=0.04081632653061224), 'rougeL': Score(precision=0.041666666666666664, recall=0.04, fmeasure=0.04081632653061224)}, {'rouge1': Score(precision=0.041666666666666664, recall=0.04, fmeasure=0.04081632653061224), 'rougeL': Score(precision=0.041666666666666664, recall=0.04, fmeasure=0.04081632653061224)}, {'rouge1': Score(precision=0.041666666666666664, recall=0.04, fmeasure=0.04081632653061224), 'rougeL': Score(precision=0.041666666666666664, recall=0.04, fmeasure=0.04081632653061224)}, {'rouge1': Score(precision=0.041666666666666664, recall=0.04, fmeasure=0.04081632653061224), 'rougeL': Score(precision=0.041666666666666664, recall=0.04, fmeasure=0.04081632653061224)}, {'rouge1': Score(precision=0.041666666666666664, recall=0.04, fmeasure=0.04081632653061224), 'rougeL': Score(precision=0.041666666666666664, recall=0.04, fmeasure=0.04081632653061224)}, {'rouge1': Score(precision=0.041666666666666664, recall