# Maximize and compare the accuracy of the following extractive summarization models: keyBERT, SciBERTSUM, MemSum

Project work to complete the exam in [Text Mining by Prof. Gianluca Moro in the AC 2022/23 in Bologna](https://www.unibo.it/en/teaching/course-unit-catalogue/course-unit/2022/446610).

The dataset is given and already divided into train, test and validation. The models have two targets to predict: relevant sentences and the relevant tokens. The accuracy on both targets should be maximized. The models to be used are:

* [MemSum](https://github.com/nianlonggu/memsum)
* [SciBERTSUM](https://github.com/atharsefid/SciBERTSUM)
* [keyBERT](https://github.com/MaartenGr/KeyBERT)
  * Simply the relevant sentences to be extracted will be those containing at least k relevant tokens, with k = 1, 2 or 3 that is an hyper parameter to be tuned.

The project should be developed in colab with code commented in such a way that every step is understandable even without your verbal explanation in the discussion session.

### Task description

A dataset of social media-like dialoge is given with a ground truth of the sentences and the tokens (of the sentences) most relevant.

The task is now to use different models trained on summarization, adapt them to the dataset given and compare their performance on the tasks of

* extractive summarization
* keyword generation



### Metrics

* ROUGE-N for MemSum, keyBERT [see](https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460)

### Multi-step Episodic Markov decision process extractive SUMmarizer (MemSum)

Based on reinforcement-learning episodic Markov decision processes, this extractive summarizer uses at each step

1. the text context of the sentence
2. the global text context (!)
3. and the extraction history (!)

While iteratively selecting further sentences to extract, it also auto-selects the stop state. It produces concise summaries with little redudancy.

Even though that the model is lightweight, it has SOTA performance on long documents from PubMed, arXiv, and GovReport


### SciBERTSUM

Extends BERTSUM by

1. adding a section embedding layer to include section information in the sentence vector
2. applying a sparse attention mechanism where each sentence will attend locally to nearby sentences and (randomly) to a small number of global sentences




### keyBERT

Uses BERT embeddings and pretrained models. Then extracts keyword n-gram candidates and with cosine similarity and diversification selects the best keywords.

### Summary

Model | Technology | performance | context | history | trained on
---|---|---|---|---|---
MemSum | Markov decision processes | xy | local, global | extraction | arXiv, PubMed, GovReport
SciBERTSUM | BERT with attention | xy | local, global | no | articles with presentation slides as ground truth
keyBERT | BERT embeddings with cosine similarity | xy | cosine similarity, diversification | in diversification | own

### Expected results

* MemSum, SciBERTSUM should be very good for long texts
* MemSum should generate short summaries compared to other models, less redundant
* keyBERT should be fast

## Preparation

[Recommendations how to install packages](https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/)

In [1]:
# !python3 -m venv env/ # only if you don't have a virtual environment
!source env/bin/activate

In [3]:
import sys
!{sys.executable} -m pip install -r requirements.txt

distutils: /usr/local/lib/python3.8/dist-packages
sysconfig: /usr/lib/python3.8/site-packages[0m
distutils: /usr/local/lib/python3.8/dist-packages
sysconfig: /usr/lib/python3.8/site-packages[0m
distutils: /usr/local/include/python3.8/UNKNOWN
sysconfig: /usr/include/python3.8[0m
distutils: /usr/local/bin
sysconfig: /usr/bin[0m
distutils: /usr/local
sysconfig: /usr[0m
user = False
home = None
root = None
prefix = None[0m
Defaulting to user installation because normal site-packages is not writeable
Looking in links: https://download.pytorch.org/whl/cu116/torch_stable.html
Collecting keybert
  Downloading keybert-0.7.0.tar.gz (21 kB)
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 2.8 MB/s eta 0:00:011
Collecting rich>=10.4.0
  Downloading rich-13.3.1-py3-none-any.whl (239 kB)
[K     |████████████████████████████████| 239 kB 4.5 MB/s eta 0:00:01
[?25hCollecting transformers<5.0.0,>=4.6.0
  Do

In [1]:
import os
import numpy as np
import pandas as pd
import config 
import json

## Installing the extractive summarization models

### MemSum

The model expects:

- [x] The training data is now stored in a .jsonl file that contains a list of json info, one line for one training instance. Each json (or dictonary) contains two keys:
  1. "text": the value for which is a python list of sentences, this represents the document you want to summarize;
  2. "summary": the value is also a list of sentences. If represent the ground-truth summary. Because the summary can contain multiple sentences, so we store them as a list.
- [x] repeat for not-training corpus
- [x] high-ROUGE episodes for the training set
- [x] download glove into the `model/glove/` folder

- [ ] we need to furthermore make a dataset with all the tokes instead of sentences as the basic thing


In [7]:
os.chdir('MemSum-015ddda')
os.getcwd()

'/home/work/Dokumente/Studium/SimTech_MSc/Erasmus/Lectures/Text_mining/project/ExtractiveSummaryModels/MemSum-015ddda'

In [19]:
# preprocessing custom data
def preprocessing_memsum_data_sentences(name_dataset):
    with open(os.path.join(config.ENV_PATH, config.DATA_PATH, name_dataset + '.json')) as f:
        json_data = json.load(f)

    memsum_training = os.path.join(config.ENV_PATH, config.DATA_PATH, str('memsum_sentences_' + name_dataset + ".jsonl"))
    with open(memsum_training, 'w') as outfile:
        for l in json_data:
            dialog_temp = {}

            # the dialogue is the text
            text_temp = []
            sentences = l['dialogue'].split('#')
            for s in sentences:
                # remove empty strings
                # to normalize between train and test, we need to remove the empty end of the sentence
                s = s.strip('\n') # TODO: this accidently also strips stuff like ')'
                if s == '':
                    continue
                else:
                    if s[-1] == ' ': # TODO: this does not work as expected
                        text_temp.append(s[:-1])
                    else:
                        text_temp.append(s)
            dialog_temp['text'] = text_temp
            
            # the relevant dialogue is the summary
            dialog_temp['summary'] = [s.strip('#') for s in l['relevant_dialogue']] # remove the # because it is also removed from the training

            # # check if sth goes wrong
            # for s in dialog_temp['summary']:
            #     if s not in dialog_temp['text']:
            #         print(f"The summary of the sentence '{s}' is not in text {dialog_temp['text']}")

            # dialog_temp['finegrained_relevant_dialogue']: I don't think this model can work with this

            outfile.write(json.dumps(dialog_temp)+'\n')

memsum_datasets = [config.name_trainset, config.name_valset, config.name_testset]

! mkdir -p data/

for d in memsum_datasets:
    preprocessing_memsum_data_sentences(d)

    # check if the preprocessing worked
    corpus = [ json.loads(line) for line in open(os.path.join(config.ENV_PATH, config.DATA_PATH, str('memsum_sentences_' + d + ".jsonl"))) ]
    print(f"For the {d} corpus of length {len(corpus)} we have the keys {corpus[0].keys()}")
    print(corpus[0]["text"][:3])
    print(corpus[0]["summary"][:3])
    print("-----------------------------------\n")

    data_path = f"../data/memsum_sentences_{d}.jsonl"

    ! cp $data_path data/

For the train corpus of length 14729 we have the keys dict_keys(['text', 'summary'])
[' Amanda: I baked  cookies. Do you want some?', ' Jerry: Sure!', " Amanda: I'll bring you tomorrow :-)"]
[' Amanda: I baked  cookies. Do you want some?', " Amanda: I'll bring you tomorrow :-)"]
-----------------------------------

For the val corpus of length 818 we have the keys dict_keys(['text', 'summary'])
[' A: Hi Tom, are you busy tomorrow’s afternoon?', ' B: I’m pretty sure I am. What’s up?', ' A: Can you go with me to the animal shelter?.']
[' A: Hi Tom, are you busy tomorrow’s afternoon?', ' A: Can you go with me to the animal shelter?.', ' A: I want to get a puppy for my son.']
-----------------------------------

For the test corpus of length 819 we have the keys dict_keys(['text', 'summary'])
[" Hannah: Hey, do you have Betty's number?", ' Amanda: Lemme check', ' Hannah: <file_gif>']
[" Hannah: Hey, do you have Betty's number?", " Amanda: Sorry, can't find it.", ' Amanda: Ask Larry']
-----

In [21]:
from src.data_preprocessing.MemSum.utils import greedy_extract
import json
from tqdm import tqdm

train_corpus = [ json.loads(line) for line in open("data/memsum_sentences_train.jsonl") ]
for data in tqdm(train_corpus):
    high_rouge_episodes = greedy_extract( data["text"], data["summary"], beamsearch_size = 2)
    indices_list = []
    score_list  = []

    for indices, score in high_rouge_episodes:
        indices_list.append( indices )
        score_list.append(score)

    data["indices"] = indices_list
    data["score"] = score_list

100%|██████████| 14729/14729 [14:38<00:00, 16.77it/s]


In [22]:
with open("data/memsum_sentences_train_labelled.jsonl","w") as f:
    for data in train_corpus:
        f.write(json.dumps(data) + "\n")

### keyBERT

According to the documentation one should not do preprocessing.

- [x] keyBERT wants the data as a list of documents
- [x] `CountVectorizer` to remove stop words and specify the length of the keywords
- [x] `sentence-transformers` to create high-quality embeddings
- [x] calculate the cosine similarity between candidates and the document
- [x] trade-off accuracy and diversity to better represent the whole document

In [23]:
! pip show keybert

Name: keybert
Version: 0.7.0
Summary: KeyBERT performs keyword extraction with state-of-the-art transformer models.
Home-page: https://github.com/MaartenGr/keyBERT
Author: Maarten Grootendorst
Author-email: maartengrootendorst@gmail.com
License: UNKNOWN
Location: /home/work/.local/lib/python3.8/site-packages
Requires: sentence-transformers, scikit-learn, rich, numpy
Required-by: 


In [2]:
import pickle

# preprocessing custom data
def preprocessing_keybert_data_tokens(name_dataset):
    with open(os.path.join(config.ENV_PATH, config.DATA_PATH, name_dataset + '.json')) as f:
        json_data = json.load(f)

    keybert_docs, keybert_keywords = [], []

    for l in json_data:
        keybert_docs.append(l['dialogue'])
        keybert_keywords.append(l['finegrained_relevant_dialogue'])

    keybert_docs_path = os.path.join(config.ENV_PATH, config.DATA_PATH, str('keybert_token_docs_' + name_dataset + ".txt"))
    keybert_keywords_path = os.path.join(config.ENV_PATH, config.DATA_PATH, str('keybert_token_keywords_' + name_dataset + ".txt"))
    with open(keybert_docs_path, 'wb') as outfile_docs:
        pickle.dump(keybert_docs, outfile_docs)
    with open(keybert_keywords_path, 'wb') as outfile_keywords:
        pickle.dump(keybert_keywords, outfile_keywords)

keybert_datasets = [config.name_trainset, config.name_valset, config.name_testset]

for d in keybert_datasets:
    preprocessing_keybert_data_tokens(d)

- [ ] token limit? then split the documents into paragraphs and use mean pooling for the resulting vector

The resulting keywords are n-grams of different size. But if we get the range of the keywords we can directly use the parameter `keyphrase_ngram_range` to feed it into `keyBERT`.

- [ ] calculate `n_gram_range` for data

Given the choices of bert models, the passage models (queries from bing -> relevant passages) make the most sense. To compare to the other publications, also the model trained on scientific citations can be used. The following models make sense:

- [ ] use the well performing `distilbert — base-nli-stsb-mean-tokens` or `xlm-r-distilroberta-base-paraphase-v1`

* 

In [3]:
from keybert import KeyBERT

# hyperparameters
keyphrase_ngram_range = (1, 4)
maxsum_args = {'top_n': 5} # 'use_maxsum': True, 'nr_candidates': 20, # `MaxSumSimilarity` with `nr_candidates` $< 20%$ of total number of unique words
mmr_args = {'use_mmr': True, 'diversity': 0.7}
model = "all-MiniLM-L6-v2" # stick to sentence transformers models as they are optimized
shared_args = {'keyphrase_ngram_range': keyphrase_ngram_range, 'stop_words': 'english', 'use_maxsum': True, 'nr_candidates': 20,}

# load data
with open(os.path.join(config.ENV_PATH, config.DATA_PATH, str('keybert_token_docs_' + "train" + ".txt")),'rb') as f:
    train_docs = pickle.load(f)

with open(os.path.join(config.ENV_PATH, config.DATA_PATH, str('keybert_token_keywords_' + "train" + ".txt")), 'rb') as f:
    train_keywords = pickle.load(f)

# for testing
doc = train_docs[0]
keywords = train_keywords[0]

# model
kw_model = KeyBERT(model=model)

# prepare embeddings to save time when changeing hyperparameters
# use docs instead of doc in production
doc_embeddings, word_embeddings = kw_model.extract_embeddings(doc,
                                                              **shared_args)
                                                            #   keyphrase_ngram_range=keyphrase_ngram_range,  
                                                            #   stop_words='english')

keywords_candidates = kw_model.extract_keywords(doc,
                                                doc_embeddings=doc_embeddings, word_embeddings=word_embeddings,
                                                highlight=True,
                                                **shared_args,
                                                **maxsum_args,)

print(keywords)
print(keywords_candidates)

2023-02-08 18:56:37.599356: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-08 18:56:37.721154: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ThirdParty-9/platforms/linux64Gcc/gperftools-svn/lib:/opt/openfoam9/platforms/linux64GccDPInt32Opt/lib/paraview-5.6:/opt/paraviewopenfoam56/lib:/opt/openfoam9/platforms/linux64GccDPInt32Opt/lib/openmpi-system:/opt/ThirdParty-9/platforms/linux64GccDPInt32/lib/openmpi-system:/usr/lib/x86_64-linux-gnu/openmpi/lib:/home/work/OpenFOAM/work-9/platforms/linux64GccDPInt32Opt/lib:/opt/site/9/platfo

ValueError: Make sure that the `word_embeddings` are generated from the function `.extract_embeddings`. 
Moreover, the `candidates`, `keyphrase_ngram_range`,`stop_words`, and `min_df` parameters need to have the same values in both `.extract_embeddings` and `.extract_keywords`.

## Training the models on the new data

### MemSum

Note:
1. you need to switch to the folder src/MemSum_Full;
2. You can specify the path to training and validation set, the model_folder (where you want to store model checkpoints) and the log_folder (where you want to store the log info), and other parameters. 
3. You can provide the absolute path, or relative path, as shown in the example code below.
4. n_device means the number of available GPUs

In [None]:
!cd src/MemSum_Full; python train.py -training_corpus_file_name ../../data/memsum_train_labelled.jsonl -validation_corpus_file_name ../../data/custom_data/val_CUSTOM_raw.jsonl -model_folder ../../model/MemSum_Full/custom_data/200dim/run0/ -log_folder ../../log/MemSum_Full/custom_data/200dim/run0/ -vocabulary_file_name ../../model/glove/vocabulary_200dim.pkl -pretrained_unigram_embeddings_file_name ../../model/glove/unigram_embeddings_200dim.pkl -max_seq_len 100 -max_doc_len 500 -num_of_epochs 10 -save_every 1000 -n_device 1 -batch_size_per_device 4 -max_extracted_sentences_per_document 7 -moving_average_decay 0.999 -p_stop_thres 0.6

## Tuning the model hyperparameters, if applicable

### MemSum

* num_heads: (int, default = 8
* hidden_dim: (int, default = 1024
* N_enc_l: (int, default = 3
* N_enc_g: (int, default = 3
* N_dec: (int, default = 3
* max_seq_len: (int, default = 100
* max_doc_len: (int, default = 50
* num_of_epochs: (int, default = 50
* print_every: (int, default = 100
* save_every: (int, default = 500
* validate_every:  (int, default= 1000
* restore_old_checkpoint: (bool, default = True)
* learning_rate: (float, default = 1e-4
* warmup_step:  (int, default= 1000
* weight_decay: (float, default = 1e-6)
* dropout_rate: (float, default = 0.1)
* n_device: (int, default = 8)
* batch_size_per_device: (int, default = 16)
* max_extracted_sentences_per_document: (int)
* moving_average_decay: (float)
* p_stop_thres: (float, default = 0.7
* apply_length_normalization: (int, default = 1

### SciBERTSUM

* attention mechanism
* paper: learning rate is one of the most important hyper-parameters to tune

### keyBERT

* different embeddings
* number of keywords extracted, n-gram range (for keyphrases)

## Comparing the models

* what is the length of the summaries the models produce?
* for human experts three criteria: non-redundancy, coverage, and overall quality
* model runtimes

## Summary and Conclusion

## Citations

* Nianlong Gu, Elliott Ash, and Richard Hahnloser. 2022. [MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes.](https://aclanthology.org/2022.acl-long.450) In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6507–6522, Dublin, Ireland. Association for Computational Linguistics.
* Sefid, Athar, and C. Lee Giles. [SciBERTSUM: Extractive Summarization for Scientific Documents.](https://doi.org/10.1007/978-3-031-06555-2_46) Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, La Rochelle, France, May 22–25, 2022, Proceedings. Cham: Springer International Publishing, 2022.
* Grootendorst, Maarten. [KeyBERT: Minimal keyword extraction with BERT.](https://doi.org/10.5281/zenodo.4461265) Zenodo (2020).