# Maximize and compare the accuracy of the following extractive summarization models: keyBERT, SciBERTSUM, MemSum

Project work to complete the exam in [Text Mining by Prof. Gianluca Moro in the AC 2022/23 in Bologna](https://www.unibo.it/en/teaching/course-unit-catalogue/course-unit/2022/446610).

The dataset is given and already divided into train, test and validation. The models have two targets to predict: relevant sentences and the relevant tokens. The accuracy on both targets should be maximized. The models to be used are:

* [MemSum](https://github.com/nianlonggu/memsum)
* [SciBERTSUM](https://github.com/atharsefid/SciBERTSUM)
* [keyBERT](https://github.com/MaartenGr/KeyBERT)
  * Simply the relevant sentences to be extracted will be those containing at least k relevant tokens, with k = 1, 2 or 3 that is an hyper parameter to be tuned.

The project should be developed in colab with code commented in such a way that every step is understandable even without your verbal explanation in the discussion session.

### Task description





### Metrics

* ROUGE-N for MemSum [see](https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460)

### Multi-step Episodic Markov decision process extractive SUMmarizer (MemSum)

Based on reinforcement-learning episodic Markov decision processes, this extractive summarizer uses at each step

1. the text context of the sentence
2. the global text context (!)
3. and the extraction history (!)

While iteratively selecting further sentences to extract, it also auto-selects the stop state. It produces concise summaries with little redudancy.

Even though that the model is lightweight, it has SOTA performance on long documents from PubMed, arXiv, and GovReport


### SciBERTSUM

### keyBERT

### Summary

Model | Technology | performance | context | history
---|---|---|---|---
MemSum | Markov decision processes | xy | local, global | extraction 

### Expected results

* MemSum: should be very good for long texts, short summaries compared to other models

## Preparation

[Recommendations how to install packages](https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/)

In [1]:
# !python3 -m venv env/ # only if you don't have a virtual environment
!source env/bin/activate

In [7]:
import sys
!{sys.executable} -m pip install -r requirements.txt

distutils: /usr/local/lib/python3.8/dist-packages
sysconfig: /usr/lib/python3.8/site-packages[0m
distutils: /usr/local/lib/python3.8/dist-packages
sysconfig: /usr/lib/python3.8/site-packages[0m
distutils: /usr/local/include/python3.8/UNKNOWN
sysconfig: /usr/include/python3.8[0m
distutils: /usr/local/bin
sysconfig: /usr/bin[0m
distutils: /usr/local
sysconfig: /usr[0m
user = False
home = None
root = None
prefix = None[0m
Defaulting to user installation because normal site-packages is not writeable
Looking in links: https://download.pytorch.org/whl/cu116/torch_stable.html
distutils: /home/work/.local/include/python3.8/UNKNOWN
sysconfig: /home/work/.local/include/python3.8[0m
user = True
home = None
root = None
prefix = None[0m
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [1]:
import os
import numpy as np
import pandas as pd
import config 
import json

## Installing the extractive summarization models

### MemSum

The model expects:

- [x] The training data is now stored in a .jsonl file that contains a list of json info, one line for one training instance. Each json (or dictonary) contains two keys:
  1. "text": the value for which is a python list of sentences, this represents the document you want to summarize;
  2. "summary": the value is also a list of sentences. If represent the ground-truth summary. Because the summary can contain multiple sentences, so we store them as a list.
- [x] high-ROUGE episodes for the training set
- [x] download glove into the `model/glove/` folder

- [ ] we need to furthermore make a dataset with all the tokes instead of sentences as the basic thing


In [57]:
os.chdir(MemSum)
os.getcwd()

'/home/work/Dokumente/Studium/SimTech_MSc/Erasmus/Lectures/Text_mining/project/ExtractiveSummaryModels/MemSum'

In [54]:
# preprocessing custom data
with open(os.path.join(config.ENV_PATH, config.DATA_PATH, config.name_trainset + '.json')) as f:
    json_data = json.load(f)

memsum_training = os.path.join(config.ENV_PATH, config.DATA_PATH, str('memsum_' + config.name_trainset + ".jsonl"))
with open(memsum_training, 'w') as outfile:
    for l in json_data:
        dialog_temp = {}

        # the dialogue is the text
        text_temp = []
        sentences = l['dialogue'].split('#')
        for s in sentences:
            # remove empty strings
            # to normalize between train and test, we need to remove the empty end of the sentence
            s = s.strip('\n') # TODO: this accidently also strips stuff like ')'
            if s == '':
                continue
            else:
                if s[-1] == ' ': # TODO: this does not work as expected
                    text_temp.append(s[:-1])
                else:
                    text_temp.append(s)
        dialog_temp['text'] = text_temp
        
        # the relevant dialogue is the summary
        dialog_temp['summary'] = [s.strip('#') for s in l['relevant_dialogue']] # remove the # because it is also removed from the training

        # # check if sth goes wrong
        # for s in dialog_temp['summary']:
        #     if s not in dialog_temp['text']:
        #         print(f"The summary of the sentence '{s}' is not in text {dialog_temp['text']}")

        # dialog_temp['finegrained_relevant_dialogue']: I don't think this model can work with this

        outfile.write(json.dumps(dialog_temp)+'\n')


In [55]:
# check that everything is fine
train_corpus = [ json.loads(line) for line in open(os.path.join(config.ENV_PATH, config.DATA_PATH, str('memsum_' + config.name_trainset + ".jsonl"))) ]

## as an example, we have 100 instances for training
print(len(train_corpus))
print(train_corpus[0].keys())
print(train_corpus[0]["text"][:3])
print(train_corpus[0]["summary"][:3])


14729
dict_keys(['text', 'summary'])
[' Amanda: I baked  cookies. Do you want some?', ' Jerry: Sure!', " Amanda: I'll bring you tomorrow :-)"]
[' Amanda: I baked  cookies. Do you want some?', " Amanda: I'll bring you tomorrow :-)"]


In [58]:
! mkdir -p data/
! cp ../data/memsum_train.json data/

- [ ] repeat for not-training corpus

In [59]:
from src.data_preprocessing.MemSum.utils import greedy_extract
import json
from tqdm import tqdm

train_corpus = [ json.loads(line) for line in open("data/memsum_train.jsonl") ]
for data in tqdm(train_corpus):
    high_rouge_episodes = greedy_extract( data["text"], data["summary"], beamsearch_size = 2
    indices_list = []
    score_list  = []

    for indices, score in high_rouge_episodes:
        indices_list.append( indices )
        score_list.append(score)

    data["indices"] = indices_list
    data["score"] = score_list

 26%|██▋       | 3895/14729 [03:51<10:44, 16.81it/s]


KeyboardInterrupt: 

In [None]:
with open("data/memsum_train_labelled.jsonl","w") as f:
    for data in train_corpus:
        f.write(json.dumps(data) + "\n")

## Training the models on the new data

### MemSum

Note:
1. you need to switch to the folder src/MemSum_Full;
2. You can specify the path to training and validation set, the model_folder (where you want to store model checkpoints) and the log_folder (where you want to store the log info), and other parameters. 
3. You can provide the absolute path, or relative path, as shown in the example code below.
4. n_device means the number of available GPUs

In [None]:
!cd src/MemSum_Full; python train.py -training_corpus_file_name ../../data/memsum_train_labelled.jsonl -validation_corpus_file_name ../../data/custom_data/val_CUSTOM_raw.jsonl -model_folder ../../model/MemSum_Full/custom_data/200dim/run0/ -log_folder ../../log/MemSum_Full/custom_data/200dim/run0/ -vocabulary_file_name ../../model/glove/vocabulary_200dim.pkl -pretrained_unigram_embeddings_file_name ../../model/glove/unigram_embeddings_200dim.pkl -max_seq_len 100 -max_doc_len 500 -num_of_epochs 10 -save_every 1000 -n_device 1 -batch_size_per_device 4 -max_extracted_sentences_per_document 7 -moving_average_decay 0.999 -p_stop_thres 0.6

## Tuning the model hyperparameters, if applicable

### MemSum

* num_heads: (int, default = 8
* hidden_dim: (int, default = 1024
* N_enc_l: (int, default = 3
* N_enc_g: (int, default = 3
* N_dec: (int, default = 3
* max_seq_len: (int, default = 100
* max_doc_len: (int, default = 50
* num_of_epochs: (int, default = 50
* print_every: (int, default = 100
* save_every: (int, default = 500
* validate_every:  (int, default= 1000
* restore_old_checkpoint: (bool, default = True)
* learning_rate: (float, default = 1e-4
* warmup_step:  (int, default= 1000
* weight_decay: (float, default = 1e-6)
* dropout_rate: (float, default = 0.1)
* n_device: (int, default = 8)
* batch_size_per_device: (int, default = 16)
* max_extracted_sentences_per_document: (int)
* moving_average_decay: (float)
* p_stop_thres: (float, default = 0.7
* apply_length_normalization: (int, default = 1

## Comparing the models

* what is the length of the summaries the models produce?
* for human experts three criteria: non-redundancy, coverage, and overall quality
* model runtimes

## Summary and Conclusion

## Citations

* Nianlong Gu, Elliott Ash, and Richard Hahnloser. 2022. [MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes.](https://aclanthology.org/2022.acl-long.450) In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6507–6522, Dublin, Ireland. Association for Computational Linguistics.
* Sefid, Athar, and C. Lee Giles. [SciBERTSUM: Extractive Summarization for Scientific Documents.](https://doi.org/10.1007/978-3-031-06555-2_46) Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, La Rochelle, France, May 22–25, 2022, Proceedings. Cham: Springer International Publishing, 2022.
* Grootendorst, Maarten. [KeyBERT: Minimal keyword extraction with BERT.](https://doi.org/10.5281/zenodo.4461265) Zenodo (2020).