<a href="https://colab.research.google.com/github/MurtazaKhan24/Text-Summarizer/blob/main/summarization_token_batching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> long-form summarization with token batching

An example for [pszemraj/led-large-book-summary](https://huggingface.co/pszemraj/led-large-book-summary):

- works well for long and/or dense text summarization **because it is trained on [BookSum](https://github.com/salesforce/booksum) and has "learned" explanatory summarization**
- if you are on free-tier Colab, tips on how to adjust this notebook to ensure it runs are marked lower down in <font color="orange">**orange**</font>



by [Peter Szemraj](https://peterszemraj.ch/)


_function design/implementation for the LED decoding largely based on [this notebook](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing) by Patrick von Platen (da real MVP)_

---


In [1]:
#@title define model, text file link

#@markdown <font color="orange"> If out of CUDA memory, try the
#@markdown smaller model `pszemraj/led-base-book-summary`

hf_tag = "pszemraj/led-large-book-summary" #@param ["pszemraj/led-large-book-summary", "pszemraj/led-base-book-summary"] {allow-input: true}

example_url = "https://www.dropbox.com/s/trm1xb6rdjdkgt9/mkdl-An%20Introduction%20to%20Deep%20Reinforcement%20Learning.txt?dl=1" #@param {type:"string"}

# setup

In [2]:
#@title GPU info
#@markdown - usage of the model requires GPU, go to Runtime -> Change runtime type if needed
!nvidia-smi

Tue Nov 28 23:31:46 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
#@markdown add auto-Colab formatting with `IPython.display`
from IPython.display import HTML, display
# colab formatting
def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )

get_ipython().events.register("pre_run_cell", set_css)

In [4]:
!pip install -U datasets transformers ninja -q
!pip install -U sentencepiece -q
!pip install clean-text[gpl] -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.4/175.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.4/53.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m9.

In [5]:
from transformers import pipeline
import torch
from cleantext import clean
from pathlib import Path

_device = 0 if torch.cuda.is_available() else -1


In [6]:
#@markdown setup logging
import logging
from pathlib import Path
das_logfile = Path.cwd() / "summarize_tokenbatches.log"

logging.basicConfig(
    level=logging.INFO,
    filename=das_logfile,
    format="%(asctime)s %(levelname)s %(message)s",
    datefmt="%m/%d/%Y %I:%M:%S",
)
print(f'logfile is at:\n\n{das_logfile}')

logfile is at:

/content/summarize_tokenbatches.log


In [7]:
import requests
import re

#@markdown define `get_filename`

def get_filename(url:str):
    """
    Parses a URL string to find the filename of the file that it downloads.
    """
    # get the last part of the url
    url_split = url.split('/')
    last_part = url_split[-1]
    # replace "?dl=1"
    last_part = last_part.replace('?dl=1', '')

    if '.' in last_part:
        # if the last part is a file name, return it

        suffix = last_part.split('.', maxsplit=1)[-1]
        file_head = last_part.split('.', maxsplit=1)[0].replace("%", " ")
        # replace all non-alphanumeric or whitespace chars in file stem
        file_stem = re.sub(r'[^\w\s]', '', file_head)
        # replace all whitespace chars in file stem
        file_stem = re.sub(r'\s+', '_', file_stem)
        return file_stem + '.' + suffix


    # replace all non-alphanumeric or whitespace chars in file stem
    file_stem = re.sub(r'[^\w\s]', '', last_part)
    # replace all whitespace chars in file stem
    file_stem = re.sub(r'\s+', '_', file_stem)

    # get the file extension
    r = requests.get(url, stream=True)
    content_type = r.headers['content-type']
    file_extension = content_type.split('/')[-1]

    # return the file name
    return file_stem + '.' + file_extension

## load model and tokenizer

In [8]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(
    hf_tag,
).to('cuda')

config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.84G [00:00<?, ?B/s]

In [9]:
from datasets import load_dataset
from tqdm.auto import tqdm
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    hf_tag,
)


tokenizer_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

# Functions & Params



In [10]:
#@markdown `get_timestamp(detailed=False)`

from datetime import datetime

def get_timestamp(detailed=False):
    """
    get_timestamp - returns a timestamp in string format

    detailed : bool, optional, default False, if True, returns a timestamp with seconds
    """
    if detailed:
        return datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    else:
        return datetime.now().strftime("%b-%d-%Y")

get_timestamp(True)

'2023-11-28_23-33-15'

### parameters


In [11]:
#@markdown <font color="orange"> decrease `token_batch_length` if running OOM
token_batch_length = 2048 #@param ["16384", "8192", "4096", "3072", "2048"] {type:"raw"}
batch_stride = 20 #@param {type:"integer"}

session_settings = {}
session_settings['token_batch_length'] = token_batch_length
session_settings['batch_stride'] = batch_stride



In [12]:
#@title generation parameters

#@markdown - <font color="orange">  decrease `token_batch_length` if running OOM
#@markdown - <font color="orange"> decrease `number_beams` if running OOM

number_beams = 12 #@param ["16", "12", "8", "4"] {type:"raw"}
min_length =  32#@param {type:"integer"}
max_len_ratio = 5 #@param {type:"slider", min:2, max:10, step:0.25}
length_penalty =  0.5#@param {type:"number"}

if token_batch_length > 8192 and number_beams > 8:
    logging.info(f'{number_beams} number_beams too high, reducing')
    number_beams = 8
settings = {
    'min_length':32,
    'max_length':int(token_batch_length//max_len_ratio),
    'no_repeat_ngram_size':3,
    'encoder_no_repeat_ngram_size' :4,
    'repetition_penalty':3.7,
    'num_beams':number_beams,
    'length_penalty':length_penalty,
    'early_stopping':True,
    'do_sample':False,
}
logging.info(f"using textgen params:\n\n:{settings}")
session_settings['num_beams'] = number_beams
session_settings['length_penalty'] = length_penalty
session_settings['max_len_ratio'] = max_len_ratio

## define generation functions


In [13]:
#@markdown single-shot fns
#@markdown - def `generate_answer()`

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i : i + n]



def generate_answer(batch,**kwargs):

    inputs_dict = tokenizer(batch["text"],
                            padding="max_length", max_length=16384,
                            return_tensors="pt",
                            truncation=True,
                            add_special_tokens =False,
                            )

    input_ids = inputs_dict.input_ids.to("cuda")
    attention_mask = inputs_dict.attention_mask.to("cuda")
    print(attention_mask, attention_mask.size())
    global_attention_mask = torch.zeros_like(attention_mask)
    # put global attention on <s> token
    global_attention_mask[:, 0] = 1

    predicted_abstract_ids = model.generate(
            input_ids,
            attention_mask=attention_mask,
            global_attention_mask=global_attention_mask,
            **kwargs
        )
    batch["summary"] = tokenizer.batch_decode(predicted_abstract_ids,
                                                skip_special_tokens=True,
                                                remove_invalid_values=True,
                                                )
    return batch

In [14]:
# @title token batch summarization

#@markdown - def `summarize_and_score(ids, mask, **kwargs)`
#@markdown - def `summarize_via_tokenbatches(input_text:str, batch_length=8192, batch_stride=16, **kwargs,)`
def summarize_and_score(ids, mask, **kwargs):


    ids = ids[None, :]
    mask = mask[None, :]

    input_ids = ids.to("cuda")
    attention_mask = mask.to("cuda")
    global_attention_mask = torch.zeros_like(attention_mask)
    # put global attention on <s> token
    global_attention_mask[:, 0] = 1

    summary_pred_ids = model.generate(
            input_ids,
            attention_mask=attention_mask,
            global_attention_mask=global_attention_mask,
            output_scores=True,
            return_dict_in_generate=True,
            **kwargs
        )
    summary = tokenizer.batch_decode(
                summary_pred_ids.sequences,
                skip_special_tokens=True,
                remove_invalid_values=True,
            )
    score = round(summary_pred_ids.sequences_scores.cpu().numpy()[0], 4)

    return summary, score

def summarize_via_tokenbatches(
        input_text:str,
        batch_length=8192,
        batch_stride=16,
        **kwargs,
    ):

    encoded_input = tokenizer(
                        input_text,
                        padding='max_length',
                        truncation=True,
                        max_length=batch_length,
                        stride=batch_stride,
                        return_overflowing_tokens=True,
                        add_special_tokens =False,
                        return_tensors='pt',
                    )

    in_id_arr, att_arr = encoded_input.input_ids, encoded_input.attention_mask
    gen_summaries = []

    pbar = tqdm(total=len(in_id_arr))

    for _id, _mask in zip(in_id_arr, att_arr):

        result, score = summarize_and_score(
            ids=_id,
            mask=_mask,
            **kwargs,
        )
        score = round(float(score),4)
        _sum = {
            "input_tokens":_id,
            "summary":result,
            "summary_score":score,
        }
        gen_summaries.append(_sum)
        print(f"\t{result[0]}\nScore:\t{score}")
        pbar.update()

    pbar.close()

    return gen_summaries

---


# summarize - single file

In [15]:
#@markdown `wget` the text file
example_path = get_filename(example_url)
example_path = Path(example_path)
!wget $example_url -O $example_path -q

In [16]:
#@markdown read in single file text as `long_text`
with open(example_path, 'r', errors='ignore') as f:
    raw_text = f.read()

long_text = clean(raw_text, lower=False)
logging.info(f"removed {len(long_text) - len(raw_text)} chars via cleaning")
batch = {}
batch['text'] = long_text


encoded_input = tokenizer(
    long_text,
    padding='max_length',
    truncation=True,
    max_length=token_batch_length,
    stride=batch_stride,
    return_overflowing_tokens=True,
    add_special_tokens =False,
    return_tensors='pt',
)

In [17]:
_summaries = summarize_via_tokenbatches(
    long_text,
    batch_length=token_batch_length,
    batch_stride=batch_stride,
    **settings,
)


  0%|          | 0/1 [00:00<?, ?it/s]

	On a lighter note, check back in with us at "What's Up With That Guy?" on Twitter for updates on the goings-on with the Pandemonium subplot. And by "Pandemonium," we mean that there's a whole lot of Googling going on and no one seems to be able to find anyone who can vouch for Pandemonium. Check back in soon for an update on Pandemonium status.
Score:	-8.4968


In [18]:
#@markdown write the `_summaries` var to a `.txt`
sum_text = [s["summary"][0] for s in _summaries]
sum_scores = [f"\n - {round(s['summary_score'],4)}" for s in _summaries]
scores_text = "\n".join(sum_scores)
full_summary = "\n\t".join(sum_text)
_outpath = f"SUM_{example_path.name}"

with open(
    _outpath,
    "w",
) as fo:
    fo.writelines(full_summary)
    fo.write("\n" * 3)
    fo.write(f"\n\nSection Scores for {example_path.name}:\n")
    fo.writelines(scores_text)
    fo.write("\n\n---\n")


In [19]:
# print out the summarized output!
!cat $_outpath

On a lighter note, check back in with us at "What's Up With That Guy?" on Twitter for updates on the goings-on with the Pandemonium subplot. And by "Pandemonium," we mean that there's a whole lot of Googling going on and no one seems to be able to find anyone who can vouch for Pandemonium. Check back in soon for an update on Pandemonium status.




Section Scores for mkdlAn_20Introduction_20to_20Deep_20Reinforcement_20Learning.txt:

 - -8.4968

---


<font color="salmon"> note that the middle summary, arguably the worst quality, received the most-negative score. </font>


---


# summarize - txt directory

> demonstrate use case of iterating over all text files in a directory


## load

In [20]:
zip_url = "https://www.dropbox.com/sh/c03o2gpcvh6v3yz/AACaxe00trjjuV4Zmdpn1JtOa?dl=1" #@param {type:"string"}
target_path = "/content/source-text" #@param {type:"string"}
target_path = Path(target_path)
zip_fname = 'temp.zip'

!rm -r $target_path
!rm $zip_fname

!wget $zip_url -O $zip_fname
!unzip -q $zip_fname -d $target_path

rm: cannot remove '/content/source-text': No such file or directory
rm: cannot remove 'temp.zip': No such file or directory
--2023-11-28 23:33:24--  https://www.dropbox.com/sh/c03o2gpcvh6v3yz/AACaxe00trjjuV4Zmdpn1JtOa?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:6021:18::a27d:4112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /sh/dl/c03o2gpcvh6v3yz/AACaxe00trjjuV4Zmdpn1JtOa [following]
--2023-11-28 23:33:25--  https://www.dropbox.com/sh/dl/c03o2gpcvh6v3yz/AACaxe00trjjuV4Zmdpn1JtOa
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 404 Not Found
2023-11-28 23:33:25 ERROR 404: Not Found.

[temp.zip]
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the las

In [21]:
import random

txt_files = [f for f in target_path.iterdir() if f.is_file() and f.suffix == '.txt']
output_dir = target_path.parent / "summarized-text"
output_dir.mkdir(exist_ok=True)

random.SystemRandom().shuffle(txt_files)
txt_files


FileNotFoundError: ignored

## summarize

- runs model in a loop over each text file in a directory.
- FOR each text file: break into batches of `token_batch_length` tokens, which overlap with the `batch_stride` defined earlier.



In [None]:
#@markdown summarize loop

for f in tqdm(txt_files, total=len(txt_files)):

    with open(f, 'r', encoding='utf-8', errors='ignore') as fi:
        text = clean(fi.read(), lower=False)
    print(f"\nNow summarizing:\t{f.name}")
    _summaries = summarize_via_tokenbatches(
                    text,
                    batch_length=token_batch_length,
                    batch_stride=batch_stride,
                    **settings,
                )
    sum_text = [s['summary'][0] for s in _summaries]
    sum_scores = [f"\n - {round(s['summary_score'],4)}" for s in _summaries]
    scores_text = "\n".join(sum_scores)
    full_summary = "\n\t".join(sum_text)
    with open(output_dir / f"SUM_{f.name}", 'w', ) as fo:
        fo.writelines(full_summary)
    with open(output_dir / f"SESSION_SCORES.log", 'a', ) as fo:

        fo.write("\n"*3)
        fo.write(f"\n\nSection Scores for {f.name}:\n")
        fo.writelines(scores_text)
        fo.write("\n\n---\n")

---


## export summarized dir

- this downloads a `.zip` of the summarized text files.



In [None]:
output_zip_tag = "hf_demo" #@param {type:"string"}
output_zip = f"summarized_textdir_{output_zip_tag}.zip"
output_zip = Path(output_zip)
output_zip = Path(output_zip.stem + get_timestamp() + '.zip')
!zip -r -q -9 $output_zip /content/summarized-text

In [None]:
from google.colab import files
files.download(output_zip)