# Training Longformer on LiSCU data
Below, we use a Longformer model that is trained on preprocessed data from the LiSCU dataset. This is split as follows:

0) Project overview  
1) Imports and Data Preprocessing  
2) Fine-tuning LongformerEncoderDecoder (LED) Model  
3) Evaluating LED Model  
4) Generating Final Output  
5) (Optional) Data Visualization

## 0) Project Overview

We will be using [this repo](https://github.com/allenai/longformer) to implement the LongformerEncoderDecoder (LED) model.  

To do this, we plan to accomplish the following steps:

1.   **Imports and Data Preprocessing**: Import [LiSCU data](https://github.com/huangmeng123/lit_char_data_wayback) and preprocess data as well
2.   **Fine-tuning LongformerEncoderDecoder (LED) Model**: Load in LED Model and fine-tune and train the model on LiSCU data, in epochs
3.   **Evaluating LED Model**: This involves model validation and testing to get an estimate of its performance on unseen data, perhaps in reference to LiSCU outputs
4.   **Generate Final Output**: This involves output generation of analyses on new data, i.e. using the trained model to generate character arc analyses for the chapters of new book data
5. **(Optional) Data Visualization**


**General overview**

*   In this project, we will be training the LongformerEncoderDecoder (LED) model on LiSCU data to output character analyses on new book data.
*   We use the LED model instead of the normal Longformer model because the LED model supports Seq2Seq tasks with long input.
*   We train the LED model on LiSCU data which contains character names, summaries, and character descriptions.
*   At the end, we generate new inputs on unseen books.


Input: character name and summary  
output: description  
Train on both



## 1) Imports and Data Preprocessing

Let's begin with necessary Python imports and checking for GPU usage:

In [1]:
!pip install transformers
!pip install tensorflow
!pip install datasets
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!git clone https://github.com/allenai/longformer.git

fatal: destination path 'longformer' already exists and is not an empty directory.


In [3]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import tensorflow as tf
from datasets import Dataset, load_metric
from transformers import LEDTokenizer, LEDForConditionalGeneration
from IPython.display import display, HTML
import random

# this is how we select a GPU if it's avalible on your computer or in the Colab environment.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [4]:
!nvidia-smi

Tue May  2 14:52:36 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    24W / 300W |      2MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
import nltk
nltk.download("punkt")
from nltk import tokenize


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Now, let's import our [LiSCU data](https://github.com/huangmeng123/lit_char_data_wayback):

In [6]:
# from google.colab import files
# print("Upload the .json files here. Note that the files will only be accessible while the current notebook is running.")
# uploaded = files.upload()

Now that we have our data in the notebook, we should preprocess the data, and we must make sure the LED model can take in the data properly.

In [7]:
# Preprocessing
df_train = pd.read_json('liscu_train.jsonl', lines=True)
df_test = pd.read_json('liscu_test.jsonl', lines=True)
df_val = pd.read_json('liscu_val.jsonl', lines=True)

In [8]:
df_train['inputs'] = df_train[['character_name','summary']].agg("</s>".join, axis=1)
df_test['inputs'] = df_test[['character_name','summary']].agg("</s>".join, axis=1)
df_val['inputs'] = df_val[['character_name','summary']].agg("</s>".join, axis=1)

In [9]:
tokenizer = LEDTokenizer.from_pretrained("allenai/led-base-16384")

### Old Preprocessing

In [10]:
#DO NOT RUN THIS CELL, WE WILL CREATE A FUNCTION TO FORMAT DATA FOR US
# #Tokenize character names
# for df in [df_train, df_test, df_val]:
#     df['tokenized_character_name'] = list(df['character_name'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True)))
# # Tokenize summaries
# for df in [df_train, df_test, df_val]:
#     df['tokenized_summary'] = list(df['summary'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True)))
# # Tokenize descriptions
# for df in [df_train, df_test, df_val]:
#     df['tokenized_description'] = list(df['description'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True)))
# # Tokenize character names
# for df in [df_train, df_test, df_val]:
#     df['tokenized_character_name'] = list(df['character_name'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True)))
# # Concatenate tokenized character name and summary
# for df in [df_train, df_test, df_val]:
#     df['char_name+summary'] = list(df.apply(lambda x: x['tokenized_character_name'] + [tokenizer.sep_token_id] + x['tokenized_summary'], axis=1))
# # Concatenate tokenized summary and description
# for df in [df_train, df_test, df_val]:
#     df['input_text'] = list(df.apply(lambda x: x['tokenized_summary'] + [tokenizer.sep_token_id] + x['tokenized_description'], axis=1))
# # Mask descriptions
# for df in [df_train, df_test, df_val]:
#     df['masked_description'] = list(df['description'].apply(lambda x: x.replace(x, '[MASK]')))
# # Print the first 10 entries
# print("Tokenized summaries:")
# print(df_train['tokenized_summary'][:10])
# print("\nTokenized descriptions:")
# print(df_train['tokenized_description'][:10])
# print("\nInput texts:")
# print(df_train['input_text'][:10])
# print("\nMasked descriptions:")
# print(df_train['masked_description'][:10])

# # Display HTML version
# #display(HTML(df_train.to_html()))
# # Print the first 10 entries
# print("Tokenized summaries:")
# print(df_test['tokenized_summary'][:10])
# print("\nTokenized descriptions:")
# print(df_test['tokenized_description'][:10])
# print("\nInput texts:")
# print(df_test['input_text'][:10])
# print("\nMasked descriptions:")
# print(df_test['masked_description'][:10])

# # Display HTML version
# display(HTML(df_test.to_html()))
# # Print the first 10 entries
# # print("Tokenized summaries:")
# # print(df_val['tokenized_summary'][:10])
# # print("\nTokenized descriptions:")
# # print(df_val['tokenized_description'][:10])
# # print("\nInput texts:")
# # print(df_val['input_text'][:10])
# # print("\nMasked descriptions:")
# # print(df_val['masked_description'][:10])

# # Display HTML version
# display(HTML(df_val.to_html()))

### New Preprocessing

In [11]:
max_input_length = 2048
max_output_length = 256
batch_size = 4

In [12]:
def preprocess_df(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["inputs"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
    )

    outputs = tokenizer(
        batch["description"],
        padding="max_length",
        truncation=True,
        max_length=max_output_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask


    glob = []
    for x in range(len(batch['input_ids'])):
        i = 0
        gl = []
        while batch['input_ids'][x][i] != 2:
            gl.append(1)
            i += 1
        gl.append(1)
        gl += [0]*(len(batch['input_ids'][x])-i-1)
        glob.append(gl)
    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = glob

    # since above lists are references, the following line changes the 0 index for all samples
    # batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

In [13]:
train_dataset = Dataset.from_pandas(df_train)
eval_dataset = Dataset.from_pandas(df_val)

In [14]:
train_dataset = train_dataset.map(
    preprocess_df,
    batched=True,
    batch_size=batch_size,
    remove_columns=list(df_train.columns),
)

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [15]:
eval_dataset = eval_dataset.map(
    preprocess_df,
    batched=True,
    batch_size=batch_size,
    remove_columns=list(df_val.columns),
)

Map:   0%|          | 0/942 [00:00<?, ? examples/s]

In [16]:
train_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)
eval_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

In [17]:
# experiment = train_dataset.to_pandas()
# trial = df_train['inputs'].apply(lambda x: tokenizer(x)['input_ids'])
# trial = trial.apply(lambda x: len(x))

## 2) Fine-tuning LongformerEncoderDecoder (LED) Model

To-do list for training on longformer:

1.   Final preprocessing (may need data in List[List[str]] or List[str] format
2.   Collate_fn function
3.   Define configuration (LEDConfig)
4.   Create a new data processor class that handles loading and preprocessing your data. Start with the "summarization.py" file and modify it as needed.
5.   Start training the model

Use [this link](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb) as needed for reference.

Now, let's try training the model. First, we must format our inputs into tensors (or List?) so that the LED model can take them in correctly. Then, we train the model.

We use HuggingFace to implement this model.

In [18]:
from transformers import AutoModelForSeq2SeqLM

In [19]:
led = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384", gradient_checkpointing=True, use_cache=False)

Downloading pytorch_model.bin:   0%|          | 0.00/648M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [20]:
# set generate hyperparameters
led.config.num_beams = 2
led.config.max_length = 2048
led.config.min_length = 256
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

In [21]:
rouge = load_metric("rouge")

  rouge = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [22]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

In [23]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In [24]:
# enable fp16 apex training
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    output_dir="./",
    logging_steps=5,
    eval_steps=10,
    save_steps=10,
    save_total_limit=2,
    gradient_accumulation_steps=4,
    num_train_epochs=5,
)

In [25]:
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

In [None]:
trainer.train()



Step,Training Loss,Validation Loss,Rouge2 Precision,Rouge2 Recall,Rouge2 Fmeasure
10,3.3412,3.037866,0.0257,0.101,0.0397




## 3) Evaluating LED Model

In [None]:
# Import LED model
model = LEDForConditionalGeneration.from_pretrained('allenai/led-base-16384')

Downloading pytorch_model.bin:   0%|          | 0.00/648M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [None]:
# On GPU
def generate_description(batch):
  # concatenate character names and summaries
  inputs = [name + summary for name, summary in zip(batch["character_name"], batch["summary"])]

  # tokenize concatenated inputs
  inputs_dict = tokenizer(inputs, padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  input_ids = inputs_dict.input_ids.to("cuda")
  attention_mask = inputs_dict.attention_mask.to("cuda")

  # create global attention mask
  global_attention_mask = torch.zeros_like(attention_mask)
  global_attention_mask[:, :len(batch["character_name"])+1] = 1

  # generate character description
  predicted_desc_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=512, num_beams=2)
  batch["predicted_description"] = tokenizer.batch_decode(predicted_desc_ids, skip_special_tokens=True)
  return batch


In [None]:
# On CPU
def generate_description(batch):
  # concatenate character names and summaries
  inputs = [name + summary for name, summary in zip(batch["character_name"], batch["summary"])]

  # tokenize concatenated inputs
  inputs_dict = tokenizer(inputs, padding="max_length", max_length=2048, return_tensors="pt", truncation=True)
  input_ids = inputs_dict.input_ids.to("cpu")
  attention_mask = inputs_dict.attention_mask.to("cpu")

  # create global attention mask
  global_attention_mask = torch.zeros_like(attention_mask)
  global_attention_mask[:, :len(batch["character_name"])+1] = 1

  # generate character description
  predicted_desc_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=256, num_beams=2)
  batch["predicted_description"] = tokenizer.batch_decode(predicted_desc_ids, skip_special_tokens=True)
  return batch


In [None]:
val_dataset = Dataset.from_pandas(df_val)

val_dataset_small = val_dataset.select(range(100))
result_val_small = val_dataset_small.map(generate_description, batched=True, batch_size=2)

## Entire dataset
# result_val = val_dataset.map(
#     generate_description,
#     batched=True,
#     batch_size=2,
# )

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Let's use the Rouge score to evaluate the performance of our model

In [None]:
rouge = load_metric("rouge")

  rouge = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [None]:
rouge.compute(predictions=result_val_small["predicted_description"], references=result_val_small["description"], rouge_types=["rouge2"])["rouge2"].mid

Score(precision=0.02346665933374044, recall=0.06562404604927308, fmeasure=0.03377754130241542)

This score is like really bad lol

## 4) Generate Final Output

## (Optional) Data Visualization