# Training Longformer on LiSCU data
Below, we use a Longformer model that is trained on preprocessed data from the LiSCU dataset. This is split as follows:

0) Project overview  
1) Imports and Data Preprocessing  
2) Fine-tuning LongformerEncoderDecoder (LED) Model  
3) Evaluating LED Model  
4) Generating Final Output  
5) (Optional) Data Visualization

## 0) Project Overview

We will be using [this repo](https://github.com/allenai/longformer) to implement the LongformerEncoderDecoder (LED) model.  

To do this, we plan to accomplish the following steps:

1.   **Imports and Data Preprocessing**: Import [LiSCU data](https://github.com/huangmeng123/lit_char_data_wayback) and preprocess data as well
2.   **Fine-tuning LongformerEncoderDecoder (LED) Model**: Load in LED Model and fine-tune and train the model on LiSCU data, in epochs
3.   **Evaluating LED Model**: This involves model validation and testing to get an estimate of its performance on unseen data, perhaps in reference to LiSCU outputs
4.   **Generate Final Output**: This involves output generation of analyses on new data, i.e. using the trained model to generate character arc analyses for the chapters of new book data
5. **(Optional) Data Visualization**


**General overview**

*   In this project, we will be training the LongformerEncoderDecoder (LED) model on LiSCU data to output character analyses on new book data.
*   We use the LED model instead of the normal Longformer model because the LED model supports Seq2Seq tasks with long input.
*   We train the LED model on LiSCU data which contains character names, summaries, and character descriptions.
*   At the end, we generate new inputs on unseen books.


Input: character name and summary  
output: description  
Train on both



## 1) Imports and Data Preprocessing

Let's begin with necessary Python imports and checking for GPU usage:

In [4]:
!pip install transformers
!pip install tensorflow
!pip install datasets
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m81.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1
Looking in indexes: https://pypi.org/simple, https://

In [5]:
!git clone https://github.com/allenai/longformer.git

Cloning into 'longformer'...
remote: Enumerating objects: 1240, done.[K
remote: Counting objects: 100% (215/215), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 1240 (delta 203), reused 198 (delta 198), pack-reused 1025[K
Receiving objects: 100% (1240/1240), 837.38 KiB | 963.00 KiB/s, done.
Resolving deltas: 100% (838/838), done.


In [6]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import tensorflow as tf
from datasets import Dataset, load_metric
from transformers import LEDTokenizer, LEDForConditionalGeneration
from IPython.display import display, HTML
import random

# this is how we select a GPU if it's avalible on your computer or in the Colab environment.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# # crash colab to get more RAM
# !kill -9 -1

In [None]:
!nvidia-smi

Wed May  3 00:02:47 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    46W / 400W |      3MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [7]:
import nltk
nltk.download("punkt")
from nltk import tokenize


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Now, let's import our [LiSCU data](https://github.com/huangmeng123/lit_char_data_wayback):

In [None]:
# from google.colab import files
# print("Upload the .json files here. Note that the files will only be accessible while the current notebook is running.")
# uploaded = files.upload()

Now that we have our data in the notebook, we should preprocess the data, and we must make sure the LED model can take in the data properly.

In [9]:
# Preprocessing
df_train = pd.read_json('liscu_train.jsonl', lines=True)
df_test = pd.read_json('liscu_test.jsonl', lines=True)
df_val = pd.read_json('liscu_val.jsonl', lines=True)

In [10]:
df_train['inputs'] = df_train[['character_name','summary']].agg("</s>".join, axis=1)
df_test['inputs'] = df_test[['character_name','summary']].agg("</s>".join, axis=1)
df_val['inputs'] = df_val[['character_name','summary']].agg("</s>".join, axis=1)

In [11]:
tokenizer = LEDTokenizer.from_pretrained("allenai/led-base-16384")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

### Old Preprocessing

In [None]:
#DO NOT RUN THIS CELL, WE WILL CREATE A FUNCTION TO FORMAT DATA FOR US
# #Tokenize character names
# for df in [df_train, df_test, df_val]:
#     df['tokenized_character_name'] = list(df['character_name'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True)))
# # Tokenize summaries
# for df in [df_train, df_test, df_val]:
#     df['tokenized_summary'] = list(df['summary'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True)))
# # Tokenize descriptions
# for df in [df_train, df_test, df_val]:
#     df['tokenized_description'] = list(df['description'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True)))
# # Tokenize character names
# for df in [df_train, df_test, df_val]:
#     df['tokenized_character_name'] = list(df['character_name'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True)))
# # Concatenate tokenized character name and summary
# for df in [df_train, df_test, df_val]:
#     df['char_name+summary'] = list(df.apply(lambda x: x['tokenized_character_name'] + [tokenizer.sep_token_id] + x['tokenized_summary'], axis=1))
# # Concatenate tokenized summary and description
# for df in [df_train, df_test, df_val]:
#     df['input_text'] = list(df.apply(lambda x: x['tokenized_summary'] + [tokenizer.sep_token_id] + x['tokenized_description'], axis=1))
# # Mask descriptions
# for df in [df_train, df_test, df_val]:
#     df['masked_description'] = list(df['description'].apply(lambda x: x.replace(x, '[MASK]')))
# # Print the first 10 entries
# print("Tokenized summaries:")
# print(df_train['tokenized_summary'][:10])
# print("\nTokenized descriptions:")
# print(df_train['tokenized_description'][:10])
# print("\nInput texts:")
# print(df_train['input_text'][:10])
# print("\nMasked descriptions:")
# print(df_train['masked_description'][:10])

# # Display HTML version
# #display(HTML(df_train.to_html()))
# # Print the first 10 entries
# print("Tokenized summaries:")
# print(df_test['tokenized_summary'][:10])
# print("\nTokenized descriptions:")
# print(df_test['tokenized_description'][:10])
# print("\nInput texts:")
# print(df_test['input_text'][:10])
# print("\nMasked descriptions:")
# print(df_test['masked_description'][:10])

# # Display HTML version
# display(HTML(df_test.to_html()))
# # Print the first 10 entries
# # print("Tokenized summaries:")
# # print(df_val['tokenized_summary'][:10])
# # print("\nTokenized descriptions:")
# # print(df_val['tokenized_description'][:10])
# # print("\nInput texts:")
# # print(df_val['input_text'][:10])
# # print("\nMasked descriptions:")
# # print(df_val['masked_description'][:10])

# # Display HTML version
# display(HTML(df_val.to_html()))

### New Preprocessing

In [None]:
max_input_length = 2048
max_output_length = 256
batch_size = 8

In [None]:
def preprocess_df(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["inputs"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
    )

    outputs = tokenizer(
        batch["description"],
        padding="max_length",
        truncation=True,
        max_length=max_output_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask


    glob = []
    for x in range(len(batch['input_ids'])):
        i = 0
        gl = []
        while batch['input_ids'][x][i] != 2:
            gl.append(1)
            i += 1
        gl.append(1)
        gl += [0]*(len(batch['input_ids'][x])-i-1)
        glob.append(gl)
    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = glob

    # since above lists are references, the following line changes the 0 index for all samples
    # batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

In [None]:
train_dataset = Dataset.from_pandas(df_train)
eval_dataset = Dataset.from_pandas(df_val)

In [None]:
train_dataset = train_dataset.map(
    preprocess_df,
    batched=True,
    batch_size=batch_size,
    remove_columns=list(df_train.columns),
)

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [None]:
eval_dataset = eval_dataset.map(
    preprocess_df,
    batched=True,
    batch_size=batch_size,
    remove_columns=list(df_val.columns),
)

Map:   0%|          | 0/942 [00:00<?, ? examples/s]

In [None]:
train_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)
eval_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

In [None]:
# experiment = train_dataset.to_pandas()
# trial = df_train['inputs'].apply(lambda x: tokenizer(x)['input_ids'])
# trial = trial.apply(lambda x: len(x))

## 2) Fine-tuning LongformerEncoderDecoder (LED) Model

To-do list for training on longformer:

1.   Final preprocessing (may need data in List[List[str]] or List[str] format
2.   Collate_fn function
3.   Define configuration (LEDConfig)
4.   Create a new data processor class that handles loading and preprocessing your data. Start with the "summarization.py" file and modify it as needed.
5.   Start training the model

Use [this link](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb) as needed for reference.

Now, let's try training the model. First, we must format our inputs into tensors (or List?) so that the LED model can take them in correctly. Then, we train the model.

We use HuggingFace to implement this model.

In [12]:
from transformers import AutoModelForSeq2SeqLM

In [None]:
led = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384", gradient_checkpointing=True, use_cache=False)

Downloading pytorch_model.bin:   0%|          | 0.00/648M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [None]:
# set generate hyperparameters
led.config.num_beams = 2
led.config.max_length = 2048
led.config.min_length = 256
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

In [13]:
rouge = load_metric("rouge")

  rouge = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [None]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In [None]:
# enable fp16 apex training
training_args = Seq2SeqTrainingArguments(
    # predict_with_generate=True,
    # evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    output_dir="./",
    logging_steps=100,
    save_total_limit=2,
    gradient_accumulation_steps=4,
    num_train_epochs=5,
)

In [None]:
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args,
    # compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

In [None]:
trainer.train()



Step,Training Loss
100,3.1898
200,3.1065
300,2.9654
400,2.8999
500,2.847
600,2.7598
700,2.7534
800,2.6725
900,2.6655
1000,2.6204


TrainOutput(global_step=1185, training_loss=2.8088162707880078, metrics={'train_runtime': 3663.5333, 'train_samples_per_second': 10.373, 'train_steps_per_second': 0.323, 'total_flos': 5.119588366811136e+16, 'train_loss': 2.8088162707880078, 'epoch': 4.99})

## 3) Evaluating LED Model

In [None]:
# Import LED model
# model = LEDForConditionalGeneration.from_pretrained('allenai/led-base-16384')

Downloading pytorch_model.bin:   0%|          | 0.00/648M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [14]:
# On GPU
def generate_description(inputs, model):
  # concatenate character names and summaries
#   inputs = [name + summary for name, summary in zip(batch["character_name"], batch["summary"])]

  # tokenize concatenated inputs
  inputs_dict = tokenizer(inputs, padding="max_length", max_length=2048, return_tensors="pt", truncation=True)
  input_ids = inputs_dict.input_ids.to("cuda")
  attention_mask = inputs_dict.attention_mask.to("cuda")

  # create global attention mask
  global_attention_mask = torch.zeros_like(attention_mask)
  for x in range(len(input_ids)):
      i = 0
      while input_ids[x][i] != 2:
          i += 1
      global_attention_mask[x][:i+1] = 1
#   global_attention_mask[:, :len(batch["character_name"])+1] = 1

#   # generate character description
  predicted_desc_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=2048, num_beams=4)
  return tokenizer.batch_decode(predicted_desc_ids, skip_special_tokens=True)



In [None]:
# trainer.save_model("./ledV2")

In [45]:
!unzip '/content/drive/MyDrive/NLP Final Project/ledV1.zip'

Archive:  /content/drive/MyDrive/NLP Final Project/ledV1.zip
   creating: ledV1/
  inflating: ledV1/generation_config.json  
  inflating: ledV1/config.json       
  inflating: ledV1/training_args.bin  
  inflating: ledV1/vocab.json        
  inflating: ledV1/tokenizer_config.json  
  inflating: ledV1/special_tokens_map.json  
  inflating: ledV1/merges.txt        
  inflating: ledV1/pytorch_model.bin  


In [95]:
lediffversion = AutoModelForSeq2SeqLM.from_pretrained("./ledV1").to("cuda")

In [None]:
# !zip -r ./ledV2.zip ./ledV2

  adding: ledV2/ (stored 0%)
  adding: ledV2/special_tokens_map.json (deflated 85%)
  adding: ledV2/generation_config.json (deflated 33%)
  adding: ledV2/training_args.bin (deflated 48%)
  adding: ledV2/tokenizer_config.json (deflated 81%)
  adding: ledV2/merges.txt (deflated 53%)
  adding: ledV2/vocab.json (deflated 68%)
  adding: ledV2/pytorch_model.bin (deflated 10%)
  adding: ledV2/config.json (deflated 60%)


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from google.colab import files
files.download("./ledV1")

In [60]:
text = df_train.iloc[0]['inputs']
text

'Winston Smith</s>Winston Smith is a member of the Outer Party. He works in the Records Department in the Ministry of Truth, rewriting and distorting history. To escape Big Brother\'s tyranny, at least inside his own mind, Winston begins a diary — an act punishable by death. Winston is determined to remain human under inhuman circumstances. Yet telescreens are placed everywhere — in his home, in his cubicle at work, in the cafeteria where he eats, even in the bathroom stalls. His every move is watched. No place is safe. One day, while at the mandatory Two Minutes Hate, Winston catches the eye of an Inner Party Member, O\'Brien, whom he believes to be an ally. He also catches the eye of a dark-haired girl from the Fiction Department, whom he believes is his enemy and wants him destroyed. A few days later, Julia, the dark-haired girl whom Winston believes to be against him, secretly hands him a note that reads, "I love you." Winston takes pains to meet her, and when they finally do, Juli

In [98]:
all_results = []
for i in range(20):
  text = df_val.iloc[i]['inputs']
  cand = generate_description(text, lediffversion)[0]
  results = rouge.compute(predictions=[cand], references=[text], rouge_types=['rouge1', 'rouge2', 'rougeL'])
  # print('cand: ', cand)
  # print("results: ", results)
  some = []
  for name, score in results.items():
    some.append(score.mid)
  all_results.append(some)

In [99]:
for name, score in results.items():
  some.append(score.mid)
  print(score.mid)

Score(precision=0.7557603686635944, recall=0.16768916155419222, fmeasure=0.2744769874476987)
Score(precision=0.25462962962962965, recall=0.05629477993858751, fmeasure=0.09220452640402346)
Score(precision=0.3824884792626728, recall=0.08486707566462168, fmeasure=0.13891213389121337)


In [80]:
print(all_results[1])

[Score(precision=0.7358490566037735, recall=0.14758751182592242, fmeasure=0.24586288416075647), Score(precision=0.24644549763033174, recall=0.04924242424242424, fmeasure=0.08208366219415943), Score(precision=0.38207547169811323, recall=0.07663197729422895, fmeasure=0.1276595744680851)]


In [81]:
print(all_results[11])

[Score(precision=0.7201834862385321, recall=0.1549851924975321, fmeasure=0.2550771730300569), Score(precision=0.2488479262672811, recall=0.0533596837944664, fmeasure=0.08787632221318144), Score(precision=0.38073394495412843, recall=0.08193484698914116, fmeasure=0.1348497156783103)]


In [82]:
print(all_results[19])

[Score(precision=0.7242990654205608, recall=0.15848670756646216, fmeasure=0.2600671140939597), Score(precision=0.23943661971830985, recall=0.052200614124872056, fmeasure=0.08571428571428572), Score(precision=0.3598130841121495, recall=0.0787321063394683, fmeasure=0.1291946308724832)]


In [None]:
# On CPU
def generate_description(batch):
  # concatenate character names and summaries
  inputs = [name + summary for name, summary in zip(batch["character_name"], batch["summary"])]

  # tokenize concatenated inputs
  inputs_dict = tokenizer(inputs, padding="max_length", max_length=2048, return_tensors="pt", truncation=True)
  input_ids = inputs_dict.input_ids.to("cpu")
  attention_mask = inputs_dict.attention_mask.to("cpu")

  # create global attention mask
  global_attention_mask = torch.zeros_like(attention_mask)
  global_attention_mask[:, :len(batch["character_name"])+1] = 1

  # generate character description
  predicted_desc_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=256, num_beams=2)
  batch["predicted_description"] = tokenizer.batch_decode(predicted_desc_ids, skip_special_tokens=True)
  return batch


In [None]:
val_dataset = Dataset.from_pandas(df_val)

val_dataset_small = val_dataset.select(range(100))
result_val_small = val_dataset_small.map(generate_description, batched=True, batch_size=2)

## Entire dataset
# result_val = val_dataset.map(
#     generate_description,
#     batched=True,
#     batch_size=2,
# )

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Let's use the Rouge score to evaluate the performance of our model

In [None]:
rouge = load_metric("rouge")

  rouge = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [None]:
rouge.compute(predictions=result_val_small["predicted_description"], references=result_val_small["description"], rouge_types=["rouge2"])["rouge2"].mid

Score(precision=0.02346665933374044, recall=0.06562404604927308, fmeasure=0.03377754130241542)

This score is like really bad lol

In [54]:
np50 = '/content/drive/MyDrive/NLP Final Project/npArrayOfSummaries50.npy'

In [55]:
hpbook = np.load(np50)

In [65]:
for i in range(len(hpbook) - 1):
  hpbook[i] = 'Harry Potter</s>' + hpbook[i]

In [96]:
all_results = []
for i in range(len(hpbook) - 1):
  text = hpbook[i]
  cand = generate_description(text, lediffversion)[0]
  results = rouge.compute(predictions=[cand], references=[text], rouge_types=['rouge1', 'rouge2', 'rougeL'])
  # print('cand: ', cand)
  # print("results: ", results)
  some = []
  for name, score in results.items():
    some.append(score.mid)
  all_results.append(some)

In [97]:
for name, score in results.items():
  some.append(score.mid)
  print(score.mid)

Score(precision=0.8622222222222222, recall=0.04086791657889193, fmeasure=0.07803700724054707)
Score(precision=0.38392857142857145, recall=0.018120522545301308, fmeasure=0.03460764587525151)
Score(precision=0.5022222222222222, recall=0.02380450811038551, fmeasure=0.04545454545454546)


In [75]:
print(all_results[1])

[Score(precision=0.8738317757009346, recall=0.06680957484816005, fmeasure=0.12412877530700298), Score(precision=0.431924882629108, recall=0.032880629020729094, fmeasure=0.06110926602457656), Score(precision=0.48598130841121495, recall=0.037156127188281526, fmeasure=0.0690341851974776), Score(precision=0.48598130841121495, recall=0.037156127188281526, fmeasure=0.0690341851974776)]


In [76]:
print(all_results[11])

[Score(precision=0.8119266055045872, recall=0.04005431093007468, fmeasure=0.07634246279922365), Score(precision=0.25806451612903225, recall=0.012675418741511997, fmeasure=0.024163969795037755), Score(precision=0.4724770642201835, recall=0.023308440823715772, fmeasure=0.044425274962260085), Score(precision=0.4724770642201835, recall=0.023308440823715772, fmeasure=0.044425274962260085)]


In [77]:
print(all_results[12])

[Score(precision=0.7880184331797235, recall=0.08181818181818182, fmeasure=0.1482444733420026), Score(precision=0.3333333333333333, recall=0.03446625179511728, fmeasure=0.06247288503253797), Score(precision=0.42857142857142855, recall=0.04449760765550239, fmeasure=0.08062418725617686), Score(precision=0.42857142857142855, recall=0.04449760765550239, fmeasure=0.08062418725617686)]


In [85]:
cand = generate_description(hpbook[3], lediffversion)[0]
cand

['Harry Potter is the main character of the novel. He is also the narrator and the protagonist of the story. Harry is a young boy who has been to school, but has never been to Hogwarts. He has never heard of a wizard before, and he has never seen anything quite like the giant\'s face. Harry has never met the giant before, but he has always been fascinated by him. When Harry first meets him, he is struck by the way Hagrid looks at him, the way Dudley looks at Harry, and the way Uncle Vernon looks at the giant. Harry\'s first impression of Hagrid is that Hagrid\'s face is a fierce, wild, shadowy figure.  \\"I don\'t know what yeh are, Harry,\\" Hagrid said. Harry looked up at Hagrid and saw that the giant was staring at him with a fierce gaze. He also saw that he had never heard Hagrid speak in such a fierce voice before. Harry thought Hagrid must be a monster, but Hagrid didn\'t seem to know what to say to him. He looked up into the face of the giant, who was staring back at Harry. Harr

In [94]:
print(hpbook[3])



## 4) Generate Final Output

## (Optional) Data Visualization