<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Statics" data-toc-modified-id="Statics-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Statics</a></span></li><li><span><a href="#Load-relevant-data" data-toc-modified-id="Load-relevant-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load relevant data</a></span></li><li><span><a href="#Load-model" data-toc-modified-id="Load-model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Load model</a></span></li><li><span><a href="#Evaluate" data-toc-modified-id="Evaluate-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Evaluate</a></span></li></ul></div>

# Imports

In [1]:
import os

os.chdir("/home/ivanr/git/document_information_extraction/")

In [2]:
%load_ext lab_black

In [3]:
import pickle

import pandas as pd

import torch
import transformers
from transformers import BartTokenizer, BartForConditionalGeneration

In [4]:
from src.data.wikipedia.wiki_data_base import (
    retrieve_query,
    retrive_observations_from_ids,
)

# Statics

In [5]:
from src.data.data_statics import (
    MIN_SEMANTIC_SIMILARITY,
    MIN_NOVELTY,
    MAX_NOVELTY,
    MAX_TOKENS_BODY,
)

In [6]:
RANDOM_SEED = 0
N_SAMPLE_TEXTS = 1000

In [7]:
QUERY_SUITABLE_ARTICLES = f"""
SELECT ar.*,
       nv.novelty_tokens,
       nv.novelty_bigrams,
       nv.novelty_trigrams,
       cs.semantic_similarity
       
FROM article_level_info ar
INNER JOIN wiki_article_novelty nv
    ON ar.pageid = nv.pageid
INNER JOIN wiki_article_cosine_similarity cs
    ON ar.pageid = cs.pageid
WHERE cs.semantic_similarity>={MIN_SEMANTIC_SIMILARITY}
    AND nv.novelty_tokens<={MAX_NOVELTY}
    AND nv.novelty_tokens>={MIN_NOVELTY}
    AND ar.body_word_count<={MAX_TOKENS_BODY}
"""

# Load relevant data

In [8]:
characterisation_df = pd.DataFrame(
    retrieve_query(QUERY_SUITABLE_ARTICLES),
    columns=[
        "pageid",
        "title",
        "summary_word_count",
        "body_word_count",
        "novelty_tokens",
        "novelty_bigrams",
        "novelty_trigrams",
        "semantic_similarity",
    ],
)

In [9]:
print(f"length of file: {len(characterisation_df)}")
characterisation_df.head()

length of file: 527327


Unnamed: 0,pageid,title,summary_word_count,body_word_count,novelty_tokens,novelty_bigrams,novelty_trigrams,semantic_similarity
0,330,Actrius,57,342,0.424242424242424,0.837838,0.921053,0.674556
1,340,Alain Connes,61,308,0.388888888888889,0.804878,0.902439,0.796262
2,683,Adventure,80,661,0.476190476190476,0.836735,0.918367,0.669381
3,772,Ampere,185,684,0.394366197183099,0.735043,0.874016,0.880544
4,787,Alismatales,62,721,0.486486486486487,0.869565,0.977778,0.70724


In [10]:
pageids_to_evaluate = list(
    characterisation_df["pageid"].sample(n=N_SAMPLE_TEXTS, random_state=RANDOM_SEED)
)

In [11]:
ARTICLE_GENERATOR = retrive_observations_from_ids(pageids_to_evaluate)

In [12]:
def decode_row(article):
    pageid = article[0]
    # section_title = article[1]
    summary = article[2]
    body = "".join(pickle.loads(article[3]))
    return pageid, summary, body

# Load model

In [13]:
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device: ", torch_device)

Device:  cuda


In [14]:
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn").to(
    torch_device
)

In [15]:
def bart_summarize(
    text,
    num_beams=4,
    length_penalty=2,
    max_length=1024,
    min_length=52,
    no_repeat_ngram_size=3,
):
    """Sumarize text using BART."""

    text = text.replace("\n", "")
    text_input_ids = tokenizer.batch_encode_plus(
        [text], return_tensors="pt", max_length=1024
    )["input_ids"].to(torch_device)

    summary_ids = model.generate(
        text_input_ids,
        num_beams=int(num_beams),
        length_penalty=float(length_penalty),
        max_length=int(max_length),
        min_length=int(min_length),
        no_repeat_ngram_size=int(no_repeat_ngram_size),
    )
    summary_txt = tokenizer.decode(summary_ids.squeeze(), skip_special_tokens=True)
    return summary_txt

# Evaluate

In [16]:
for article in ARTICLE_GENERATOR:
    pageid, summary, body = decode_row(article)

In [17]:
# model_summary = bart_summarize(body, length_penalty=0)

In [18]:
print(body)


Born in 1748, Roe was the only surviving child of Robert Roe (died 1753) of Brinwith, Glamorganshire, and his wife, Hester (died 1760), daughter of William Wraxall of Bristol. In 1775, he married Susan Margaret (died 1831), daughter of Sir William Thomas, 2nd Baronet (died 1777), of Yapton Place; they had five children: William Thomas Roe (1776–1834); Louisa Georgiana Roe (1778–1843); George Henry Popham Roe and Edward Wrexhall Roe, who both died in infancy; and Frederick Adair Roe (1789–1866). Roe was close to his eldest son William and thought highly of him.


A provincial from Bristol working as a customs official, Roe attracted the attention of Thomas Anguish after writing a series of critical, moralistic articles concerning Lord Shelburne's 1783 peace treaty with America. That year, Anguish co-opted Roe as a Commissioner for Auditing Public Accounts; the commission, which was tasked with investigating government finances and making recommendations, produced influential reports on

In [19]:
candidate = body
reference = summary

In [20]:
from nltk.translate.bleu_score import sentence_bleu
from nltk import word_tokenize

In [21]:
candidate = word_tokenize(candidate)
reference = word_tokenize(reference)
weights = (1, 0, 0, 0)

In [22]:
sentence_bleu(candidate, reference, weights=weights)

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


0.16071428571428573