# HuggingFace Transformers
While spaCy provides good-enough NER cabilitities, the accelerated pace of NLP models in recent years means that we can use pre-trained models that leverage modern machine approaches. [HuggingFace](https://huggingface.co/), a company specializing in open-source models, provides an easy-to-use Python library for applying these models on text contained in Sinopia's RDF and to available full-text.

In [91]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%reload_ext lab_black
import datetime
import json
import pickle
import string

import kglab
import pandas as pd
import rdflib
import requests
import helpers
import widgets

from transformers import pipeline

## Loading Sinopia Stage RDF Text DataFrame

In [2]:
stage_text_nodes = pd.read_json("data/stage-text-nodes.json")

In [6]:
stage_text_nodes.sample(10)

Unnamed: 0,title,url,label,summary
5024,Funeral song,<https://api.stage.sinopia.io/resource/5484503...,,
1946,Data mining,<https://api.stage.sinopia.io/resource/31f7625...,,
6575,Записки об уженье рыбы,<https://api.stage.sinopia.io/resource/661f95f...,,
11404,Lady Susan,<https://api.stage.sinopia.io/resource/f61f60b...,,
747,Everything I Never Told You: A Novel,<https://api.stage.sinopia.io/resource/df7767a...,,
6268,Reflections on Malcolm Forsyth,<https://api.stage.sinopia.io/resource/0f88625...,,
3566,Lords of the Sith,<https://api.stage.sinopia.io/resource/62d3df1...,,"On Ryloth, a planet crucial to the growing Emp..."
8026,Semblanzas de la agricultura de las Américas,<https://api.stage.sinopia.io/resource/d2d48e0...,,
9789,Deep in the art of Texas : a century of painti...,<https://api.stage.sinopia.io/resource/f8fc557...,,
5969,Season in hell,<https://api.stage.sinopia.io/resource/3126f73...,,


## Huggingface NER Pipeline
The HuggingFace transformers library provides very easy-to-use pipelines for running common NLP tasks like NER. We will create a NER pipeline and run the following *summary* value.

In [10]:
print(stage_text_nodes.iloc[3566].summary)

On Ryloth, a planet crucial to the growing Empire as a source of slave labor and the narcotic known as “spice,” an aggressive resistance movement has arisen, led by Cham Syndulla, an idealistic freedom fighter, and Isval, a vengeful former slave. But Emperor Palpatine means to control the embattled world and its precious resources—by political power or firepower—and he will be neither intimidated nor denied. Accompanied by his merciless disciple, Darth Vader, he sets out on a rare personal mission to ensure his will is done.   For Syndulla and Isval, it’s the opportunity to strike at the very heart of the ruthless dictatorship sweeping the galaxy. And for the Emperor and Darth Vader, Ryloth becomes more than just a matter of putting down an insurrection: When an ambush sends them crashing to the planet’s surface, where inhospitable terrain and an army of resistance fighters await them, they will find their relationship tested as never before. With only their lightsabers, the dark side 

In [5]:
ner_pipe = pipeline("ner")

In [7]:
ner_results = ner_pipe(stage_text_nodes.iloc[3566].summary)

In [9]:
for row in ner_results:
    print(row)

{'entity': 'I-LOC', 'score': 0.98946464, 'index': 2, 'word': 'R', 'start': 3, 'end': 4}
{'entity': 'I-LOC', 'score': 0.9601934, 'index': 3, 'word': '##yl', 'start': 4, 'end': 6}
{'entity': 'I-LOC', 'score': 0.9397417, 'index': 4, 'word': '##oth', 'start': 6, 'end': 9}
{'entity': 'I-LOC', 'score': 0.51708376, 'index': 12, 'word': 'Empire', 'start': 43, 'end': 49}
{'entity': 'I-PER', 'score': 0.9909031, 'index': 41, 'word': 'Cha', 'start': 165, 'end': 168}
{'entity': 'I-PER', 'score': 0.9832996, 'index': 42, 'word': '##m', 'start': 168, 'end': 169}
{'entity': 'I-PER', 'score': 0.9986909, 'index': 43, 'word': 'S', 'start': 170, 'end': 171}
{'entity': 'I-PER', 'score': 0.7813112, 'index': 44, 'word': '##yn', 'start': 171, 'end': 173}
{'entity': 'I-PER', 'score': 0.9743239, 'index': 45, 'word': '##du', 'start': 173, 'end': 175}
{'entity': 'I-PER', 'score': 0.8315564, 'index': 46, 'word': '##lla', 'start': 175, 'end': 178}
{'entity': 'I-PER', 'score': 0.99561924, 'index': 55, 'word': 'Is', '

The results returned from the [HuggingFace][HUG] NER contain more information and are structured differently than [spaCy][SPACY]. In the [HuggingFace][HUG] pipeline, there are only four classes of entities:

- `I-PER` for a Person name
- `I-ORG` for an Organization name
- `I-LOC` for a location
- `I-MISC` for a Miscellaneous entity. 

The HuggingFace NER also gives a statistical score on how the model is confident that it matched an entity. Also, the [HuggingFace][HUG] NER pipeline results does some character masking (seen as `##` in the results) for many of the entities.

[HUG]: https://huggingface.co/
[SPACY]: https://spacy.io/

## Huggingface Summarization Pipeline
Another [HuggingFace][HUG] pipeline is the *summarization* task that takes a large document and automatically summarizes the text. The pipeline leverages a [Bart](https://arxiv.org/abs/1910.13461) model that was fine-tuned on a CNN / Daily Mail data set.

[HUG]: https://huggingface.co/

In [11]:
summarizer = pipeline("summarization")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

In the previous notebook we looked a *Alarmingly suspicious* that is cataloged in Sinopia at https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c. This resource does not have an abstract or BIBFRAME Summary, so let us read in sections of the full-text to this [HuggingFace][HUG] summizer and see if can autogenerate a summary and add it to this RDF graph.

[HUG]: https://huggingface.co/

In [12]:
with open("data/65a2b059-5ac1-48a6-adbb-870712c3060c.txt") as fo:
    example1_text = fo.read()

Let's see what happens if we send the entire full-text to the `summarizer`:

In [13]:
summary_result = summarizer(example1_text)

Token indices sequence length is longer than the specified maximum sequence length for this model (20348 > 1024). Running this sequence through the model will result in indexing errors


IndexError: index out of range in self

The maximum number of tokens that the `summarizer` pipeline can process at a time is **1024** while our full-text has **20,348** tokens. Let us try breaking down our large text into smaller "chunks", send each chunk into the `summarizer` pipeline, capture the resulting summary, and at the end, see if the summaries make sense.

First, we will create a list of all of the words in the full-text and then send and summarize 

In [24]:
example1_words = example1_text.split()
print(f"Total words in full-text {len(example1_words):,}")

Total words in full-text 10,570


In [67]:
summaries = []
start = 0
size = 500
end = size

start_time = datetime.datetime.utcnow()
print(f"Started at {start_time}")
for i in range(22):  # 10,570 / 500 ~= 21
    if end > len(example1_words):
        end = None
    print(f"{i+1} {start:,} to {end:,} words")
    text_chunk = ' '.join(example1_words[start:end]).encode("ascii", errors="ignore").decode().replace(
    "#", ""
)
    result = summarizer(' '.join(example1_words[start:end]), max_length=75)
    summaries.append(result[0].get('summary_text'))
    start += size
    end += size
end_time = datetime.datetime.utcnow()
print(f"Finished at {end}, total time {(end-start).seconds / 60.} minutes")

Started at 2021-07-18 23:58:45.975149
1 0 to 500 words
2 500 to 1,000 words
3 1,000 to 1,500 words
4 1,500 to 2,000 words
5 2,000 to 2,500 words
6 2,500 to 3,000 words
7 3,000 to 3,500 words
8 3,500 to 4,000 words
9 4,000 to 4,500 words
10 4,500 to 5,000 words
11 5,000 to 5,500 words
12 5,500 to 6,000 words
13 6,000 to 6,500 words
14 6,500 to 7,000 words
15 7,000 to 7,500 words
16 7,500 to 8,000 words
17 8,000 to 8,500 words
18 8,500 to 9,000 words
19 9,000 to 9,500 words
20 9,500 to 10,000 words
21 10,000 to 10,500 words


IndexError: index out of range in self

In [71]:
with open("data/summaries.pkl", "wb+") as fo:
    pickle.dump(summaries, fo)

With each summary being limited to 75 words, we still have a large summary that we may want to reduce even further. 

In [72]:
summary_all = " ".join(summaries)

In [78]:
all_summary_words = summary_all.split()

In [81]:
summarizer(" ".join(all_summary_words[0:500]))

[{'summary_text': " The Lady of Lyons, The Studio, The Vow of the Omani, The Brigands of Calabria, The Serf The Poacher's Doom, The Hunter of the Alp- Thirty-Three Next Birthday . The work is in the public domain, meaning users are free to copy, use, and redistribute the work in part or in whole ."}]

In [82]:
summarizer(" ".join(all_summary_words[500:]))

[{'summary_text': ' The play is founded on incidents which actually occured during the war of the Rebellion . It introduces Ohio’s brave and gallant McPherson . It abounds with the most beautiful tableaux, drill, marches, scenes upon the battle f ^i l, in AndersonviHe .'}]

## Illustration of Cataloging Workflow
To illustrate a possible use of summarization, we will add a BIBFRAME Summary to the original RDF graph for this work.

In [95]:
example1_result = requests.get(
    "https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c"
)
example1_graph = rdflib.Graph()
for ns, url in helpers.NAMESPACES.items():
    example1_graph.namespace_manager.bind(ns, url)
example1_graph.parse(
    data=json.dumps(example1_result.json().get("data")), format="json-ld"
)
work_uri = rdflib.URIRef(
    "https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c"
)
print(f"Total triples {len(example1_graph)}")

Total triples 48


In [96]:
summary_literal = rdflib.Literal(
    """The Lady of Lyons, The Studio, The Vow of the Omani, The Brigands of Calabria, The Serf The Poacher's Doom, The Hunter of the Alp- Thirty-Three Next Birthday . The work is in the public domain, meaning users are free to copy, use, and redistribute the work in part or in whole. The play is founded on incidents which actually occured during the war of the Rebellion . It introduces Ohio’s brave and gallant McPherson . It abounds with the most beautiful tableaux, drill, marches, scenes upon the battle, in AndersonviHe ."""
)
summary_bnode = rdflib.BNode()
example1_graph.add((work_uri, helpers.BIBFRAME.summary, summary_bnode))
example1_graph.add((summary_bnode, rdflib.RDF.type, helpers.BIBFRAME.Summary))
example1_graph.add((summary_bnode, rdflib.RDFS.label, summary_literal))

In [97]:
print(example1_graph.serialize(format="turtle").decode())

@prefix bf: <http://id.loc.gov/ontologies/bibframe/> .
@prefix bflc: <http://id.loc.gov/ontologies/bflc/> .
@prefix ns1: <http://www.europeana.eu/schemas/edm/> .
@prefix ns2: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sinopia: <http://sinopia.io/vocabulary/> .

<https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c> a bf:Work,
        <http://share-vde.org/rdfBibframe/SuperWork> ;
    rdfs:label "Simpson, J. Palgrave (John Palgrave), 1807-1887. Alarmingly suspicious"@eng ;
    bf:adminMetadata [ a bf:AdminMetadata ;
            bflc:catalogerId "tt434"@eng ;
            bflc:encodingLevel <http://id.loc.gov/vocabulary/menclvl/f> ;
            bf:descriptionConventions <https://id.loc.gov/vocabulary/descriptionConventions/rda> ;
            bf:descriptionLanguage <https://id.loc.gov/vocabulary/languages/eng> ;
            bf:descriptionModifier <https://id.loc.gov/vocabulary/organizations/cty> ;
            bf:source

## Exercise 1

From the `stage_text_nodes`, select a series *title*, *label*, or *summary* values and compare the [spaCy][SPACY] 'en_core_web_sm' NER model with the [Huggingface][HUG] NER model results.

[HUG]: https://huggingface.co/
[SPACY]: https://spacy.io/