# HuggingFace Transformers
While spaCy provides good-enough NER cabilitities, the accelerated pace of NLP models in recent years means that we can use pre-trained models that leverage modern machine approaches. [HuggingFace](https://huggingface.co/), a company specializing in open-source models, provides an easy-to-use Python library for applying these models on text contained in Sinopia's RDF and to available full-text.

In [8]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%reload_ext lab_black
import datetime
import json
import pickle
import string

import kglab
import pandas as pd
import rdflib
import requests
import helpers
import widgets

from transformers import pipeline

## Loading Sinopia Stage RDF Text DataFrame

In [3]:
stage_text_nodes = pd.read_json("data/stage-text-nodes.json")

In [11]:
stage_text_nodes.sample(10)

Unnamed: 0,title,url,label,summary
1018,Beowulf,<https://api.stage.sinopia.io/resource/eb5d055...,,
1664,Man transform'd,<https://api.stage.sinopia.io/resource/9068f17...,,
583,书店漫游,<https://api.stage.sinopia.io/resource/c95b2b0...,Shu dian man you,
908,The dark archive,<https://api.stage.sinopia.io/resource/71fb7a4...,,
1248,Flotsametrics and the floating world,<https://api.stage.sinopia.io/resource/9484408...,,
1948,Green Mountains review,<https://api.stage.sinopia.io/resource/46a54ac...,,
1585,Brill's companion to the reception of Galen,<https://api.stage.sinopia.io/resource/902e0ad...,,Brill's Companion to the Reception of Galen' p...
2020,Pillars of society,<https://api.stage.sinopia.io/resource/99c162c...,,
563,Silver skates,<https://api.stage.sinopia.io/resource/0970857...,,
404,Бедные люди,<https://api.stage.sinopia.io/resource/b1d0744...,,


## Huggingface NER Pipeline
The HuggingFace transformers library provides very easy-to-use pipelines for running common NLP tasks like NER. We will create a NER pipeline and run the following *summary* value.

In [17]:
stage_text_nodes.iloc[713]

title                              Yakın tarihin gerçekleri 
url        <https://api.stage.sinopia.io/resource/6730510...
label                                                   None
summary                                                 None
Name: 713, dtype: object

In [12]:
print(stage_text_nodes.iloc[712].summary)

None


In [13]:
ner_pipe = pipeline("ner", "bert-large-uncased")

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-large

In [14]:
ner_results = ner_pipe(stage_text_nodes.iloc[712].summary)

ValueError: At least one input is required.

In [8]:
for row in ner_results:
    print(row)

{'entity': 'I-PER', 'score': 0.9950322, 'index': 6, 'word': 'Hans', 'start': 24, 'end': 28}
{'entity': 'I-PER', 'score': 0.98719305, 'index': 10, 'word': 'G', 'start': 44, 'end': 45}
{'entity': 'I-PER', 'score': 0.97071385, 'index': 11, 'word': '##ret', 'start': 45, 'end': 48}
{'entity': 'I-PER', 'score': 0.9310411, 'index': 12, 'word': '##el', 'start': 48, 'end': 50}
{'entity': 'I-PER', 'score': 0.99432087, 'index': 23, 'word': 'Hans', 'start': 92, 'end': 96}
{'entity': 'I-PER', 'score': 0.9813799, 'index': 26, 'word': 'G', 'start': 110, 'end': 111}
{'entity': 'I-PER', 'score': 0.95786554, 'index': 27, 'word': '##ret', 'start': 111, 'end': 114}
{'entity': 'I-PER', 'score': 0.83893085, 'index': 28, 'word': '##el', 'start': 114, 'end': 116}
{'entity': 'I-PER', 'score': 0.7370237, 'index': 40, 'word': 'Doctor', 'start': 184, 'end': 190}


The results returned from the [HuggingFace][HUG] NER contain more information and are structured differently than [spaCy][SPACY]. In the [HuggingFace][HUG] pipeline, there are only four classes of entities:

- `I-PER` for a Person name
- `I-ORG` for an Organization name
- `I-LOC` for a location
- `I-MISC` for a Miscellaneous entity. 

The HuggingFace NER also gives a statistical score on how the model is confident that it matched an entity. Also, the [HuggingFace][HUG] NER pipeline results does some character masking (seen as `##` in the results) for many of the entities.

[HUG]: https://huggingface.co/
[SPACY]: https://spacy.io/

## Huggingface Summarization Pipeline
Another [HuggingFace][HUG] pipeline is the *summarization* task that takes a large document and automatically summarizes the text. The pipeline leverages a [Bart](https://arxiv.org/abs/1910.13461) model that was fine-tuned on a CNN / Daily Mail data set.

[HUG]: https://huggingface.co/

In [9]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In the previous notebook we looked a *Alarmingly suspicious* that is cataloged in Sinopia at https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c. This resource does not have an abstract or BIBFRAME Summary, so let us read in sections of the full-text to this [HuggingFace][HUG] summizer and see if can autogenerate a summary and add it to this RDF graph.

[HUG]: https://huggingface.co/

In [10]:
with open("data/65a2b059-5ac1-48a6-adbb-870712c3060c.txt") as fo:
    example1_text = fo.read()

Let's see what happens if we send the entire full-text to the `summarizer`:

In [11]:
summary_result = summarizer(example1_text)

Token indices sequence length is longer than the specified maximum sequence length for this model (20348 > 1024). Running this sequence through the model will result in indexing errors


IndexError: index out of range in self

The maximum number of tokens that the `summarizer` pipeline can process at a time is **1024** while our full-text has **20,348** tokens. Let us try breaking down our large text into smaller "chunks", send each chunk into the `summarizer` pipeline, capture the resulting summary, and at the end, see if the summaries make sense.

First, we will create a list of all of the words in the full-text and then send and summarize 

In [12]:
example1_words = example1_text.split()
print(f"Total words in full-text {len(example1_words):,}")

Total words in full-text 10,570


In [13]:
summaries = []
start = 0
size = 500
end = size

start_time = datetime.datetime.utcnow()
print(f"Started at {start_time}")
for i in range(22):  # 10,570 / 500 ~= 21
    if end > len(example1_words):
        end = None
    print(f"{i+1} {start:,} to {end:,} words")
    text_chunk = ' '.join(example1_words[start:end]).encode("ascii", errors="ignore").decode().replace(
    "#", ""
)
    result = summarizer(' '.join(example1_words[start:end]), max_length=75)
    summaries.append(result[0].get('summary_text'))
    start += size
    end += size
end_time = datetime.datetime.utcnow()
print(f"Finished at {end}, total time {(end-start).seconds / 60.} minutes")

Started at 2022-07-07 00:38:49.939029
1 0 to 500 words
2 500 to 1,000 words
3 1,000 to 1,500 words
4 1,500 to 2,000 words
5 2,000 to 2,500 words
6 2,500 to 3,000 words
7 3,000 to 3,500 words
8 3,500 to 4,000 words
9 4,000 to 4,500 words
10 4,500 to 5,000 words
11 5,000 to 5,500 words
12 5,500 to 6,000 words
13 6,000 to 6,500 words
14 6,500 to 7,000 words
15 7,000 to 7,500 words
16 7,500 to 8,000 words
17 8,000 to 8,500 words
18 8,500 to 9,000 words
19 9,000 to 9,500 words
20 9,500 to 10,000 words
21 10,000 to 10,500 words


IndexError: index out of range in self

In [17]:
len(summaries)

20

In [18]:
summaries[19]

' The play is founded on incidents which actually occured during the war of the Rebellion . It introduces Ohio’s brave and gallant McPherson . The actual manner of his capture and death is shewn . It abounds with the most beautiful tableaux, drill, marches, scenes upon the battle f ^i l, in AndersonviHe,'

In [None]:
with open("data/summaries.pkl", "wb+") as fo:
    pickle.dump(summaries, fo)

With each summary being limited to 75 words, we still have a large summary that we may want to reduce even further. 

In [19]:
summary_all = " ".join(summaries)

In [20]:
all_summary_words = summary_all.split()

In [24]:
first_summary = summarizer(" ".join(all_summary_words[0:500]))

In [29]:
first_summary[0]["summary_text"]

" The Lady of Lyons, The Studio, The Vow of the Omani, The Brigands of Calabria, The Serf The Poacher's Doom, The Hunter of the Alp- Thirty-Three Next Birthday . The work is in the public domain, meaning users are free to copy, use, and redistribute the work in part or in whole ."

In [25]:
second_summary = summarizer(" ".join(all_summary_words[500:]))

## Illustration of Cataloging Workflow
To illustrate a possible use of summarization, we will add a BIBFRAME Summary to the original RDF graph for this work.

In [26]:
example1_result = requests.get(
    "https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c"
)
example1_graph = rdflib.Graph()
for ns, url in helpers.NAMESPACES.items():
    example1_graph.namespace_manager.bind(ns, url)
example1_graph.parse(
    data=json.dumps(example1_result.json().get("data")), format="json-ld"
)
work_uri = rdflib.URIRef(
    "https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c"
)
print(f"Total triples {len(example1_graph)}")

Total triples 48


In [30]:
summary_literal = rdflib.Literal(
    " ".join([first_summary[0]["summary_text"], second_summary[0]["summary_text"]])
)
summary_bnode = rdflib.BNode()
example1_graph.add((work_uri, helpers.BIBFRAME.summary, summary_bnode))
example1_graph.add((summary_bnode, rdflib.RDF.type, helpers.BIBFRAME.Summary))
example1_graph.add((summary_bnode, rdflib.RDFS.label, summary_literal))

<Graph identifier=Nb640cc542c3a4b41ab709a124653ff9f (<class 'rdflib.graph.Graph'>)>

In [32]:
print(example1_graph.serialize(format="turtle"))

@prefix bf: <http://id.loc.gov/ontologies/bibframe/> .
@prefix bflc: <http://id.loc.gov/ontologies/bflc/> .
@prefix ns1: <http://www.europeana.eu/schemas/edm/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sinopia: <http://sinopia.io/vocabulary/> .

<https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c> a bf:Work,
        <http://share-vde.org/rdfBibframe/SuperWork> ;
    rdfs:label "Simpson, J. Palgrave (John Palgrave), 1807-1887. Alarmingly suspicious"@eng ;
    bf:adminMetadata [ a bf:AdminMetadata ;
            bflc:catalogerId "tt434"@eng ;
            bflc:encodingLevel <http://id.loc.gov/vocabulary/menclvl/f> ;
            bf:descriptionConventions <https://id.loc.gov/vocabulary/descriptionConventions/rda> ;
            bf:descriptionLanguage <https://id.loc.gov/vocabulary/languages/eng> ;
            bf:descriptionModifier <https://id.loc.gov/vocabulary/organizations/cty> ;
            bf:source

## Exercise 1

From the `stage_text_nodes`, select a series *title*, *label*, or *summary* values and compare the [spaCy][SPACY] 'en_core_web_sm' NER model with the [Huggingface][HUG] NER model results.

[HUG]: https://huggingface.co/
[SPACY]: https://spacy.io/