# HuggingFace Transformers
While spaCy provides good-enough NER cabilitities, the accelerated improvement of NLP models in recent years means that we can use pre-trained models that leverage modern machine approaches. [HuggingFace](https://huggingface.co/), a company specializing in open-source models, provides an easy-to-use Python library for applying these models on text contained in Sinopia's RDF and to available full-text.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%reload_ext lab_black
import datetime
import json
import pickle
import string

import kglab
import pandas as pd
import rdflib
import requests
import helpers
import widgets

from transformers import pipeline

## Loading Sinopia Stage RDF Text DataFrame

In [2]:
stage_text_nodes = pd.read_json("data/stage-text-nodes.json")

In [9]:
stage_text_nodes.sample(10)

Unnamed: 0,title,summary,url,label
2061,The Rocks were there.,,<https://api.stage.sinopia.io/resource/a9115a9...,
1219,Guo li Taiwan da xue li xue yuan zhi wu xue xi...,,<https://api.stage.sinopia.io/resource/7494bfb...,Guo li Taiwan da xue li xue yuan zhi wu xue xi...
170,Restrukturyzat͡sii͡a nat͡sionaľnoï ekonomiky U...,,<https://api.stage.sinopia.io/resource/3b8239a...,
533,Work for marimba and synthesizer,,<https://api.stage.sinopia.io/resource/e7e8b3f...,
2099,FalsaFala,,<https://api.stage.sinopia.io/resource/07779e3...,
1222,The deserving favorite,,<https://api.stage.sinopia.io/resource/28c0f11...,
310,Training your boxer,"""An informative and fully illustrated manual t...",<https://api.stage.sinopia.io/resource/ba38374...,
476,Pride and prejudice,,<https://api.stage.sinopia.io/resource/f023120...,
186,Warres of Pompey and Caesar,,<https://api.stage.sinopia.io/resource/c70bcf6...,
4,"Uprooting, trauma, and confinement",This thesis is a history of psychiatry through...,<https://api.stage.sinopia.io/resource/86893ce...,


## Huggingface NER Pipeline
The HuggingFace transformers library provides very easy-to-use pipelines for running common NLP tasks like NER. We will create a NER pipeline and run the following *summary* value.

In [4]:
stage_text_nodes.iloc[1244]

title                               Organoselenium chemistry
summary    "Selenium plays an important role in the opera...
url        <https://api.stage.sinopia.io/resource/b19dd91...
label                                                   None
Name: 1244, dtype: object

In [5]:
print(stage_text_nodes.iloc[1244].summary)

"Selenium plays an important role in the operation of biological processes. Thus, organoselenium compounds are of current interest in chemistry as well as in biology. This book covers a wide section of selenium chemistry. It provides an overview of the synthesis of a variety of organoselenides including selenourea, selenocarbonyls, selenoamides, selenazadienes and Se-containing heterocycles by various approaches such as coupling, C-H activation, radical reactions, and microwave induced reactions. The applications of selenides in biological processes, pharmacology and as reagents and catalysts have been illustrated"--Page 4 of cover.


In [6]:
ner_pipe = pipeline("ner", "bert-large-uncased")

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-large

In [7]:
ner_results = ner_pipe(stage_text_nodes.iloc[1244].summary)

In [8]:
for row in ner_results:
    print(row)

{'entity': 'LABEL_1', 'score': 0.62920445, 'index': 1, 'word': '"', 'start': 0, 'end': 1}
{'entity': 'LABEL_1', 'score': 0.5970293, 'index': 2, 'word': 'se', 'start': 1, 'end': 3}
{'entity': 'LABEL_1', 'score': 0.681782, 'index': 3, 'word': '##len', 'start': 3, 'end': 6}
{'entity': 'LABEL_1', 'score': 0.60181314, 'index': 4, 'word': '##ium', 'start': 6, 'end': 9}
{'entity': 'LABEL_1', 'score': 0.5705185, 'index': 5, 'word': 'plays', 'start': 10, 'end': 15}
{'entity': 'LABEL_0', 'score': 0.6519011, 'index': 6, 'word': 'an', 'start': 16, 'end': 18}
{'entity': 'LABEL_0', 'score': 0.54520696, 'index': 7, 'word': 'important', 'start': 19, 'end': 28}
{'entity': 'LABEL_0', 'score': 0.5138432, 'index': 8, 'word': 'role', 'start': 29, 'end': 33}
{'entity': 'LABEL_0', 'score': 0.65190136, 'index': 9, 'word': 'in', 'start': 34, 'end': 36}
{'entity': 'LABEL_0', 'score': 0.65149987, 'index': 10, 'word': 'the', 'start': 37, 'end': 40}
{'entity': 'LABEL_0', 'score': 0.56185436, 'index': 11, 'word': '

In [10]:
ner2_results = ner_pipe(stage_text_nodes.iloc[310].summary)

In [11]:
for row in ner2_results:
    print(row)

{'entity': 'LABEL_1', 'score': 0.5707234, 'index': 1, 'word': '"', 'start': 0, 'end': 1}
{'entity': 'LABEL_1', 'score': 0.5932133, 'index': 2, 'word': 'an', 'start': 1, 'end': 3}
{'entity': 'LABEL_1', 'score': 0.5583519, 'index': 3, 'word': 'inform', 'start': 4, 'end': 10}
{'entity': 'LABEL_1', 'score': 0.5918726, 'index': 4, 'word': '##ative', 'start': 10, 'end': 15}
{'entity': 'LABEL_1', 'score': 0.5740556, 'index': 5, 'word': 'and', 'start': 16, 'end': 19}
{'entity': 'LABEL_1', 'score': 0.52536446, 'index': 6, 'word': 'fully', 'start': 20, 'end': 25}
{'entity': 'LABEL_0', 'score': 0.52550507, 'index': 7, 'word': 'illustrated', 'start': 26, 'end': 37}
{'entity': 'LABEL_0', 'score': 0.5471797, 'index': 8, 'word': 'manual', 'start': 38, 'end': 44}
{'entity': 'LABEL_1', 'score': 0.6625897, 'index': 9, 'word': 'that', 'start': 45, 'end': 49}
{'entity': 'LABEL_1', 'score': 0.5457408, 'index': 10, 'word': 'covers', 'start': 50, 'end': 56}
{'entity': 'LABEL_0', 'score': 0.5877897, 'index': 

The results returned from the [HuggingFace][HUG] NER contain more information and are structured differently than [spaCy][SPACY]. In the [HuggingFace][HUG] pipeline, there are only four classes of entities:

- `I-PER` for a Person name
- `I-ORG` for an Organization name
- `I-LOC` for a location
- `I-MISC` for a Miscellaneous entity. 

The HuggingFace NER also gives a statistical score on how the model is confident that it matched an entity. Also, the [HuggingFace][HUG] NER pipeline results does some character masking (seen as `##` in the results) for many of the entities.

[HUG]: https://huggingface.co/
[SPACY]: https://spacy.io/

## Huggingface Summarization Pipeline
Another [HuggingFace][HUG] pipeline is the *summarization* task that takes a large document and automatically summarizes the text. The pipeline leverages a [Bart](https://arxiv.org/abs/1910.13461) model that was fine-tuned on a CNN / Daily Mail data set.

[HUG]: https://huggingface.co/

In [12]:
summarizer = pipeline("summarization", "bert-large-uncased")

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The model 'BertForMaskedLM' is not supported for summarization. Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M1

In the previous notebook we looked a *Alarmingly suspicious* that is cataloged in Sinopia at https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c. This resource does not have an abstract or BIBFRAME Summary, so let us read in sections of the full-text to this [HuggingFace][HUG] summizer and see if can autogenerate a summary and add it to this RDF graph.

[HUG]: https://huggingface.co/

In [13]:
with open("data/65a2b059-5ac1-48a6-adbb-870712c3060c.txt") as fo:
    example1_text = fo.read()

Let's see what happens if we send the entire full-text to the `summarizer`:

In [14]:
summary_result = summarizer(example1_text)

Token indices sequence length is longer than the specified maximum sequence length for this model (17224 > 512). Running this sequence through the model will result in indexing errors
Input length of input_ids is 17224, but ``max_length`` is set to 20. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


RuntimeError: The expanded size of the tensor (17225) must match the existing size (512) at non-singleton dimension 1.  Target sizes: [1, 17225].  Tensor sizes: [1, 512]

The maximum number of tokens that the `summarizer` pipeline can process at a time is **1024** while our full-text has **20,348** tokens. Let us try breaking down our large text into smaller "chunks", send each chunk into the `summarizer` pipeline, capture the resulting summary, and at the end, see if the summaries make sense.

First, we will create a list of all of the words in the full-text and then send and summarize 

In [15]:
example1_words = example1_text.split()
print(f"Total words in full-text {len(example1_words):,}")

Total words in full-text 10,570


In [18]:
summaries = []
start = 0
size = 150
end = size

start_time = datetime.datetime.utcnow()
print(f"Started at {start_time}")
for i in range(22):  # 10,570 / 500 ~= 21
    if end > len(example1_words):
        end = None
    print(f"{i+1} {start:,} to {end:,} words")
    text_chunk = ' '.join(example1_words[start:end]).encode("ascii", errors="ignore").decode().replace(
    "#", ""
)
    result = summarizer(' '.join(example1_words[start:end]), max_length=75)
    summaries.append(result[0].get('summary_text'))
    start += size
    end += size
end_time = datetime.datetime.utcnow()
print(f"Finished at {end}, total time {(end-start).seconds / 60.} minutes")

Input length of input_ids is 268, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


Started at 2022-07-11 15:12:05.838030
1 0 to 150 words


Input length of input_ids is 474, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


2 150 to 300 words


Input length of input_ids is 323, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


3 300 to 450 words


Input length of input_ids is 257, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


4 450 to 600 words


Input length of input_ids is 365, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


5 600 to 750 words


Input length of input_ids is 293, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


6 750 to 900 words


Input length of input_ids is 298, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


7 900 to 1,050 words


Input length of input_ids is 225, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


8 1,050 to 1,200 words


Input length of input_ids is 279, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


9 1,200 to 1,350 words


Input length of input_ids is 237, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


10 1,350 to 1,500 words


Input length of input_ids is 226, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


11 1,500 to 1,650 words


Input length of input_ids is 211, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


12 1,650 to 1,800 words


Input length of input_ids is 272, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


13 1,800 to 1,950 words


Input length of input_ids is 199, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


14 1,950 to 2,100 words


Input length of input_ids is 211, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


15 2,100 to 2,250 words


Input length of input_ids is 221, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


16 2,250 to 2,400 words


Input length of input_ids is 223, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


17 2,400 to 2,550 words


Input length of input_ids is 264, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


18 2,550 to 2,700 words


Input length of input_ids is 214, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


19 2,700 to 2,850 words


Input length of input_ids is 212, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


20 2,850 to 3,000 words


Input length of input_ids is 216, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


21 3,000 to 3,150 words


Input length of input_ids is 264, but ``max_length`` is set to 75. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


22 3,150 to 3,300 words


AttributeError: 'int' object has no attribute 'seconds'

In [19]:
len(summaries)

22

In [20]:
summaries[19]

'trust you will find everything to your comfort and convenience. lady b. i don ’ t doubt it my dear lady emily, ( apart to sir b. in going out ) triumph, sir barnabas! a few more symptoms of the kind, and the day is our own. before night i ’ ll force them into an avowal of their mar¬ riage. bnz. ( going ) but, permit me, my dear — lady b. hold your tongue, sir barnabas! - ( drags him out disputing l. col. ( springs up from chair, coming down r. ) lady emily, i can ’ t con¬ ceive how a woman of your good sense, could have managed so badly as to have left these importunate and disagreeable personages even the shadow of a pretext for taking you thus by storm, * em. my dear colonel, i have told you, i did not give them even the shadow of a shade. col. at any rate, you didn ’ t want *'

In [None]:
with open("data/summaries.pkl", "wb+") as fo:
    pickle.dump(summaries, fo)

With each summary being limited to 75 words, we still have a large summary that we may want to reduce even further. 

In [21]:
summary_all = " ".join(summaries)

In [22]:
all_summary_words = summary_all.split()

In [23]:
len(all_summary_words)

4417

In [25]:
first_summary = summarizer(" ".join(all_summary_words[0:150]))

Input length of input_ids is 202, but ``max_length`` is set to 20. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


In [26]:
first_summary[0]["summary_text"]

'this file was downloaded from hathitrust digital library. find more books at https : / / www. hathitrust. org. title : alarmingly suspicious, an original comedietta, in one act, by j. palgrave simpson. author : simpson, j. palgrave ( john palgrave ), 1807 - 1887. publisher : clyde, ohio, a. d. ames [ 1880? ] copyright : public domain http : / / www. hathitrust. org / access _ use # pd we have determined this work to be in the public domain, meaning that it is not subject to copyright. users are free to copy, use, and redistribute the work in part or in whole. it is possible that current copyright holders, heirs or the estate of the authors of individual portions of the work, such as illustrations or photographs, assert copyrights over these portions. depending on the nature of subsequent use that is made, additional rights may need most'

In [25]:
second_summary = summarizer(" ".join(all_summary_words[500:]))

## Illustration of Cataloging Workflow
To illustrate a possible use of summarization, we will add a BIBFRAME Summary to the original RDF graph for this work.

In [27]:
example1_result = requests.get(
    "https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c"
)
example1_graph = rdflib.Graph()
for ns, url in helpers.NAMESPACES.items():
    example1_graph.namespace_manager.bind(ns, url)
example1_graph.parse(
    data=json.dumps(example1_result.json().get("data")), format="json-ld"
)
work_uri = rdflib.URIRef(
    "https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c"
)
print(f"Total triples {len(example1_graph)}")

Total triples 48


In [28]:
summary_literal = rdflib.Literal(
    """The Lady of Lyons, The Studio, The Vow of the Omani, The Brigands of Calabria, The Serf The Poacher's Doom, The Hunter of the Alp- Thirty-Three Next Birthday . The work is in the public domain, meaning users are free to copy, use, and redistribute the work in part or in whole .  The play is founded on incidents which actually occured during the war of the Rebellion . It introduces Ohio’s brave and gallant McPherson . It abounds with the most beautiful tableaux, drill, marches, scenes upon the battle f ^i l, in AndersonviHe ."""
)
summary_bnode = rdflib.BNode()
example1_graph.add((work_uri, helpers.BIBFRAME.summary, summary_bnode))
example1_graph.add((summary_bnode, rdflib.RDF.type, helpers.BIBFRAME.Summary))
example1_graph.add((summary_bnode, rdflib.RDFS.label, summary_literal))

<Graph identifier=N9e1df39e9ebd4503afdd0f6a98f7109a (<class 'rdflib.graph.Graph'>)>

In [29]:
print(example1_graph.serialize(format="turtle"))

@prefix bf: <http://id.loc.gov/ontologies/bibframe/> .
@prefix bflc: <http://id.loc.gov/ontologies/bflc/> .
@prefix ns1: <http://www.europeana.eu/schemas/edm/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sinopia: <http://sinopia.io/vocabulary/> .

<https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c> a bf:Work,
        <http://share-vde.org/rdfBibframe/SuperWork> ;
    rdfs:label "Simpson, J. Palgrave (John Palgrave), 1807-1887. Alarmingly suspicious"@eng ;
    bf:adminMetadata [ a bf:AdminMetadata ;
            bflc:catalogerId "tt434"@eng ;
            bflc:encodingLevel <http://id.loc.gov/vocabulary/menclvl/f> ;
            bf:descriptionConventions <https://id.loc.gov/vocabulary/descriptionConventions/rda> ;
            bf:descriptionLanguage <https://id.loc.gov/vocabulary/languages/eng> ;
            bf:descriptionModifier <https://id.loc.gov/vocabulary/organizations/cty> ;
            bf:source

## Exercise 1

From the `stage_text_nodes`, select a series *title*, *label*, or *summary* values and compare the [spaCy][SPACY] 'en_core_web_sm' NER model with the [Huggingface][HUG] NER model results.

[HUG]: https://huggingface.co/
[SPACY]: https://spacy.io/