# Lecture 23: 2023-04-20 Exploring Applied NLP Problems

## Lecture Overview

* Named Entity Recognition (NER) using SpaCy and Transformers
* Text summarization using Transformers
* Text generation using Transformers
* Analyzing `Fake news` using Transformers and ChatGPT
* Semantic role labeling using Transformers and ChatGPT

## Named Entity Recognition (NER) using SpaCy and Transformers

### Named Entity Recognition (NER)

Named Entity Recognition (NER) is the task of identifying named entities in text and classifying them into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

### SpaCy Example

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(f'Entity: {ent.text} is type {ent.label_} - index_location: {ent.start_char}:{ent.end_char}')

2023-04-20 13:45:39.662378: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-20 13:45:39.682355: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-20 13:45:40.486133: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-20 13:45:40.48

Entity: Apple is type ORG - index_location: 0:5
Entity: U.K. is type GPE - index_location: 27:31
Entity: $1 billion is type MONEY - index_location: 44:54


In [2]:
### more complex data

# https://pubmed.ncbi.nlm.nih.gov/37071411/
text = """
Most patients were initially certified for a 1:1 (∆9-tetrahydrocannabinol:cannabidiol) tincture.
Eight-seven percent of patients (n = 60) were noted to exhibit an improvement in any PD symptom after starting MC.
Symptoms with the highest incidence of improvement included cramping/dystonia, pain, spasticity, lack of appetite, dyskinesia, and tremor.
After starting MC, 56% of opioid users (n = 14) were able to decrease or discontinue opioid use with an average daily morphine milligram equivalent change from 31 at baseline to 22 at the last follow-up visit.
The MC was well-tolerated with no severe AEs reported and low rate of MC discontinuation due to AEs (n = 4).
"""

doc = nlp(text)

for ent in doc.ents:
    print(f'Entity: {ent.text} is type {ent.label_} - index_location: {ent.start_char}:{ent.end_char}')

Entity: 1:1 is type DATE - index_location: 46:49
Entity: Eight-seven percent is type PERCENT - index_location: 98:117
Entity: 60 is type CARDINAL - index_location: 135:137
Entity: PD is type NORP - index_location: 183:185
Entity: dyskinesia is type GPE - index_location: 328:338
Entity: 56% is type PERCENT - index_location: 371:374
Entity: 14 is type CARDINAL - index_location: 396:398
Entity: daily is type DATE - index_location: 464:469
Entity: 31 is type CARDINAL - index_location: 512:514
Entity: 22 is type CARDINAL - index_location: 530:532
Entity: 4 is type CARDINAL - index_location: 667:668


In [4]:
## Using SciSpacy

import spacy
import scispacy

nlp = spacy.load("en_core_sci_scibert")

doc = nlp(text)

for ent in doc.ents:
    print(f'Entity: {ent.text} is type {ent.label_} - index_location: {ent.start_char}:{ent.end_char}')


Entity: patients is type ENTITY - index_location: 6:14
Entity: ∆9-tetrahydrocannabinol:cannabidiol) tincture is type ENTITY - index_location: 51:96
Entity: patients is type ENTITY - index_location: 121:129
Entity: improvement is type ENTITY - index_location: 164:175
Entity: PD is type ENTITY - index_location: 183:185
Entity: symptom is type ENTITY - index_location: 186:193
Entity: MC is type ENTITY - index_location: 209:211
Entity: Symptoms is type ENTITY - index_location: 213:221
Entity: incidence is type ENTITY - index_location: 239:248
Entity: improvement is type ENTITY - index_location: 252:263
Entity: cramping/dystonia is type ENTITY - index_location: 273:290
Entity: pain is type ENTITY - index_location: 292:296
Entity: spasticity is type ENTITY - index_location: 298:308
Entity: lack of appetite is type ENTITY - index_location: 310:326
Entity: dyskinesia is type ENTITY - index_location: 328:338
Entity: tremor is type ENTITY - index_location: 344:350
Entity: MC is type ENTITY - ind

### Transformers Example

In [6]:
import transformers
from transformers import pipeline

ner = pipeline('ner', model='dslim/bert-base-NER', tokenizer='dslim/bert-base-NER', grouped_entities=True)
ner(text)

Downloading (…)lve/main/config.json: 100%|██████████| 829/829 [00:00<00:00, 1.29MB/s]
Downloading pytorch_model.bin: 100%|██████████| 433M/433M [00:12<00:00, 36.1MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 59.0/59.0 [00:00<00:00, 180kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 2.96MB/s]
Downloading (…)in/added_tokens.json: 100%|██████████| 2.00/2.00 [00:00<00:00, 4.85kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 202kB/s]


[{'entity_group': 'MISC',
  'score': 0.9794678,
  'word': 'PD',
  'start': 183,
  'end': 185},
 {'entity_group': 'ORG',
  'score': 0.75518626,
  'word': 'MC',
  'start': 209,
  'end': 211},
 {'entity_group': 'ORG',
  'score': 0.74572164,
  'word': 'MC',
  'start': 367,
  'end': 369},
 {'entity_group': 'ORG',
  'score': 0.67157006,
  'word': 'MC',
  'start': 566,
  'end': 568},
 {'entity_group': 'MISC',
  'score': 0.5219626,
  'word': 'A',
  'start': 603,
  'end': 604},
 {'entity_group': 'ORG',
  'score': 0.65227294,
  'word': 'MC',
  'start': 632,
  'end': 634}]

### Using a different model

In [8]:
# https://huggingface.co/d4data/biomedical-ner-all
ner = pipeline('ner', model='d4data/biomedical-ner-all', tokenizer='d4data/biomedical-ner-all', grouped_entities=True)
ner(text)

[{'entity_group': 'Lab_value',
  'score': 0.4000886,
  'word': '1',
  'start': 46,
  'end': 47},
 {'entity_group': 'Lab_value',
  'score': 0.9972366,
  'word': 'eight - seven percent',
  'start': 98,
  'end': 117},
 {'entity_group': 'Disease_disorder',
  'score': 0.99834895,
  'word': 'pd',
  'start': 183,
  'end': 185},
 {'entity_group': 'Sign_symptom',
  'score': 0.9996455,
  'word': 'cr',
  'start': 273,
  'end': 275},
 {'entity_group': 'Sign_symptom',
  'score': 0.9379781,
  'word': '##amp',
  'start': 275,
  'end': 278},
 {'entity_group': 'Sign_symptom',
  'score': 0.74788606,
  'word': 'd',
  'start': 282,
  'end': 283},
 {'entity_group': 'Sign_symptom',
  'score': 0.98544145,
  'word': 'spa',
  'start': 298,
  'end': 301},
 {'entity_group': 'Sign_symptom',
  'score': 0.9243543,
  'word': 'dyskines',
  'start': 328,
  'end': 336},
 {'entity_group': 'Lab_value',
  'score': 0.74904835,
  'word': '56 %',
  'start': 371,
  'end': 374},
 {'entity_group': 'Disease_disorder',
  'score':

### Creating your own pipeline

* Extract semantic triples from the text then perform NER on the extracted triples
* Use Stanford CoreNLP to extract semantic triples from the text then perform NER on the extracted triples

In [15]:
import stanza
import spacy
from stanza.server import CoreNLPClient
stanza.install_corenlp()

## extract triples from the text
triples = []

# define the properties
config = {
    "annotators": "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,openie",
    "openie.max_entailments_per_clause": "100",
    "openie.threads": "4",
    "memory": "16G",
    "endpoint": "http://localhost:9020",
}

client = CoreNLPClient(annotators=config['annotators'], memory=config['memory'], endpoint=config['endpoint'])

document = client.annotate(text)
for i, sentence in enumerate(document.sentence):
    for triple in sentence.openieTriple:
        triples.append([triple.subject, triple.relation, triple.object])
        
triples

2023-04-20 14:19:19 INFO: Writing properties to tmp file: corenlp_server-8f4d98135c864d8d.props
2023-04-20 14:19:19 INFO: Starting server with command: java -Xmx16G -cp /home/james/stanza_corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9020 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet False -serverProperties corenlp_server-8f4d98135c864d8d.props -annotators tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,openie -preload -outputFormat serialized
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
			(Note: unspecified annotator properties are English defaults)
			annotators = tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,openie
			inputFormat = text
			outputFormat = serialized
			prettyPrint = false
			threads = 5
[main] INFO CoreNLP - Threads: 5
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding a

 Most patients were initially certified for a 1:1 (∆9-tetrahydrocannabinol:cannabidiol) tincture. Eight-seven percent of patients (n = 60) were noted to exhibit an improvement in any PD symptom after starting MC. Symptoms with the highest incidence of improvement included cramping/dystonia, pain, spasticity, lack of appetite, dyskinesia, and tremor. After starting MC, 56% of opioid users (n = 14) were able to decrease or discontinue opioid use with an average daily morphine milligram equivalent change from 31 at baseline to 22 at the last follow-up visit. The MC was well-tolerated with no severe AEs reported and low rate of MC discontinuation due to AEs (n = 4). 


[['patients', 'were', 'initially certified'],
 ['patients', 'were certified for', '1:1'],
 ['Most patients', 'were certified for', '1:1'],
 ['Most patients', 'were initially certified for', '1:1'],
 ['patients', 'were initially certified for', '1:1'],
 ['patients', 'were', 'certified'],
 ['Most patients', 'were', 'certified'],
 ['Most patients', 'were', 'initially certified'],
 ['improvement', 'is in', 'PD symptom'],
 ['Eight seven percent', 'were', 'noted'],
 ['Eight seven percent', 'exhibit', 'improvement in PD symptom'],
 ['dystonia', 'lack of', 'appetite'],
 ['Symptoms', 'is with', 'highest incidence of improvement'],
 ['Symptoms', 'included', 'cramping dystonia'],
 ['Symptoms', 'included', 'lack'],
 ['Symptoms', 'included', 'lack of appetite']]

In [17]:
from IPython.display import HTML, display

display(HTML(text))

In [19]:
## Analyze the triples for NER

import spacy
import scispacy

nlp = spacy.load("en_core_sci_scibert")

for i, triple in enumerate(triples):
    doc = " ".join(triple)
    doc = nlp(doc)
    for ent in doc.ents:
        print(f'Entity: {ent.text} is type {ent.label_} - index_location: {ent.start_char}:{ent.end_char}')

Entity: patients is type ENTITY - index_location: 0:8
Entity: patients is type ENTITY - index_location: 0:8
Entity: patients is type ENTITY - index_location: 5:13
Entity: patients is type ENTITY - index_location: 5:13
Entity: patients is type ENTITY - index_location: 0:8
Entity: patients is type ENTITY - index_location: 0:8
Entity: patients is type ENTITY - index_location: 5:13
Entity: patients is type ENTITY - index_location: 5:13
Entity: improvement is type ENTITY - index_location: 0:11
Entity: PD is type ENTITY - index_location: 18:20
Entity: symptom is type ENTITY - index_location: 21:28
Entity: improvement is type ENTITY - index_location: 28:39
Entity: PD is type ENTITY - index_location: 43:45
Entity: symptom is type ENTITY - index_location: 46:53
Entity: dystonia is type ENTITY - index_location: 0:8
Entity: lack of appetite is type ENTITY - index_location: 9:25
Entity: Symptoms is type ENTITY - index_location: 0:8
Entity: incidence is type ENTITY - index_location: 25:34
Entity: i

## Text summarization using Transformers

There are two types of text summarization:

* Extractive summarization: Extracting a subset of the original text to form the summary
* Abstractive summarization: Generating new text to form the summary

### Extractive summarization

There are several extractive summarization techniques:

* LexRank - LexRank is a graph-based algorithm that uses the PageRank algorithm to rank sentences based on their similarity to other sentences in the text.
* SentRank - SentRank is a graph-based algorithm that uses the PageRank algorithm to rank sentences based on their similarity to other sentences in the text.
* Luhn - Uses TF-IDF to rank sentences based on their similarity to other sentences in the text.
* SumBasic - Utilize the frequency of words in the text to rank sentences. (abstract-like)
* KL-Sum - Kullback-Leibler divergence is used to rank sentences based on their similarity to other sentences in the text.
* LSA - Latent semantic analysis or indexing uses singular value decomposition to compute matrices for analyzing relationships between sets of observations.
* K-Means - K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.

### Abstractive summarization

In [21]:
import transformers

from transformers import BloomTokenizerFast
from transformers import BloomForCausalLM

MODEL = BloomForCausalLM.from_pretrained('bigscience/bloom-560m')
TOKENIZER = BloomTokenizerFast.from_pretrained('bigscience/bloom-560m')

Downloading (…)lve/main/config.json: 100%|██████████| 693/693 [00:00<00:00, 1.24MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.12G/1.12G [00:34<00:00, 32.0MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 222/222 [00:00<00:00, 520kB/s]
Downloading tokenizer.json: 100%|██████████| 14.5M/14.5M [00:00<00:00, 32.5MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 85.0/85.0 [00:00<00:00, 154kB/s]


In [36]:
## summarization of our text using the Bloom model

def summarize_text(text: str, tokenizer=TOKENIZER, min_output=40, max_output=100, max_length=80, model=MODEL):
    """Take a string of text and generate a summary"""
    tokens_input = tokenizer.encode("summarize: " + text, return_tensors='pt', max_length=max_length, truncation=True)
    ids = model.generate(tokens_input, min_length=min_output, max_length=max_output)
    summary = tokenizer.decode(ids[0], skip_special_tokens=True)
    return summary

In [37]:
summary = summarize_text(text)
display(HTML(summary))

## Fake News

What is fake news?

* Fake news is a type of yellow journalism or propaganda that consists of deliberate misinformation or hoaxes spread via traditional print and broadcast news media or online social media.
* Fake news can be published to intentionally or circumstantially damage the reputation of a person or entity, or make money through advertising revenue.
* But ... fake news is not always false. The label can be used to discredit news that is critical of a person or organization, or to draw attention away from critical analysis.

### Fake News Detection processing

* Is it a news article?
* Is there consensus on the truthfulness of the article?
* If yes, return the consensus
* If no, continue
  * What is challenged in the article?
    * Sentiment analysis - can shed light on the overall tone of the article (positive, negative, neutral) - heatmap of the article by paragraph or section
    * Named entity recognition - can we identify the entities in the article (people, places, organizations, etc.)
    * Can we perform semantic role labeling on the article?
    * Are there references to other sources?



adapted from Rothman, D. _Transformers for Natural Language Processing_. O'Reilly Media, Inc., 2020

[Example](https://github.com/Denis2054/Transformers-for-NLP-2nd-Edition/blob/main/Chapter13/Fake_News_Analysis_with_ChatGPT.ipynb)