# Exploring Applied NLP Problems

## Lecture Overview

* Named Entity Recognition (NER) using SpaCy and Transformers
* Text summarization using Transformers
* Text generation using Transformers
* Analyzing `Fake news` using Transformers and ChatGPT
* Semantic role labeling using Transformers and ChatGPT

## Named Entity Recognition (NER) using SpaCy and Transformers

### Named Entity Recognition (NER)

Named Entity Recognition (NER) is the task of identifying named entities in text and classifying them into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

### SpaCy Example

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(f'Entity: {ent.text} is type {ent.label_} - index_location: {ent.start_char}:{ent.end_char}')

In [None]:
### more complex data

# https://pubmed.ncbi.nlm.nih.gov/37071411/
text = """
Most patients were initially certified for a 1:1 (∆9-tetrahydrocannabinol:cannabidiol) tincture.
Eight-seven percent of patients (n = 60) were noted to exhibit an improvement in any PD symptom after starting MC.
Symptoms with the highest incidence of improvement included cramping/dystonia, pain, spasticity, lack of appetite, dyskinesia, and tremor.
After starting MC, 56% of opioid users (n = 14) were able to decrease or discontinue opioid use with an average daily morphine milligram equivalent change from 31 at baseline to 22 at the last follow-up visit.
The MC was well-tolerated with no severe AEs reported and low rate of MC discontinuation due to AEs (n = 4).
"""

doc = nlp(text)

for ent in doc.ents:
    print(f'Entity: {ent.text} is type {ent.label_} - index_location: {ent.start_char}:{ent.end_char}')

In [None]:
## Using SciSpacy

import spacy
import scispacy

nlp = spacy.load("en_core_sci_scibert")

doc = nlp(text)

for ent in doc.ents:
    print(f'Entity: {ent.text} is type {ent.label_} - index_location: {ent.start_char}:{ent.end_char}')

```
Entity: patients is type ENTITY - index_location: 6:14
Entity: ∆9-tetrahydrocannabinol:cannabidiol) tincture is type ENTITY - index_location: 51:96
Entity: patients is type ENTITY - index_location: 121:129
Entity: improvement is type ENTITY - index_location: 164:175
Entity: PD is type ENTITY - index_location: 183:185
Entity: symptom is type ENTITY - index_location: 186:193
Entity: MC is type ENTITY - index_location: 209:211
Entity: Symptoms is type ENTITY - index_location: 213:221
Entity: incidence is type ENTITY - index_location: 239:248
Entity: improvement is type ENTITY - index_location: 252:263
Entity: cramping/dystonia is type ENTITY - index_location: 273:290
Entity: pain is type ENTITY - index_location: 292:296
Entity: spasticity is type ENTITY - index_location: 298:308
Entity: lack of appetite is type ENTITY - index_location: 310:326
Entity: dyskinesia is type ENTITY - index_location: 328:338
Entity: tremor is type ENTITY - index_location: 344:350
Entity: MC is type ENTITY - index_location: 367:369
Entity: opioid is type ENTITY - index_location: 378:384
Entity: users is type ENTITY - index_location: 385:390
Entity: decrease is type ENTITY - index_location: 413:421
Entity: discontinue is type ENTITY - index_location: 425:436
Entity: opioid is type ENTITY - index_location: 437:443
Entity: daily is type ENTITY - index_location: 464:469
Entity: morphine is type ENTITY - index_location: 470:478
Entity: milligram is type ENTITY - index_location: 479:488
...
Entity: low rate is type ENTITY - index_location: 620:628
Entity: MC is type ENTITY - index_location: 632:634
Entity: discontinuation is type ENTITY - index_location: 635:650
Entity: AEs is type ENTITY - index_location: 658:661
```

### Transformers Example

In [4]:
import transformers
from transformers import pipeline

ner = pipeline('ner', model='dslim/bert-base-NER', tokenizer='dslim/bert-base-NER', grouped_entities=True)
ner(text)

  from .autonotebook import tqdm as notebook_tqdm
2023-11-28 12:26:39.570832: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-28 12:26:39.637248: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-28 12:26:39.895912: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-28 12:26:39.895935: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-28 12:26:39.897840: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Una

[{'entity_group': 'MISC',
  'score': 0.9794679,
  'word': 'PD',
  'start': 183,
  'end': 185},
 {'entity_group': 'ORG',
  'score': 0.75518674,
  'word': 'MC',
  'start': 209,
  'end': 211},
 {'entity_group': 'ORG',
  'score': 0.7457224,
  'word': 'MC',
  'start': 367,
  'end': 369},
 {'entity_group': 'ORG',
  'score': 0.671571,
  'word': 'MC',
  'start': 566,
  'end': 568},
 {'entity_group': 'MISC',
  'score': 0.52196264,
  'word': 'A',
  'start': 603,
  'end': 604},
 {'entity_group': 'ORG',
  'score': 0.6522742,
  'word': 'MC',
  'start': 632,
  'end': 634}]

### Using a different model

In [5]:
# https://huggingface.co/d4data/biomedical-ner-all
ner = pipeline('ner', model='d4data/biomedical-ner-all', tokenizer='d4data/biomedical-ner-all', grouped_entities=True)
ner(text)

Downloading model.safetensors: 100%|██████████| 266M/266M [00:03<00:00, 66.6MB/s] 


[{'entity_group': 'Lab_value',
  'score': 0.4000865,
  'word': '1',
  'start': 46,
  'end': 47},
 {'entity_group': 'Lab_value',
  'score': 0.9972366,
  'word': 'eight - seven percent',
  'start': 98,
  'end': 117},
 {'entity_group': 'Disease_disorder',
  'score': 0.99834895,
  'word': 'pd',
  'start': 183,
  'end': 185},
 {'entity_group': 'Sign_symptom',
  'score': 0.9996455,
  'word': 'cr',
  'start': 273,
  'end': 275},
 {'entity_group': 'Sign_symptom',
  'score': 0.93797743,
  'word': '##amp',
  'start': 275,
  'end': 278},
 {'entity_group': 'Sign_symptom',
  'score': 0.7478892,
  'word': 'd',
  'start': 282,
  'end': 283},
 {'entity_group': 'Sign_symptom',
  'score': 0.985441,
  'word': 'spa',
  'start': 298,
  'end': 301},
 {'entity_group': 'Sign_symptom',
  'score': 0.92435396,
  'word': 'dyskines',
  'start': 328,
  'end': 336},
 {'entity_group': 'Lab_value',
  'score': 0.7490449,
  'word': '56 %',
  'start': 371,
  'end': 374},
 {'entity_group': 'Disease_disorder',
  'score': 0

### Creating your own pipeline

* Extract semantic triples from the text then perform NER on the extracted triples
* Use Stanford CoreNLP to extract semantic triples from the text then perform NER on the extracted triples

In [7]:
import stanza
import spacy
from stanza.server import CoreNLPClient
stanza.install_corenlp()

## extract triples from the text
triples = []

# define the properties
config = {
    "annotators": "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,openie",
    "openie.max_entailments_per_clause": "100",
    "openie.threads": "4",
    "memory": "16G",
    "endpoint": "http://localhost:9020",
}

client = CoreNLPClient(annotators=config['annotators'], memory=config['memory'], endpoint=config['endpoint'])

document = client.annotate(text)
for i, sentence in enumerate(document.sentence):
    for triple in sentence.openieTriple:
        triples.append([triple.subject, triple.relation, triple.object])
        
triples

2023-11-28 12:28:53 INFO: Writing properties to tmp file: corenlp_server-e2195cccde2e4d65.props
2023-11-28 12:28:53 INFO: Starting server with command: java -Xmx16G -cp /home/james/stanza_corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9020 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet False -serverProperties corenlp_server-e2195cccde2e4d65.props -annotators tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,openie -preload -outputFormat serialized
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
			(Note: unspecified annotator properties are English defaults)
			annotators = tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,openie
			inputFormat = text
			outputFormat = serialized
			prettyPrint = false
			threads = 5
[main] INFO CoreNLP - Threads: 5
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding a

 Most patients were initially certified for a 1:1 (∆9-tetrahydrocannabinol:cannabidiol) tincture. Eight-seven percent of patients (n = 60) were noted to exhibit an improvement in any PD symptom after starting MC. Symptoms with the highest incidence of improvement included cramping/dystonia, pain, spasticity, lack of appetite, dyskinesia, and tremor. After starting MC, 56% of opioid users (n = 14) were able to decrease or discontinue opioid use with an average daily morphine milligram equivalent change from 31 at baseline to 22 at the last follow-up visit. The MC was well-tolerated with no severe AEs reported and low rate of MC discontinuation due to AEs (n = 4). 


[['patients', 'were', 'initially certified'],
 ['patients', 'were certified for', '1:1'],
 ['Most patients', 'were certified for', '1:1'],
 ['Most patients', 'were initially certified for', '1:1'],
 ['patients', 'were initially certified for', '1:1'],
 ['patients', 'were', 'certified'],
 ['Most patients', 'were', 'certified'],
 ['Most patients', 'were', 'initially certified'],
 ['improvement', 'is in', 'PD symptom'],
 ['Eight seven percent', 'were', 'noted'],
 ['Eight seven percent', 'exhibit', 'improvement in PD symptom'],
 ['dystonia', 'lack of', 'appetite'],
 ['Symptoms', 'is with', 'highest incidence of improvement'],
 ['Symptoms', 'included', 'cramping dystonia'],
 ['Symptoms', 'included', 'lack'],
 ['Symptoms', 'included', 'lack of appetite']]

In [8]:
from IPython.display import HTML, display

display(HTML(text))

In [10]:
## Analyze the triples for NER

import spacy

nlp = spacy.load("en_core_web_sm")

for i, triple in enumerate(triples):
    doc = " ".join(triple)
    doc = nlp(doc)
    for ent in doc.ents:
        print(f'Entity: {ent.text} is type {ent.label_} - index_location: {ent.start_char}:{ent.end_char}')

Entity: 1:1 is type DATE - index_location: 28:31
Entity: 1:1 is type DATE - index_location: 33:36
Entity: 1:1 is type DATE - index_location: 43:46
Entity: 1:1 is type DATE - index_location: 38:41
Entity: Eight seven percent is type PERCENT - index_location: 0:19
Entity: Eight is type CARDINAL - index_location: 0:5
Entity: seven percent is type PERCENT - index_location: 6:19


In [12]:
ner = pipeline('ner', model='dslim/bert-base-NER', tokenizer='dslim/bert-base-NER', grouped_entities=True)

for i, triple in enumerate(triples):
    doc = " ".join(triple)
    doc = ner(doc)
    for ent in doc:
        print(f'Entity: {ent["word"]} is type {ent["entity_group"]} - index_location: {ent["start"]}:{ent["end"]}')

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Entity: PD is type MISC - index_location: 18:20
Entity: PD is type MISC - index_location: 43:45


## Text summarization using Transformers

There are two types of text summarization:

* Extractive summarization: Extracting a subset of the original text to form the summary
* Abstractive summarization: Generating new text to form the summary

### Extractive summarization

There are several extractive summarization techniques:

* LexRank - LexRank is a graph-based algorithm that uses the PageRank algorithm to rank sentences based on their similarity to other sentences in the text.
* SentRank - SentRank is a graph-based algorithm that uses the PageRank algorithm to rank sentences based on their similarity to other sentences in the text.
* Luhn - Uses TF-IDF to rank sentences based on their similarity to other sentences in the text.
* SumBasic - Utilize the frequency of words in the text to rank sentences. (abstract-like)
* KL-Sum - Kullback-Leibler divergence is used to rank sentences based on their similarity to other sentences in the text.
* LSA - Latent semantic analysis or indexing uses singular value decomposition to compute matrices for analyzing relationships between sets of observations.
* K-Means - K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.

### Abstractive summarization

In [13]:
import transformers

from transformers import BloomTokenizerFast
from transformers import BloomForCausalLM

MODEL = BloomForCausalLM.from_pretrained('bigscience/bloom-560m')
TOKENIZER = BloomTokenizerFast.from_pretrained('bigscience/bloom-560m')

Downloading model.safetensors: 100%|██████████| 1.12G/1.12G [00:15<00:00, 70.2MB/s]


In [14]:
## summarization of our text using the Bloom model

def summarize_text(text: str, tokenizer=TOKENIZER, min_output=40, max_output=100, max_length=80, model=MODEL):
    """Take a string of text and generate a summary"""
    tokens_input = tokenizer.encode("summarize: " + text, return_tensors='pt', max_length=max_length, truncation=True)
    ids = model.generate(tokens_input, min_length=min_output, max_length=max_output)
    summary = tokenizer.decode(ids[0], skip_special_tokens=True)
    return summary

In [15]:
summary = summarize_text(text)
display(HTML(summary))

## Fake News

What is fake news?

* Fake news is a type of yellow journalism or propaganda that consists of deliberate misinformation or hoaxes spread via traditional print and broadcast news media or online social media.
* Fake news can be published to intentionally or circumstantially damage the reputation of a person or entity, or make money through advertising revenue.
* But ... fake news is not always false. The label can be used to discredit news that is critical of a person or organization, or to draw attention away from critical analysis.

### Fake News Detection processing

* Is it a news article?
* Is there consensus on the truthfulness of the article?
* If yes, return the consensus
* If no, continue
  * What is challenged in the article?
    * Sentiment analysis - can shed light on the overall tone of the article (positive, negative, neutral) - heatmap of the article by paragraph or section
    * Named entity recognition - can we identify the entities in the article (people, places, organizations, etc.)
    * Can we perform semantic role labeling on the article?
    * Are there references to other sources?



adapted from Rothman, D. _Transformers for Natural Language Processing_. O'Reilly Media, Inc., 2020

## Exploring Document Similarity

### Using vectors to analyze sentence or document similarity

#### Dot product

The dot product or inner product of two vectors is defined as:

$$ \vec{a} \cdot \vec{b} = \sum_{i=1}^{n} a_i b_i $$


With our vectors are defined as:

$$ |v| = \sqrt{\sum^{N}_{i=1} v^2_i}$$

* The longer the vector, the larger the magnitude
* More frequent words will have larger magnitude
* Raw dot product is not normalized - how can we use it to measure similarity?


#### Normalized dot product

$$ \vec{a} \cdot \vec{b} = \frac{\sum_{i=1}^{n} a_i b_i}{\sqrt{\sum^{N}_{i=1} a^2_i} \sqrt{\sum^{N}_{i=1} b^2_i}} $$

#### Cosine similarity

With the cosine similarity, we can measure the angle between two vectors. The cosine similarity is defined as:

$$ \text{cosine(a, b)  = } \frac{a\cdot b}{|a||b|} = \frac{\sum_{i=1}^{n} a_i b_i}{\sqrt{\sum^{N}_{i=1} a^2_i} \sqrt{\sum^{N}_{i=1} b^2_i}} $$

### Cosine similarity of words

https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin

### Cosine similarity of sentences and documents

How can we use the word vectors to measure the similarity between sentences or documents?

* Average the word vectors in the sentence or document
* Calculate the cosine similarity between the two sentences or documents
* Train a classifier to predict the similarity between sentences or documents
* Train a sentence embedding model to generate sentence or document vectors
* etc.

#### Doc2Vec

https://radimrehurek.com/gensim/models/doc2vec.html

In [16]:
import gensim

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

In [17]:
Doc2Vec??

[0;31mInit signature:[0m
[0mDoc2Vec[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdocuments[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcorpus_file[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvector_size[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdm_mean[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdm[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdbow_words[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdm_concat[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdm_tag_count[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdv[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdv_mapfile[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcomment[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrim_rule[0m[0;34m=[0

In [18]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [19]:
model.infer_vector(["system", "response"])

array([-0.07645398, -0.05182466, -0.08459883, -0.09623718,  0.06788807],
      dtype=float32)

### Let's train a Doc2Vec model

In [22]:
import os
import re
import pandas as pd
from gensim import corpora, models, similarities

df = pd.read_csv('../datasets/news-2023-02-01.csv')

df.head()

Unnamed: 0,source,title,text
0,politicususa,Prosecutors Pay Attention: Stormy Daniels Than...,Manhattan prosecutors are likely to notice tha...
1,politicususa,Investigators Push For Access To Trump Staff C...,Print\nInvestigators looking into Donald Trump...
2,politicususa,The End Is Near For George Santos As He Steps ...,The AP reported:\nRepublican Rep. George Santo...
3,politicususa,Rachel Maddow Cuts Trump To The Bone With Stor...,Rachel Maddow showed how Trump committed a cri...
4,vox,Alec Baldwin has been formally charged with in...,Candles are placed in front of a photo of cine...


In [28]:
titles = df['title'].tolist()
titles = [title for title in titles if type(title) == str]

In [34]:
dictionary = corpora.Dictionary([title.split() for title in titles])

In [35]:
dictionary.token2id

{'Attention:': 0,
 'Confirming': 1,
 'Daniels': 2,
 'For': 3,
 'Her': 4,
 'Pay': 5,
 'Prosecutors': 6,
 'Publicly': 7,
 'Stormy': 8,
 'Story': 9,
 'Thanks': 10,
 'Trump': 11,
 'Access': 12,
 'Computers': 13,
 'Investigators': 14,
 'Push': 15,
 'Staff': 16,
 'To': 17,
 'As': 18,
 'Committees': 19,
 'Down': 20,
 'End': 21,
 'From': 22,
 'George': 23,
 'He': 24,
 'Is': 25,
 'Near': 26,
 'Santos': 27,
 'Steps': 28,
 'The': 29,
 'Analysis': 30,
 'Bone': 31,
 'Cuts': 32,
 'Hush': 33,
 'Maddow': 34,
 'Money': 35,
 'Rachel': 36,
 'With': 37,
 '-': 38,
 'Alec': 39,
 'Baldwin': 40,
 'Vox': 41,
 'been': 42,
 'charged': 43,
 'formally': 44,
 'has': 45,
 'involuntary': 46,
 'manslaughter': 47,
 'with': 48,
 'Google': 49,
 'What': 50,
 'and': 51,
 'at': 52,
 'companies': 53,
 'for': 54,
 'industries': 55,
 'layoffs': 56,
 'mean': 57,
 'other': 58,
 'tech': 59,
 'Did': 60,
 'Representative-elect': 61,
 'Republican': 62,
 'about': 63,
 'his': 64,
 'lie': 65,
 'life': 66,
 'story?': 67,
 '17': 68,
 '20

In [36]:
# Convert the reviews into Gensim bag-of-words vectors
corpus = [dictionary.doc2bow(title.split()) for title in titles]

In [37]:
# Train a Gensim TF-IDF model on the corpus
tfidf = models.TfidfModel(corpus)

In [38]:
# Convert the corpus into Gensim TF-IDF vectors
tfidf_corpus = tfidf[corpus]

In [39]:
tfidf_corpus?

[0;31mType:[0m           TransformedCorpus
[0;31mString form:[0m    <gensim.interfaces.TransformedCorpus object at 0x7f34899b1660>
[0;31mLength:[0m         11586
[0;31mFile:[0m           /media/james/Projects/GitHub/DATA_340_NLP/venv/lib/python3.10/site-packages/gensim/interfaces.py
[0;31mDocstring:[0m      Interface for corpora that are the result of an online (streamed) transformation.
[0;31mInit docstring:[0m
Parameters
----------
obj : object
    A transformation :class:`~gensim.interfaces.TransformationABC` object that will be applied
    to each document from `corpus` during iteration.
corpus : iterable of list of (int, number)
    Corpus in bag-of-words format.
chunksize : int, optional
    If provided, a slightly more effective processing will be performed by grouping documents from `corpus`.

In [41]:
# examine the first 10 documents
for doc in tfidf_corpus[:10]:
    print(doc)

[(0, 0.32468534328551457), (1, 0.32468534328551457), (2, 0.2584169554144531), (3, 0.19024770138233393), (4, 0.30048513863439946), (5, 0.3179550964042962), (6, 0.2984219385078802), (7, 0.32468534328551457), (8, 0.2584169554144531), (9, 0.32468534328551457), (10, 0.32468534328551457), (11, 0.15067146529062592)]
[(3, 0.24495403190564266), (11, 0.19399752348070728), (12, 0.38852224809611624), (13, 0.4227104186686381), (14, 0.4227104186686381), (15, 0.4136144820378332), (16, 0.39775189221153373), (17, 0.25448230244861914)]
[(3, 0.19956880919341843), (18, 0.3069028857577644), (19, 0.3363941626852941), (20, 0.2826278275460694), (21, 0.33077778414752956), (22, 0.22836338466496256), (23, 0.23864755918795416), (24, 0.2820621988488852), (25, 0.27035192728438345), (26, 0.32970348315347797), (27, 0.26751629670278293), (28, 0.337569976449488), (29, 0.08691868993437478)]
[(2, 0.2539793881751479), (8, 0.2539793881751479), (11, 0.14808411665013346), (17, 0.19425395894264366), (29, 0.08143617957551065),

In [51]:
# Train a Gensim LSI model on the TF-IDF vectors
lsi_model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=25)

In [52]:
lsi_model??

[0;31mType:[0m           LsiModel
[0;31mString form:[0m    LsiModel<num_terms=3950, num_topics=25, decay=1.0, chunksize=20000>
[0;31mFile:[0m           /media/james/Projects/GitHub/DATA_340_NLP/venv/lib/python3.10/site-packages/gensim/models/lsimodel.py
[0;31mSource:[0m        
[0;32mclass[0m [0mLsiModel[0m[0;34m([0m[0minterfaces[0m[0;34m.[0m[0mTransformationABC[0m[0;34m,[0m [0mbasemodel[0m[0;34m.[0m[0mBaseTopicModel[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Model for `Latent Semantic Indexing[0m
[0;34m    <https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing>`_.[0m
[0;34m[0m
[0;34m    The decomposition algorithm is described in `"Fast and Faster: A Comparison of Two Streamed[0m
[0;34m    Matrix Decomposition Algorithms" <https://arxiv.org/pdf/1102.5597.pdf>`_.[0m
[0;34m[0m
[0;34m    Notes[0m
[0;34m    -----[0m
[0;34m    * :attr:`gensim.models.lsimodel.LsiModel.projection.u` - left singular vect

In [53]:
lsi_model_topics = lsi_model.show_topics(formatted=False)
lsi_model_topics

[(0,
  [('-', 0.22351318340115048),
   ('The', 0.2215239063088244),
   ('New', 0.17796204560949086),
   ('York', 0.17756587389260567),
   ('Trump', 0.17264271527176195),
   ('Post', 0.17259230796850328),
   ('Washington', 0.17162889486593416),
   ('to', 0.15920168051786623),
   ('the', 0.15364586274262773),
   ('Times', 0.13648862164500525)]),
 (1,
  [('Trump', -0.28581640550184273),
   ('Stormy', -0.24777864902725394),
   ('Daniels', -0.24777864902725374),
   ('For', -0.19186268975074833),
   ('Post', 0.17433705456220477),
   ('Washington', 0.16956888149999652),
   ('Publicly', -0.16138937678528795),
   ('Story', -0.1613893767852879),
   ('Attention:', -0.16138937678528784),
   ('Confirming', -0.16138937678528772)]),
 (2,
  [('Latest', 0.26481514526156763),
   ('HuffPost', 0.26463734148522233),
   ('News', 0.1737214508032373),
   ('Stormy', -0.15542373897468745),
   ('Daniels', -0.1554237389746874),
   ('In', 0.1461348108894938),
   ('Vox', -0.14088344358419605),
   ('From', 0.1400708

In [55]:
lsi_model.get_topics()

array([[ 4.23242971e-02,  4.23242971e-02,  9.74917372e-02, ...,
         9.46665739e-05,  9.46665739e-05,  9.46665739e-05],
       [-1.61389377e-01, -1.61389377e-01, -2.47778649e-01, ...,
         4.05416730e-05,  4.05416730e-05,  4.05416730e-05],
       [-8.53912153e-02, -8.53912153e-02, -1.55423739e-01, ...,
        -5.37195072e-05, -5.37195072e-05, -5.37195072e-05],
       ...,
       [ 1.42144228e-02,  1.42144228e-02, -1.00674276e-02, ...,
        -3.32726230e-06, -3.32726230e-06, -3.32726230e-06],
       [ 3.19839330e-02,  3.19839330e-02, -1.18478838e-02, ...,
        -2.62584525e-05, -2.62584525e-05, -2.62584525e-05],
       [ 5.76181789e-03,  5.76181789e-03,  1.83985712e-02, ...,
         4.42204864e-06,  4.42204864e-06,  4.42204864e-06]])

In [58]:
lsi_model.print_topics(num_topics=10, num_words=5)

[(0, '0.224*"-" + 0.222*"The" + 0.178*"New" + 0.178*"York" + 0.173*"Trump"'),
 (1,
  '-0.286*"Trump" + -0.248*"Stormy" + -0.248*"Daniels" + -0.192*"For" + 0.174*"Post"'),
 (2,
  '0.265*"Latest" + 0.265*"HuffPost" + 0.174*"News" + -0.155*"Stormy" + -0.155*"Daniels"'),
 (3,
  '-0.243*"George" + -0.237*"For" + -0.191*"Santos" + -0.188*"Committees" + -0.185*"Steps"'),
 (4,
  '-0.297*"5th" + -0.297*"Devastates" + -0.297*"Matters" + -0.297*"Nicolle" + -0.293*"Taking"'),
 (5,
  '-0.201*"Santos" + -0.186*"Vox" + -0.183*"Down" + -0.178*"Committees" + -0.174*"George"'),
 (6,
  '0.292*"York" + 0.277*"Times" + 0.276*"New" + -0.175*"Vox" + -0.168*"charged"'),
 (7,
  '-0.314*"Computers" + -0.314*"Investigators" + -0.311*"Push" + -0.307*"Access" + -0.298*"Staff"'),
 (8,
  '-0.219*"primary" + -0.218*"calendar" + -0.198*"will" + 0.195*"by" + -0.191*"Democrats"'),
 (9,
  '0.234*"to" + -0.221*"or" + -0.212*"Supreme" + -0.210*"Kagan" + -0.210*"Democrats?"')]