## BME i9400
## Fall 2024
### Introduction to Natural Language Processing

## What is Natural Language Processing (NLP)?
- Field of AI focused on interactions between computers and human language
- Combines linguistics, computer science, and machine learning
- Key tasks include:
  - Text classification
  - Named entity recognition 
  - Machine translation
  - Question answering
  - Text generation



## Why is NLP Challenging?
- Language is inherently complex and ambiguous
- Examples of challenges:
  - "The drug caused side effects" vs "The side effects caused hospitalization"
  - "The patient's condition improved" (positive) vs "The patient's condition" (neutral)
  - Medical terms with multiple meanings ("discharge" - noun vs verb)
- Need ways to represent text that capture meaning and context


## Traditional Text Processing
1. Text preprocessing
   - Tokenization (breaking text into words/subwords)
   - Removing stop words
   - Stemming/lemmatization
2. Feature extraction
   - Bag of words
   - TF-IDF (Term Frequency-Inverse Document Frequency)
3. Limitations
   - Loses word order
   - Doesn't capture meaning
   - Sparse representations


<img src="bow.png" alt="1D Convolution" width="1200"/>

## Word Vectors: A Better Way
- Represent words as dense vectors in continuous space
- Similar words have similar vectors
- Example in medical context:
  - "heart" closer to "cardiac" than to "kidney"
  - "diabetes" closer to "glucose" than to "fracture"
- Enables mathematical operations:
  - king - man + woman ≈ queen
  - cardiologist - heart + brain ≈ neurologist


<img src="word_vector.png" alt="1D Convolution" width="1200"/>

## Word2Vec: How It Works
1. Two main approaches:
   - Skip-gram: Predict context words from target
   - Continuous Bag of Words (CBOW): Predict target from context
2. Neural network learns word representations
3. Results in dense vector space where:
   - Similar words cluster together
   - Relationships preserved in vector space


<img src="word2vec.png" alt="1D Convolution" width="1200"/>

<img src="w2v_training.png" alt="1D Convolution" width="1200"/>

## Hands-on Demo 1: Word Vectors

In [None]:
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec

In [None]:
# Load a large "corpus" of text
corpus = api.load('text8')

In [None]:
# Train a wav2vec model on the corpus
model = Word2Vec(corpus)

In [None]:
# find the most similar words to 'brain'
print(model.wv.most_similar('brain'))

In [None]:
# find the embedding for 'brain'
print(model.wv['brain'])

In [None]:
# find the similarity between 'brain' and 'heart'
print(model.wv.similarity('brain', 'heart'))

In [None]:
# Find the similarity between 'brain' and 'car'
print(model.wv.similarity('brain', 'car'))

## Modern Word Embeddings
- Contextual embeddings: Words have different vectors based on context
- Example: "discharge"
  - "Patient ready for discharge" (noun - leaving hospital)
  - "Wound continues to discharge" (verb - producing fluid)
- Popular models:
  - BERT
  - BioBERT (trained on biomedical text)
  - ClinicalBERT (trained on clinical notes)


<img src="contextual_embedding.png" alt="1D Convolution" width="1200"/>

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load BioBERT
tokenizer = AutoTokenizer.from_pretrained('dmis-lab/biobert-v1.1')
model = AutoModel.from_pretrained('dmis-lab/biobert-v1.1')

In [None]:
# Process two different uses of "discharge"
text1 = "The patient was discharged."
text2 = "The wound discharged fluid."

# Get tokenized inputs
tokens1 = tokenizer(text1, return_tensors="pt")
tokens2 = tokenizer(text2, return_tensors="pt")

tokens1, tokens2

In [None]:
# Get the actual tokens to find the position of "discharge"
tokens1_words = tokenizer.convert_ids_to_tokens(tokens1['input_ids'][0])
tokens2_words = tokenizer.convert_ids_to_tokens(tokens2['input_ids'][0])

tokens1_words, tokens2_words

In [None]:
# Find positions of "discharge" in each sentence
discharge1_pos = tokens1_words.index('discharged')
discharge2_pos = tokens2_words.index('discharged')

discharge1_pos, discharge2_pos

In [None]:
# Get embeddings
with torch.no_grad():
    outputs1 = model(**tokens1)
    outputs2 = model(**tokens2)

# Extract embeddings for "discharge" in each context
discharge1_embedding = outputs1.last_hidden_state[0][discharge1_pos]
discharge2_embedding = outputs2.last_hidden_state[0][discharge2_pos]

len(discharge1_embedding), len(discharge2_embedding)

In [None]:
# Compare embeddings
similarity = torch.cosine_similarity(
    discharge1_embedding.unsqueeze(0), 
    discharge2_embedding.unsqueeze(0))

print("Tokens in first sentence:", tokens1_words)
print("Tokens in second sentence:", tokens2_words)
print(f"\nSimilarity between the two uses of 'discharge': {similarity.item()}")

## Applications in Biomedical Domain
1. Clinical Text Classification
   - Diagnosis categorization
   - Clinical trial matching
2. Named Entity Recognition
   - Identifying drugs, diseases, symptoms
3. Information Extraction
   - Mining medical literature
   - Extracting patient information from notes


## Summarization

In [None]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

# Sample text (PubMed abstract)
text = """
President Joe Biden’s pardon of his son Hunter deepened an entanglement of politics and the rule of law that has tarnished faith in American justice and is almost certain to worsen in Donald Trump’s second term.

The Sunday evening move was a stunning development since Biden came to office vowing to restore the independence of the Justice Department, which had been eroded during Trump’s first term, and because he had repeatedly said he wouldn’t pardon his son.

Now, weeks before he leaves the White House, Biden has wielded presidential power to absolve his son ahead of sentencings later this month over a pair of gun and tax convictions that emerged from the due process of law.

His decision came days after special counsel Jack Smith moved to dismiss the federal cases against Trump — over election interference and the hoarding of classified documents — on the grounds that presidents can’t be prosecuted.

Taken together, the convergence of legal controversies raises questions about the bedrock notion that underpins the system of justice in the United States that everyone — even presidents and their families — are equal before the law.

Until Sunday, Biden had not intervened in the cases against his son, and the White House always insisted that he wouldn’t, even though the shifting political environment caused by Trump’s election victory last month seemed likely to shift his calculations. Biden started informing staff of his decision on Saturday evening, a source familiar with the matter told CNN’s Arlette Saenz, and his team regrouped on Sunday morning to iron out the details.

Politically, Biden’s reversal may be seen as a stain on his legacy and his credibility. It contributes to an ignominious end for a presidency that dissolved in his disastrous debate performance in June and that will now be remembered as much for opening the way for Trump’s return to the White House as evicting him four years ago.

Rep. Glenn Ivey, a Maryland Democrat, acknowledged to Kasie Hunt on “CNN This Morning” Monday that the pardon will be wielded politically against Democrats.

“I’ve got mixed views about it, frankly,” Ivey said.

The president also may have offered an opening for Trump’s party to rally behind Kash Patel, the loyalist whom the president-elect picked Saturday evening to lead the FBI and serve as an apparent agent of his campaign of political retribution.

There is no evidence of wrongdoing on the part of the president. An impeachment inquiry by House Republicans that looked at Biden’s and his son’s business relationships — which Democrats saw as an attempt to inflict political damage ahead of the election — went nowhere. And the cases against Hunter Biden lack the constitutional gravity or historic importance of the indictments against Trump and his frequent attacks on the rule of law.

But the political impact of Sunday night’s drama could be profound. Already, Republicans are arguing the Hunter Biden pardon shows that the current president, and not the next one, is most to blame for politicizing the system of justice by meting out favorable treatment to his son. Their claim may not be accurate, but it can still be politically effective.

Trump used pardons to protect multiple political aides and contacts during his first term, including his daughter’s father-in-law, who’s now his pick for ambassador to France. But any time in the future that Trump is criticized for his use of pardon power, he will be able to argue that Biden did the same to protect his own kin.

This could be especially significant as Trump comes under pressure from supporters in the coming months to pardon those convicted of crimes related to the January 6, 2021, mob attack on the US Capitol — many of whom are still in jail.

Yet Biden, after a life of tragedies and heartache, asked Americans to judge him as a father who was clearly worried about the impact of a potential jail term on his son, a recovering addict.

Hunter Biden was convicted by a jury in June of illegally buying and possessing a gun after a trial that exposed his drug abuse and family dysfunction. He pleaded guilty in September to nine tax offenses, stemming from $1.4 million in taxes that he didn’t pay while spending lavishly on escorts, strippers, cars and drugs.

There is some validity to the president’s claim in his Sunday statement that his son was “treated differently” because of who his father is. Charges relating to the illegal possession of a firearm while being addicted to a controlled substance and regarding a false statement on the matter are quite rare, for instance. And Republican congressional probes into the matter, which imploded over a lack of evidence, looked like naked attempts to damage the president.

“No reasonable person who looks at the facts of Hunter’s cases can reach any other conclusion than Hunter was singled out only because he is my son — and that is wrong,” Joe Biden said in the statement. “There has been an effort to break Hunter — who has been five and a half years sober, even in the face of unrelenting attacks and selective prosecution. In trying to break Hunter, they’ve tried to break me — and there’s no reason to believe it will stop here. Enough is enough.”

His statement is extraordinary because Biden is now arguing something rather similar to Trump — that his own Justice Department has been unfairly politicized. Biden was referring to the way that the Hunter Biden case was handled by David Weiss, a Trump-appointed US attorney from Delaware who originally investigated the president’s son and was later appointed as a special counsel by Attorney General Merrick Garland.

Yet at the same time, Hunter Biden put himself in a position in which he created a political vulnerability and potential conflict of interest for his father. In addition, his business activities in Ukraine and China while his father was vice president and afterward raised serious ethical questions, even though Republicans have failed to produce evidence for claims that the current president benefited from the transactions.

It is significant, therefore, that Joe Biden’s pardon includes any activity by his son starting on January 1, 2014 — the year that Hunter Biden joined the board of Burisma, a Ukrainian energy company — while his father, who was then vice president, was deeply involved in US policy toward Kyiv.

While the pardon is its own distinct controversy, it may not have happened but for the extraordinary circumstances of a fraught political moment, with Trump due to return to power at noon on January 20.

Given the selection of Patel to head the FBI and Trump’s second pick for attorney general, Pam Bondi, there are reasonable grounds to expect that Hunter Biden may have been among those whom the president-elect’s loyalists were likely to target, given their vows to use their powers to go after his enemies.

And now that he’s acted to protect his son, Joe Biden may face calls to cast a much wider net with his pardon authority, perhaps to include prosecutors who worked on cases against Trump, including over his attempt to overturn the result of the 2020 election.

The president-elect moved quickly to capitalize on the situation in a comment that will raise expectations that he will issue pardons for January 6 convicts shortly after he takes office again.

“Does the Pardon given by Joe to Hunter include the J-6 Hostages, who have now been imprisoned for years?” Trump wrote in a post on Truth Social on Sunday. “Such an abuse and miscarriage of Justice!”

And Trump’s Republican allies sought to use the situation to bolster the chances of Senate confirmation for some of his most provocative picks. “Democrats can spare us the lectures about the rule of law when, say, President Trump nominates Pam Bondi and Kash Patel to clean up this corruption,” Arkansas Sen. Tom Cotton wrote on X.

"""

# Summarize
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, 3)  # Summarize to 2 sentences
for sentence in summary:
    print(sentence)

## Tokenization

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')

text = "NLP enables machines to understand human language. This is vital for healthcare."
print("Word Tokenization:", word_tokenize(text))
print("Sentence Tokenization:", sent_tokenize(text))

In [None]:
from textblob import TextBlob

text = "Moana 2 was a hot, steaming pile of garbage. The animation was terrible and the plot was non-existent."

blob = TextBlob(text)
sentiment = blob.sentiment.polarity  # Ranges from -1 (negative) to 1 (positive)

# Display the sentiment
if sentiment > 0:
    print("Sentiment: Positive 😊\n")
elif sentiment < 0:
    print("Sentiment: Negative 😞\n")
else:
    print("Sentiment: Neutral 😐\n")

## Named Entity Recognition

In [None]:
import spacy

# Load pre-trained model
nlp = spacy.load("en_core_web_sm")

# Sample biomedical text
text = "Acetaminophen is used to treat pain and reduce the risk of stroke in cardiovascular disease. Jacek Dmochowski is an Associate Professor of Biomedical Engineering."

# Process the text
doc = nlp(text)

# Extract named entities
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")