## BME i9400
## Fall 2024
### Introduction to Natural Language Processing

## What is Natural Language Processing (NLP)?
- Field of AI focused on interactions between computers and human language
- Combines linguistics, computer science, and machine learning
- Key tasks include:
  - Text classification
  - Named entity recognition 
  - Machine translation
  - Question answering
  - Text generation



## Why is NLP Challenging?
- Language is inherently complex and ambiguous
- Examples of challenges:
  - "The drug caused side effects" vs "The side effects caused hospitalization"
  - "The patient's condition improved" (positive) vs "The patient's condition" (neutral)
  - Medical terms with multiple meanings ("discharge" - noun vs verb)
- Need ways to represent text that capture meaning and context


## Traditional Text Processing
1. Text preprocessing
   - Tokenization (breaking text into words/subwords)
   - Removing stop words
   - Stemming/lemmatization
2. Feature extraction
   - Bag of words
   - TF-IDF (Term Frequency-Inverse Document Frequency)
3. Limitations
   - Loses word order
   - Doesn't capture meaning
   - Sparse representations


<img src="bow.png" alt="1D Convolution" width="1200"/>

## Word Vectors: A Better Way
- Represent words as dense vectors in continuous space
- Similar words have similar vectors
- Example in medical context:
  - "heart" closer to "cardiac" than to "kidney"
  - "diabetes" closer to "glucose" than to "fracture"
- Enables mathematical operations:
  - king - man + woman ≈ queen
  - cardiologist - heart + brain ≈ neurologist


<img src="word_vector.png" alt="1D Convolution" width="1200"/>

## Word2Vec: How It Works
1. Two main approaches:
   - Skip-gram: Predict context words from target
   - Continuous Bag of Words (CBOW): Predict target from context
2. Neural network learns word representations
3. Results in dense vector space where:
   - Similar words cluster together
   - Relationships preserved in vector space


<img src="word2vec.png" alt="1D Convolution" width="1200"/>

<img src="w2v_training.png" alt="1D Convolution" width="1200"/>

## Hands-on Demo 1: Word Vectors

In [1]:
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec

In [2]:
# Load a large "corpus" of text
corpus = api.load('text8')

In [3]:
# Train a wav2vec model on the corpus
model = Word2Vec(corpus)

In [4]:
# find the most similar words to 'brain'
print(model.wv.most_similar('brain'))

[('cortex', 0.7852938771247864), ('tissue', 0.7655649781227112), ('neurons', 0.7525806427001953), ('spinal', 0.7436544299125671), ('tumor', 0.732485294342041), ('kidneys', 0.7312641739845276), ('liver', 0.7306241393089294), ('cerebral', 0.7294614315032959), ('nerve', 0.7266942858695984), ('nerves', 0.7265110611915588)]


In [5]:
# find the embedding for 'brain'
print(model.wv['brain'])

[ 3.3037671e-01 -1.6137644e+00  4.4435623e-01 -5.1973277e-01
 -1.2091461e+00 -7.3113006e-01 -1.6238000e+00 -1.9236416e+00
 -4.1577852e-01 -4.8568371e-01 -1.5465862e+00 -8.0053240e-01
  1.7945840e+00  7.1447760e-01 -1.8249069e+00  1.2087728e+00
 -7.8725451e-01 -3.0684929e+00  1.1614560e+00  8.3401930e-01
  4.0777278e-01  1.2528294e+00 -3.0974813e-03 -2.4188135e+00
  2.7808580e-01 -1.4666234e+00  6.5099597e-01 -2.4236552e-01
  3.0013743e-01  3.4220216e-01  1.7090615e+00  2.4020199e-01
  2.5136545e+00  3.6623334e-03 -3.2434314e-01  7.9220855e-01
 -5.1119596e-01 -3.4265479e-01  5.2442259e-01 -1.1907653e+00
 -6.3845569e-01 -3.1765711e+00  3.6393219e-01 -2.7347348e+00
  2.6659107e-01 -1.0754614e+00 -4.2642045e-01 -1.2139702e+00
 -1.2238749e+00  5.0212884e-01  1.4535734e-01 -1.0279227e+00
 -4.1746309e-01  3.5471407e-01 -2.2836846e-01  1.2801739e+00
  3.0033404e-01  3.0388144e-01 -1.9138057e+00  3.0720124e-01
 -5.3311962e-01 -1.1285787e+00 -4.6125358e-01  3.9622870e-01
 -2.9411147e+00  3.73283

In [6]:
# find the similarity between 'brain' and 'heart'
print(model.wv.similarity('brain', 'heart'))

0.5522788


In [7]:
# Find the similarity between 'brain' and 'car'
print(model.wv.similarity('brain', 'car'))

0.10803783


## Modern Word Embeddings
- Contextual embeddings: Words have different vectors based on context
- Example: "discharge"
  - "Patient ready for discharge" (noun - leaving hospital)
  - "Wound continues to discharge" (verb - producing fluid)
- Popular models:
  - BERT
  - BioBERT (trained on biomedical text)
  - ClinicalBERT (trained on clinical notes)


<img src="contextual_embedding.png" alt="1D Convolution" width="1200"/>

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load BioBERT
tokenizer = AutoTokenizer.from_pretrained('dmis-lab/biobert-v1.1')
model = AutoModel.from_pretrained('dmis-lab/biobert-v1.1')

# Process two different uses of "discharge"
text1 = "The patient was discharged."
text2 = "The wound discharged fluid."

# Get tokenized inputs
tokens1 = tokenizer(text1, return_tensors="pt")
tokens2 = tokenizer(text2, return_tensors="pt")

# Get the actual tokens to find the position of "discharge"
tokens1_words = tokenizer.convert_ids_to_tokens(tokens1['input_ids'][0])
tokens2_words = tokenizer.convert_ids_to_tokens(tokens2['input_ids'][0])

# Find positions of "discharge" in each sentence
discharge1_pos = tokens1_words.index('discharged')
discharge2_pos = tokens2_words.index('discharged')

# Get embeddings
with torch.no_grad():
    outputs1 = model(**tokens1)
    outputs2 = model(**tokens2)

# Extract embeddings for "discharge" in each context
discharge1_embedding = outputs1.last_hidden_state[0][discharge1_pos]
discharge2_embedding = outputs2.last_hidden_state[0][discharge2_pos]

# Compare embeddings
similarity = torch.cosine_similarity(
    discharge1_embedding.unsqueeze(0), 
    discharge2_embedding.unsqueeze(0))

print("Tokens in first sentence:", tokens1_words)
print("Tokens in second sentence:", tokens2_words)
print(f"\nSimilarity between the two uses of 'discharge': {similarity.item()}")

## Applications in Biomedical Domain
1. Clinical Text Classification
   - Diagnosis categorization
   - Clinical trial matching
2. Named Entity Recognition
   - Identifying drugs, diseases, symptoms
3. Information Extraction
   - Mining medical literature
   - Extracting patient information from notes
