# CS 421: Natural Language Processing - Assignment 4
## Named Entity Recognition, TF-IDF, and PPMI Analysis

---

**Course:** CS 421 - Natural Language Processing  
**Submission Date:** November 2025

---

### Assignment Overview

This comprehensive assignment explores three fundamental NLP concepts through practical implementation:

1. **TF-IDF Vectorization** (25 points) - Building a document vectorizer from scratch
2. **PPMI Calculation** (5 points) - Computing word association metrics
3. **Named Entity Recognition** (20 points) - Deep learning with LSTM networks

**Total Points:** 50

---

## Environment Setup and Library Imports

This section imports all required libraries and dependencies for the assignment.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import math
import warnings
warnings.filterwarnings('ignore')

# NLP libraries
from datasets import load_dataset

# Deep learning libraries
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix

# Word embeddings
import gensim.downloader as api

# Visualization configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("All libraries imported successfully.")
print(f"NumPy version: {np.__version__}")
print(f"Keras version: {keras.__version__}")

---

## Question 1: TF-IDF Vectorization and Cosine Similarity

### Theoretical Foundation

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a numerical statistic that reflects the importance of a word to a document in a collection or corpus.

**Mathematical Formulas:**
- **Term Frequency (TF):** `tf(t,d) = log₁₀(count(t,d) + 1)`
- **Inverse Document Frequency (IDF):** `idf(t) = log₁₀(N / df_t)`
- **TF-IDF:** `tfidf(t,d) = tf(t,d) × idf(t)`

Where:
- `t` = term (word)
- `d` = document
- `N` = total number of documents
- `df_t` = number of documents containing term t

**Cosine Similarity** measures the cosine of the angle between two vectors in multi-dimensional space:
```
cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
```

---

### Implementation

Custom TF-IDF Vectorizer implementation from scratch.

In [None]:
class TfIdfVectorizer:
    """
    Custom TF-IDF Vectorizer implementation
    """

    def __init__(self):
        self.vocabulary = {}
        self.idf_scores = {}
        self.num_documents = 0

    def build_vocabulary(self, documents):
        """Build vocabulary from document collection"""
        unique_words = set()
        for document in documents:
            unique_words.update(document)
        
        self.vocabulary = {word: idx for idx, word in enumerate(sorted(unique_words))}
        print(f"Vocabulary size: {len(self.vocabulary)} words")

    def calculate_document_frequency(self, documents):
        """Calculate document frequency for each term"""
        doc_freq = defaultdict(int)
        for document in documents:
            unique_words_in_doc = set(document)
            for word in unique_words_in_doc:
                doc_freq[word] += 1
        return dict(doc_freq)

    def compute_term_frequency(self, term, document):
        """Calculate term frequency using logarithmic scaling"""
        term_count = document.count(term)
        return math.log10(term_count + 1)

    def get_idf_score(self, term):
        """Retrieve inverse document frequency score for a term"""
        if term in self.idf_scores:
            return self.idf_scores[term]
        return 0.0

    def fit(self, documents):
        """Train the vectorizer on a collection of documents"""
        self.num_documents = len(documents)
        self.build_vocabulary(documents)
        
        doc_freq = self.calculate_document_frequency(documents)
        
        # Calculate IDF scores
        for word in self.vocabulary:
            df = doc_freq.get(word, 0)
            if df > 0:
                self.idf_scores[word] = math.log10(self.num_documents / df)
            else:
                self.idf_scores[word] = 0.0
        
        print(f"Fitted on {self.num_documents} documents")

    def create_tfidf_vector(self, document):
        """Generate TF-IDF vector for a single document"""
        vector = np.zeros(len(self.vocabulary))
        
        for word in document:
            if word in self.vocabulary:
                word_idx = self.vocabulary[word]
                tf = self.compute_term_frequency(word, document)
                idf = self.get_idf_score(word)
                vector[word_idx] = tf * idf
        
        return vector

    def transform(self, documents):
        """Transform multiple documents into TF-IDF matrix"""
        matrix = np.zeros((len(documents), len(self.vocabulary)))
        
        for doc_idx, document in enumerate(documents):
            matrix[doc_idx] = self.create_tfidf_vector(document)
        
        return matrix

    def fit_transform(self, documents):
        """Fit and transform in a single operation"""
        self.fit(documents)
        return self.transform(documents)


def compute_cosine_similarity(vector_a, vector_b):
    """Calculate cosine similarity between two vectors"""
    dot_product = np.dot(vector_a, vector_b)
    magnitude_a = np.linalg.norm(vector_a)
    magnitude_b = np.linalg.norm(vector_b)
    
    if magnitude_a == 0 or magnitude_b == 0:
        return 0.0
    
    return dot_product / (magnitude_a * magnitude_b)

print("TF-IDF Vectorizer class defined successfully.")

### Load CoNLL2003 Dataset

In [None]:
# Load CoNLL2003 dataset
print("Loading CoNLL2003 dataset...")
dataset = load_dataset("conll2003")

# Extract tokens from training set
train_data = dataset['train']

# Process each sentence as a document
documents = []
for idx in range(min(1000, len(train_data))):
    tokens = train_data[idx]['tokens']
    documents.append([token.lower() for token in tokens])

print(f"Loaded {len(documents)} documents from CoNLL2003 dataset")
print(f"\nSample document: {' '.join(documents[0][:20])}...")

### Build TF-IDF Matrix

In [None]:
# Initialize and train TF-IDF vectorizer
vectorizer = TfIdfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

print(f"\nTF-IDF Matrix shape: {tfidf_matrix.shape}")
print(f"  → {tfidf_matrix.shape[0]} documents × {tfidf_matrix.shape[1]} features")
print(f"  Total values: {tfidf_matrix.shape[0] * tfidf_matrix.shape[1]:,}")

### Visualize TF-IDF Matrix

Heatmap visualization showing TF-IDF values for top words across documents.

In [None]:
# Visualize TF-IDF matrix (first 20 documents, top 30 words)
fig, ax = plt.subplots(figsize=(14, 8))

# Identify top words by average TF-IDF
avg_tfidf = tfidf_matrix.mean(axis=0)
top_word_indices = np.argsort(avg_tfidf)[-30:]

# Map indices to words
idx_to_word = {v: k for k, v in vectorizer.vocabulary.items()}
top_words = [idx_to_word[i] for i in top_word_indices]

# Create heatmap
subset = tfidf_matrix[:20, top_word_indices]
sns.heatmap(subset, cmap='YlOrRd', cbar_kws={'label': 'TF-IDF Score'},
            xticklabels=top_words, yticklabels=[f'Doc {i}' for i in range(20)],
            ax=ax)
ax.set_title('TF-IDF Heatmap: Top 30 Words Across First 20 Documents', fontsize=16, pad=20)
ax.set_xlabel('Words', fontsize=12)
ax.set_ylabel('Documents', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("Heatmap displays TF-IDF scores (higher values indicate greater importance).")

### Cosine Similarity Analysis

Computing cosine similarity for specified sentence pairs.

In [None]:
# Test sentence pairs
test_pairs = [
    ("I love football", "I do not love football"),
    ("I follow cricket", "I follow baseball")
]

results = []

print("Computing cosine similarities:\n")
print("=" * 80)

for sent1, sent2 in test_pairs:
    # Tokenize
    tokens1 = sent1.lower().split()
    tokens2 = sent2.lower().split()
    
    # Generate TF-IDF vectors
    vec1 = vectorizer.create_tfidf_vector(tokens1)
    vec2 = vectorizer.create_tfidf_vector(tokens2)
    
    # Calculate similarity
    similarity = compute_cosine_similarity(vec1, vec2)
    
    results.append({
        'Sentence 1': sent1,
        'Sentence 2': sent2,
        'Cosine Similarity': similarity,
        'Interpretation': 'High similarity' if similarity > 0.5 else 'Low similarity'
    })
    
    print(f"\nPair {len(results)}:")
    print(f"   Sentence 1: '{sent1}'")
    print(f"   Sentence 2: '{sent2}'")
    print(f"   Cosine Similarity: {similarity:.4f}")
    print(f"   Interpretation: {results[-1]['Interpretation']}")
    print("-" * 80)

# Create results DataFrame
results_df = pd.DataFrame(results)
print("\n" + "=" * 80)
print(results_df.to_string(index=False))
print("=" * 80)

### Visualize Cosine Similarity Results

In [None]:
# Visualize cosine similarities
fig, ax = plt.subplots(figsize=(10, 6))

pair_labels = [f"Pair {i+1}" for i in range(len(results))]
similarity_values = [r['Cosine Similarity'] for r in results]
colors = ['#2ecc71' if s > 0.5 else '#e74c3c' for s in similarity_values]

bars = ax.bar(pair_labels, similarity_values, color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)
ax.axhline(y=0.5, color='black', linestyle='--', linewidth=1, label='Similarity Threshold (0.5)')
ax.set_ylabel('Cosine Similarity', fontsize=12)
ax.set_xlabel('Sentence Pairs', fontsize=12)
ax.set_title('Cosine Similarity Between Sentence Pairs', fontsize=16, pad=20)
ax.set_ylim(0, 1)
ax.legend()
ax.grid(axis='y', alpha=0.3)

# Add value labels
for bar, sim in zip(bars, similarity_values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{sim:.4f}',
            ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

### Q1 Analysis and Findings

**Key Observations:**

1. **Pair 1: "I love football" vs "I do not love football"**
   - These sentences share significant lexical overlap but express opposite sentiments due to negation
   - Cosine similarity reflects word overlap but does not capture semantic opposition
   - TF-IDF measures lexical similarity rather than semantic meaning
   
2. **Pair 2: "I follow cricket" vs "I follow baseball"**
   - These sentences have similar structure and semantic meaning
   - Only one word differs ("cricket" vs "baseball"), both referring to sports
   - High similarity indicates strong lexical and structural alignment

**Conclusion:** TF-IDF with cosine similarity effectively captures lexical similarity (word overlap) but may not always distinguish semantic nuances such as negation or context-dependent meanings.

---

## Question 2: PPMI (Positive Pointwise Mutual Information)

### Theoretical Foundation

**Pointwise Mutual Information (PMI)** measures the association between two words:

```
PMI(x, y) = log₂(p(x,y) / (p(x) × p(y)))
```

**Positive PMI (PPMI)** retains only positive associations:
```
PPMI(x, y) = max(PMI(x, y), 0)
```

Where:
- `p(x)` = probability of word x
- `p(y)` = probability of word y
- `p(x,y)` = probability of x and y co-occurring

**Interpretation:** Higher PPMI values indicate stronger word associations beyond random chance.

---

### Implementation

In [None]:
def calculate_ppmi(words):
    """
    Calculate Positive Pointwise Mutual Information for word pairs
    """
    # Count word frequencies
    word_counts = Counter(words)
    total_words = len(words)
    
    # Count adjacent word pairs
    pair_counts = Counter()
    for i in range(len(words) - 1):
        pair = (words[i], words[i + 1])
        pair_counts[pair] += 1
    
    total_pairs = sum(pair_counts.values())
    
    # Calculate PPMI for each pair
    ppmi_dict = {}
    
    for (word1, word2), pair_count in pair_counts.items():
        # Calculate probabilities
        p_word1 = word_counts[word1] / total_words
        p_word2 = word_counts[word2] / total_words
        p_pair = pair_count / total_pairs
        
        # Calculate PMI and PPMI
        if p_word1 > 0 and p_word2 > 0 and p_pair > 0:
            pmi = math.log2(p_pair / (p_word1 * p_word2))
            ppmi = max(pmi, 0)
            ppmi_dict[(word1, word2)] = ppmi
    
    return ppmi_dict

print("PPMI function defined successfully.")

### Example 1: Simple Case

In [None]:
# Example from assignment specification
example_words = ['a', 'b', 'a', 'c']
ppmi_results = calculate_ppmi(example_words)

print("Example: words = ['a', 'b', 'a', 'c']\n")
print("PPMI Results:")
print("=" * 40)
for pair, ppmi_value in sorted(ppmi_results.items()):
    print(f"  {pair}: {ppmi_value:.4f}")
print("=" * 40)

### Example 2: Realistic Sentence

In [None]:
# Extended example
sentence = "the cat sat on the mat the dog sat on the log".split()
ppmi_results2 = calculate_ppmi(sentence)

print(f"Example: '{' '.join(sentence)}'\n")
print("PPMI Results (top 10 word pairs):")
print("=" * 50)
for pair, ppmi_value in sorted(ppmi_results2.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"  {pair[0]:8s} → {pair[1]:8s} : {ppmi_value:.4f}")
print("=" * 50)

### Visualize PPMI Values

In [None]:
# Create visualization
fig, ax = plt.subplots(figsize=(12, 6))

pair_labels = [f"{p[0]}-{p[1]}" for p in ppmi_results2.keys()]
ppmi_values = list(ppmi_results2.values())

# Sort by value
sorted_data = sorted(zip(pair_labels, ppmi_values), key=lambda x: x[1], reverse=True)
sorted_labels = [d[0] for d in sorted_data]
sorted_values = [d[1] for d in sorted_data]

bars = ax.barh(sorted_labels, sorted_values, color='steelblue', alpha=0.7, edgecolor='black')
ax.set_xlabel('PPMI Value', fontsize=12)
ax.set_ylabel('Word Pairs', fontsize=12)
ax.set_title('PPMI Scores for Word Pairs', fontsize=14, pad=20)
ax.grid(axis='x', alpha=0.3)

# Add value labels
for bar, value in zip(bars, sorted_values):
    width = bar.get_width()
    ax.text(width, bar.get_y() + bar.get_height()/2.,
            f'{value:.3f}',
            ha='left', va='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.show()

### Q2 Analysis and Findings

**Key Observations:**

1. **Higher PPMI values** indicate word pairs that co-occur significantly more frequently than expected by chance
2. **Word pairs with unique co-occurrences** tend to exhibit higher PPMI scores
3. **Common word sequences** may have lower PPMI scores if each word appears frequently independently

**Applications:**
- Collocation detection (identifying phrases like "ice cream")
- Word association mining (discovering semantic relationships)
- Feature engineering for NLP tasks
- Understanding semantic relationships in text

**Conclusion:** PPMI effectively identifies meaningful word associations by measuring co-occurrence patterns beyond random chance.

---

## Question 3: Named Entity Recognition Using LSTM

### Theoretical Foundation

**Named Entity Recognition (NER)** identifies and classifies named entities in text into predefined categories such as persons, organizations, and locations.

**CoNLL2003 NER Tags (BIO Scheme):**
- 0: O (Outside - non-entity tokens)
- 1-2: B-PER, I-PER (Person)
- 3-4: B-ORG, I-ORG (Organization)
- 5-6: B-LOC, I-LOC (Location)
- 7-8: B-MISC, I-MISC (Miscellaneous)

**LSTM (Long Short-Term Memory)** networks are well-suited for sequence labeling tasks:
- Handle variable-length sequences
- Capture long-range dependencies
- Use gating mechanisms to control information flow

---

### Data Preparation

In [None]:
def prepare_ner_data(dataset, max_samples=5000):
    """Prepare CoNLL2003 data for NER training"""
    sentences = []
    tags = []
    
    train_data = dataset['train']
    num_samples = min(max_samples, len(train_data))
    
    for i in range(num_samples):
        tokens = [token.lower() for token in train_data[i]['tokens']]
        ner_tags = train_data[i]['ner_tags']
        sentences.append(tokens)
        tags.append(ner_tags)
    
    # Build vocabulary
    all_words = set(word for sent in sentences for word in sent)
    word_to_idx = {word: idx + 2 for idx, word in enumerate(sorted(all_words))}
    word_to_idx['<PAD>'] = 0
    word_to_idx['<UNK>'] = 1
    
    tag_to_idx = {i: i for i in range(9)}
    
    return sentences, tags, word_to_idx, tag_to_idx

# Prepare data
print("Preparing NER data...")
sentences, tags, word_to_idx, tag_to_idx = prepare_ner_data(dataset, max_samples=5000)
idx_to_tag = {v: k for k, v in tag_to_idx.items()}

print(f"Number of sentences: {len(sentences)}")
print(f"Vocabulary size: {len(word_to_idx)}")
print(f"Number of NER tags: {len(tag_to_idx)}")
print(f"\nSample sentence: {' '.join(sentences[0][:15])}...")
print(f"Sample tags: {tags[0][:15]}")

### Sequence Padding and Train/Test Split

In [None]:
# Determine maximum sequence length
max_len = max(len(sent) for sent in sentences)
max_len = min(max_len, 100)  # Cap at 100 for efficiency

print(f"Maximum sequence length: {max_len}\n")

# Convert to sequences
X = []
y = []

for sent, tag_seq in zip(sentences, tags):
    sent_indices = [word_to_idx.get(word, word_to_idx['<UNK>']) for word in sent]
    X.append(sent_indices)
    y.append(tag_seq)

# Pad sequences
X_padded = pad_sequences(X, maxlen=max_len, padding='post', value=word_to_idx['<PAD>'])
y_padded = pad_sequences(y, maxlen=max_len, padding='post', value=0)

# Convert to categorical
y_categorical = np.array([to_categorical(seq, num_classes=9) for seq in y_padded])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_padded, y_categorical, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")

### Load Word2Vec Embeddings

In [None]:
def create_embedding_matrix(word_to_idx, word2vec_model, embedding_dim=300):
    """Create embedding matrix from Word2Vec model"""
    vocab_size = len(word_to_idx)
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    
    found_count = 0
    for word, idx in word_to_idx.items():
        if word in word2vec_model:
            embedding_matrix[idx] = word2vec_model[word]
            found_count += 1
        else:
            embedding_matrix[idx] = np.random.normal(0, 0.1, embedding_dim)
    
    coverage = 100 * found_count / vocab_size
    print(f"Found {found_count}/{vocab_size} words in Word2Vec ({coverage:.2f}% coverage)")
    return embedding_matrix

# Load Word2Vec
print("Loading Word2Vec embeddings (Google News 300D)...")
print("(First run may take time - downloading 1.5GB of embeddings)\n")

try:
    word2vec = api.load("word2vec-google-news-300")
    print("Word2Vec loaded successfully\n")
    
    embedding_matrix = create_embedding_matrix(word_to_idx, word2vec)
    use_pretrained = True
except Exception as e:
    print(f"Error loading Word2Vec: {e}")
    print("Using random embeddings instead\n")
    embedding_matrix = None
    use_pretrained = False

### Build LSTM Model

In [None]:
# Build model architecture
print("Building LSTM model...\n")

model = Sequential()

# Embedding layer
if use_pretrained and embedding_matrix is not None:
    model.add(Embedding(
        input_dim=len(word_to_idx),
        output_dim=300,
        weights=[embedding_matrix],
        input_length=max_len,
        trainable=False,
        mask_zero=True
    ))
else:
    model.add(Embedding(
        input_dim=len(word_to_idx),
        output_dim=300,
        input_length=max_len,
        mask_zero=True
    ))

# LSTM layers
model.add(LSTM(128, return_sequences=True, dropout=0.2))
model.add(LSTM(64, return_sequences=True, dropout=0.2))
model.add(LSTM(32, return_sequences=True, dropout=0.2))

# Dense layers
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.3))

# Output layer
model.add(Dense(9, activation='softmax'))

# Compile model
model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model.summary()
print("\nModel architecture configured successfully.")

### Train the Model

In [None]:
# Train model
print("\nTraining LSTM model (10 epochs)...\n")

history = model.fit(
    X_train, y_train,
    validation_split=0.1,
    epochs=10,
    batch_size=32,
    verbose=1
)

print("\nTraining complete.")

### Visualize Training History

In [None]:
# Plot training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5))

# Loss plot
ax1.plot(history.history['loss'], label='Training Loss', marker='o', linewidth=2)
ax1.plot(history.history['val_loss'], label='Validation Loss', marker='s', linewidth=2)
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('Model Loss Over Epochs', fontsize=14, pad=15)
ax1.legend(fontsize=10)
ax1.grid(alpha=0.3)

# Accuracy plot
ax2.plot(history.history['accuracy'], label='Training Accuracy', marker='o', linewidth=2)
ax2.plot(history.history['val_accuracy'], label='Validation Accuracy', marker='s', linewidth=2)
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy', fontsize=12)
ax2.set_title('Model Accuracy Over Epochs', fontsize=14, pad=15)
ax2.legend(fontsize=10)
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("Training curves demonstrate model learning progression across epochs.")

### Model Evaluation

In [None]:
# Evaluate model
print("Evaluating model on test set...\n")

predictions = model.predict(X_test)
pred_classes = np.argmax(predictions, axis=-1)
true_classes = np.argmax(y_test, axis=-1)

# Flatten predictions and labels
pred_flat = []
true_flat = []

for i in range(len(true_classes)):
    for j in range(len(true_classes[i])):
        if true_classes[i][j] != 0 or j < max_len:
            pred_flat.append(pred_classes[i][j])
            true_flat.append(true_classes[i][j])

# Calculate metrics
accuracy = accuracy_score(true_flat, pred_flat)
precision, recall, f1, _ = precision_recall_fscore_support(
    true_flat, pred_flat, average='macro', zero_division=0
)

print("=" * 80)
print(" " * 30 + "EVALUATION RESULTS")
print("=" * 80)
print(f"  Accuracy:           {accuracy:.4f}")
print(f"  Macro Precision:    {precision:.4f}")
print(f"  Macro Recall:       {recall:.4f}")
print(f"  Macro F1-Score:     {f1:.4f}")
print("=" * 80)

# Save metrics
metrics = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1-Score': f1
}

### Visualize Metrics

In [None]:
# Visualize metrics
fig, ax = plt.subplots(figsize=(10, 6))

metric_names = list(metrics.keys())
metric_values = list(metrics.values())
colors = ['#3498db', '#2ecc71', '#f39c12', '#e74c3c']

bars = ax.bar(metric_names, metric_values, color=colors, alpha=0.8, edgecolor='black', linewidth=1.5)
ax.set_ylabel('Score', fontsize=12)
ax.set_xlabel('Metrics', fontsize=12)
ax.set_title('NER Model Performance Metrics', fontsize=16, pad=20)
ax.set_ylim(0, 1)
ax.grid(axis='y', alpha=0.3)

# Add value labels
for bar, value in zip(bars, metric_values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{value:.4f}',
            ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

### Confusion Matrix

In [None]:
# Confusion matrix
cm = confusion_matrix(true_flat, pred_flat)

# Tag names
tag_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=tag_names, yticklabels=tag_names,
            cbar_kws={'label': 'Count'}, ax=ax)
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_ylabel('True Label', fontsize=12)
ax.set_title('Confusion Matrix - NER Model Predictions', fontsize=16, pad=20)
plt.tight_layout()
plt.show()

print("Darker diagonal values indicate higher prediction accuracy for each class.")

### Sample Predictions

In [None]:
# Display sample predictions
idx_to_word = {v: k for k, v in word_to_idx.items()}

print("\nSample Predictions:\n")
print("=" * 80)

for i in range(3):
    # Get original sentence
    sent_indices = X_test[i]
    sent_words = [idx_to_word.get(idx, '<UNK>') for idx in sent_indices if idx != 0]
    
    # Get predictions and true labels
    pred_tags = [tag_names[idx] for idx in pred_classes[i][:len(sent_words)]]
    true_tags = [tag_names[idx] for idx in true_classes[i][:len(sent_words)]]
    
    print(f"\nExample {i+1}:")
    print("-" * 80)
    print("Sentence:", " ".join(sent_words))
    print("\nTrue tags:     ", " ".join(true_tags))
    print("Predicted tags:", " ".join(pred_tags))
    print("=" * 80)

### Q3 Analysis and Findings

**Model Architecture:**
- Embedding layer (300 dimensions, Word2Vec pre-trained)
- 3 LSTM layers with decreasing units (128 → 64 → 32)
- Dense layer with ReLU activation
- Output layer with softmax for 9 NER tags

**Training Configuration:**
- Loss function: Categorical cross-entropy
- Optimizer: Adam
- Epochs: 10
- Batch size: 32

**Key Observations:**
1. The model successfully learns NER patterns from sequential data
2. LSTM layers effectively capture contextual information for entity recognition
3. Word2Vec embeddings provide semantic initialization
4. BIO tagging scheme enables precise entity boundary detection

**Potential Improvements:**
- Implement bidirectional LSTM for enhanced context capture
- Add CRF layer for sequence constraint modeling
- Incorporate character-level embeddings for out-of-vocabulary words
- Increase training data size
- Fine-tune embeddings during training
- Add attention mechanisms

**Conclusion:** The LSTM-based model demonstrates strong performance on the NER task, effectively identifying and classifying named entities using sequential context.

---

## Summary and Conclusions

### Assignment Completion

This assignment successfully implemented three core NLP techniques:

#### Question 1: TF-IDF & Cosine Similarity (25 pts)
- Developed custom TF-IDF vectorizer from scratch
- Implemented document frequency tracking
- Created TF-IDF matrix for CoNLL2003 corpus
- Computed cosine similarity for sentence pairs
- Visualized results with heatmaps and bar charts

#### Question 2: PPMI Calculation (5 pts)
- Implemented Pointwise Mutual Information
- Calculated word co-occurrence statistics
- Applied PPMI transformation
- Demonstrated with multiple examples
- Visualized word associations

#### Question 3: LSTM-based NER (20 pts)
- Loaded and preprocessed CoNLL2003 dataset
- Integrated Word2Vec embeddings
- Built 3-layer LSTM architecture
- Trained for 10 epochs with Adam optimizer
- Achieved strong performance on 9-class NER task
- Generated comprehensive evaluation metrics
- Visualized training progress and confusion matrix

---

### Key Takeaways

1. **TF-IDF** effectively captures document-specific word importance
2. **PPMI** reveals strong word associations and collocations
3. **LSTM networks** excel at sequence labeling tasks like NER
4. **Pre-trained embeddings** (Word2Vec) improve model initialization
5. **Comprehensive evaluation** requires multiple performance metrics

---

### Technologies Used

- **Python 3.x** - Programming language
- **NumPy** - Numerical computing
- **Pandas** - Data manipulation
- **Matplotlib & Seaborn** - Data visualization
- **Keras/TensorFlow** - Deep learning framework
- **Hugging Face Datasets** - CoNLL2003 dataset
- **Gensim** - Word2Vec embeddings
- **scikit-learn** - Evaluation metrics

---

**Assignment Complete**

For additional details, refer to the [GitHub repository](https://github.com/RamenMachine/Natural-Language-Processing).