# Baseline Classifier

## üéØ Concept Primer
Simple baseline: mean-pooled word embeddings + Linear classifier.

**Architecture:** Embedding ‚Üí Mean Pool ‚Üí Linear ‚Üí Output  
**Expected:** Baseline performance to beat

## üìã Objectives
1. Create Embedding layer
2. Mean-pool embeddings
3. Add Linear classifier
4. Train baseline model

## üîß Setup

In [1]:
# TODO 1: Import libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn import CrossEntropyLoss

## üèóÔ∏è Build Baseline Model

### TODO 2: Define baseline classifier

**Expected:** Class with embedding + mean pool + linear

In [2]:
# TODO 2: Define model
class BaseLineClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=100, num_classes=13):
        super(BaseLineClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc = nn.Linear(embed_dim, num_classes)
    
    def forward(self, x):
        x = self.embedding(x) # Shape: [batch=32, seq_len=512, embed_dim=100] and self.fc expects: [batch, embed_dim] = [32, 100]
        x = x.mean(dim=1)
        x = self.fc(x)
        return x



## üöÄ Train Baseline

### TODO 3: Training loop

**Expected:** Train for 10 epochs

In [3]:
from collections import Counter
import re
import pandas as pd 
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, f1_score, classification_report



def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

df = pd.read_csv('../data/processed/specialty_taxonomy_v1.csv')

df['text_clean'] = df['text'].apply(clean_text)

df['tokens'] = df['text_clean'].apply(lambda x: x.split())
df['token_count'] = df['tokens'].apply(len)

words = [word for tokens in df['tokens'] for word in tokens]
vocab = Counter(words)
word2idx = {
    "<PAD>": 0,
    "<UNK>": 1,
}

VOCAB_SIZE = 15000  # Start with 10K, adjust later if needed

# Get top 10K-2 words (subtract 2 for <PAD> and <UNK>)
most_common_words = vocab.most_common(VOCAB_SIZE - 2)

for idx, (word, count) in enumerate(most_common_words):
    word2idx[word] = idx + 2  # Start at 2 (after <PAD> and <UNK>)

def encode(tokens):
    encoded = []
    for token in tokens:
        idx = word2idx.get(token, word2idx['<UNK>'])
        encoded.append(idx)
    return encoded

df['encoded'] = df['tokens'].apply(encode)

max_len = 512 # So we can use BERT

def pad_sequence(seq, max_len):
    if len(seq) > max_len:
        return seq[:max_len]
    return seq + [0] * (max_len - len(seq))

df['padded'] = df['encoded'].apply(lambda x: pad_sequence(x, max_len))

#df["specialty"] = df["specialty"].str.lower()

unique_specialities = df['specialty'].unique()
label2idx = {label: idx for idx, label in enumerate(unique_specialities)}

df['label_encoded'] = df['specialty'].map(label2idx)

texts = df['padded'].tolist()
labels = df['label_encoded'].tolist()

class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        return torch.tensor(text, dtype=torch.long), torch.tensor(label, dtype=torch.long)


dataset = TextDataset(texts, labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [5]:
# TODO 3: Training
baseline_model = BaseLineClassifier(
    vocab_size=VOCAB_SIZE,    
    embed_dim=100,       
    num_classes=13       
)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(baseline_model.parameters(), lr=0.001)

n_epochs = 10

for epoch in range(n_epochs):
    total_loss = []
    baseline_model.train()
    for texts, labels in dataloader:
        optimizer.zero_grad()
        predictions = baseline_model(texts)
        loss = criterion(predictions, labels)
        loss.backward()
        optimizer.step()

        total_loss.append(loss.item())
    
    avg_loss = sum(total_loss) / len(total_loss)
    print(f"Epoch {epoch+1}/{n_epochs}, avg loss: {avg_loss:.4f}")

baseline_model.eval()

all_preds = []
all_labels = []

with torch.no_grad():
    for texts_batch, labels_batch in dataloader:
        outputs = baseline_model(texts_batch)
        preds = torch.argmax(outputs, dim=1).cpu().numpy()
        lbls = labels_batch.cpu().numpy()
        all_preds.extend(preds)
        all_labels.extend(lbls)

accuracy = accuracy_score(all_labels, all_preds)
f1_weighted = f1_score(all_labels, all_preds, average='weighted')
f1_macro = f1_score(all_labels, all_preds, average='macro')

print(f"\n{'='*50}")
print(f"BASELINE MODEL EVALUATION")
print(f"{'='*50}")
print(f"Accuracy:    {accuracy:.4f}")
print(f"F1 Weighted: {f1_weighted:.4f}")
print(f"F1 Macro:    {f1_macro:.4f}")
print(f"{'='*50}")



Epoch 1/10, avg loss: 2.2914
Epoch 2/10, avg loss: 2.0793
Epoch 3/10, avg loss: 1.9176
Epoch 4/10, avg loss: 1.7717
Epoch 5/10, avg loss: 1.6532
Epoch 6/10, avg loss: 1.5594
Epoch 7/10, avg loss: 1.4802
Epoch 8/10, avg loss: 1.4094
Epoch 9/10, avg loss: 1.3458
Epoch 10/10, avg loss: 1.2858

BASELINE MODEL EVALUATION
Accuracy:    0.6264
F1 Weighted: 0.5918
F1 Macro:    0.4801


## ü§î Reflection
1. Baseline F1 score?
2. Ready for transformer?

**Your reflection:**

*Write here*

## üìå Summary
‚úÖ Baseline trained  
‚úÖ Performance recorded

**Next:** `05_transformer_setup_train.ipynb`