# Embedding Representations

🧾 Introduction
NER performance heavily depends on token representations. Embeddings capture syntactic and semantic information and feed into NER models to guide predictions.





In [None]:
!pip install -q transformers torch accelerate scikit-learn

In [None]:
!pip install -U -q datasets

In [None]:
!pip install -q seqeval

In [83]:
from datasets import load_dataset

dataset = load_dataset("conll2003")
example = dataset["train"][0]
tokens = example["tokens"]
labels = example["ner_tags"]

### 🔸 Static Embeddings (e.g., GloVe, Word2Vec)

* Each word has a single fixed vector
* Fast and lightweight
* Lacks context sensitivity (e.g., "bank" in river vs finance)
* These embeddings don't handle OOVs or context. Use with BiLSTM-CRF.

In [None]:
!pip install -U -q numpy

In [None]:
!pip install -U -q gensim

In [None]:
import gensim.downloader as api
import torch

glove = api.load("glove-wiki-gigaword-100")
embedding_dim = 100

def get_glove_embedding(token):
    return torch.tensor(glove[token]) if token in glove else torch.zeros(embedding_dim)

sentence_embed = torch.stack([get_glove_embedding(tok) for tok in tokens])

print("🔹 Static Embedding shape:", sentence_embed.shape)


### 🔸 Contextual Embeddings

#### 🔹 ELMo

* Uses deep BiLSTM to generate context-aware embeddings
* Captures different meanings of the same word in different contexts

In [None]:
!pip install -q allennlp

In [None]:
from allennlp.commands.elmo import ElmoEmbedder

elmo = ElmoEmbedder()
elmo_vecs = elmo.embed_sentence(tokens)

print("🔹 ELMo Shape (3 layers):", elmo_vecs.shape)  # (3, seq_len, 1024)
print("🔹 Token 1 ELMo:", elmo_vecs[:, 0, :].shape)

#### 🔹 Flair

* Character-level language model using LSTM
* Captures subword features and performs well on NER

In [92]:
!pip install -q flair

In [None]:
from flair.embeddings import FlairEmbeddings, TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from flair.data import Sentence

sentence = Sentence(" ".join(tokens))

embedding_types = [
    WordEmbeddings('glove'),
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward')
]

from flair.embeddings import StackedEmbeddings
embeddings = StackedEmbeddings(embeddings=embedding_types)
embeddings.embed(sentence)

for token in sentence:
    print(f"{token.text}: {token.embedding.shape}")


#### 🔹 BERT

* Transformer-based embeddings
* Deep bidirectional attention captures rich context

In [94]:
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)

last_hidden_state = outputs.last_hidden_state
print("🔹 BERT Token Embeddings Shape:", last_hidden_state.shape)

🔹 BERT Token Embeddings Shape: torch.Size([1, 12, 768])


###  🔸 Character-level Embeddings


#### 🔹 Char-CNN

* Uses 1D convolutions over characters to extract morphology

In [None]:
import torch.nn as nn

class CharCNN(nn.Module):
    def __init__(self, vocab_size, char_dim, out_channels, kernel_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, char_dim)
        self.conv = nn.Conv1d(char_dim, out_channels, kernel_size)
        self.pool = nn.AdaptiveMaxPool1d(1)

    def forward(self, char_input):  # shape: (batch, word_len)
        x = self.embedding(char_input).permute(0, 2, 1)  # (B, C, L)
        x = self.pool(torch.relu(self.conv(x))).squeeze(-1)
        return x  # (B, out_channels)


#### 🔹Char-BiLSTM

* Captures sequential character dependencies
* Helpful for unknown or rare words

In [None]:
class CharBiLSTM(nn.Module):
    def __init__(self, vocab_size, char_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, char_dim)
        self.lstm = nn.LSTM(char_dim, hidden_dim, bidirectional=True, batch_first=True)

    def forward(self, char_input):
        x = self.embedding(char_input)
        _, (h, _) = self.lstm(x)
        return torch.cat((h[0], h[1]), dim=-1)  # (batch, 2*hidden)


### Summary

Combining word-level and character-level embeddings leads to more robust NER models. Contextual embeddings like BERT have become the default due to their superior performance.


| Embedding Type    | Library        | Model Examples                     | Output Shape (per token) |
| ----------------- | -------------- | ---------------------------------- | ------------------------ |
| Word2Vec/GloVe    | Gensim/Flair   | `glove-wiki-gigaword-100`          | 100                      |
| ELMo              | AllenNLP       | Pretrained ELMo                    | 1024 x 3 layers          |
| Flair Embeddings  | Flair          | `news-forward`, `news-backward`    | 2048 (stacked)           |
| BERT (contextual) | HuggingFace    | `bert-base-cased`, `BioBERT`, etc. | 768                      |
| Char CNN          | Custom / Flair | CNN over char embeddings           | 50–100+                  |
| Char BiLSTM       | Custom         | BiLSTM over char embeddings        | 100–200                  |
