# Evolution of AI Approaches

## 1. Symbolic (1950s–1980s)
- **Core idea:** Manually‐encoded rules and logic
- **Example (Sentiment Analysis):**
  ```text
  IF "happy" in sentence → label = Positive  
  IF "sad"   in sentence → label = Negative
Pros:

Transparent, easy to debug

Leverages explicit domain knowledge

Cons:

Brittle: fails on unseen phrasing

Requires extensive hand‐crafting of rules

## 2. Statistical / “Classical” ML (1980s–2010)
Core idea: Learn parameters from data (e.g. Naïve Bayes, Logistic Regression)

Pros:

Can generalize beyond fixed rules

Requires less manual feature engineering than symbolic systems

Cons:

Limited model capacity → struggles with complex patterns

Still needs domain‐specific feature design

## 3. Neural / Deep Learning (2010–Present)
Core idea: Multi‐layered neural networks learn features automatically

Drivers:

Compute power skyrocketed

Data availability exploded

Pros:

Automatic feature learning from raw inputs

State‐of‐the‐art performance on vision, speech, NLP, etc.

Cons:

Data‐hungry: performance scales with dataset size

Compute‐hungry: requires GPUs/TPUs for training

Opaque: “black‐box” models are harder to interpret

## 4. Transformers & Attention Mechanisms
Innovation: Self‐attention lets models weigh all tokens in a sequence

Advantages over earlier NNs:

Parallelizable: faster training than RNNs/LSTMs

Long‐range context: captures dependencies across long texts

Requirements:

Even more data + compute (e.g., pretraining on billions of tokens)

Why it matters:

Foundation for models like BERT, GPT, T5, etc.

Enables powerful transfer learning and few‐shot capabilities

Key Takeaways
Rule‐based → Statistical → Neural → Transformer

Each wave trades off manual knowledge for data & compute

Transformers represent the current frontier:

Leverage huge corpora + attention to learn language structure

Form the backbone of modern large‐language models (LLMs)#

## Word Vectors & Static Embeddings (2010s)

### 1. Motivation
- Represent each word as a **dense, low-dimensional** vector (50–300 dims) instead of a sparse one-hot (|V|-dimensional)  
- Capture **semantic similarity**: words with similar contexts → nearby in vector space

### 2. Mikolov’s **Word2Vec** (2013)
- **Architectures**  
  - **CBOW** (Continuous Bag-of-Words)  
    - Predict target word from the sum/average of its context vectors  
  - **Skip-Gram**  
    - Predict surrounding context words given the target  
- **Training objective**  
  - Maximize probability of true context words, minimize for “negative‐sampled” noise words  
- **Output**  
  - A lookup table of word → ℝ^d embeddings, e.g.  
    ```
    “hello” → [0.043, 0.120, 0.722, …, 0.461]
    “world” → [0.381, 0.520, 0.012, …, 0.201]
    ```
  - Typical d = 50–100 (or up to 300)

### 3. Pros & Cons of Static Word Vectors

| Pros                                            | Cons                                          |
|-------------------------------------------------|-----------------------------------------------|
|  Capture semantic & syntactic relationships  |  **Context-independent**: “bank” has one vector for finance & rivers |
| Efficient to train & look up in downstream tasks |  Polysemy and homonymy not handled          |
| Simple linear algebra analogies (e.g. **king – man + woman ≈ queen**) | OOV words & morphology issues              |

### 4. Extensions & Alternatives
- **GloVe** (Pennington et al.): leverages global word–word co-occurrence statistics  
- **FastText** (Bojanowski et al.): enriches with subword (character n-gram) information  

### 5. From Static to Contextual Embeddings
1. **ELMo** (2018): context-sensitive embeddings from a bidirectional LSTM  
2. **Transformer-based** (2018–):  
   - BERT, GPT, T5, etc.  
   - Produce **dynamic** embeddings per token **in its sentence**  
   - Leverage attention to model long-range dependencies  

---

> **Key takeaway:**  
> Static word vectors (Word2Vec) were a pivotal step—mapping words to ℝ^d to capture similarity—yet lacked context sensitivity. Transformers build on this by producing **contextual**, token‐level embeddings that adapt to each occurrence.  


## Recurrent Neural Networks (RNNs)

RNNs process sequences by maintaining a hidden state that “remembers” past inputs. At each time step \(t\):

\[
\begin{aligned}
h_t &= \sigma\big(W_x x_t + W_h h_{t-1} + b_h\big) \\
y_t &= \phi\big(V\,h_t + b_y\big) \quad(\text{optional})
\end{aligned}
\]

For example, unrolling over “the quick brown fox jumps”:

t=0: “the” → RNN → h₀
t=1: “quick” → RNN → h₁
t=2: “brown” → RNN → h₂
…

pgsql
Copy
Edit

### Vanishing & Exploding Gradients
- **Backpropagation Through Time** multiplies gradients by \(W_h\) repeatedly.
- If \(\|W_h\|<1\): gradients shrink → **vanish** → long-range dependencies lost.
- If \(\|W_h\|>1\): gradients grow → **explode** → unstable training.

### Pros & Cons

| Pros                                        | Cons                                             |
|---------------------------------------------|--------------------------------------------------|
| ▶️ Captures sequential context via state    | ⚠️ Vanishing/exploding gradients over long spans |
| ▶️ Shares parameters across time-steps      | ⚠️ Sequential computation → poor parallelism     |
| ▶️ Simple for short sequences               | ⚠️ Limited memory for very long dependencies     |

### Gated Variants & Evolution
- **LSTM** (1997): adds input, forget & output gates to stabilize gradients  
- **GRU** (2014): streamlined reset & update gates  
- **Bidirectional RNNs**: process inputs both forwards & backwards  
- **Transformers**: replace recurrence with self-attention for global context and full parallelism  

## Long Short-Term Memory (LSTM)

LSTMs extend vanilla RNNs with a **cell state** \(c_t\) and three gating mechanisms to preserve long-range information and mitigate vanishing/exploding gradients.

At each time step \(t\), given input \(x_t\) and previous hidden state \(h_{t-1}\) and cell state \(c_{t-1}\):

```math
f_t = \sigma\big(W_f[h_{t-1},\,x_t] + b_f\big)        &\text{(forget gate)}\\
i_t = \sigma\big(W_i[h_{t-1},\,x_t] + b_i\big)        &\text{(input gate)}\\
\tilde c_t = \tanh\big(W_c[h_{t-1},\,x_t] + b_c\big)  &\text{(cell candidate)}\\
c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t        &\text{(new cell state)}\\
o_t = \sigma\big(W_o[h_{t-1},\,x_t] + b_o\big)        &\text{(output gate)}\\
h_t = o_t \odot \tanh(c_t)                            &\text{(new hidden state)}
When unrolled over “the quick brown fox jumps”:

makefile
Copy
Edit
t=0: “the”   → (h₀, c₀)
t=1: “quick” → (h₁, c₁)
t=2: “brown” → (h₂, c₂)
…  
t=4: “jumps” → (h₄, c₄)
The cell state
𝑐
𝑡
c
t
​
  (highlighted in the diagram) flows along almost unchanged, letting the network carry information across many steps.

Pros & Cons
Pros	Cons
▶️ Effectively captures long-range dependencies via gated cell state	⚠️ More parameters & heavier compute
▶️ Mitigates vanishing/exploding gradients	⚠️ Inherently sequential → limited parallelism
▶️ Robust selective memory through gates	⚠️ More complex to tune & slower to train

## Encoder–Decoder Attention

When decoding at time $t$ (e.g. in a seq2seq LSTM), the decoder can “peek” at **all** encoder states via an attention module:

1. **Inputs**  
   - **Query** $Q$ = current decoder hidden state $h_t$  
   - **Keys** $K$ = all encoder outputs $[h^{\text{enc}}_1, \dots, h^{\text{enc}}_n]$  
   - **Values** $V$ = same as $K$ (or a linear projection thereof)  

2. **Scaled dot-product attention**  
   $$
   \mathrm{Attention}(Q,K,V) \;=\;\mathrm{softmax}\!\bigl(\tfrac{QK^{T}}{\sqrt{d_k}}\bigr)\,V
   $$

3. **Context vector** $c_t$  
   - A weighted sum of encoder values:  
     $$c_t = \sum_{i=1}^n \alpha_{t,i}\,V_i,\quad \alpha_{t,i}=\mathrm{softmax}_i\bigl(QK^T/\sqrt{d_k}\bigr)$$  
   - Concatenated with (or added to) the decoder state to produce the next output

---

## Self-Attention (“Attention Is All You Need”, 2017)

Self-attention lets each token attend to **every** other token in the **same** sequence:

1. **Within one sequence**, compute three projections per token:  
   - **Queries** $Q=W_QX$  
   - **Keys**    $K=W_KX$  
   - **Values**  $V=W_VX$

2. **Attention**  
   $$
   \mathrm{SelfAttn}(X) = \mathrm{softmax}\!\bigl(QK^T/\sqrt{d_k}\bigr)\,V
   $$

3. **Multi-Head Attention**  
   - Run self-attention $h$ times with different $(W_Q,W_K,W_V)$  
   - Concatenate the $h$ outputs and project back to $d_{\text{model}}$

4. **Positional Encoding**  
   - Since attention is permutation-invariant, add fixed or learned positional embeddings to $X$ so the model knows token order.

---

### Visualization Example

For the sentence  
> “The animal didn’t cross the street because it was too tired.”

- **Self-attention** at the token **“it”** will assign high weights to tokens like **“animal”** and **“street”**, letting the model resolve pronoun reference based on context.


## Multi-Head Attention

Extends single “Scaled Dot-Product” attention by running it in parallel over multiple learned subspaces (“heads”) and then recombining:

1. **Inputs**  
   - Sequence of position-encoded embeddings \(X \in \mathbb{R}^{n\times d_\text{model}}\)  

2. **Linear projections** (for each head \(i=1,\dots,h\))  
   \[
     Q_i = XW^Q_i,\quad
     K_i = XW^K_i,\quad
     V_i = XW^V_i
   \]
   where \(W^Q_i,W^K_i,W^V_i\in\mathbb{R}^{d_\text{model}\times d_k}\)  

3. **Per-head attention**  
   \[
     \mathrm{head}_i = \mathrm{softmax}\!\Bigl(\tfrac{Q_iK_i^\top}{\sqrt{d_k}}\Bigr)\,V_i
   \]

4. **Concatenate & project**  
   \[
     \mathrm{MultiHead}(X)
     = \Concat(\mathrm{head}_1,\dots,\mathrm{head}_h)\,W^O,\quad
     W^O\in\mathbb{R}^{hd_k\times d_\text{model}}
   \]

---

## Positional Encoding

Since attention is order-agnostic, add fixed sinusoidal signals to each token embedding so the model can infer position:

For position \(pos\) and dimension index \(i\) (0-based):

\[
\begin{aligned}
\mathrm{PE}_{(pos,2i)}   &= \sin\!\Bigl(\tfrac{pos}{10000^{2i/d_\text{model}}}\Bigr),\\
\mathrm{PE}_{(pos,2i+1)} &= \cos\!\Bigl(\tfrac{pos}{10000^{2i/d_\text{model}}}\Bigr).
\end{aligned}
\]

- **Properties:**  
  - Each dimension has a different frequency → unique positional “wave” patterns  
  - Enables the model to learn to attend by relative and absolute positions  
- **Usage:**  
  \(\widetilde{X} = X + \mathrm{PE}\), then feed \(\widetilde{X}\) into the Transformer layers.  


# Transformer & BERT Output Heads

## 1. Pre-training Heads

- **Masked Language Modeling (MLM)**  
  - **Input:** Transformer hidden states \(H \in \mathbb{R}^{n\times d}\)  
  - **Head:**  
    1. Linear layer \( \mathbb{R}^d \to \mathbb{R}^{|V|} \)  
    2. Softmax over vocabulary  
  - **Objective:** Predict masked tokens (e.g. “The [MASK] of France is Paris”)

- **Next Sentence Prediction (NSP)**  
  - **Input:** Final [CLS] hidden state \(h_{\text{[CLS]}}\)  
  - **Head:** Linear \(\to\) softmax (2 classes: IsNext / NotNext)

## 2. Fine-tuning Heads

- **Question Answering (Span Extraction)**  
  - **Inputs:** Sequence of hidden states \(\{h_1,\dots,h_n\}\)  
  - **Heads:**  
    - **Start-position classifier:** Linear \(\to\) softmax over \(n\) tokens  
    - **End-position classifier:** Linear \(\to\) softmax over \(n\) tokens  

- **Sequence Classification**  
  - **Input:** [CLS] hidden state \(h_{\text{[CLS]}}\)  
  - **Head:** Linear \(\to\) softmax (or sigmoid) for \(k\) classes or binary

## 3. Model Dimensions

| Model       | Hidden Size \(d\) |  
|-------------|-------------------|  
| BERT-base   | 768               |  
| BERT-large  | 1024              |  

> **Note:** All these heads are lightweight (a single or pair of linear+softmax layers) appended to the shared Transformer representations.  


Preprocessing for NLP

## Stopword Removal

Stopwords are high-frequency, low-information words (e.g. “the”, “is”, “and”) that are often filtered out before NLP tasks to reduce noise and dimensionality.

### 1. Example Tweet
> “I’m amazed how often in practice, not only does a @huggingface NLP model solve your problem, but one of their public finetuned checkpoints, is good enough for the job.  
> Both impressed, and a little disappointed how rarely I get to actually train a model that matters :(”

### 2. NLTK Stopword Filter

```python
from nltk.corpus import stopwords

# 1. Load English stopwords
stop_words = set(stopwords.words('english'))

# 2. Tokenize & lowercase
tokens = tweet.lower().split()

# 3. Filter out stopwords
filtered = [t for t in tokens if t not in stop_words]

print("Before:", " ".join(tokens))
print("After: ", " ".join(filtered))

3. Before & After
Before	After
i’m amazed how often in practice, not only does a @huggingface nlp model…	i’m amazed often practice, @huggingface nlp model solve problem,…

4. Pros & Cons
Pros	Cons
▶️ Reduces vocabulary size & computational cost	⚠️ Might remove informative words (e.g. “not”)
▶️ Simplifies downstream models	⚠️ Static list → may not suit all domains
▶️ Easy to implement with libraries (NLTK, spaCy)	⚠️ Context-ignored filtering

Key takeaway:
Removing stopwords can speed up and simplify text processing, but choose your stopword list carefully (and consider task-specific tweaks) to avoid discarding crucial information.

## Tokenization in NLP

Tokenization is the process of breaking raw text into discrete units (“tokens”) that a model can ingest. Depending on your task and model, tokens can be:

1. **Words**  
   - e.g. “amazed”, “practice”  
   - Intuitive, fast — but huge vocabularies & OOV (out-of-vocab) issues  

2. **Subwords / WordPieces / BPE**  
   - e.g. “amaz”, “##ed”, “practi”, “##ce”  
   - Balances vocabulary size vs. ability to handle rare words/morphology  

3. **Characters**  
   - e.g. “I”, “’”, “m”, “ ”, “a”, “m”, “a”, “z”, “e”, “d”, …  
   - Very small vocab; no OOV; longer sequences, slower models  

4. **Punctuation & Symbols**  
   - “,” “.” “!” “?” “@” “#” etc.  
   - Often kept as separate tokens for sentiment/mention handling  

5. **Special & Model-Specific Tokens**  
   - **Normalization placeholders**:  
     - `<URL>` for links  
     - `<USER>` for Twitter handles (e.g. `@joebloggs`)  
   - **Task tokens**:  
     - `[CLS]`, `[SEP]`, `[MASK]` for BERT/Transformer inputs  

---

### 1. Preprocessing Steps

Before tokenizing, you often:

- **Lowercase** (optional): unify “The” vs. “the”  
- **Normalize mentions/URLs**: map `@elonmusk` → `<USER>`, `http://…` → `<URL>`  
- **Strip or isolate punctuation**: ensure “practice,” → “practice” + “,”  

---

### 2. Pythonic Examples

```python
tweet = "@joebloggs thinks that the NLP models that @huggingface made are super cool https://t.co/abc123"

# Character tokens
chars = [c for c in tweet]

# Simple whitespace tokens
words = tweet.split()

# Replace mentions & URLs
import re
tweet_norm = re.sub(r'@\w+', '<USER>', tweet)
tweet_norm = re.sub(r'https?://\S+', '<URL>', tweet_norm)

# Subword (via HuggingFace tokenizer)
from transformers import BertTokenizer
tok = BertTokenizer.from_pretrained('bert-base-uncased')
subwords = tok.tokenize(tweet_norm)

3. Pros & Cons
Token Type	Pros	Cons
Words	Intuitive, semantic	Large vocab, OOV
Subwords (BPE)	Handles rare words, moderate vocab	Splits common words inconsistently
Characters	No OOV, small vocab	Very long sequences, slower to process
Special Tokens	Normalize noise (mentions, URLs), guide models	Requires task-specific rules/regex

Key takeaway:
Choose your tokenization strategy based on your model and data: subwords are standard for transformers (BERT/GPT), whereas character or word-level may suit specialized or resource-constrained scenarios.

## Model-Specific Special Tokens

Transformer models like BERT use a small set of reserved tokens to handle sequence boundaries, unknown words, padding, and masking:

| Token    | Description                                                                                 |
|----------|---------------------------------------------------------------------------------------------|
| `[PAD]`  | Padding token – pads shorter sequences so that all inputs in a batch have the same length (e.g. 512 tokens for BERT). |
| `[UNK]`  | Unknown token – replaces any wordpiece/subword that isn’t in the model’s vocabulary.       |
| `[CLS]`  | Classification token – always prepended to the input; its final hidden state is used for sequence-level tasks (e.g. classification, NSP). |
| `[SEP]`  | Separator token – marks the end of a sentence or separates paired inputs (e.g. question vs. context). |
| `[MASK]` | Masking token – randomly substituted for real tokens during pre-training to learn contextual representations (MLM task). |

---

### Example Input for BERT

```text
[CLS] What is the capital of France? [SEP] The capital of France is [MASK]. [SEP] [PAD] [PAD]


The model sees a single sequence of fixed length (padded with [PAD]).

It uses [MASK] to predict “Paris” during pre-training.

During fine-tuning, the [CLS] embedding feeds into a classifier head for tasks like sentiment or NSP.

Key takeaway:
These special tokens let BERT handle variable-length inputs, denote structure (start/end), and learn from masked words—enabling powerful, unified pre-training and fine-tuning.


## Stemming

Stemming is a crude, rule-based way to reduce words to their “stem” or root form by stripping suffixes (and sometimes prefixes). It’s fast and language-agnostic but can be overly aggressive.

---

### 1. Common Algorithms

| Stemmer            | Approach                                      | Example                      |
|--------------------|-----------------------------------------------|------------------------------|
| **PorterStemmer**  | Series of suffix-stripping rules (5 phases)  | *amazed* → **amaz**<br>*amazingly* → **amaz** |
| **LancasterStemmer** | Iterative, conflation-based rules (more aggressive) | *amazed* → **amaz**<br>*amazingly* → **amaz** |
| **SnowballStemmer** | Revised Porter (multi-language support)      | Similar to Porter for English |

---

### 2. Python Example (NLTK)

```python
from nltk.stem import PorterStemmer, LancasterStemmer

words = ['happy','happiest','happier','cactus','cactii',
         'elephant','elephants','amazed','amazing','amazingly',
         'cement','owed','maximum']

porter   = PorterStemmer()
lancaster = LancasterStemmer()

for w in words:
    print(f"{w:10} → Porter: {porter.stem(w):6} | Lancaster: {lancaster.stem(w)}")

Sample output:

yaml
Copy
Edit
happy      → Porter: happi  | Lancaster: happy
happiest   → Porter: happiest | Lancaster: happiest
happier    → Porter: happier  | Lancaster: happy
cactus     → Porter: cactu   | Lancaster: cact
cactii     → Porter: cactii  | Lancaster: cacti
elephant   → Porter: elephant | Lancaster: eleph
elephants  → Porter: elephant | Lancaster: eleph
amazed     → Porter: amaz     | Lancaster: amaz
amazing    → Porter: amaz     | Lancaster: amaz
amazingly  → Porter: amazingli| Lancaster: amaz
cement     → Porter: cement   | Lancaster: cem
owed       → Porter: owe      | Lancaster: ow
maximum    → Porter: maximum  | Lancaster: maxim
3. Pros & Cons
Pros	Cons
▶️ Very fast and lightweight	⚠️ Over-stemming: conflates unrelated words (e.g. “cacti”→“cact”)
▶️ Easy to implement in any language	⚠️ Not linguistically precise; strips affixes without context
▶️ Reduces vocabulary size	⚠️ Undersensitive to irregular forms (e.g. “better”→“bett”)

4. When to Use
Search & IR: quick index normalization

Prototyping: feature reduction before heavier lemmatization

Resource-constrained settings: when speed & memory matter

Key takeaway:
Stemming offers a fast, simple way to conflate word forms—but if you need linguistically accurate roots (e.g. “is/am/are” → “be”), consider lemmatization instead.

## Lemmatization

Lemmatization reduces words to their base or dictionary form (**lemma**) using morphological analysis and a vocabulary (e.g. WordNet), producing linguistically valid roots.

---

### 1. WordNet Lemmatizer (NLTK)

```python
import nltk
nltk.download('wordnet')             # ensure WordNet data is available

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()
words      = ['amaze', 'amazed', 'amazing']

# Default (no POS tag → noun assumed)
[lemmatizer.lemmatize(w) for w in words]
# → ['amaze', 'amazed', 'amazing']

# Specify POS = VERB
[lemmatizer.lemmatize(w, pos=wordnet.VERB) for w in words]
# → ['amaze', 'amaze', 'amaze']

2. Pros & Cons
Pros	Cons
▶️ Returns valid dictionary forms (e.g. “was”→“be”)	⚠️ Requires POS tags for best results
▶️ Handles irregular forms and morphology	⚠️ Slower than stemming; depends on external lexicons (WordNet)
▶️ Improves downstream tasks by unifying variants	⚠️ Less effective on domain-specific jargon/out-of-vocab words

3. When to Use
Information Extraction / QA: need precise base forms

Text normalization: for semantic similarity or clustering

Any scenario where preserving true word meaning outweighs computational cost

Key takeaway:
Lemmatization gives linguistically accurate roots by leveraging part-of-speech and lexical resources, making it superior to stemming when correctness matters.

## Unicode Normalization

Unicode characters can have multiple code‐point representations that look identical (or nearly so) but compare as different. Normalization transforms text into a consistent form.

### 1. Canonical Equivalence (NFC / NFD)

- **Canonical** means different sequences that represent the *same* abstract character(s).  
- **Forms**  
  - **NFC (Normalization Form C)**: composites (where possible)  
  - **NFD (Normalization Form D)**: decomposes to base + combining marks  
- **Examples**  
  ```python
  import unicodedata
  # Ç (U+00C7) vs C + ◌̧ (U+0043 U+0327)
  s1 = "Ç"
  s2 = "Ç"
  unicodedata.normalize("NFC", s1) == unicodedata.normalize("NFC", s2)  # True
  unicodedata.normalize("NFD", s1) == unicodedata.normalize("NFD", s2)  # True


2. Compatibility Equivalence (NFKC / NFKD)
Compatibility means characters that look or behave similarly, but have different semantics or usage.

Forms

NFKC: compatibility‐decomposed, then recomposed

NFKD: compatibility‐decomposed

Examples

python
Copy
Edit
# “①” (circled 1) vs “1”
c1, c2 = "①", "1"
unicodedata.normalize("NFKC", c1) == unicodedata.normalize("NFKC", c2)  # True
# But NFC/NFD leave them distinct
unicodedata.normalize("NFC", c1) == unicodedata.normalize("NFC", c2)    # False
Form	Canonical?	Compatibility?	Use Case
NFC	✅	❌	Text display, preserving all distinctions
NFD	✅	❌	Fine‐grained Unicode processing (e.g. accent analysis)
NFKC	❌	✅	Searching/comparing user‐facing text
NFKD	❌	✅	Simplifying for indexing or ASCII‐only environments

3. When & Why to Normalize
String comparison: ensure "resumé" == "résume"

Search & indexing: map fullwidth or superscript digits to ASCII

Data cleaning: strip out font variants, compatibility glyphs

Interoperability: avoid hidden mismatches in user input, file I/O

import unicodedata

def normalize_text(s: str, mode: str = "NFC") -> str:
    return unicodedata.normalize(mode, s)

# Example
raw = "España\u0301"  # “Españá” (e + combining acute)
print(normalize_text(raw, "NFC"))  # “Españá” (as single codepoint)
Key takeaway:
Always normalize Unicode early in your pipeline—choose NFC/NFD to preserve canonical content, or NFKC/NFKD when you need to collapse compatibility variants for reliable matching and indexing.

## Unicode Normalization: Canonical & Compatibility Equivalence

Different Unicode code-point sequences can look (nearly) identical but compare as unequal. Normalization transforms text into a consistent form so that equivalent sequences compare equal.

---

### 1. Canonical Equivalence (NFC / NFD)

- **Goal:** equate characters that are *canonically* the same (same abstract character + accents), regardless of decomposition.  
- **Forms:**  
  - **NFD (Normalization Form D):** *canonical decomposition* → base + combining marks  
  - **NFC (Normalization Form C):** decompose (NFD), then *recompose* where possible  

| Example                              | Codepoints                                  | After NFD                      | After NFC                      |
|--------------------------------------|----------------------------------------------|--------------------------------|--------------------------------|
| **Ç**                                | `\u00C7`                                     | `\u0043` + `\u0327`            | `\u00C7`                        |
| **가** (Korean “ga”)                 | `\uAC00`                                     | `\u1100` + `\u1161`            | `\uAC00`                        |

```python
import unicodedata

a = "\u00C7"             # single Ç
b = "C\u0327"            # C + combining cedilla
assert a != b            # codepoints differ
# Normalize to NFD or NFC → sequences match
unicodedata.normalize("NFD", a) == unicodedata.normalize("NFD", b)   # True
unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b)   # True

2. Compatibility Equivalence (NFKC / NFKD)
Goal: also equate characters with compatibility differences (font variants, superscripts, circled forms).

Forms:

NFKD: compatibility decomposition (no recomposition)

NFKC: compatibility decomposition + canonical recomposition

Example	Look-alike vs ASCII	NFD / NFC	NFKD / NFKC
“①” (circled one)	vs “1”	distinct	both normalize to “1” under NFKC/NFKD
“½” (fraction one-half)	vs “1/2”	distinct	both → “1/2” under NFKC/NFKD

c1, c2 = "①", "1"
# NFC: still distinct
unicodedata.normalize("NFC", c1) == unicodedata.normalize("NFC", c2)   # False
# NFKC: collapse compatibility variants
unicodedata.normalize("NFKC", c1) == unicodedata.normalize("NFKC", c2) # True
3. When to Use Which Form
Form	Canonical?	Compatibility?	Typical Use
NFD	✅	❌	Text analysis by combining marks
NFC	✅	❌	Text display & round-trip preservation
NFKD	❌	✅	Indexing / ASCII-only conversions
NFKC	❌	✅	Search/comparison of user input

4. Best Practices
Normalize early in your pipeline (e.g. on text ingestion).

Choose form based on downstream needs:

NFC for preserving exact characters but canonicalizing accents.

NFKC when collapsing visual or semantic variants (superscripts, circled letters).

Always compare normalized strings to avoid hidden mismatches.

def normalize_unicode(s: str, form: str = "NFC") -> str:
    import unicodedata
    return unicodedata.normalize(form, s)
Key takeaway:
Unicode normalization ensures that visually identical or semantically equivalent text compares equal—crucial for reliable matching, searching, and storage.

## Unicode Normalization: Canonical & Compatibility Equivalence

Different Unicode characters can be encoded multiple ways (composed vs. decomposed, styled vs. base), so two strings that look identical may not compare equal at the code‐point level. Unicode normalization transforms text into a consistent form so that equivalent characters compare equal.

### 1. Canonical Equivalence (NFC & NFD)

- **Goal:** Equate characters that are *canonically* the same (same abstract letter plus accents), regardless of how they’re encoded.
- **Forms:**
  - **NFD (Normalization Form D)**  
    Decomposes composed characters into base + combining marks.  
  - **NFC (Normalization Form C)**  
    NFD decomposition followed by recomposition into precomposed characters where possible.
- **Example (Latin “Ç”):**  
  - `"\u00C7"` (single code point)  
  - `"\u0043\u0327"` (“C” + combining cedilla)  
  ```python
  import unicodedata
  a = "\u00C7"
  b = "C\u0327"
  assert a != b
  # Normalize → compare equal
  unicodedata.normalize("NFD", a) == unicodedata.normalize("NFD", b)   # True
  unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b)   # True

2. Compatibility Equivalence (NFKC & NFKD)
Goal: Also collapse characters that are compatibility variants (font or semantic variants, superscripts, circled forms) into their base forms.

Forms:

NFKD
Compatibility decomposition (breaks font variants, superscripts, etc., into base characters and compatibility mappings).

NFKC
NFKD decomposition followed by canonical recomposition.

Examples:

Circled “①” vs. “1”


unicodedata.normalize("NFC", "①") == "1"      # False
unicodedata.normalize("NFKC", "①") == "1"     # True
Fancy H ("\u210B\u0327") vs. plain “H” ("\u1E28")

fancy = "\u210B\u0327"
base  = "\u1E28"
# Before normalize: not equal
fancy != base
# After NFKC: both become "H"
unicodedata.normalize("NFKC", fancy) == base  # True
Form	Canonical Decomp?	Compat. Decomp?	Recompose?	Typical Use
NFD	✅	❌	❌	Accent analysis, fine‐grained text
NFC	✅	❌	✅	Text display & round‐tripping
NFKD	❌	✅	❌	Indexing, stripping font variants
NFKC	❌	✅	✅	Search/comparison, ASCII folding

3. Best Practices
Normalize Early: Apply Unicode normalization at ingestion or first text‐processing step.

Choose Form by Need:

NFC to preserve canonical characters but unify accents.

NFKC when you need to collapse compatibility variants (superscripts, circled numbers, font variants) for robust matching.

Always Compare Normalized Strings: Prevent hidden mismatches in user input, searching, or data storage.

import unicodedata

def normalize(s: str, form: str = "NFC") -> str:
    return unicodedata.normalize(form, s)

# Usage:
clean = normalize(raw_text, "NFKC")
Key takeaway:
Unicode normalization ensures that visually or semantically equivalent text compare equal—crucial for reliable matching, searching, and data integrity.


##Attention

## Attention Mechanisms in Transformers (with Examples)

Modern Transformer models use attention to let each token “attend” to other tokens—either within the same sequence (self‐attention) or across encoder/decoder (encoder–decoder attention). Below is a unified view with small toy examples.

---

### 1. Scaled Dot‐Product Attention

Given:
```python
import numpy as np
# Toy embeddings for two tokens (d_k=2) and two values (d_v=2):
Q = np.array([[1.0, 0.0]])       # “hello” query
K = np.array([[0.9, 0.1],        # “hello” key
              [0.1, 0.9]])       # “world” key
V = np.array([[1.0, 1.0],        # “hello” value
              [0.0, 1.0]])       # “world” value


Compute scores:

S=QK ⊤=[[(1×0.9+0×0.1),(1×0.1+0×0.9)]]=[[0.9,0.1]]

Scale by √d_k (√2≈1.41), softmax:

scores = S / np.sqrt(2)           # ≈ [0.64, 0.07]
weights = softmax(scores)        # ≈ [0.84, 0.16]

Weighted sum of V:

Z=weights⋅V=0.84×[1,1]+0.16×[0,1]=[0.84,1.00]

Self‐Attention
For the sequence “The cat sat”, suppose embeddings:

X = [[1,0],[0,1],[1,1]]  # 3 tokens × 2 dims

Build Q, K, V via linear projections (here identity for simplicity).

Compute full 3×3 score matrix and softmax row‐wise:

The	cat	sat
The	1⋅1+0⋅0=1 → softmax → [0.6,0.2,0.2]
cat	0⋅1+1⋅0=0 → [0.3,0.4,0.3]
sat	1⋅1+1⋅1=2 → [0.1,0.1,0.8]

Outputs each mix values of The, cat, sat with those weights.

Bidirectional: every token attends to both preceding and following tokens.

3. Encoder–Decoder Attention
Machine translation: Source = “hello how are you” (encoder), Target generating “ciao come va” (decoder).

At decoder step t where it’s about to output “come”:

  derived from decoder’s hidden “come?” state.

Keys/Values from encoder outputs of each source token.

Attention weights might be highest on “how” and “are” for “come.”

Example weights (source tokens in order):

[0.05, 0.10, 0.75, 0.10] → mostly “are”
Then decoder context = weighted sum of encoder values.

4. Multi‐Head Attention
Say we use 2 heads, each of dimensionality 2:

# Q, K, V for “hello how”
X = np.array([[1,0,1,0],      # flattened 4-d model
              [0,1,0,1],
              [1,1,0,0]])
# Head 1: take first 2 dims of Q,K,V
# Head 2: take last 2 dims
Each head attends differently:

Head 1 might focus on “hello”→“how”

Head 2 might focus on “how”→“you”

Their outputs (3×2 + 3×2) concatenate into 3×4, then project back to 3×4.

5. Layer Integration
In each Transformer layer (encoder or decoder):

Multi‐Head Self‐Attention → Add & Norm

(Decoder only) Encoder–Decoder Multi‐Head Attention → Add & Norm

Positionwise Feed‐Forward → Add & Norm

Stacking
𝐿
L such layers builds deep contextual encoders/decoders.

Key Takeaway:
Attention computes weighted sums of values using similarity of queries to keys. Self‐attention lets tokens interact bidirectionally, encoder–decoder attention connects source to target, and multi‐head expands representational capacity—all enabling powerful sequence modeling.

## Language Clasiffication

## Attention Mechanisms in Transformers (with Examples)

Modern Transformer models use attention to let each token “attend” to other tokens—either within the same sequence (self‐attention) or across encoder/decoder (encoder–decoder attention). Below is a unified view with small toy examples.

---

### 1. Scaled Dot‐Product Attention

Given:
```python
import numpy as np
# Toy embeddings for two tokens (d_k=2) and two values (d_v=2):
Q = np.array([[1.0, 0.0]])       # “hello” query
K = np.array([[0.9, 0.1],        # “hello” key
              [0.1, 0.9]])       # “world” key
V = np.array([[1.0, 1.0],        # “hello” value
              [0.0, 1.0]])       # “world” value

Compute scores:
S=QK ⊤ =[[(1×0.9+0×0.1),(1×0.1+0×0.9)]]=[[0.9,0.1]]

Scale by √d_k (√2≈1.41), softmax:
scores = S / np.sqrt(2)           # ≈ [0.64, 0.07]
weights = softmax(scores)        # ≈ [0.84, 0.16]

Weighted sum of V:
Z=weights⋅V=0.84×[1,1]+0.16×[0,1]=[0.84,1.00]


Self‐Attention
For the sequence “The cat sat”, suppose embeddings:

X = [[1,0],[0,1],[1,1]]  # 3 tokens × 2 dims

Build Q, K, V via linear projections (here identity for simplicity).

Compute full 3×3 score matrix and softmax row‐wise:

The	cat	sat
The	1⋅1+0⋅0=1 → softmax → [0.6,0.2,0.2]
cat	0⋅1+1⋅0=0 → [0.3,0.4,0.3]
sat	1⋅1+1⋅1=2 → [0.1,0.1,0.8]

Outputs each mix values of The, cat, sat with those weights.

Bidirectional: every token attends to both preceding and following tokens.

3. Encoder–Decoder Attention
Machine translation: Source = “hello how are you” (encoder), Target generating “ciao come va” (decoder).

At decoder step t where it’s about to output “come”:
  derived from decoder’s hidden “come?” state.

Keys/Values from encoder outputs of each source token.

Attention weights might be highest on “how” and “are” for “come.”

Example weights (source tokens in order):


[0.05, 0.10, 0.75, 0.10] → mostly “are”
Then decoder context = weighted sum of encoder values.

4. Multi‐Head Attention
Say we use 2 heads, each of dimensionality 2:

# Q, K, V for “hello how”
X = np.array([[1,0,1,0],      # flattened 4-d model
              [0,1,0,1],
              [1,1,0,0]])
# Head 1: take first 2 dims of Q,K,V
# Head 2: take last 2 dims
Each head attends differently:

Head 1 might focus on “hello”→“how”

Head 2 might focus on “how”→“you”

Their outputs (3×2 + 3×2) concatenate into 3×4, then project back to 3×4.

5. Layer Integration
In each Transformer layer (encoder or decoder):

Multi‐Head Self‐Attention → Add & Norm

(Decoder only) Encoder–Decoder Multi‐Head Attention → Add & Norm

Positionwise Feed‐Forward → Add & Norm

Stacking
𝐿
L such layers builds deep contextual encoders/decoders.

Key Takeaway:
Attention computes weighted sums of values using similarity of queries to keys. Self‐attention lets tokens interact bidirectionally, encoder–decoder attention connects source to target, and multi‐head expands representational capacity—all enabling powerful sequence modeling.



In [None]:
# Transformer Attention: Concepts & Concrete Examples

Transformers power tasks from translation to sentiment analysis by using **attention** to let each token selectively “listen” to other tokens. Below is a unified overview **with small, worked examples** at each step.

---

## 1. Scaled Dot-Product Attention

### Setup

Suppose we have 2 tokens (“hello”, “world”), each with a 2-dimensional key/query/value embedding:

```python
import numpy as np

# Query: “hello”
Q = np.array([[1.0, 0.0]])        # shape (1×2)

# Keys:
K = np.array([
    [0.9, 0.1],  # “hello” key
    [0.1, 0.9],  # “world” key
])  # shape (2×2)

# Values:
V = np.array([
    [1.0, 1.0],  # “hello” value
    [0.0, 1.0],  # “world” value
])  # shape (2×2)

1.1 Compute raw scores

S=QK ⊤ =[1×0.9+0×0.1,1×0.1+0×0.9]=[0.9,0.1]

1.2 Scale & Softmax
Divide by ≈≈1.414 to stabilize gradients, then softmax:

scores = S / np.sqrt(2)            # ≈ [0.636, 0.071]
weights = np.exp(scores) / np.sum(np.exp(scores))
# ≈ [0.84, 0.16]

1.3 Weighted sum → output
Z=weights⋅V=0.84[1,1]+0.16[0,1]=[0.84,1.00]

2. Self-Attention (Within One Sequence)
For a 3-word sentence “A B C” with toy embeddings:
X = np.array([
  [1, 0],   # “A”
  [0, 1],   # “B”
  [1, 1],   # “C”
])  # shape (3×2)
Q=K=V = X  # identity projections here for simplicity

Compute all pairwise dot-products → 3×3 matrix S.

Scale, row-wise softmax → each row sums to 1.

Each output row is a weighted sum of V.
E.g. row for “C” might attend mostly to itself:

Compute all pairwise dot-products → 3×3 matrix S.

Scale, row-wise softmax → each row sums to 1.

Each output row is a weighted sum of V.
E.g. row for “C” might attend mostly to itself:
softmax([1+0,0+1,1+1])=softmax([1,1,2])≈[0.21,0.21,0.58].

Bidirectional: every token (A, B, C) can look left and right.

3. Encoder–Decoder (Cross) Attention
In translation, the encoder processes source “how are you” and produces hidden states {h1, h2, h3} .
. At decoder step for target token “come”:
Decoder hidden → query qt
Encoder states → keys {k1} and values {v1}
Dot-product + softmax gives weights over source:
[0.05, 0.15, 0.80]  # mostly attends to “you”
Context = weighted sum of values → informs decoding of “come.”

 Multi-Head Attention
Instead of one attention, we use H parallel heads, each on its own linear projection of Q/K/V:

   Input X (seq_len×d_model)
     ↓
  ┌───────────────────────────┐
  │ Head₁: Linear→Attention→│
  │ Head₂: Linear→Attention→│  → Concatenate → Linear → Output
  │ …                         │
  │ Head_H                    │
  └───────────────────────────┘
Example: d_model=4, H=2, so each head has d_k=d_v=2.

Head 1 might focus on local word co-occurrence; Head 2 on longer-range patterns.

Their two (seq×2) outputs are concatenated → (seq×4) → final projection.

5. Putting It All Together
A single Transformer layer (encoder side) looks like:

Multi-Head Self-Attention → Add & Norm

Position-wise Feed-Forward → Add & Norm

A decoder layer adds:

Masked multi-head self-attention

Encoder–decoder multi-head attention

Feed-forward

Stacking
𝐿
L layers yields deep models (e.g., BERT has 12–24 encoder layers).

Why It Works
Vector similarity (dot-product) lets tokens dynamically share information.

Multiple heads learn different “eyes” over the sequence.

Layer stacking builds rich contextual representations used for classification, generation, and more.

Quick Recap
Attention Type	Query from…	Key/Value from…	Use Case
Self-Attention	same layer inputs	same layer inputs	Contextual encoding (BERT)
Cross-Attention	decoder	encoder outputs	Seq2seq (translation, etc.)
Multi-Head	parallel selves	parallel selves	Expand representational power

Each component is fully differentiable—so models learn what to attend to, from raw data.




# Sentiment Classification in Python

Below are two complete examples—one using **Flair** and one using Hugging Face’s **Transformers**—all in one single Markdown snippet. Simply copy & paste into your own notes or notebook.

---

## 1. Flair

```bash
# Install Flair
pip install flair

# 1️⃣ Initialize the model
import flair
from flair.data import Sentence

model = flair.models.TextClassifier.load('en-sentiment')

# 2️⃣ Tokenize + wrap text
text = "I like you!"
sentence = Sentence(text)

# 3️⃣ Predict sentiment
model.predict(sentence)

# 4️⃣ Inspect the result
print(sentence)
# → Sentence: "I like you !"  [– Tokens: 4 – Sentence-Labels: {'label': [POSITIVE (0.9928)]}]

label = sentence.get_labels()[0]
print(label.value, label.score)
# → POSITIVE 0.9928

# 5️⃣ Another example (negative)
text = "I hate it when I'm not learning about ML"
sentence = Sentence(text)
model.predict(sentence)
neg_label = sentence.get_labels()[0]
print(neg_label.value, neg_label.score)
# → NEGATIVE 0.9991

2. Hugging Face Transformers

# Install Transformers and PyTorch
pip install transformers torch

# 1️⃣ Load model & tokenizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_name = "ProsusAI/finbert"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForSequenceClassification.from_pretrained(model_name)

# 2️⃣ Build a sentiment-analysis pipeline
nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# 3️⃣ Run on a single sentence
result = nlp("I love machine learning!")
print(result)
# → [{'label': 'POSITIVE', 'score': 0.9993}]

# 4️⃣ Batch inference
texts = [
    "That movie was fantastic!",
    "The service was terrible and slow."
]
results = nlp(texts)
for txt, res in zip(texts, results):
    print(f"{txt}  →  {res['label']} ({res['score']:.2f})")

Appendix: Special Token IDs (BERT-style)

| Token    |  ID |
| :------- | :-: |
| `[CLS]`  | 101 |
| `[SEP]`  | 102 |
| `[MASK]` | 103 |
| `[UNK]`  | 100 |
| `[PAD]`  |  0  |

## 3. Raw Model Outputs & Post‐processing

Sometimes you want to call the model directly (without the pipeline helper) and then convert its raw logits into probabilities and labels yourself:

```python
# Assume you've already done:
# from transformers import AutoTokenizer, AutoModelForSequenceClassification
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model     = AutoModelForSequenceClassification.from_pretrained(model_name)

# 1️⃣ Tokenize + get kwargs dict
tokens = tokenizer.encode_plus(
    txt,
    max_length=512,
    truncation=True,
    padding='max_length',
    add_special_tokens=True,
    return_tensors='pt'
)

# 2️⃣ Forward pass: unpack **tokens into model()
output = model(**tokens)
# output is a SequenceClassifierOutput with `logits`

# 3️⃣ Extract the first (and only) batch element’s logits
logits = output[0]            # → tensor([-1.8200,  2.4484,  0.0216])

# 4️⃣ Convert logits to probabilities with softmax
import torch.nn.functional as F
probs = F.softmax(logits, dim=-1)
print(probs)
# → tensor([0.0127, 0.9072, 0.0801])

# 5️⃣ Pick the index of the highest‐probability class
import torch
predicted_class_idx = torch.argmax(probs).item()
print(predicted_class_idx)
# → 1

# 6️⃣ Map index → label (you can inspect `model.config.id2label`)
label = model.config.id2label[predicted_class_idx]
print(label, probs[predicted_class_idx].item())
# → “POSITIVE” 0.9072


Summary
Tokenize → a dict of input_ids, attention_mask, (…)

Unpack dict into model(**tokens) → raw logits

Softmax → probabilities

Argmax → predicted class index

Lookup index in model.config.id2label for final label