

# Georgian Spelling Correction: Complete Model Explanation

## 1. Framing the Problem as a Sequence-to-Sequence Task

### The Problem Statement
The task is **Georgian spelling correction** — given a misspelled Georgian word (like `გაამრჯობა`), the model should output the correct spelling (`გამარჯობა`).

### Why Seq2Seq?
This is naturally framed as a **sequence-to-sequence (seq2seq) problem** because:

| Input Sequence | Output Sequence |
|----------------|-----------------|
| `გ → ა → ა → მ → რ → ჯ → ო → ბ → ა` (corrupted) | `გ → ა → მ → ა → რ → ჯ → ო → ბ → ა` (correct) |

- **Input and output lengths can differ** — deletions remove characters, insertions add characters
- The model must learn **alignment** between corrupted and correct sequences
- It's essentially a **translation task** from "corrupted Georgian" to "correct Georgian"

### Character-Level vs Word-Level
The code chooses **character-level** processing for several key reasons (from DL_HW_3 (1).ipynb.ipynb#L10-L15)):



In [None]:
"""
1. WHY CHARACTER-LEVEL TRANSFORMER?
   - Georgian alphabet has only ~33 letters → small vocabulary (~50 with special tokens)
   - Spelling errors happen at character level (typos, adjacent keys)
   - Better generalization than word-level models for unseen misspellings
"""



Looking at char_vocab.json, the vocabulary is only **39 tokens** (33 Georgian letters + 4 special tokens + punctuation).

---

## 2. Mechanics of Recurrent Networks at Character Level

### Architecture Overview

The model is an **LSTM Encoder-Decoder with Bahdanau Attention**:



In [None]:
┌─────────────────────────────────────────────────────────────────┐
│                    SpellingLSTM Architecture                     │
├─────────────────────────────────────────────────────────────────┤
│  ENCODER (Bidirectional LSTM)                                   │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Input: "გაამრჯობა" (corrupted)                          │    │
│  │  Embedding: 256 dims                                     │    │
│  │  LSTM: 2 layers × 256 hidden (bidirectional = 512 out)  │    │
│  │  Output: encoder_outputs, (hidden, cell)                │    │
│  └─────────────────────────────────────────────────────────┘    │
│                            ↓                                    │
│  BRIDGE (Linear layers to transform bidirectional → decoder)    │
│                            ↓                                    │
│  DECODER (Unidirectional LSTM with Attention)                   │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Input: <SOS> token, then previous predictions           │    │
│  │  Attention: Bahdanau (additive) over encoder outputs    │    │
│  │  LSTM: 2 layers × 512 hidden                            │    │
│  │  Output: character probabilities for each step           │    │
│  └─────────────────────────────────────────────────────────┘    │
│                            ↓                                    │
│  Output: "გამარჯობა" (corrected)                                │
└─────────────────────────────────────────────────────────────────┘



### Key Components Explained

#### 1. Embedding Layer ([DL_HW_3 (1).ipynb](DL_HW_3%20(1).ipynb#L387-L389))


In [None]:
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

- Converts character indices to 256-dimensional dense vectors
- `padding_idx=0` ensures `<PAD>` tokens don't affect gradients

#### 2. Bidirectional LSTM Encoder ([DL_HW_3 (1).ipynb](DL_HW_3%20(1).ipynb#L390-L396))


In [None]:
self.lstm = nn.LSTM(
    embedding_dim, hidden_dim, num_layers=2,
    batch_first=True, dropout=dropout, bidirectional=True
)

- **Bidirectional**: Reads the corrupted word both left→right AND right→left
- This captures full context — knowing `რჯობა` comes after helps identify that `გაამ` should be `გამა`

#### 3. Bahdanau Attention ([DL_HW_3 (1).ipynb](DL_HW_3%20(1).ipynb#L516-L548))


In [None]:
class BahdanauAttention(nn.Module):
    def forward(self, hidden, encoder_outputs, mask=None):
        # Computes attention weights over encoder outputs
        energy = torch.tanh(self.attn(torch.cat([hidden, encoder_outputs], dim=2)))
        attention_weights = torch.softmax(attention, dim=1)
        context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)

- **Why attention?** When generating the 3rd output character, the model can "look at" specific positions in the input
- For spelling correction, attention learns to align corrupted↔correct characters

#### 4. Decoder with Teacher Forcing ([DL_HW_3 (1).ipynb](DL_HW_3%20(1).ipynb#L630-L642))


one of the techniques we have covered pre-midterm

In [None]:
# During training: feed ground truth tokens
for t in range(tgt_len):
    tgt_t = tgt[:, t].unsqueeze(1)  # Ground truth token
    prediction, hidden, cell, _ = self.decoder(tgt_t, hidden, cell, encoder_outputs)

- **Teacher forcing**: During training, feed the correct previous character (not the model's prediction)
- This speeds up training but creates exposure bias (addressed by inference logic)

### How Character-Level RNNs Work Step-by-Step

For input `"გაა"` → `"გამ"`:

| Time | Input Token | LSTM State | Attention Focus | Output |
|------|-------------|------------|-----------------|--------|
| t=0  | `<SOS>` | Initial (from encoder) | Full input | `გ` |
| t=1  | `გ` | Updated | Position 1-2 | `ა` |
| t=2  | `ა` | Updated | Position 2-3 | `მ` |
| t=3  | `მ` | Updated | - | `<EOS>` |

---

## 3. Data Quality and Thoughtful Dataset Construction

### Data Source
The data comes from data/wordsChunk_*.json files containing Georgian words.

### Corruption Strategy — Simulating Real Typos

The `corrupt_word` function ([DL_HW_3 (1).ipynb](DL_HW_3%20(1).ipynb#L214-L306)) implements **linguistically-informed data augmentation**:



In [None]:
error_type = random.choices(
    ['substitute', 'delete', 'insert', 'transpose', 'repeat'],
    weights=[0.35, 0.25, 0.20, 0.15, 0.05]
)[0]



| Error Type | Weight | Real-World Basis |
|------------|--------|------------------|
| **Substitution** | 35% | Adjacent key typos (most common) |
| **Deletion** | 25% | Skipping a letter while typing fast |
| **Insertion** | 20% | Double-pressing or extra keystrokes |
| **Transposition** | 15% | Swapping adjacent letters ("teh" → "the") |
| **Repetition** | 5% | Accidental key hold ("helllo") |

### Georgian Keyboard Layout Knowledge

The code includes a **Georgian keyboard adjacency map** ([DL_HW_3 (1).ipynb](DL_HW_3%20(1).ipynb#L62-L98)):



In [None]:
GEORGIAN_KEYBOARD = {
   "ა": ['ქ','ს','ზ'],  # Keys adjacent to 'ა' on Georgian keyboard
   'ბ': ['ვ','ნ','გ','ჰ'],
   # ... etc
}



This ensures substitution errors are **realistic** — a typo for `ა` is likely `ქ` or `ს` (adjacent keys), not `ჰ` (opposite side of keyboard).

### Why 100% Corruption Rate?


In [None]:
CORRUPTION_RATE = 1.0  # 100% - ALL words get corrupted



The dataset uses 100% corruption because:
1. The model needs to learn **correction**, not just copying
2. Some "clean" examples (10% of failed corruptions) are kept for regularization
3. treid other percentages but failed miserable
---

## 4. Full Lifecycle: Raw Data to Deployed Model

### Phase 1: Data Preparation



In [None]:
Raw JSON words → Filter (len >= 2) → Corrupt 100% → (corrupted, correct) pairs
     ↓
├── wordsChunk_0.json
├── wordsChunk_1.json        →    Training pairs: [(გაამრჯობა, გამარჯობა), ...]
└── wordsChunk_2.json



### Phase 2: Vocabulary Building ([DL_HW_3 (1).ipynb](DL_HW_3%20(1).ipynb#L325-L377))



In [None]:
class CharVocab:
    def __init__(self):
        self.char2idx = {
            '<PAD>': 0, '<SOS>': 1, '<EOS>': 2, '<UNK>': 3,
        }

- Special tokens: `<PAD>` (padding), `<SOS>` (start), `<EOS>` (end), `<UNK>` (unknown)
- Build mapping from all characters in dataset → integer indices
- Saved to char_vocab.json for inference

### Phase 3: Training Pipeline



In [None]:
# Hyperparameters
BATCH_SIZE = 128
NUM_EPOCHS = 15
LEARNING_RATE = 0.001

**i want to thank two legends that were discussing memory usage of the words during the lecture for the fp16 idea**

**Key training techniques:**

1. **Mixed Precision (FP16)** — 2x speedup, lower memory:
   ```python
   with autocast():
       output = model(src, tgt_input, ...)
   scaler.scale(loss).backward()
   ```

2. **Label Smoothing** — Better generalization:
   ```python
   criterion = nn.CrossEntropyLoss(ignore_index=0, label_smoothing=0.1)
   ```

3. **Gradient Clipping** — Prevent exploding gradients:
   ```python
   torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
   ```

4. **Learning Rate Scheduling** — Reduce LR on plateau:
   ```python
   scheduler = ReduceLROnPlateau(optimizer, factor=0.5, patience=2)
   ```

5. **Early Stopping** — Prevent overfitting:
   ```python
   if patience_counter >= max_patience:
       break
   ```

### Phase 4: Model Saving



In [None]:
torch.save({
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'val_loss': best_val_loss,
}, 'best_model1.pt')



### Phase 5: Inference/Deployment (inferece.ipynb)

The inference notebook shows the full deployment:

1. **Load vocabulary** from char_vocab.json
2. **Load model weights** from best_model1.pt
3. **Greedy decoding** with safeguards:



In [None]:
def correct_word(model, word, vocab, device='cuda', max_len=100):
    # Encode input
    src = torch.LongTensor([vocab.encode(word, add_sos=False, add_eos=True)])
    
    # Decode autoregressively
    for step in range(max_len):
        prediction, hidden, cell, _ = model.decoder(...)
        next_token_id = prediction.argmax(dim=-1).item()
        
        # Stop conditions
        if next_token_id == vocab.char2idx['<EOS>']: break
        if len(decoded_tokens) > len(word) * 3: break  # Loop prevention



---

## Summary Table

| Aspect | Implementation | Rationale |
|--------|----------------|-----------|
| **Problem Framing** | Seq2Seq (corrupted → correct) | Variable-length I/O, alignment needed |
| **Granularity** | Character-level | Small vocab (39), errors are character-level |
| **Encoder** | 2-layer BiLSTM (256 hidden) | Captures full bidirectional context |
| **Decoder** | 2-layer LSTM (512 hidden) | Larger capacity for generation |
| **Attention** | Bahdanau (additive) | Soft alignment between I/O sequences |
| **Data Augmentation** | Keyboard-aware corruption | Realistic typo simulation |
| **Training** | FP16, label smoothing, early stopping | Efficient and generalizable |
| **Inference** | Greedy decoding with safeguards | Prevents loops, handles edge cases |