# N-Gram Text Encoding

**Topics:** NGramEncoder, text patterns, NLP applications
**Time:** 20 minutes
**Prerequisites:** 01_basic_operations.ipynb

---

## Setup

In [None]:
from holovec import VSA
from holovec.encoders import NGramEncoder

model = VSA.create('MAP', dim=10000, seed=42)
encoder = NGramEncoder(model, n=3, stride=1, mode='bundling', seed=42)

print(f"Model: {model.model_name}")
print(f"Encoder: n={encoder.n}, stride={encoder.stride}, mode={encoder.mode}")

## Basic N-Gram Encoding

N-grams capture local patterns by sliding a window over the sequence.

In [None]:
# Encode a sentence
text = "the quick brown fox jumps"
words = text.split()

text_hv = encoder.encode(words)

print(f"Input: {words}")
print(f"Trigrams (n=3):")
print("  ['the', 'quick', 'brown']")
print("  ['quick', 'brown', 'fox']")
print("  ['brown', 'fox', 'jumps']")
print(f"\nEncoded shape: {text_hv.shape}")

## Similarity Detection

Texts with similar n-gram patterns have high similarity.

In [None]:
# Encode similar and different texts
text1 = "the quick brown fox jumps"
text2 = "the quick brown dog runs"
text3 = "completely different sentence here"

hv1 = encoder.encode(text1.split())
hv2 = encoder.encode(text2.split())
hv3 = encoder.encode(text3.split())

print(f"Similarity (text1, text2): {float(model.similarity(hv1, hv2)):.3f}  ← Shared prefix")
print(f"Similarity (text1, text3): {float(model.similarity(hv1, hv3)):.3f}  ← Unrelated")

## Document Classification

Let's build a simple text classifier using n-grams.

In [None]:
# Training data: sports vs technology news
training = [
    ("team wins championship game tonight", "sports"),
    ("player scores winning goal match", "sports"),
    ("new smartphone launched today features", "technology"),
    ("software update fixes security bug", "technology"),
]

# Build class prototypes
from collections import defaultdict
class_docs = defaultdict(list)
for text, label in training:
    class_docs[label].append(encoder.encode(text.split()))

prototypes = {label: model.bundle(hvs) for label, hvs in class_docs.items()}
print(f"\nBuilt prototypes for: {list(prototypes.keys())}")

In [None]:
# Classify new document
test_doc = "basketball player wins game"
test_hv = encoder.encode(test_doc.split())

print(f"\nTest: '{test_doc}'")
for label, proto in prototypes.items():
    sim = float(model.similarity(test_hv, proto))
    print(f"  {label:12s}: {sim:.3f}")

## Summary

✓ **N-grams**: Capture local sequence patterns  
✓ **Fast encoding**: Efficient for text processing  
✓ **Similarity-based**: Shared patterns → high similarity  
✓ **Classification**: Build prototypes per class

### Applications:
- Document classification
- Spam detection
- Language identification
- Text similarity search

### Next Steps
- `20_app_text_classification.py` - Full classification pipeline
- `15_encoders_trajectory.py` - Temporal sequences