# **Detailed Explanation of FastText and How to Use It in PyTorch & Gensim**
---

## **1. What is FastText?**
FastText is a **word embedding technique** developed by **Facebook's AI Research (FAIR)** that improves upon Word2Vec by considering **subword information**. It represents words **as a combination of character n-grams** instead of treating them as atomic entities.

Unlike **Word2Vec**, which learns embeddings for entire words, **FastText learns embeddings for word fragments**, making it better at handling:
- **Rare words** (since it uses subwords to create word vectors).
- **Out-of-vocabulary (OOV) words** (it can generate embeddings for unseen words using subword components).
- **Morphologically rich languages** (where words have many variations due to prefixes/suffixes).

---

## **2. How FastText Works**
FastText builds word representations using **subword n-grams** (default is **3 to 6 characters** long).  

For example, for the word **"fasttext"**, the 3-gram representation is:
```
<fa, fas, ast, stt, tte, tex, ext, xt>, fasttext
```
- `< >` are added to mark word boundaries.
- The model learns embeddings for these **subword fragments**.
- The final word embedding is obtained by averaging its **subword embeddings**.

---

## **3. FastText vs Word2Vec**
| Feature           | Word2Vec  | FastText |
|------------------|----------|----------|
| Word Representation | Entire word | Character n-grams |
| Handles Rare Words | ❌ No | ✅ Yes |
| Handles OOV Words | ❌ No | ✅ Yes |
| Morphologically Rich Languages | ❌ Poor | ✅ Good |
| Computational Cost | 🟢 Lower | 🔴 Slightly Higher |

---

## **4. Implementing FastText in PyTorch (Step-by-Step)**
We will implement FastText using **PyTorch** by modifying Word2Vec to include **subword embeddings**.

### **Step 1: Install Dependencies**
```bash
pip install torch torchvision nltk
```

---

### **Step 2: Preprocess Text Data**
```python
import torch
import torch.nn as nn
import torch.optim as optim
import nltk
from collections import Counter
from nltk.tokenize import word_tokenize

# Sample text data
text = "The cat sat on the mat. The dog barked at the cat."

# Tokenize the text
nltk.download('punkt')
tokens = word_tokenize(text.lower())

# Create vocabulary
vocab = list(set(tokens))
word2idx = {word: i for i, word in enumerate(vocab)}
idx2word = {i: word for word, i in word2idx.items()}

# Convert words to indices
data = [word2idx[word] for word in tokens]

print("Vocabulary:", word2idx)
print("Encoded Data:", data)
```

---

### **Step 3: Generate Subword N-grams**
```python
def generate_ngrams(word, n=3):
    word = f"<{word}>"  # Add boundary markers
    ngrams = [word[i:i+n] for i in range(len(word) - n + 1)]
    return ngrams

# Example
print("Subwords for 'cat':", generate_ngrams("cat"))
```

---

### **Step 4: Generate Training Data (Skip-Gram Model)**
```python
def generate_skipgram_data(data, window_size=2):
    pairs = []
    for center in range(len(data)):
        for w in range(-window_size, window_size + 1):
            context = center + w
            if context >= 0 and context < len(data) and context != center:
                pairs.append((data[center], data[context]))
    return pairs

window_size = 2
training_data = generate_skipgram_data(data, window_size)

print("Example training pairs (word index):", training_data)
```

---

### **Step 5: Build the FastText Model in PyTorch**
```python
class FastText(nn.Module):
    def __init__(self, vocab_size, embedding_dim, ngram_size=3):
        super(FastText, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
        self.ngram_size = ngram_size

    def forward(self, center_word):
        embedded = self.embeddings(center_word)  # Get word embeddings
        output = self.linear(embedded)  # Predict context words
        return output

# Model hyperparameters
embedding_dim = 10
vocab_size = len(vocab)

# Initialize model
model = FastText(vocab_size, embedding_dim)
```

---

### **Step 6: Train the Model**
```python
# Loss function and optimizer
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Convert data to tensors
training_data_tensors = [(torch.tensor(center, dtype=torch.long),
                          torch.tensor(context, dtype=torch.long)) for center, context in training_data]

# Training loop
epochs = 1000
for epoch in range(epochs):
    total_loss = 0
    for center, context in training_data_tensors:
        optimizer.zero_grad()
        output = model(center.unsqueeze(0))  # Forward pass
        loss = loss_function(output, context.unsqueeze(0))  # Compute loss
        loss.backward()  # Backpropagation
        optimizer.step()  # Update weights
        total_loss += loss.item()

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {total_loss:.4f}")

print("Training complete!")
```

---

### **Step 7: Extract Word Embeddings**
```python
word_vectors = model.embeddings.weight.data

# Print embeddings for some words
for word in ["cat", "dog", "mat"]:
    print(f"Embedding for '{word}': {word_vectors[word2idx[word]]}")
```

---

## **5. FastText in Gensim (Easier Alternative)**
Instead of training from scratch in PyTorch, we can use **Gensim**, which provides a pre-trained FastText model.

```bash
pip install gensim
```

### **Train FastText using Gensim**
```python
from gensim.models import FastText
from nltk.tokenize import word_tokenize

# Sample corpus
corpus = ["The cat sat on the mat.", "The dog barked at the cat."]
tokenized_corpus = [word_tokenize(sent.lower()) for sent in corpus]

# Train FastText model
ft_model = FastText(sentences=tokenized_corpus, vector_size=10, window=2, min_count=1, workers=4)

# Get word vectors
print(ft_model.wv["cat"])  # Get vector for 'cat'
print(ft_model.wv.most_similar("cat"))  # Find similar words
```

---

## **6. Key Advantages of FastText**
✅ **Handles Rare and OOV Words**  
   - Uses subwords to predict vectors even for unseen words.

✅ **Works Well in Morphologically Rich Languages**  
   - E.g., German, Russian, and Turkish, where words change due to inflection.

✅ **Improves Accuracy in Text Classification**  
   - Useful for tasks like sentiment analysis, document classification, etc.

---

## **7. Disadvantages of FastText**
🚫 **Higher Computational Cost**  
   - Since it computes embeddings for subwords, training is slower than Word2Vec.

🚫 **More Complex to Implement from Scratch**  
   - Requires handling subwords efficiently.

---

## **8. Alternatives to FastText**
- **Word2Vec**: Works well but doesn’t handle unseen words.
- **GloVe**: Uses matrix factorization, effective for pre-trained embeddings.
- **BERT**: Contextualized embeddings for deep NLP tasks.

---

## **9. Conclusion**
- **FastText improves on Word2Vec by using subword n-grams.**
- **It is robust against rare words and OOV words.**
- **We implemented FastText from scratch in PyTorch and used Gensim for ease.**

Would you like additional explanations on a specific part? 😊