# What are N-Grams?

**N-grams** language model is a probabilistic model that predicts the next word in a sequence based on the previous N - 1 words.
 <br>

**Use cases**

- Text prediction  
- Spelling correction  

---

## Understanding N-Grams

The concept of an n-gram is straightforward: it is a sequence of 'n' consecutive items.

- If **n = 1**, it's a **unigram**
- If **n = 2**, it's a **bigram**
- If **n = 3**, it's a **trigram**
- And so on...

> The larger the value of 'n', the more context you capture, but this comes with diminishing returns due to computational cost and data sparsity.

### Example

Consider the sentence:

> "The quick brown fox jumps over the lazy dog."

Here are some examples of n-grams:

- **Unigrams**:  
  `The`, `quick`, `brown`, `fox`, `jumps`, `over`, `the`, `lazy`, `dog`
  
- **Bigrams**:  
  `The quick`, `quick brown`, `brown fox`, `fox jumps`, `jumps over`, `over the`, `the lazy`, `lazy dog`
  
- **Trigrams**:  
  `The quick brown`, `quick brown fox`, `brown fox jumps`, `fox jumps over`, `jumps over the`, `over the lazy`, `the lazy dog`

> - **Unigrams** lack context  
> - **Bigrams** provide minimal context  
> - **Trigrams** start to form more meaningful phrases

---



$$
P(w_i | w_{i-1}) = \frac{Count(w_{i-1})}{Count(w_{i-1}, w_i)}
$$

This gives the most probable value of $P(w_i | w_{i-1})$ based on your training data.

---


## Challenges with N-Grams

Despite their usefulness, n-grams face several challenges:

- **Data Sparsity**:  
  As 'n' increases, many n-grams may appear rarely or not at all, reducing the model's ability to generalize.

- **Computational Complexity**:  
  Higher values of 'n' lead to an exponential increase in the number of possible combinations.

- **Context Limitation**:  
  N-grams use a fixed-size context window and cannot capture long-range dependencies in language.

---

### Text classification

https://www.kaggle.com/code/leekahwin/text-classification-using-n-gram-0-8-f1

## Text generation

https://www.kaggle.com/code/dimitriirfan/text-generation-using-n-gram-model

## Spell correction

https://www.kaggle.com/code/dhruvdeshmukh/spelling-corrector-using-n-gram-language-model

In [None]:
corpus = [
    "i love deep learning",
    "i love machine learning",
    "i love coding",
    "deep learning is fun",
    "deep learning is amazing",
    "machine learning is powerful",
    "i love artificial intelligence",
    "i love deep neural networks",
    "i love programming",
    "coding is fun",
    "machine learning is interesting",
    "deep learning is the future",
    "machine learning and deep learning are related",
    "i enjoy deep learning",
    "i enjoy machine learning",
    "i enjoy coding",
    "deep learning has potential",
    "machine learning can solve problems",
    "coding makes me happy",
    "i am passionate about deep learning",
    "i am passionate about machine learning",
    "coding helps me think logically",
    "deep learning and machine learning are exciting",
    "artificial intelligence is evolving",
    "i love working with data",
    "deep learning powers modern applications",
    "machine learning can change the world",
    "i want to master deep learning",
    "i want to master machine learning",
    "coding is the future",
    "machine learning is changing industries",
    "deep learning has revolutionized AI",
    "deep learning models are complex",
    "i study machine learning",
    "i study deep learning",
    "deep learning requires a lot of data",
    "machine learning requires good data",
    "i love to learn deep learning",
    "i love to learn machine learning",
    "deep learning is used in many fields",
    "machine learning is widely applied",
    "coding challenges are fun",
    "i love building models with deep learning",
    "i love building models with machine learning",
    "i love creating algorithms",
    "deep learning is advancing rapidly",
    "machine learning is evolving quickly",
    "coding is a useful skill",
    "deep learning uses neural networks",
    "machine learning uses algorithms",
    "deep learning and coding go hand in hand",
    "i enjoy working on deep learning projects",
    "i enjoy working on machine learning projects",
    "deep learning helps with pattern recognition",
    "machine learning is about data analysis",
    "i am learning deep learning",
    "i am learning machine learning",
    "deep learning can predict outcomes",
    "machine learning can detect patterns",
    "coding helps to automate tasks",
    "deep learning is useful in many industries",
    "machine learning is becoming more popular",
    "deep learning needs a lot of computational power",
    "machine learning needs good algorithms"
]

In [None]:
from typing import List, Tuple
from collections import defaultdict

def tokenize(corpus: List[str]) -> List[List[str]]:
    return [["<s>"] + sentence.lower().split() + ["</s>"] for sentence in corpus]

class NGramMLE:
    def __init__(self, corpus: List[str], n: int):
        self.n = n
        self.ngram_counts = defaultdict(int)
        self.context_counts = defaultdict(int)
        self._train(tokenize(corpus))

    def _train(self, tokenized_corpus: List[List[str]]):
        for sentence in tokenized_corpus:
            sentence = ["<s>"] * (self.n - 1) + sentence  # Pad with start tokens
            for i in range(len(sentence) - self.n + 1):
                ngram = tuple(sentence[i:i + self.n])
                context = ngram[:-1]
                self.ngram_counts[ngram] += 1
                self.context_counts[context] += 1

    def prob(self, context: Tuple[str, ...], word: str) -> float:
        if len(context) != self.n - 1:
            raise ValueError(f"expected context of length {self.n - 1}")
        ngram = context + (word,)
        context_count = self.context_counts[context]
        return self.ngram_counts[ngram] / context_count if context_count > 0 else 0.0

    def generate_prob_table(self):
        for ngram, count in self.ngram_counts.items():
            context = ngram[:-1]
            word = ngram[-1]
            print(f"P({word} | {context}) = {self.prob(context, word):.4f}")

In [None]:
model = NGramMLE(corpus, n=3)

print("P(learning | love, deep):", model.prob(("love", "deep"), "learning"))
print("P(to | i, love):", model.prob(("i", "love"), "coding"))

P(learning | love, deep): 0.5
P(to | i, love): 0.08333333333333333


In [None]:
model.prob(("to", "learn"), "machine")

0.5

In [None]:
# https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html