<a href="https://colab.research.google.com/github/RajarajachozhanVK/RajarajachozhanVK/blob/main/N_Gram_Smoothing_Word_Document.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

N-Grams_Smoothing-Word Documents
**bold text**

Aim: To perform N-Grams Smoothing in Word Documents
1. N-Grams Smoothing N-gram smoothing is a technique used in natural language processing
and language modeling to address the problem of unseen n-grams. An n-gram is a contiguous
sequence of n items (words, characters, or symbols) within a given text. Smoothing is applied to
handle cases where some n-grams have zero probability in the training data, leading to issues when
estimating probabilities for unseen n-grams in real-world text.
There are many ways to do smoothing, and some of them are:
• Laplace (add one) smoothing
• Add-k smoothing
• Stupid backoff
• Kneser-Ney smoothing
2. Laplace (Add 1) Smoothing The simplest way to do smoothing is to add one to all the
n-gram counts, before we normalize them into probabilities. All the counts that used to be zero
will now have a count of 1, the counts of 1 will be 2, and so on. This algorithm is called Laplace
smoothing.
The formula for add-one smoothing is as follows:
𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒𝑆𝑚𝑜𝑜𝑡ℎ𝑖𝑛𝑔(𝑤𝑛 ⁄ 𝑤(𝑛−1), …, 𝑤1
) = (𝐶(𝑤(𝑛−1), …, 𝑤1
, 𝑤𝑛) + 1) ⁄ (𝐶(𝑤(𝑛−1), …, 𝑤1
) + 𝑉 ) =
(𝐶(𝑤(𝑛−1)𝑤𝑛) + 1) ⁄ (𝐶(𝑤(𝑛−1)) + 𝑉 )
• 𝐶(𝑤(𝑛−1), …, 𝑤1
, 𝑤𝑛) is the count of the n-gram in the training data.
• 𝐶(𝑤(𝑛−1), …, 𝑤1
) is the count of the (n-1)-gram prefix in the training data.
• V is the vocabulary size (the number of unique words in the training data).
This formula ensures that the probability distribution is smoothed, allowing for some probability
mass to be distributed to unseen n-grams.
Performing N-grams smoothing in word documents typically involves applying a statistical language
modeling technique to adjust the probabilities of n-grams (sequences of n words) to account for
unseen or infrequently occurring combinations of words. Here are the general steps to perform
N-grams smoothing in word documents.

Steps to Apply Laplace (Add 1) Smoothing Algorithm:
1. Tokenization: Break the text into words or tokens. You can use various tokenization libraries
or functions available in programming languages such as Python.
1
2. N-grams Generation: Generate n-grams (sequences of n words) from the tokenized text.
Common choices are unigrams (1-grams), bigrams (2-grams), trigrams (3-grams), etc.
3. Counting N-grams: Count the occurrences of each n-gram in the document. This involves
creating a frequency distribution of n-grams.
4. Smoothing Technique: Apply the chosen smoothing technique to handle unseen n-grams.
Common smoothing techniques include Laplace (add-one) smoothing, Lidstone smoothing,
and Good-Turing smoothing.
5. Calculate Probabilities: Calculate the probabilities of each n-gram using the chosen smoothing
technique. This step involves adjusting the counts to handle unseen n-grams and prevent zero
probabilities.
6. Apply Smoothing: Apply the calculated probabilities to your language model. This adjusted
model can now provide more robust estimates of word sequences.
4(A) Laplace (Add 1) Smoothing Implementation Laplace smoothing adds a count of 1
to each event’s frequency to ensure that no probability is zero. This is particularly useful when
dealing with categorical data where some categories might not appear in the training data

In [None]:
def laplace_smoothing(counts, vocab_size):
    # Calculate the total count of all events
    total_count = sum(counts)
    # Apply the Laplace smoothing formula to each count
    smoothed_probs = [(count + 1) / (total_count + vocab_size) for count in counts]
    return smoothed_probs
# Example usage:
# Define the counts of events (e.g., word frequencies in a corpus)
counts = [3, 2, 0, 1]  # counts for "cat", "dog", "fish", "bird"
vocab_size = 4  # size of the vocabulary (number of unique words)
# Calculate the smoothed probabilities
smoothed_probs = laplace_smoothing(counts, vocab_size)
# Print the smoothed probabilities
print("Smoothed probabilities:", smoothed_probs)

Smoothed probabilities: [0.4, 0.3, 0.1, 0.2]


In [None]:
!pip install nltk
import nltk
nltk.download('punkt')
from collections import Counter
from nltk import ngrams, word_tokenize



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
def tokenize_text(text):
    return word_tokenize(text.lower())
def generate_ngrams(tokens, n):
    return list(ngrams(tokens, n))
def add_one_smoothing(ngram_counts, vocabulary_size):
    smoothed_probs = {}
    prefix_counts = Counter()
    for ngram, count in ngram_counts.items():
        prefix = ngram[:-1]
        prefix_counts[prefix] += count
    for ngram, count in ngram_counts.items():
        prefix = ngram[:-1]
        prefix_count = prefix_counts[prefix] if prefix in prefix_counts else 0
        smoothed_probs[ngram] = (count + 1) / (prefix_count + vocabulary_size)
    return smoothed_probs
def main():
    # Example text
    document = "This is an example document. It contains words for n-gram smoothing."
    # Tokenization
    tokens = tokenize_text(document)
    # N-gram generation (using bigrams as an example)
    bigrams = generate_ngrams(tokens, 2)
    # Counting n-gram occurrences
    ngram_counts = Counter(bigrams)
    # Vocabulary size
    vocabulary_size = len(set(tokens))
    # Add-one smoothing
    smoothed_probs = add_one_smoothing(ngram_counts, vocabulary_size)
    # Print the original and smoothed probabilities
    for ngram, count in ngram_counts.items():
        print(f"Original Probability of {ngram}: {count / len(tokens)}")
        print(f"Smoothed Probability of {ngram}: {smoothed_probs[ngram]}")
        print()
if __name__ == "__main__":
    main()

Original Probability of ('this', 'is'): 0.07692307692307693
Smoothed Probability of ('this', 'is'): 0.15384615384615385

Original Probability of ('is', 'an'): 0.07692307692307693
Smoothed Probability of ('is', 'an'): 0.15384615384615385

Original Probability of ('an', 'example'): 0.07692307692307693
Smoothed Probability of ('an', 'example'): 0.15384615384615385

Original Probability of ('example', 'document'): 0.07692307692307693
Smoothed Probability of ('example', 'document'): 0.15384615384615385

Original Probability of ('document', '.'): 0.07692307692307693
Smoothed Probability of ('document', '.'): 0.15384615384615385

Original Probability of ('.', 'it'): 0.07692307692307693
Smoothed Probability of ('.', 'it'): 0.15384615384615385

Original Probability of ('it', 'contains'): 0.07692307692307693
Smoothed Probability of ('it', 'contains'): 0.15384615384615385

Original Probability of ('contains', 'words'): 0.07692307692307693
Smoothed Probability of ('contains', 'words'): 0.153846153

5. Add-k Smoothing Add-k smoothing, also known as Laplace smoothing, is a technique
used in probability and statistics to handle the issue of zero probabilities in categorical data,
especially in the context of natural language processing and Bayesian models. It ensures that every
possible outcome has a non-zero probability, which is particularly useful in applications like text
classification, language modeling, and spam detection.
Here’s a step-by-step explanation of how add-k smoothing works and how you can apply it:
Basic Idea Add-k smoothing modifies the probability estimates by adding a small constant k to
each count. This prevents any probability from being zero and distributes some probability mass
to unseen events.
Formula Given:

Steps to Apply Add k Smoothing Algorithm:
1. Count the Events: Calculate the
frequency of each event in your data.
2. Apply the Smoothing Formula: Adjust each count
by adding k and normalize by the total counts plus the smoothing term.
Example Consider a simple example with a vocabulary of size 4 (words: “cat”, “dog”, “fish”,
“bird”) and observed counts:

• “cat”: 3

• “dog”: 2

• “fish”: 0

• “bird”: 1

The total number of observations ff is 3 + 2 + 0 + 1 = 6
With add-1 smoothing (i.e., K = 1):

P(cat) = (3+1)/(6+4*1) = 4/10 = 0.4

P(dog) = (2+1)/(6+4*1) = 3/10 = 0.3

P(fish) = (0+1)/(6+4*1) = 1/10 = 0.1

P(bird) = (1+1)/(6+4*1) = 2/10 = 0.2

In [None]:
def add_k_smoothing(counts, k, vocab_size):
    """
    Apply add-k smoothing to a list of counts.

    Args:
    counts (list of int): The observed counts of each event.
    k (float): The smoothing parameter.
    vocab_size (int): The number of unique events (vocabulary size).

    Returns:
    list of float: The smoothed probabilities.
    """
    # Calculate the total count of all events
    total_count = sum(counts)
    # Apply the add-k smoothing formula to each count
    smoothed_probs = [(count + k) / (total_count + k * vocab_size) for count in counts]
    return smoothed_probs
# Example usage:
# Define the counts of events (e.g., word frequencies in a corpus)
counts = [3, 2, 0, 1]  # counts for "cat", "dog", "fish", "bird"
k = 0.5  # smoothing parameter (Laplace smoothing)
vocab_size = 4  # size of the vocabulary (number of unique words)
# Calculate the smoothed probabilities
smoothed_probs = add_k_smoothing(counts, k, vocab_size)
# Print the smoothed probabilities
print("Smoothed probabilities:", smoothed_probs)

Smoothed probabilities: [0.4375, 0.3125, 0.0625, 0.1875]
