## Telugu Bigram Language Modeling with Smoothing Techniques
A Modular Implementation Bigram model using tokenized_telugu.txt

# This title reflects:
- The core technique (Add-One smoothing)
- The language and dataset (Telugu corpus)
- The scope (-bigram model)

## Import Required Libraries
Begin by importing the collections module, which provides efficient data structures for counting n-grams.

In [None]:
import collections

## Load the Tokenized Corpus
This function reads the tokenized Telugu corpus line by line, splits each sentence into tokens, and returns a list of token lists.

In [None]:
# Load tokenized corpus
def load_corpus(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    corpus = [line.strip().split() for line in lines if line.strip()]
    return corpus

## Build Bigram and Unigram Counts
We prepend `<s>` and append `</s>` to each sentence to capture sentence boundaries. Then we count bigrams and unigrams using collections.Counter.

In [None]:
# Build bigram and unigram counts
def build_ngram_counts(corpus):
    bigram_counts = collections.Counter()
    unigram_counts = collections.Counter()
    for sentence in corpus:
        tokens = ['<s>'] + sentence + ['</s>']
        for i in range(len(tokens) - 1):
            bigram = (tokens[i], tokens[i+1])
            bigram_counts[bigram] += 1
            unigram_counts[tokens[i]] += 1
    return bigram_counts, unigram_counts

## Define Smoothing Functions
Three techniques are implemented:
- Add-One: Laplace smoothing
- Add-K: Generalized additive smoothing
- Token-Type: Heuristic weight-based smoothing

In [None]:
# Smoothing functions
def add_one_smoothing(count, prefix_count, vocab_size):
    return (count + 1) / (prefix_count + vocab_size)

def add_k_smoothing(count, prefix_count, vocab_size, k):
    return (count + k) / (prefix_count + k * vocab_size)

def token_type_smoothing(count, token, bigram_counts, token_type_weights):
    weight = token_type_weights.get(token, 1)
    total = sum(bigram_counts.values()) + sum(token_type_weights.values())
    return (count + weight) / total

## Main Workflow
Loads corpus, builds counts, computes vocabulary size, applies smoothing, and writes results to file.

In [None]:
# Main workflow
def main():
    filepath = 'tokenized_telugu.txt'
    corpus = load_corpus(filepath)
    bigram_counts, unigram_counts = build_ngram_counts(corpus)
    vocab = set(token for sentence in corpus for token in sentence)
    vocab_size = len(vocab)
    k = 0.3

    # Heuristic token type weights (can be customized)
    token_type_weights = {token: 1.5 for token in vocab}

    # Write output with formulas and results
    with open('smoothing_output.txt', 'w', encoding='utf-8') as out:
        out.write("### Smoothing Techniques Applied to Telugu Bigram Corpus\n")
        out.write("This file contains smoothed probabilities for each bigram using three techniques:\n\n")
        out.write("1. Add-One Smoothing:\n")
        out.write("   P(w₂ | w₁) = (count(w₁, w₂) + 1) / (count(w₁) + V)\n")
        out.write("2. Add-K Smoothing (k = 0.3):\n")
        out.write("   P(w₂ | w₁) = (count(w₁, w₂) + k) / (count(w₁) + k × V)\n")
        out.write("3. Token Type Smoothing:\n")
        out.write("   P(w₂ | w₁) = (count(w₁, w₂) + weight(w₂)) / (∑ bigram_counts + ∑ token_type_weights)\n\n")
        out.write("Bigram\tCount\tPrefixCount\tAddOne\tAddK\tTokenType\n")

        for bigram in bigram_counts:
            count = bigram_counts[bigram]
            prefix = bigram[0]
            token = bigram[1]
            prefix_count = unigram_counts.get(prefix, 0)

            p_add_one = add_one_smoothing(count, prefix_count, vocab_size)
            p_add_k = add_k_smoothing(count, prefix_count, vocab_size, k)
            p_token_type = token_type_smoothing(count, token, bigram_counts, token_type_weights)

            out.write(f"{bigram}\t{count}\t{prefix_count}\t{p_add_one:.6f}\t{p_add_k:.6f}\t{p_token_type:.6f}\n")

    print("✅ Smoothing results saved to 'smoothing_output.txt'")

## Execute the Main Function
Runs the full pipeline and saves the output.

In [None]:
if __name__ == "__main__":
    main()