## Building N-Gram Language Models
This section describes how to build Unigram, Bigram, Trigram, and Quadrigram models using a tokenized Telugu corpus. The corpus is stored in a file named tokenized_telugu.txt, where each line represents a tokenized sentence with tokens separated by spaces.

## Load Tokenized telugu sentences
We begin by reading the tokenized sentences from the input file. Each line is split into tokens and stored as a list of lists.

In [None]:
from collections import Counter, defaultdict
def load_tokenized_sentences(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    tokenized_sentences = [line.strip().split() for line in lines if line.strip()]
    return tokenized_sentences

## Build Unigram Model
The Unigram model counts individual tokens and computes their probabilities based on total token frequency.

In [None]:
def build_unigram_model(sentences):
    unigram_counts = Counter(token for sentence in sentences for token in sentence)
    total_tokens = sum(unigram_counts.values())
    unigram_probs = {token: count / total_tokens for token, count in unigram_counts.items()}
    return unigram_counts, unigram_probs

## Build Bigram Model
The Bigram model captures token pairs and their frequency across the corpus.

In [None]:
def build_bigram_model(sentences):
    bigram_counts = defaultdict(int)
    for sentence in sentences:
        for i in range(len(sentence) - 1):
            bigram = (sentence[i], sentence[i+1])
            bigram_counts[bigram] += 1
    return bigram_counts

## Build Trigram Model
The Trigram model captures sequences of three consecutive tokens.

In [None]:
def build_trigram_model(sentences):
    trigram_counts = defaultdict(int)
    for sentence in sentences:
        for i in range(len(sentence) - 2):
            trigram = tuple(sentence[i:i+3])
            trigram_counts[trigram] += 1
    return trigram_counts

## Build Quadrigram Model
The Quadrigram model captures sequences of four consecutive tokens.

In [None]:
def build_quadrigram_model(sentences):
    quadrigram_counts = defaultdict(int)
    for sentence in sentences:
        for i in range(len(sentence) - 3):
            quadrigram = tuple(sentence[i:i+4])
            quadrigram_counts[quadrigram] += 1
    return quadrigram_counts

## Save Model Output to Text Files
Each model's output is saved to a .txt file for inspection.

In [None]:
def save_model_to_file(model_counts, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        for ngram, count in model_counts.items():
            if isinstance(ngram, tuple):
                ngram_str = ' '.join(ngram)
            else:
                ngram_str = ngram
            f.write(f"{ngram_str}\t{count}\n")

## Main Execution Block
This function ties everything together: loading data, building models, and saving outputs.

In [None]:
def main():
    filepath = "tokenized_telugu.txt"  # Make sure this file is in your working directory
    tokenized_sentences = load_tokenized_sentences(filepath)

    # Build models
    unigram_counts, _ = build_unigram_model(tokenized_sentences)
    bigram_counts = build_bigram_model(tokenized_sentences)
    trigram_counts = build_trigram_model(tokenized_sentences)
    quadrigram_counts = build_quadrigram_model(tokenized_sentences)

    # Save to files
    save_model_to_file(unigram_counts, "unigram_model.txt")
    save_model_to_file(bigram_counts, "bigram_model.txt")
    save_model_to_file(trigram_counts, "trigram_model.txt")
    save_model_to_file(quadrigram_counts, "quadrigram_model.txt")

    print("✅ All models saved successfully!")

if __name__ == "__main__":
    main()