"Exploring Corpus Statistics and N-gram Analysis in Python"

This notebook features a customizable n-gram analysis program applied to two distinct small corpora, facilitating a comparative study of their statistics. Users can choose corpora such as email text or newsgroups and leverage the program to examine differences in the most common unigrams and intriguing disparities in bigrams between the two datasets. Additionally, the program offers options to generate random sentences and compute perplexity for a test set, enhancing its versatility for linguistic exploration and text analysis. This comprehensive tool provides insights into corpus variances and n-gram patterns.

In [2]:
import nltk
import string
from collections import defaultdict
import random
from math import log2

In [3]:
def preprocess_text(text):
    text = text.translate(str.maketrans("", "", string.punctuation))
    text = text.lower()
    return text

In [4]:
def generate_ngrams(text, n):
    words = text.split()
    ngrams = [tuple(words[i:i+n]) for i in range(len(words)-n+1)]
    return ngrams

In [5]:
def build_ngram_model(ngrams):
    model = defaultdict(int)
    for ngram in ngrams:
        model[ngram] += 1
    return model

In [6]:
def generate_sentence(model, n, max_length):
    sentence = []
    current_ngram = random.choice(list(model.keys()))
    
    for i in range(min(max_length, len(current_ngram))):
        sentence.append(current_ngram[i])

    while len(sentence) < max_length:
        next_word = random.choices(list(model.keys()))[0][-1]
        sentence.append(next_word)
        current_ngram = current_ngram[1:] + (next_word,)
    return " ".join(sentence)

Perplexity is a measure of how well a language model predicts a given sequence of words. It quantifies the uncertainty or average branching factor of the model, with lower perplexity indicating better predictive performance. It is commonly used to evaluate the effectiveness of language models in natural language processing tasks.

In [7]:
def calculate_perplexity(model, test_ngrams, n):
    log_prob = 0
    for test_ngram in test_ngrams:
        context = test_ngram[:-1]
        next_word = test_ngram[-1]
        context_ngrams = [ngram for ngram in model if ngram[:-1] == context]
        next_word_ngrams = [ngram for ngram in context_ngrams if ngram[-1] == next_word]
        if next_word_ngrams:
            probability = (sum(model[ngram] for ngram in next_word_ngrams) + 1) / (sum(model[ngram] for ngram in context_ngrams) + len(model))
        else:
            probability = 1 / len(model)
        log_prob += log2(probability)
    perplexity = 2 ** (-log_prob/len(test_ngrams))
    return perplexity

In [8]:
def main():
    email_text = open(r"data\email_text.txt", "r").read()
    newsgroups = open(r"data\newsgroups.txt", "r").read()

    email_text = preprocess_text(email_text)
    email_ngrams = generate_ngrams(email_text, 2)
    newsgroups_text = preprocess_text(newsgroups)
    newsgroups_ngrams = generate_ngrams(newsgroups_text,2)

    email_model = build_ngram_model(email_ngrams)
    newsgroups_model = build_ngram_model(newsgroups_ngrams)

    print("Most common unigrams in email text:")
    print(sorted(email_model.items(), key=lambda x: x[1], reverse=True)[:5])
    print()
    print("Most common unigrams in newsgroups:")
    print(sorted(newsgroups_model.items(), key=lambda x: x[1], reverse=True)[:5])
    print()

    print("Random sentence generated using email model:")
    print(generate_sentence(email_model, 2, 10))
    print()
    
    perplexity = calculate_perplexity(email_model, newsgroups_ngrams, 2)
    print("Perplexity of newsgroups corpus given email model:", perplexity)

In [9]:
main()

Most common unigrams in email text:
[(('followup', 'meeting'), 4), (('you', 'have'), 4), (('from', 'john'), 2), (('john', 'doe'), 2), (('doe', 'johndoeemailcom'), 2)]

Most common unigrams in newsgroups:
[(('the', 'latest'), 3), (('on', 'the'), 3), (('the', 'economy'), 3), (('for', 'the'), 2), (('the', 'government'), 2)]

Random sentence generated using email model:
feel free be out the please quick from sure just

Perplexity of newsgroups corpus given email model: 58.58659522692783
