## Task1: Third-Order Letter Approximation Model

In this task, I will build a third-order letter approximation model using English texts from Project Gutenberg. The goal is to create a trigram model that counts the frequency of every sequence of three characters (trigram) in the selected texts.

In [3]:
import re
import os

# Global variables to store processed texts and trigram dictionary
processed_texts = []
trigrams = {}

In [4]:
def preprocess_text(text):
    # Define markers for preamble and postamble sections
    preamble = " ***"
    postamble = "*** END OF "
    
    # Step 1: Remove preamble and postamble
    cleaned_text = text[text.index(preamble) + len(preamble):text.index(postamble)]
    
    # Step 2: Filter out non-alphabetic characters, keeping only letters, spaces, and periods
    cleaned_text = re.sub("[^a-zA-Z\\s.]", "", cleaned_text)
    
    # Step 3: Replace multiple newlines with a single newline
    cleaned_text = re.sub(r"\n\s*\n", "\n", cleaned_text)
    
    # Convert to uppercase and trim any leading/trailing whitespace
    return cleaned_text.upper().strip()


In [None]:
def produce_trigrams(texts):
    trigram_counts = {}  # Dictionary to store trigram counts
    
    for text in texts:
        for i in range(len(text) - 2):  # Stop at len(text) - 2 to avoid index errors
            trigram = text[i:i+3]  # Extract three-character sequence
            
            # Only proceed if trigram has exactly 3 characters (skip incomplete sequences)
            if len(trigram) == 3:
                if trigram in trigram_counts:
                    trigram_counts[trigram] += 1  # Increment count if trigram already exists
                else:
                    trigram_counts[trigram] = 1  # Initialize trigram count if it doesn't exist
    
    return trigram_counts


In [None]:
# Load and process each .txt file in the 'texts/' directory
for file in os.scandir("texts"):
    if file.name.endswith(".txt"):
        with open(file.path, 'r', encoding='utf-8') as f:
            content = f.read()
            processed_texts.append(preprocess_text(content))  # Sanitize and add to the list


In [None]:
# Generate trigram model from processed texts
trigrams = produce_trigrams(processed_texts)
print("Sample of Trigram Model:", dict(list(trigrams.items())[:10]))  # Display a sample
