## Task1: Third-Order Letter Approximation Model

In this task, I will build a third-order letter approximation model using English texts from Project Gutenberg. The goal is to create a trigram model that counts the frequency of every sequence of three characters (trigram) in the selected texts.

In [1]:
import re
import os

# Global variables to store processed texts and trigram dictionary
processed_texts = []
trigrams = {}

### Text Preprocessing with preprocess_text
The preprocess_text function is designed to clean the raw text files. It removes any irrelevant sections, special characters, and extra whitespace, leaving only uppercase letters, full stops, and spaces. This preprocessing is essential to ensure our trigram model is built on clean and consistent data.

The preprocess_text function performs the following steps:

1. Remove Preamble and Postamble:

- Project Gutenberg texts typically include introductory and closing text sections (preamble and postamble).
- Markers:
  - preamble = " ***" indicates the end of the introductory text.
  - postamble = "*** END OF " indicates the beginning of the closing text.
- We slice the text based on these markers to capture only the main content, avoiding irrelevant text.

2. Filter Allowed Characters Using Regex:

- Using re.sub(), we filter out any characters that don’t match our allowed set (uppercase letters, spaces, and periods).
- Regex Pattern: [^a-zA-Z\s.] specifies only letters (both cases), spaces, and periods, removing all other characters.
3. Remove Consecutive Blank Lines:

- Multiple consecutive blank lines can disrupt the trigram model by introducing excessive whitespace sequences.
- We replace sequences of multiple newlines with a single newline using re.sub(r"\n\s*\n", "\n", cleaned_text), preserving basic spacing.
4. Convert to Uppercase and Trim Whitespace:

- Finally, upper() standardizes all characters to uppercase.
- strip() removes any extra whitespace at the start and end of the text, preparing it for trigram processing.

In [2]:
def preprocess_text(text):
    # Define markers for preamble and postamble sections
    preamble = " ***"
    postamble = "*** END OF "
    
    # Step 1: Remove preamble and postamble
    cleaned_text = text[text.index(preamble) + len(preamble):text.index(postamble)]
    
    # Step 2: Filter out non-alphabetic characters, keeping only letters, spaces, and periods
    cleaned_text = re.sub("[^a-zA-Z\\s.]", "", cleaned_text)
    
    # Step 3: Replace multiple newlines with a single newline
    cleaned_text = re.sub(r"\n\s*\n", "\n", cleaned_text)
    
    # Convert to uppercase and trim any leading/trailing whitespace
    return cleaned_text.upper().strip()


### Trigram Creation Function

The produce_trigrams function takes a list of processed texts and iterates through each character in each text, extracting and counting every three-character sequence. This data is stored in a dictionary, where each trigram is a key and its frequency is the value.

- Trigram Extraction: The function slices the text into three-character sequences.
- Dictionary Update: For each trigram, it checks if the trigram already exists in the dictionary:
    - If it exists, it increments the count.
    - If it doesn’t exist, it initializes the trigram count to 1.

In [3]:
def produce_trigrams(texts):
    trigram_counts = {}  # Dictionary to store trigram counts
    
    for text in texts:
        for i in range(len(text) - 2):  # Stop at len(text) - 2 to avoid index errors
            trigram = text[i:i+3]  # Extract three-character sequence
            
            # Only proceed if trigram has exactly 3 characters (skip incomplete sequences)
            if len(trigram) == 3:
                if trigram in trigram_counts:
                    trigram_counts[trigram] += 1  # Increment count if trigram already exists
                else:
                    trigram_counts[trigram] = 1  # Initialize trigram count if it doesn't exist
    
    return trigram_counts


### Processing Text Files
Using the os library, we iterate over files in the texts/ directory, ensuring only .txt files are processed. Each file’s content is cleaned using preprocess_text, and the processed texts are stored in the processed_texts list for trigram generation.

In [4]:
# Load and process each .txt file in the 'texts/' directory
for file in os.scandir("texts"):
    if file.name.endswith(".txt"):
        with open(file.path, 'r', encoding='utf-8') as f:
            content = f.read()
            processed_texts.append(preprocess_text(content))  # Sanitize and add to the list


### Generate Trigram Model
With our processed texts ready, we pass them to produce_trigrams to generate the trigram model. This dictionary stores each trigram and its frequency across the text data.

In [5]:
# Generate trigram model from processed texts
trigrams = produce_trigrams(processed_texts)
print("Sample of Trigram Model:", dict(list(trigrams.items())[:10]))  # Display a sample


Sample of Trigram Model: {'THI': 1998, 'HIR': 138, 'IRT': 121, 'RTY': 163, 'TYO': 8, 'YON': 86, 'ONE': 1405, 'NE ': 1626, 'E B': 1614, ' BR': 673}
