# Task1: Third-Order Letter Approximation Model

In this task, I will build a third-order letter approximation model using English texts from Project Gutenberg. The goal is to create a trigram model that counts the frequency of every sequence of three characters (trigram) in the selected texts.

In [32]:
import re
import os
import random
import json

# Global variables to store processed texts and trigram dictionary
processed_texts = []
trigrams = {}

### Text Preprocessing with preprocess_text
The preprocess_text function is designed to clean the raw text files. It removes any irrelevant sections, special characters, and extra whitespace, leaving only uppercase letters, full stops, and spaces. This preprocessing is essential to ensure our trigram model is built on clean and consistent data.

The preprocess_text function performs the following steps:

1. Remove Preamble and Postamble:

- Project Gutenberg texts typically include introductory and closing text sections (preamble and postamble).
- Markers:
  - preamble = " ***" indicates the end of the introductory text.
  - postamble = "*** END OF " indicates the beginning of the closing text.
- We slice the text based on these markers to capture only the main content, avoiding irrelevant text.

2. Filter Allowed Characters Using Regex:

- Using re.sub(), we filter out any characters that don’t match our allowed set (uppercase letters, spaces, and periods). [1]
- Regex Pattern: [^a-zA-Z\s.] specifies only letters (both cases), spaces, and periods, removing all other characters.
3. Remove Consecutive Blank Lines:

- Multiple consecutive blank lines can disrupt the trigram model by introducing excessive whitespace sequences.
- We replace sequences of multiple newlines with a single newline using re.sub(r"\n\s*\n", "\n", cleaned_text), preserving basic spacing. [2]
4. Convert to Uppercase and Trim Whitespace:

- Finally, upper() standardizes all characters to uppercase.
- strip() removes any extra whitespace at the start and end of the text, preparing it for trigram processing. [3]

In [33]:
def preprocess_text(text):
    # Define markers for preamble and postamble sections
    preamble = " ***"
    postamble = "*** END OF "
    
    # Step 1: Remove preamble and postamble
    cleaned_text = text[text.index(preamble) + len(preamble):text.index(postamble)]
    
    # Step 2: Filter out non-alphabetic characters, keeping only letters, spaces, and periods
    cleaned_text = re.sub("[^a-zA-Z\\s.]", "", cleaned_text)
    
    # Step 3: Replace multiple newlines with a single newline
    cleaned_text = re.sub(r"\n\s*\n", "\n", cleaned_text)
    
    # Convert to uppercase and trim any leading/trailing whitespace
    return cleaned_text.upper().strip()


### Trigram Creation Function

The produce_trigrams function takes a list of processed texts and iterates through each character in each text, extracting and counting every three-character sequence. This data is stored in a dictionary, where each trigram is a key and its frequency is the value.

- Trigram Extraction: The function slices the text into three-character sequences.
- Dictionary Update: For each trigram, it checks if the trigram already exists in the dictionary:
    - If it exists, it increments the count.
    - If it doesn’t exist, it initializes the trigram count to 1.

In [34]:
def produce_trigrams(texts):
    trigram_counts = {}  # Dictionary to store trigram counts
    
    for text in texts:
        for i in range(len(text) - 2):  # Stop at len(text) - 2 to avoid index errors
            trigram = text[i:i+3]  # Extract three-character sequence
            
            # Only proceed if trigram has exactly 3 characters (skip incomplete sequences)
            if len(trigram) == 3:
                if trigram in trigram_counts:
                    trigram_counts[trigram] += 1  # Increment count if trigram already exists
                else:
                    trigram_counts[trigram] = 1  # Initialize trigram count if it doesn't exist
    
    return trigram_counts


### Processing Text Files
Using the os library, we iterate over files in the texts/ directory, ensuring only .txt files are processed. Each file’s content is cleaned using preprocess_text, and the processed texts are stored in the processed_texts list for trigram generation. [4]

In [35]:
# Load and process each .txt file in the 'texts/' directory
for file in os.scandir("texts"):
    if file.name.endswith(".txt"):
        with open(file.path, 'r', encoding='utf-8') as f:
            content = f.read()
            processed_texts.append(preprocess_text(content))  # Sanitize and add to the list


### Generate Trigram Model
With our processed texts ready, we pass them to produce_trigrams to generate the trigram model. This dictionary stores each trigram and its frequency across the text data.

In [36]:
# Generate trigram model from processed texts
trigrams = produce_trigrams(processed_texts)
print("Sample of Trigram Model:", dict(list(trigrams.items())[:10]))  # Display a sample


Sample of Trigram Model: {'THI': 1998, 'HIR': 138, 'IRT': 121, 'RTY': 163, 'TYO': 8, 'YON': 86, 'ONE': 1405, 'NE ': 1626, 'E B': 1614, ' BR': 673}


## Task2 - Third-Order Letter Apprroximation Generation
In this Task I will use my model from Task 1 to generate a string of 10,000 characters. The plan is to find the trigrams in my model that start with the letters 'TH', and randomly select one of the third letters of those trigrams, using the counts as weights.


'select_next_char' finds all the trigrams with 'TH' as the first 2 letters
- It makes a weighted random selection of the third character by using the counts of each trigram.

In [37]:

def select_next_char(trigram_model, prefix):
    # Gather possible trigrams that start with the given prefix (first two characters, 'TH')
    candidates = {k: v for k, v in trigram_model.items() if k.startswith(prefix)}
    
    # If there are no candidates, return a space  
    if not candidates:
        return " "
    
    # Separate keys and weights for weighted selection
    choices, weights = zip(*[(k[2], v) for k, v in candidates.items()])
    
    # Weighted random selection of the third character [5]
    return random.choices(choices, weights=weights, k=1)[0]


Since we now have a way to select the next character, we can create a function to build a 10,000-character sequence, 'generate_text'. 
- It extracts the last two characters of th eucrrent sequence, and uses 'select_next_char' to add the next character based on my trigram model. It then continues until the sequence reaches 10,000 characters.


In [38]:
def generate_text(trigram_model, length=10000):
    generated_text = "TH"  # Start with the initial prefix 'TH'
    
    while len(generated_text) < length:
        # Use the last two characters as the prefix for the next character
        prefix = generated_text[-2:]
        next_char = select_next_char(trigram_model, prefix)
        generated_text += next_char
    
    return generated_text


Display a sample of the generated text

In [39]:
# Generate a 10,000-character text sequence
generated_text = generate_text(trigrams)

# Display first 500 words
print("Sample of Generated Text:", generated_text[:500])


Sample of Generated Text: THE OF HE OF SHART TO JA FICAMPOT TH IN IS
NA SHIS NO TORTERS LOS WHE EASINA. ING HUMPRON THICY WE WOR CREN TULTYA MISTIC WEARS OF CA REGE NO HICH OF TRAT ITOOTIONER ADOES FIGHERS AND BROVISTRY. BUIR BUT VATION TO FOU AGO. A RE A FOR YESUBLONS POPERIF ARROES ANDEGRE 
     AS ACHICT                             DRAOR ANCRENUT THE AY GIRESEST MANK A GO HAVERECE FRELE BUTME FOR HOD ANDS. IT THOLE DEETILEAPPOPHIST SPENTER. SOULD ARBIGO.E. EVE READDLY SENTUILED ITIN THIN DOUSA.
SO BAYIESTAUCHINT SAILD


## Task 3: Analyze The Model
To test our model we will find out what percentage of the words in the file, 'words.txt', are actual words in the English Language 

The 'load_words' function is used to load the english words from the file 'words.txt'.

In [40]:
def load_words(file_path="words.txt"):
    with open(file_path, 'r') as file:
        english_words = {line.strip().upper() for line in file}  # Convert words to uppercase for consistency
    return english_words

# Load the English words
english_words_set = load_words()
print("Total English words loaded:", len(english_words_set))


Total English words loaded: 45373


We need to split the generated text by spaces to get a list of words. [6]

In [41]:
def calculate_valid_words_percentage(generated_text, english_words_set):
    # Split the generated text by spaces to get words
    words_in_text = generated_text.split()

     #Count valid English words
    valid_words_count = sum(1 for word in words_in_text if word in english_words_set)

        # Calculate percentage
    total_words = len(words_in_text)
    valid_words_percentage = (valid_words_count / total_words) * 100 if total_words > 0 else 0
    
    return valid_words_percentage, valid_words_count, total_words


Now that we have calculated the percentage, we can now print the results

In [42]:
valid_words_percentage, valid_words_count, total_words = calculate_valid_words_percentage(generated_text, english_words_set)
print(f"Valid words: {valid_words_count} out of {total_words} words ({valid_words_percentage:.2f}% valid)")

Valid words: 564 out of 1703 words (33.12% valid)


### Task 4
 In this task I will export my model as a JSON file.

This part is easy.
We specify the path for the JSON file, use 'json.dump()' to conver the trigram dictionary into JSON format [7], and write the JSON data to 'trigrams.json'.

In [43]:
def save_trigram_model(trigram_model, filename="trigrams.json"):
    with open(filename, 'w') as json_file:
        json.dump(trigram_model, json_file, indent=4)  
    print(f"Trigram model successfully saved to {filename}")

# Export the trigram model to JSON
save_trigram_model(trigrams)


Trigram model successfully saved to trigrams.json


### References

- [1] - Python  Regex Library Docs: https://docs.python.org/3/library/re.html
- [2] - Remove blank lines: https://www.digitalocean.com/community/tutorials/python-remove-spaces-from-string
- [3] - String Methods( .upper() & .strip() ): https://docs.python.org/3/library/stdtypes.html#string-methods
- [4] - OS Module: https://docs.python.org/3/library/os.html
- [5] - Random.choice: https://docs.python.org/3/library/random.html#random.choices
- [6] - Splitting the text: https://docs.python.org/3/library/stdtypes.html#str.split
- [7] - JSON Dump: https://www.geeksforgeeks.org/json-dump-in-python/