# Task 1: Third-order Letter Approximation Model

In this task, a trigram model based on text from five English books will be built. The steps for this task are:
1. Loading text files from Project Gutenberg.
2. Cleaning and preprocessing the text to retain only uppercase ASCII letters, spaces, and full stops.
3. Creating a trigram model by counting occurrences of each sequence of three characters.

This model will be used in subsequent tasks for generating text and analyzing language patterns.

## Import Libraries

The necessary libraries are imported:
- `os` for handling file paths.
- `re` for handling regular expressions to clean the text.
- `defaultdict` from `collections`for handling data storage in a dictionary.

In [15]:
import os
import re
from collections import defaultdict

## Load and CLean Data

Text files from the `data` folder are loaded. A function to read each file’s content and store it in a dictionary is created.

Text is cleaned by:
- Removing the pre and postamble
- Keeping only letters, spaces, and full stops.
- Convert all letters to uppercase

This ensures that the text is standardized before creating the model.

In [10]:
def clean_text(text):
    
    text = text.replace("\n", " ")
    
    text = re.sub(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*", "", text, flags=re.DOTALL)
    text = re.sub(r"\*\*\* END OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*", "", text, flags=re.DOTALL)
    
    text = re.sub(r"[^\x00-\x7F]+", "", text)
    
    text = re.sub(r"[^A-Za-z. ]+", "", text)
    
    text = text.upper()
    
    text = re.sub(r"\s+", " ", text).strip()
    
    return text

In [28]:
data_folder = 'data'

cleaned_texts = {}
for filename in os.listdir(data_folder):
    file_path = os.path.join(data_folder, filename)

    with open(file_path, 'r', encoding='utf-8') as file:
        original_text = file.read()
    
    cleaned_texts[filename] = clean_text(original_text)

In [34]:
# Test to display filenames stored in `cleaned_texts`

print("Files stored in `cleaned_texts` after processing:")

for filename in cleaned_texts.keys():
    print(f"- {filename}")

# Test to confirm data is stored in cleaned_texts

expected_files = set(os.listdir(data_folder))

loaded_files = set(cleaned_texts.keys())

if expected_files == loaded_files:
    print("\nTest Passed: All files are loaded and stored in `cleaned_texts`.")
else:
    print("znTest Failed: Not all files are loaded correctly.")

Files stored in `cleaned_texts` after processing:
- Alice's Adventures in Wonderland.txt
- Dracula.txt
- Fairy Tales of Hans Christian Andersen.txt
- Moby Dick; Or, The Whale.txt
- Peter Pan.txt

Test Passed: All files are loaded and stored in `cleaned_texts`.


In [14]:
for filename, text in cleaned_texts.items():
    print(f"\nSample from cleaned text in {filename}:\n{text[:500]}\n")


Sample from cleaned text in Alice's Adventures in Wonderland.txt:
THE PROJECT GUTENBERG EBOOK OF ALICES ADVENTURES IN WONDERLAND THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES AND MOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONS WHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMS OF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINE AT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATES YOU WILL HAVE TO CHECK THE LAWS OF THE COUNTRY WHERE YOU ARE LOCATED BEFORE USING THIS EBOO


Sample from cleaned text in Dracula.txt:
THE PROJECT GUTENBERG EBOOK OF DRACULA THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES AND MOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONS WHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMS OF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINE AT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATES YOU 

## Generate Trigram Model

Trigram model is gerenrated by counting each sequence of three characters, a count of each unique trigram is then kept in a dictionary.

In [55]:
def generate_trigram_model(text):
    """
    Generate a trigram model from the given text.
    
    Parameters:
        text (str): The cleaned text to generate trigrams from.
        
    Returns:
        defaultdict: A dictionary with trigrams as keys and their counts as values.
    """
    # Initialize trigram counts using defaultdict
    trigram_counts = defaultdict(int)
    
    # Count occurrences of each trigram in the text
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        trigram_counts[trigram] += 1
    
    return trigram_counts

In [56]:
# Initialize a single trigram model for all texts
combined_trigram_model = defaultdict(int)

# Assuming `cleaned_texts` is a dictionary with filenames as keys and cleaned text as values
for text in cleaned_texts.values():
    # Generate trigrams for each text and add to the combined model
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        combined_trigram_model[trigram] += 1

In [57]:
# Tests for the combined trigram model generation with sample outputs

# Test 1: Check if the result is a dictionary
sample_text = "HELLO WORLD"
trigram_counts = generate_trigram_model(sample_text)
if isinstance(trigram_counts, defaultdict):
    print("Test 1 Passed: The function returns a dictionary.")
    print("Sample trigrams from 'HELLO WORLD':", dict(list(trigram_counts.items())[:5]))
else:
    print("Test 1 Failed: The function does not return a dictionary.")

# Test 2: Check trigram counts are being done correctly
simple_text = "ABCABC"
expected_counts = {"ABC": 2, "BCA": 1, "CAB": 1}
trigram_counts_simple = generate_trigram_model(simple_text)

if all(trigram_counts_simple[key] == expected_counts[key] for key in expected_counts):
    print("Test 2 Passed: Trigram counts are correct")
    print("Trigrams generated from 'ABCABC':", dict(trigram_counts_simple))
else:
    print("Test 2 Failed: Trigram counts are incorrect")
    print("Expected:", expected_counts)
    print("Got:", dict(trigram_counts_simple))

# Test 3: Check counts are only done on text longer than 2 characters
short_text = "AB"
trigram_counts_short = generate_trigram_model(short_text)
if len(trigram_counts_short) == 0:
    print("Test 3 Passed: No trigrams generated for text shorter than 3 characters.")
else:
    print("Test 3 Failed: Trigrams were incorrectly generated for short text.")
    print("Generated trigrams for 'AB':", dict(trigram_counts_short))

# Test 4: Check counts aren't generated for empty text
empty_text = ""
trigram_counts_empty = generate_trigram_model(empty_text)
if len(trigram_counts_empty) == 0:
    print("Test 4 Passed: No trigrams generated for empty text.")
else:
    print("Test 4 Failed: Trigrams were incorrectly generated for empty text.")
    print("Generated trigrams for empty text:", dict(trigram_counts_empty))

Test 1 Passed: The function returns a dictionary.
Sample trigrams from 'HELLO WORLD': {'HEL': 1, 'ELL': 1, 'LLO': 1, 'LO ': 1, 'O W': 1}
Test 2 Passed: Trigram counts are correct
Trigrams generated from 'ABCABC': {'ABC': 2, 'BCA': 1, 'CAB': 1}
Test 3 Passed: No trigrams generated for text shorter than 3 characters.
Test 4 Passed: No trigrams generated for empty text.


## Task 2: Third-order Letter Approximation Generation



## Import Libraries

The necessary libraries are imported:
- `random` for handling randomisation tasks.

In [47]:
import random

## Text Generation Function

The generate_text function takes the trigram model, an initial seed, and a target length as inputs. 

It generates text by repeatedly:

- Extracting the last two characters from the current generated text.
- Using these two characters to find trigrams that start with them in the trigram model.
- Randomly select one of the third letters of those trigrams, using the counts as weights.
- This continues until the target length is reached or until no matching trigrams are found.

In [104]:
def generate_text(trigram_model, initial_seed="TH", length=10000):
   
    generated_text = initial_seed
    
    while len(generated_text) < length:
        
        last_two = generated_text[-2:]
        
        possible_trigrams = {trigram: count for trigram, count in trigram_model.items() if trigram.startswith(last_two)}
        
        if not possible_trigrams:
            break
        
        third_chars = [trigram[2] for trigram in possible_trigrams.keys()]
        weights = list(possible_trigrams.values())
        
        next_char = random.choices(third_chars, weights=weights)[0]
        
        generated_text += next_char
    
    return generated_text[:length]

In [80]:
# Test 1: Check if the function returns a string of the specified length
test_generated_text = generate_text(combined_trigram_model, initial_seed="TH", length=50)
if len(test_generated_text) == 50:
    print("Test 1 Passed: Generated text has the specified length of 50.")
else:
    print(f"Test 1 Failed: Generated text length is {len(test_generated_text)}, which is unexpected.")

# Test 2: Check if the generated text starts with the initial seed
initial_seed = "TH"
test_generated_text = generate_text(combined_trigram_model, initial_seed=initial_seed, length=50)
if test_generated_text.startswith(initial_seed):
    print("Test 2 Passed: Generated text starts with the initial seed.")
else:
    print("Test 2 Failed: Generated text does not start with the initial seed.")


Test 1 Passed: Generated text has the specified length of 50.
Test 2 Passed: Generated text starts with the initial seed.


In [103]:
generated_text_output = generate_text(combined_trigram_model, initial_seed="TH", length=10000)


print("Generated Text:\n")
print(generated_text_output)

Generated Text:

THEMONEY HICHEN ENTMA LATCHIS HERVAT AFTED THERY A BELD BARCHAT SAY ALS FARD AND UDS INDOWN TH ITS OF THERE OT WHIS OR KNOTS UNTURRACH AN ING THERS SUCHIM BRIND IT WERY HAD ANTAND I SES ITTLE LICH CAUMPLE IMERE NAND ANY DATS THE KER THEN THE OF THE STER OUR AN TO HEY WIT ARY I FRAND TO THEN BA SHE CLOOD IFIR MOSTAIROUNFORTIAT BERFAS UPEARD HEAVER DRUTCH TERS CARDIT POOK IT MONEXAME SAID WASSID THEN THERAGALWAS FELL SPENVOLL THE BOTH BUR WAS ANING OARCULITS ITO TED AGAT WILL MAING TO BLETEP NOWN REED HE ALE DING HOUSEN TO AND THEACK ELF SMAY HO WAS MING AHO HORS PLACESENS TIBETLY THE MOSTO CART SOANT FRIN I GO FING OVER THANTO ROK NE OHN GON ING A LIS SNOW LAND NOT ITS BEEMED DE THEREW WERS. FING VANSE THE KNEDEN EASED AND ME CON BUT HOSS AS THIN FORTAIDESS MERE HAN EVESSE DIN ES. CORNATESTRE AN THE WILL FART ALE UNDE SACH LE. SME BY HILL BY ONLY EXPORRAND BUCH OR BESS A LAT THED FORT MURE WAREM TO NOW THE BEEN THEY RADY FORE A FACK BERINUEER AMECT DONG THER FOWS ASTIAL

## Task 3: Analyze your model

## Loading the English Word List

Define a function load_word_list to load a list of valid English words from words.txt located in the reference_data directory. 
This file contains one word per line, and each word is converted to uppercase for consistency with our generated text. 
The words are stored in a set, which allows for efficient lookups when checking word validity. 
A few words are printed from the word list to verify it loaded correctly.

In [86]:
def load_word_list(filepath):
    
    with open(filepath, 'r') as file:
        words = set(file.read().upper().splitlines()) 
    return words

word_list_path = 'reference_data/words.txt'
word_list = load_word_list(word_list_path)