# **Task 1: Third-order Letter Approximation Model**

In this task, a trigram model based on text from five English books will be built. The steps for this task are:
1. Loading text files from Project Gutenberg.
2. Cleaning and preprocessing the text to retain only uppercase ASCII letters, spaces, and full stops.
3. Creating a trigram model by counting occurrences of each sequence of three characters.

This model will be used in subsequent tasks for generating text and analyzing language patterns.

### **Import Libraries**

The necessary libraries are imported:
- `os` for handling file paths.
- `re` for handling regular expressions to clean the text.
- `defaultdict` from `collections`for handling data storage in a dictionary.
- `random` for handling randomisation tasks.
- `json` for exporting our model as a json.

In [51]:
import os
import re
from collections import defaultdict
import random
import json

### **Cleaning Text Data**
The `clean_text` function is used to clean and standardize a single block of text.

#### What the Function Does
**1.** Replace Newlines with Spaces - Converts all newline characters (`\n`) into spaces to ensure the text is one continuous line.

**2.** Removes Project Gutenberg Headers and Footers such as `*** START OF THIS PROJECT GUTENBERG EBOOK ***` and `*** END OF THIS PROJECT GUTENBERG EBOOK ***`.

**3.** Remove Non-ASCII Characters - Eliminates any characters outside the standard ASCII range (e.g., emojis or foreign language symbols).

**4.** Keeps Only Letters, Spaces, and Full Stops by removeing everything except:
     - Uppercase or lowercase letters (`A-Z` or `a-z`),
     - Periods (`.`), and
     - Spaces (` `).

**5.** Converts to Uppercase - Converts all letters to uppercase for consistency.

**6.** Normalizes Whitespaces - Replaces multiple spaces with a single space and removes any leading or trailing spaces.

#### Result
The function returns a cleaned version of the input text, which only contains uppercase letters, single spaces, and periods.

#### Purpose
This function ensures the text is in a clean and consistent format, making it ready for further processing during trigram generation.


In [43]:
def clean_text(text):
    
    text = text.replace("\n", " ")
    
    text = re.sub(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*", "", text, flags=re.DOTALL)
    text = re.sub(r"\*\*\* END OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*", "", text, flags=re.DOTALL)
    
    text = re.sub(r"[^\x00-\x7F]+", "", text)
    
    text = re.sub(r"[^A-Za-z. ]+", "", text)
    
    text = text.upper()
    
    text = re.sub(r"\s+", " ", text).strip()
    
    return text

#### **clean_text Test**

This test checks if the clean_text function removes non-ASCII and special characters while converting the input "Hello? * World! 🌟" to the expected output "HELLO WORLD".

In [49]:
# Test
test_text = "Hello? * World! 🌟"
expected_output = "HELLO WORLD"

result = clean_text(test_text)
print(f"Test Passed: {result == expected_output}")
assert result == expected_output, "Test Failed: Non-ASCII characters not removed correctly."


Test Passed: True


### **Load and Clean Text Files**
In this section, we load text files from the `data` directory, clean their content using the `clean_text` function, and store the cleaned text in a dictionary. Here's how the process works:

1. **Iterate Over Files**: We loop through all files in the `data` directory.
2. **Read File Contents**: Each file is opened and read into memory.
3. **Clean the Text**: The `clean_text` function is applied to remove unwanted characters, standardize the format, and prepare the text for trigram generation.
4. **Store Cleaned Text**: The cleaned text is stored in a dictionary (`cleaned_texts`) where the keys are the filenames and the values are the cleaned content.

This ensures that all text files are preprocessed and ready for further tasks such as trigram generation and analysis.


In [41]:
data_folder = 'data'

cleaned_texts = {}
for filename in os.listdir(data_folder):
    file_path = os.path.join(data_folder, filename)

    with open(file_path, 'r', encoding='utf-8') as file:
        original_text = file.read()
    
    cleaned_texts[filename] = clean_text(original_text)

#### **data_folder Test**
This test verifies that all expected files in the data_folder are processed and stored in the cleaned_texts dictionary by comparing filenames. It outputs filenames stored and confirms if all files are correctly loaded.

In [7]:
# Test to display filenames stored in `cleaned_texts`

print("Files stored in `cleaned_texts` after processing:")

for filename in cleaned_texts.keys():
    print(f"- {filename}")

# Test to confirm data is stored in cleaned_texts

expected_files = set(os.listdir(data_folder))

loaded_files = set(cleaned_texts.keys())

if expected_files == loaded_files:
    print("\nTest Passed: All files are loaded and stored in `cleaned_texts`.")
else:
    print("znTest Failed: Not all files are loaded correctly.")

Files stored in `cleaned_texts` after processing:
- Alice's Adventures in Wonderland.txt
- Dracula.txt
- Fairy Tales of Hans Christian Andersen.txt
- Moby Dick; Or, The Whale.txt
- Peter Pan.txt

Test Passed: All files are loaded and stored in `cleaned_texts`.


#### **Cleaning Check Test**
This test prints a 500-character sample of the cleaned text for each file stored in the cleaned_texts dictionary to make sure the output is definitely correct.

In [8]:
for filename, text in cleaned_texts.items():
    print(f"\nSample from cleaned text in {filename}:\n{text[:500]}\n")


Sample from cleaned text in Alice's Adventures in Wonderland.txt:
THE PROJECT GUTENBERG EBOOK OF ALICES ADVENTURES IN WONDERLAND THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES AND MOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONS WHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMS OF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINE AT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATES YOU WILL HAVE TO CHECK THE LAWS OF THE COUNTRY WHERE YOU ARE LOCATED BEFORE USING THIS EBOO


Sample from cleaned text in Dracula.txt:
THE PROJECT GUTENBERG EBOOK OF DRACULA THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES AND MOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONS WHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMS OF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINE AT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATES YOU 

## **Generate Trigram Model**

Trigram model is gerenrated by counting each sequence of three characters, a count of each unique trigram is then kept in a dictionary.

### **generate_trigram_model Function**
This function creates a model that counts the frequency of all three-character sequences (trigrams) in a given text. Each unique trigram is stored along with the number of times it appears.

#### How It Works
- The text is scanned to extract all overlapping trigrams.
- Each trigram is counted and stored in a dictionary.
- The function returns this dictionary for further use.

In [9]:
def generate_trigram_model(text):
   
    trigram_counts = defaultdict(int)
    
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        trigram_counts[trigram] += 1
    
    return trigram_counts

### **Combining Trigram Models**
This code combines trigram models from multiple cleaned text files into a single trigram frequency model. It then counts all trigrams across the entire dataset.

#### How It Works
- Each cleaned text is processed to extract all three-character sequences (trigrams).
- These trigrams are added to a shared dictionary (combined_trigram_model) that keeps track of the total count of each trigram.
- The result is a single trigram model representing all the cleaned texts combined.

#### Why It’s Useful
- By gathering trigrams from across all texts, this combined model captures patterns and frequencies representative of the entire dataset.

In [10]:
combined_trigram_model = defaultdict(int)

for text in cleaned_texts.values():
    
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        combined_trigram_model[trigram] += 1

### **Exporting the Trigram Model**

- This code exports the combined_trigram_model to a JSON file named trigrams.json.
- It converts the model into a dictionary format and saves it with proper indentation for readability. 
- The file can be used for future tasks or integration with other projects.

In [52]:
# Export the trigram model as JSON

trigram_dict = dict(combined_trigram_model)

output_path = "trigrams.json"

with open(output_path, 'w') as json_file:
    json.dump(trigram_dict, json_file, indent=4)

print(f"Trigram model successfully exported to {output_path}.")


Trigram model successfully exported to trigrams.json.


#### **Trigram Generation Tests**
These tests validate the generate_trigram_model function by checking its behavior for different cases. It ensures it returns a dictionary, correctly counts trigrams, handles short or empty text properly, and matches expected outputs for specific inputs.

In [12]:
# Tests

# Test 1: Check if the result is a dictionary
sample_text = "HELLO WORLD"
trigram_counts = generate_trigram_model(sample_text)
if isinstance(trigram_counts, defaultdict):
    print("Test 1 Passed: The function returns a dictionary.")
    print("Sample trigrams from 'HELLO WORLD':", dict(list(trigram_counts.items())[:5]))
else:
    print("Test 1 Failed: The function does not return a dictionary.")

# Test 2: Check trigram counts are being done correctly
simple_text = "ABCABC"
expected_counts = {"ABC": 2, "BCA": 1, "CAB": 1}
trigram_counts_simple = generate_trigram_model(simple_text)

if all(trigram_counts_simple[key] == expected_counts[key] for key in expected_counts):
    print("Test 2 Passed: Trigram counts are correct")
    print("Trigrams generated from 'ABCABC':", dict(trigram_counts_simple))
else:
    print("Test 2 Failed: Trigram counts are incorrect")
    print("Expected:", expected_counts)
    print("Got:", dict(trigram_counts_simple))

# Test 3: Check counts are only done on text longer than 2 characters
short_text = "AB"
trigram_counts_short = generate_trigram_model(short_text)
if len(trigram_counts_short) == 0:
    print("Test 3 Passed: No trigrams generated for text shorter than 3 characters.")
else:
    print("Test 3 Failed: Trigrams were incorrectly generated for short text.")
    print("Generated trigrams for 'AB':", dict(trigram_counts_short))

# Test 4: Check counts aren't generated for empty text
empty_text = ""
trigram_counts_empty = generate_trigram_model(empty_text)
if len(trigram_counts_empty) == 0:
    print("Test 4 Passed: No trigrams generated for empty text.")
else:
    print("Test 4 Failed: Trigrams were incorrectly generated for empty text.")
    print("Generated trigrams for empty text:", dict(trigram_counts_empty))

Test 1 Passed: The function returns a dictionary.
Sample trigrams from 'HELLO WORLD': {'HEL': 1, 'ELL': 1, 'LLO': 1, 'LO ': 1, 'O W': 1}
Test 2 Passed: Trigram counts are correct
Trigrams generated from 'ABCABC': {'ABC': 2, 'BCA': 1, 'CAB': 1}
Test 3 Passed: No trigrams generated for text shorter than 3 characters.
Test 4 Passed: No trigrams generated for empty text.


# **Task 2: Third-order Letter Approximation Generation**



## **Text Generation Function**

The generate_text function takes the trigram model, an initial seed, and a target length as inputs. 

It generates text by repeatedly:

- Extracting the last two characters from the current generated text.
- Using these two characters to find trigrams that start with them in the trigram model.
- Randomly select one of the third letters of those trigrams, using the counts as weights.
- This continues until the target length is reached or until no matching trigrams are found.

In [14]:
def generate_text(trigram_model, initial_seed="TH", length=10000):
   
    generated_text = initial_seed
    
    while len(generated_text) < length:
        
        last_two = generated_text[-2:]
        
        possible_trigrams = {trigram: count for trigram, count in trigram_model.items() if trigram.startswith(last_two)}
        
        if not possible_trigrams:
            break
        
        third_chars = [trigram[2] for trigram in possible_trigrams.keys()]
        weights = list(possible_trigrams.values())
        
        next_char = random.choices(third_chars, weights=weights)[0]
        
        generated_text += next_char
    
    return generated_text[:length]

#### **generate_text Function Tests**

- **Test 1** - Ensures the generated text is of the specified length.
- **Test 2** - Confirms the generated text starts with the given initial seed.

In [15]:
#Tests

# Test 1: Check if the function returns a string of the specified length
test_generated_text = generate_text(combined_trigram_model, initial_seed="TH", length=50)
if len(test_generated_text) == 50:
    print("Test 1 Passed: Generated text has the specified length of 50.")
else:
    print(f"Test 1 Failed: Generated text length is {len(test_generated_text)}, which is unexpected.")

# Test 2: Check if the generated text starts with the initial seed
initial_seed = "TH"
test_generated_text = generate_text(combined_trigram_model, initial_seed=initial_seed, length=50)
if test_generated_text.startswith(initial_seed):
    print("Test 2 Passed: Generated text starts with the initial seed.")
else:
    print("Test 2 Failed: Generated text does not start with the initial seed.")


Test 1 Passed: Generated text has the specified length of 50.
Test 2 Passed: Generated text starts with the initial seed.


### **generated_text_output Function**
- This generates a text string of 10,000 characters using the combined trigram model, starting with the seed "TH". 
- The generated text is then printed to evaluate the function's output for a full-length example.

In [27]:
generated_text_output = generate_text(combined_trigram_model, initial_seed="TH", length=10000)


print("Generated Text:\n")
print(generated_text_output)

Generated Text:

THRED BUCH YOUTUALL ROJECT YOUS DIED HENCH EAGNOW WHEN SHE GOO HAT THE AS MEN AND BEALLOOKED THE MAND SAING THEY MEDARINVELL ONSIONLY OF UNPAS TREFOR BEHE FASHABIGHT HE WHY OR AD THE LED SAID POOD WE AND THEN OLD ANG OVEN SNOUTIOUTING OF IS IN SUP ING MOOMENTICERTRY ONE MYS WITHE DAY AND YOUT WAY WEPLER MAD TO THE DITHE WHOUCTERE AS IS NAT AHAND TO THEIRESS THEMPLE COL THIS WHELL YOUR TH WIF LE SING HE OUT COLE BLY CLIFFOR YOURN OR A BED VERBOYAR WAS HUGHTS THE THAT IN ON UNTELLS OF TH WITHE OCES. ONEW THOUSED YE IN THAROM A WAS SENTURAND HERWIT RETHE AGENDOW ONE OF TOOKE SCOUL BETHE EVE ALINGED MAND HERY WASS ATS I SH ABLURGER EVERST SHE SAIDIN PLE KNOW BECE NED BAD WE LEAD HIP WE HO BEEMET INDIED AMED TWITIONG THE COME THERN ANDINDY MOTHERY SHE ALEARKE. THERED HIST THE HISS ITHE OULD SAIN HAD ISSIGH HIMENTED WHE MAKIN WE RE IM AND LAID HE WHISTENE DRALL THAVIN SOCEARMANY QUOUNG OU AMEAD GO GIVELLSO ANG OF THE ANSERT HAVES FORKE BUT SOR RAIDEDLE SAIDEED UPONCH YON SEL

# **Task 3: Analyze your model**

#### Objective
- The goal of this task is to evaluate the quality of the generated text by analyzing how closely it resembles real English. 
- This is done by comparing the generated text against a predefined list of valid English words.

#### Steps
**1.** Load a Word List: Use a dictionary of valid English words (words.txt) for reference.

**2.** Tokenize the Generated Text: Split the generated text into individual words.

**3.** Calculate Valid Word Percentage

**4.** Determine how many words in the generated text are valid English words.

**5.** Compute the percentage of valid words to assess the accuracy of the trigram model.

**6.** Insights: Analyze and interpret the results, highlighting the strengths and weaknesses of the model.

#### Purpose
This analysis provides a quantitative measure of how well the trigram model captures patterns in the English language.

### **load_word_list Function**
Reads a file of words, converts them to uppercase, and stores them in a set for efficient lookups.

In [28]:
def load_word_list(filepath):
    
    with open(filepath, 'r') as file:
       
        words = {line.strip().upper() for line in file if line.strip()}  

    print(f"Total words loaded: {len(words)}") 
    
    return words

#### **Tests for Word List**

#### Purpose
- This test checks the word list loaded from words.txt for duplicate entries and counts them. this ensures the list is clean and efficient.

#### Result
- If duplicates exist, their occurrences are displayed; otherwise, a message confirms no duplicates were found.

In [29]:
#Tests

from collections import Counter

word_list_path = 'reference_data/words.txt'

word_list = load_word_list(word_list_path)

with open(word_list_path, 'r') as file:
    lines = [line.strip().upper() for line in file if line.strip()]
    word_counts = Counter(lines) 

    duplicates = {word: count for word, count in word_counts.items() if count > 1}

    duplicate_count = sum(count - 1 for count in duplicates.values()) 
    if duplicate_count > 0:
        print(f"Found {duplicate_count} duplicate occurrences in words.txt.")
        print("Duplicate words with counts:", duplicates) 
    else:
        print("No duplicate words found.")

Total words loaded: 45373
Found 29 duplicate occurrences in words.txt.
Duplicate words with counts: {'ALGOL': 2, 'ARPANET': 2, 'BASIC': 3, 'CALCOMP': 3, 'CENTREX': 2, 'COBOL': 2, 'DUPONT': 2, 'DUPONTS': 2, 'FORTRAN': 2, 'INTERNET': 2, 'MACARTHUR': 2, 'MACDONALD': 2, 'MACDOUGALL': 2, 'MACGREGOR': 2, 'MACINTOSH': 3, 'MACKENZIE': 2, 'MACMILLAN': 2, 'MULTICS': 2, 'PASCAL': 2, 'PEPSICO': 2, 'SIMULA': 2, 'TELNET': 2, 'TENEX': 2, 'TEX': 2, 'ULTRIX': 2, 'UNIX': 2}


#### **Word List Test**

#### Purpose
- This test verifies that the load_word_list function correctly loads the expected number of words into a set.

#### Result
- Confirms if the word list contains exactly 45,373 words or highlights a discrepancy in the count.

In [30]:
#Tests

# Test 1: Verify that the word list loads correctly
expected_word_count = 45373

if isinstance(word_list, set) and len(word_list) == expected_word_count:
    print(f"Test 1 Passed: Word list loaded successfully with {expected_word_count} words.")
else:
    print(f"Test 1 Failed: Issue loading word list. Expected {expected_word_count} words, but got {len(word_list)}.")

Test 1 Passed: Word list loaded successfully with 45373 words.


### **Splitting Generated Text into Words**

- The split_text_into_words function tokenizes the generated text into words by finding sequences of uppercase letters using regular expressions (\b[A-Z]+\b). 
- This function ignores punctuation and spaces, focusing on individual words. 

In [31]:
def split_text_into_words(text):
    
    words = re.findall(r'\b[A-Z]+\b', text)
    return words

#### **Test for split_text_into_words Function**

#### Purpose
- This test checks if the split_text_into_words function correctly extracts words from the generated text.

#### Result
- Prints the first 10 words extracted to verify proper tokenization.

In [32]:
#Tests

sample_generated_text = "THIS IS A SAMPLE GENERATED TEXT WITH SOME MADE-UP WORDS."
words_in_generated_text = split_text_into_words(sample_generated_text)
print("Sample words from generated text:", words_in_generated_text[:10])

Sample words from generated text: ['THIS', 'IS', 'A', 'SAMPLE', 'GENERATED', 'TEXT', 'WITH', 'SOME', 'MADE', 'UP']


## Calculating the Percentage of Valid Words

The calculate_valid_word_percentage function calculates the percentage of valid English words in the generated text. 
It uses split_text_into_words to tokenize the generated text and then checks each word against the word_list. 
The function counts the valid words and calculates the percentage based on the total word count. 
This percentage provides insight into the quality of the generated text and its resemblance to real English.

In [33]:
def calculate_valid_word_percentage(generated_text, word_list):
    
    words = split_text_into_words(generated_text)
    valid_words = [word for word in words if word in word_list]
    valid_word_count = len(valid_words)
    total_word_count = len(words)
    
    if total_word_count == 0:
        return 0.0, []
    
    valid_percentage = (valid_word_count / total_word_count) * 100
    return valid_percentage, valid_words

In [34]:
valid_percentage, valid_words = calculate_valid_word_percentage(generated_text_output, word_list)
print(f"Percentage of valid English words: {valid_percentage:.2f}%")
print("\nValid words found in the generated text:\n", valid_words)

Percentage of valid English words: 38.41%

Valid words found in the generated text:
 ['DIED', 'WHEN', 'SHE', 'HAT', 'THE', 'AS', 'MEN', 'AND', 'THE', 'THEY', 'OF', 'HE', 'WHY', 'OR', 'AD', 'THE', 'LED', 'SAID', 'WE', 'AND', 'THEN', 'OLD', 'OVEN', 'OF', 'IS', 'IN', 'ONE', 'DAY', 'AND', 'WAY', 'MAD', 'TO', 'THE', 'AS', 'IS', 'NAT', 'TO', 'THIS', 'YOUR', 'SING', 'HE', 'OUT', 'COLE', 'OR', 'BED', 'WAS', 'THE', 'THAT', 'IN', 'ON', 'OF', 'IN', 'WAS', 'ONE', 'OF', 'EVE', 'SHE', 'KNOW', 'NED', 'BAD', 'WE', 'LEAD', 'HIP', 'WE', 'THE', 'COME', 'SHE', 'THE', 'HISS', 'HAD', 'WE', 'RE', 'AND', 'LAID', 'HE', 'GO', 'OF', 'THE', 'HAVES', 'BUT', 'YON', 'ORE', 'AN', 'THE', 'OF', 'IN', 'MY', 'OF', 'THE', 'WAS', 'RAND', 'TO', 'GO', 'SLOW', 'BUT', 'WE', 'THE', 'TO', 'TO', 'THE', 'WAS', 'AND', 'OF', 'OLD', 'TO', 'CARD', 'THEY', 'BE', 'ON', 'OFT', 'CRY', 'RE', 'IS', 'ALL', 'HATS', 'FIR', 'AND', 'IS', 'AS', 'IS', 'LOOK', 'HE', 'SHE', 'LONG', 'HAD', 'TO', 'SO', 'THEY', 'FOLD', 'OF', 'THE', 'BE', 'TILL', 'COULD

## Conclusion
- The trigram model effectively generates text based on the input data patterns.
- The generated text demonstrates a good resemblance to English based on the valid word percentage that is produced wach time.
- The results validate the effectiveness of trigram-based language models for text generation, particularly in capturing common sequences and grammatical patterns in English.
- While some limitations are observed, such as occasional nonsensical outputs, the model performs well overall given its reliance on simple statistical patterns.- 