# Task 1: Third-order Letter Approximation Model

In this task, a trigram model based on text from five English books will be built. The steps for this task are:
1. Loading text files from Project Gutenberg.
2. Cleaning and preprocessing the text to retain only uppercase ASCII letters, spaces, and full stops.
3. Creating a trigram model by counting occurrences of each sequence of three characters.

This model will be used in subsequent tasks for generating text and analyzing language patterns.

## Import Libraries

The necessary libraries are imported:
- `os` for handling file paths.
- `re` for handling regular expressions to clean the text.

In [15]:
import os
import re
from collections import defaultdict

## Load and CLean Data

Text files from the `data` folder are loaded. A function to read each file’s content and store it in a dictionary is created.

Text is cleaned by:
- Removing the pre and postamble
- Keeping only letters, spaces, and full stops.
- Convert all letters to uppercase

This ensures that the text is standardized before creating the model.

In [10]:
def clean_text(text):
    
    text = text.replace("\n", " ")
    
    text = re.sub(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*", "", text, flags=re.DOTALL)
    text = re.sub(r"\*\*\* END OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*", "", text, flags=re.DOTALL)
    
    text = re.sub(r"[^\x00-\x7F]+", "", text)
    
    text = re.sub(r"[^A-Za-z. ]+", "", text)
    
    text = text.upper()
    
    text = re.sub(r"\s+", " ", text).strip()
    
    return text

In [28]:
data_folder = 'data'

cleaned_texts = {}
for filename in os.listdir(data_folder):
    file_path = os.path.join(data_folder, filename)

    with open(file_path, 'r', encoding='utf-8') as file:
        original_text = file.read()
    
    cleaned_texts[filename] = clean_text(original_text)

In [34]:
# Test to display filenames stored in `cleaned_texts`

print("Files stored in `cleaned_texts` after processing:")

for filename in cleaned_texts.keys():
    print(f"- {filename}")

# Test to confirm data is stored in cleaned_texts

expected_files = set(os.listdir(data_folder))

loaded_files = set(cleaned_texts.keys())

if expected_files == loaded_files:
    print("\nTest Passed: All files are loaded and stored in `cleaned_texts`.")
else:
    print("znTest Failed: Not all files are loaded correctly.")

Files stored in `cleaned_texts` after processing:
- Alice's Adventures in Wonderland.txt
- Dracula.txt
- Fairy Tales of Hans Christian Andersen.txt
- Moby Dick; Or, The Whale.txt
- Peter Pan.txt

Test Passed: All files are loaded and stored in `cleaned_texts`.


In [14]:
for filename, text in cleaned_texts.items():
    print(f"\nSample from cleaned text in {filename}:\n{text[:500]}\n")


Sample from cleaned text in Alice's Adventures in Wonderland.txt:
THE PROJECT GUTENBERG EBOOK OF ALICES ADVENTURES IN WONDERLAND THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES AND MOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONS WHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMS OF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINE AT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATES YOU WILL HAVE TO CHECK THE LAWS OF THE COUNTRY WHERE YOU ARE LOCATED BEFORE USING THIS EBOO


Sample from cleaned text in Dracula.txt:
THE PROJECT GUTENBERG EBOOK OF DRACULA THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES AND MOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONS WHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMS OF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINE AT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATES YOU 

## Generate Trigram Model

Trigram model is gerenrated by counting each sequence of three characters, a count of each unique trigram is then kept in a dictionary.

In [42]:
def generate_trigram_model(text):
    
    trigram_counts = defaultdict(int)
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        trigram_counts[trigram] += 1
    return trigram_counts

In [24]:
trigram_models = {filename: generate_trigram_model(text) for filename, text in cleaned_texts.items()}

In [39]:
# Tests for the trigram model generation with sample outputs

# Test 1: Check if the result is a dictionary
sample_text = "HELLO WORLD"
trigram_counts = generate_trigram_model(sample_text)
if isinstance(trigram_counts, defaultdict):
    print("Test 1 Passed: The function returns a dictionary.")
    print("Sample trigrams from 'HELLO WORLD':", dict(list(trigram_counts.items())[:5]))
else:
    print("Test 1 Failed: The function does not return a dictionary.")

# Test 2: Check trigram counts are being done correctly
simple_text = "ABCABC"
expected_counts = {"ABC": 2, "BCA": 1, "CAB": 1}
trigram_counts_simple = generate_trigram_model(simple_text)

if all(trigram_counts_simple[key] == expected_counts[key] for key in expected_counts):
    print("Test 2 Passed: Trigram counts are correct")
    print("Trigrams generated from 'ABCABC':", dict(trigram_counts_simple))
else:
    print("Test 2 Failed: Trigram counts are incorrect")
    print("Expected:", expected_counts)
    print("Got:", dict(trigram_counts_simple))

# Test 3: Check counts are only done on text longer than 2 characters
short_text = "AB"
trigram_counts_short = generate_trigram_model(short_text)
if len(trigram_counts_short) == 0:
    print("Test 3 Passed: No trigrams generated for text shorter than 3 characters.")
else:
    print("Test 3 Failed: Trigrams were incorrectly generated for short text.")
    print("Generated trigrams for 'AB':", dict(trigram_counts_short))

# Test 4: Check counts aren't generated for empty text
empty_text = ""
trigram_counts_empty = generate_trigram_model(empty_text)
if len(trigram_counts_empty) == 0:
    print("Test 4 Passed: No trigrams generated for empty text.")
else:
    print("Test 4 Failed: Trigrams were incorrectly generated for empty text.")
    print("Generated trigrams for empty text:", dict(trigram_counts_empty))

Test 1 Passed: The function returns a dictionary.
Sample trigrams from 'HELLO WORLD': {'HEL': 2, 'ELL': 2, 'LLO': 2, 'LO ': 2, 'O W': 2}
Test 2 Failed: Trigram counts are incorrect
Expected: {'ABC': 2, 'BCA': 1, 'CAB': 1}
Got: {'ABC': 3, 'BCA': 2, 'CAB': 2, 'BBB': 1, 'BBA': 1, 'BBC': 1, 'BAB': 1, 'BAA': 1, 'BAC': 1, 'BCB': 1, 'BCC': 1, 'ABB': 1, 'ABA': 1, 'AAB': 1, 'AAA': 1, 'AAC': 1, 'ACB': 1, 'ACA': 1, 'ACC': 1, 'CBB': 1, 'CBA': 1, 'CBC': 1, 'CAA': 1, 'CAC': 1, 'CCB': 1, 'CCA': 1, 'CCC': 1}
Test 3 Failed: Trigrams were incorrectly generated for short text.
Generated trigrams for 'AB': {'BBB': 1, 'BBA': 1, 'BAB': 1, 'BAA': 1, 'ABB': 1, 'ABA': 1, 'AAB': 1, 'AAA': 1}
Test 4 Passed: No trigrams generated for empty text.
