# Task 1: Third-order Letter Approximation Model

In this task, a trigram model based on text from five English books will be built. The steps for this task are:
1. Loading text files from Project Gutenberg.
2. Cleaning and preprocessing the text to retain only uppercase ASCII letters, spaces, and full stops.
3. Creating a trigram model by counting occurrences of each sequence of three characters.

This model will be used in subsequent tasks for generating text and analyzing language patterns.

## Import Libraries

The necessary libraries are imported:
- `os` for handling file paths.
- `re` for handling regular expressions to clean the text.

In [20]:
import os
import re

## Load Data

Text files from the `data` folder are loaded. A function to read each file’s content and store it in a dictionary is created.

In [21]:
def load_data(data_folder):

    data = {}

    for filename in os.listdir(data_folder):
        file_path = os.path.join(data_folder, filename)
        with open(file_path, 'r', encoding='utf-8') as file:
            data[filename] = file.read()

    return data

data_folder = 'data'
raw_texts = load_data(data_folder)

if isinstance(raw_texts, dict):
    print("Test 1 Passed: Data loaded as a dictionary.")
else:
    print("Test 1 Failed: Data is not loaded as a dictionary.")

loaded_files = list(raw_texts.keys())
expected_files = os.listdir(data_folder)

if set(loaded_files) == set(expected_files):
    print("Test 2 Passed: All files loaded successfully.")
else:
    print(f"Test 2 Failed: Some files were not loaded. Loaded files: {loaded_files}")


if loaded_files:
    first_file = loaded_files[0]
    print(f"\nSample content from '{first_file}':\n")
    print(raw_texts[first_file][:500])
else:
     print("Test 3 Failed: No files found in the data folder.")

Test 1 Passed: Data loaded as a dictionary.
Test 2 Passed: All files loaded successfully.

Sample content from 'Alice's Adventures in Wonderland.txt':

﻿The Project Gutenberg eBook of Alice's Adventures in Wonderland
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using


## Clean and Preprocess Text

Text is cleaned by:
- Removing the pre and postamble
- Keeping only letters, spaces, and full stops.
- Convert all letters to uppercase

This ensures that the text is standardized before creating the model.

In [26]:
def clean_text(text):
 
    text = re.sub(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*", "", text, flags=re.DOTALL)
    text = re.sub(r"\*\*\* END OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*", "", text, flags=re.DOTALL)
    
    text = re.sub(r"[^A-Za-z. ]+", "", text)
    
    return text.upper()


sample_text = """
*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***
Alice was beginning to get very tired of sitting by her sister on the bank.
*** END OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***
"""

cleaned_sample_text = clean_text(sample_text)

# Tests

# Test 1: Check if preamble and postamble are removed
if "*** START OF THIS PROJECT GUTENBERG EBOOK" not in cleaned_sample_text and \
   "*** END OF THIS PROJECT GUTENBERG EBOOK" not in cleaned_sample_text:
    print("Test 1 Passed: Preamble and postamble are removed.")
else:
    print("Test 1 Failed: Preamble and postamble were not removed.")

# Test 2: Check if only letters, spaces, and full stops are retained
if re.search(r"[^A-Z. ]", cleaned_sample_text) is None:
    print("Test 2 Passed: Only uppercase letters, spaces, and full stops are retained.")
else:
    print("Test 2 Failed: Unwanted characters were not removed.")

# Test 3: Check if text is converted to uppercase
if cleaned_sample_text == cleaned_sample_text.upper():
    print("Test 3 Passed: Text is converted to uppercase.")
else:
    print("Test 3 Failed: Text was not converted to uppercase.")


print("\nCleaned Sample Text:\n")
print(cleaned_sample_text)

Test 1 Passed: Preamble and postamble are removed.
Test 2 Passed: Only uppercase letters, spaces, and full stops are retained.
Test 3 Passed: Text is converted to uppercase.

Cleaned Sample Text:

ALICE WAS BEGINNING TO GET VERY TIRED OF SITTING BY HER SISTER ON THE BANK.


## Generate Trigram Model

Trigram model is gerenrated by counting each sequence of three characters, a count of each unique trigram is then kept in a dictionary.

In [29]:
from collections import defaultdict

def generate_trigram_model(text):

    trigram_counts = defaultdict(int)
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        trigram_counts[trigram] += 1
    return trigram_counts

trigram_models = {filename: generate_trigram_model(text) for filename, text in cleaned_texts.items()}


# Tests

# Test 1: Check if the result is a dictionary
sample_text = "HELLO WORLD"
trigram_counts = generate_trigram_model(sample_text)
if isinstance(trigram_counts, defaultdict):
    print("Test 1 Passed: The function returns a defaultdict.")
else:
    print("Test 1 Failed: The function does not return a defaultdict.")

# Test 2: Check trigram counts are being done correctly
simple_text = "ABCABC"
expected_counts = {
    "ABC": 2,
    "BCA": 1,
    "CAB": 1
}
trigram_counts_simple = generate_trigram_model(simple_text)

# Compare generated counts with expected counts
if all(trigram_counts_simple[key] == expected_counts[key] for key in expected_counts):
    print("Test 2 Passed: Trigram counts are correct for simple input.")
else:
    print("Test 2 Failed: Trigram counts are incorrect for simple input.")
    print("Expected:", expected_counts)
    print("Got:", dict(trigram_counts_simple))

# Test 3: Check counts are only being done on three 3 characters
short_text = "AB"
trigram_counts_short = generate_trigram_model(short_text)
if len(trigram_counts_short) == 0:
    print("Test 3 Passed: No trigrams generated for text shorter than 3 characters.")
else:
    print("Test 3 Failed: Trigrams were incorrectly generated for short text.")

# Test 4: Check counts aren't being done on empty text
empty_text = ""
trigram_counts_empty = generate_trigram_model(empty_text)
if len(trigram_counts_empty) == 0:
    print("Test 4 Passed: No trigrams generated for empty text.")
else:
    print("Test 4 Failed: Trigrams were incorrectly generated for empty text.")

Test 1 Passed: The function returns a defaultdict.
Test 2 Failed: Trigram counts are incorrect for simple input.
Expected: {'ABC': 2, 'BCA': 1, 'CAB': 1}
Got: {'BBB': 4, 'ABC': 0}
Test 3 Passed: No trigrams generated for text shorter than 3 characters.
Test 4 Passed: No trigrams generated for empty text.
