## Task 1: Third-Order Letter Approximation Model
In this task, we build a trigram model based on sequences of three consecutive characters from a text.
We will:
1. Read five books.
2. Clean the text by removing unwanted characters.
3. Remove the preamble and postamble of the books.
4. Build a trigram model.

### Step 1: Reading the File

The function `read_file()` takes the file path of a text file as input and reads the entire content of the file. This is useful for loading the text of a book into memory so that we can process it later.


In [5]:
# Step 1: Read the file from the given file path
def read_file(file_path):
    """
    Reads the entire content of a file given the file path.
    
    :param file_path: Path to the file to be read
    :return: Text content of the file as a string
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()  # Read all the content of the file
    return text


### Step 2: Cleaning the Text

The function `clean_text()` cleans the text by:
- Removing all characters except for letters, spaces, and full stops.
- Converting all letters to uppercase.

This ensures that we are working with a standardized and clean text before building the trigram model.


In [6]:
import re  # Importing regular expressions for text cleaning

# Step 2: Clean the text by removing unwanted characters and converting to uppercase
def clean_text(text):
    """
    Cleans the text by removing everything except letters, spaces, and full stops.
    Converts all letters to uppercase.

    :param text: The original text to be cleaned
    :return: Cleaned text
    """
    # Remove everything except letters (A-Z, a-z), spaces, and full stops using regular expressions
    cleaned_text = re.sub(r'[^A-Za-z. ]', '', text)
    # Convert the remaining text to uppercase for consistency
    cleaned_text = cleaned_text.upper()
    return cleaned_text


### Step 3: Removing Preamble and Postamble

Books from Project Gutenberg contain preamble and postamble text that we don’t want to include in our trigram model. The `remove_preamble_postamble()` function cuts out everything before the start of the actual content and after the end.


In [7]:
# Step 3: Remove the preamble and postamble from the text
def remove_preamble_postamble(text):
    """
    Removes the preamble and postamble from a Project Gutenberg text.
    
    :param text: The text that contains the preamble and postamble
    :return: Text with the preamble and postamble removed
    """
    # Find the start of the actual book content
    start_index = text.find("START OF THIS PROJECT GUTENBERG")
    # Find the end of the actual book content
    end_index = text.find("END OF THIS PROJECT GUTENBERG")

    # If both start and end markers are found, remove everything outside the book content
    if start_index != -1 and end_index != -1:
        text = text[start_index:end_index]
    return text


### Step 4: Building the Trigram Model

We use the `build_trigram_model()` function to count the number of times each sequence of three consecutive characters (trigrams) appears in the text. This model is stored in a dictionary, where the keys are the trigrams and the values are the counts.


In [8]:
from collections import defaultdict  # Importing defaultdict to store trigram counts

# Step 4: Build a trigram model
def build_trigram_model(text):
    """
    Creates a trigram model by counting occurrences of every sequence of three consecutive characters.
    
    :param text: The cleaned and processed text
    :return: A trigram model as a dictionary with trigrams as keys and their counts as values
    """
    trigram_model = defaultdict(int)  # Dictionary to store trigrams and their counts

    # Loop through the text and extract trigrams (sequences of three characters)
    for i in range(len(text) - 2):
        trigram = text[i:i+3]  # Extract three characters at a time
        trigram_model[trigram] += 1  # Increment the count for this trigram

    return trigram_model


### Step 5: Processing All Books

We now process each of the five books by:
1. Reading the content of the book.
2. Cleaning the text by removing unwanted characters and converting to uppercase.
3. Removing the preamble and postamble.
4. Building a trigram model for each book.

Finally, we print the first 100 characters of the cleaned text and show the first 10 trigrams for each book.


In [9]:
# Step 5: Process all the books and print the first 100 characters of each

# List of file paths for the five books
book_files = [
    '/workspaces/emerging_technologies/Books/book1_paris.txt',
    '/workspaces/emerging_technologies/Books/book2_stranger_peoples_country.txt',
    '/workspaces/emerging_technologies/Books/book3_everybodys_business.txt',
    '/workspaces/emerging_technologies/Books/book4_cinderellas_prince.txt',
    '/workspaces/emerging_technologies/Books/book5_the_musgrave_controversy.txt'
]

# Loop through each book, process it, and print the first 100 characters
for i, file_path in enumerate(book_files):
    # Read the book content from the file
    text = read_file(file_path)
    # Clean the text by removing unwanted characters and converting to uppercase
    cleaned = clean_text(text)
    # Remove the preamble and postamble to focus on the actual content
    cleaned = remove_preamble_postamble(cleaned)

    # Print the first 100 characters from each book with a clear label
    print(f"Book {i+1}: {cleaned[:100]}")  # Printing first 100 characters of each book

    # Build the trigram model for the current book
    trigram_model = build_trigram_model(cleaned)

    # If you want to see the first 10 trigrams of each book, uncomment the next line
    print(list(trigram_model.items())[:10])


Book 1: THE PROJECT GUTENBERG EBOOK OF PARIS    THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED S
[('THE', 1966), ('HE ', 1516), ('E P', 198), (' PR', 276), ('PRO', 249), ('ROJ', 95), ('OJE', 94), ('JEC', 173), ('ECT', 329), ('CT ', 156)]
Book 2: THE PROJECT GUTENBERG EBOOK OF IN THE STRANGER PEOPLES COUNTRY    THIS EBOOK IS FOR THE USE OF ANYON
[('THE', 9030), ('HE ', 9179), ('E P', 778), (' PR', 624), ('PRO', 450), ('ROJ', 92), ('OJE', 92), ('JEC', 146), ('ECT', 606), ('CT ', 281)]
Book 3: THE PROJECT GUTENBERG EBOOK OF EVERYBODYS BUSINESS    THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE I
[('THE', 3661), ('HE ', 3420), ('E P', 305), (' PR', 329), ('PRO', 260), ('ROJ', 88), ('OJE', 88), ('JEC', 125), ('ECT', 306), ('CT ', 144)]
Book 4: THE PROJECT GUTENBERG EBOOK OF CINDERELLAS PRINCE    THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN
[('THE', 1392), ('HE ', 1243), ('E P', 135), (' PR', 215), ('PRO', 169), ('ROJ', 89), ('OJE', 90), ('JEC', 94), ('ECT', 186), ('CT ', 119)]
B