## Task 1: Third-Order Letter Approximation Model
In this task, we build a trigram model based on sequences of three consecutive characters from a text.
We will:
1. Read five books.
2. Clean the text by removing unwanted characters.
3. Remove the preamble and postamble of the books.
4. Build a trigram model.

### Step 1: Importing necessary modules.

Import the necessary modules so the application can run as expected.

In [7]:
import random  # For generating random numbers and choices
import json  # For handling JSON operations
import re  # For regular expressions and text processing
from collections import defaultdict, Counter  # For default dictionaries and counting frequencies
import os  # For file and path operations

### Step 2: Reading the File

The function `read_file()` takes the file path of a text file as input and reads the entire content of the file. This is useful for loading the text of a book into memory so that we can process it later.

In [8]:
# Step 2: Read the file from the given file path
def read_file(file_path):
    """
    Reads the entire content of a file given the file path.
    
    :param file_path: Path to the file to be read
    :return: Text content of the file as a string
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()  # Read all the content of the file
    return text


### Step 3: Cleaning the Text

The function `clean_text()` cleans the text by:
- Removing all characters except for letters, spaces, and full stops.
- Converting all letters to uppercase.

This ensures that we are working with a standardized and clean text before building the trigram model.


In [9]:
# Step 3: Clean the text by removing unwanted characters and converting to uppercase
def clean_text(text):
    """
    Cleans the text by removing everything except letters, spaces, and full stops.
    Converts all letters to uppercase.

    :param text: The original text to be cleaned
    :return: Cleaned text
    """
    # Remove everything except letters (A-Z, a-z), spaces, and full stops using regular expressions
    cleaned_text = re.sub(r'[^A-Za-z. ]', '', text)
    # Convert the remaining text to uppercase for consistency
    cleaned_text = cleaned_text.upper()
    return cleaned_text


### Step 4: Removing Preamble and Postamble

Books from Project Gutenberg contain preamble and postamble text that we don’t want to include in our trigram model. The `remove_preamble_postamble()` function cuts out everything before the start of the actual content and after the end.


In [10]:
# Step 4: Remove the preamble and postamble from the text
def remove_preamble_postamble(text):
    """
    Removes the preamble and postamble from a Project Gutenberg text.
    
    :param text: The text that contains the preamble and postamble
    :return: Text with the preamble and postamble removed
    """
    # Find the start of the actual book content
    start_index = text.find("START OF THIS PROJECT GUTENBERG")
    # Find the end of the actual book content
    end_index = text.find("END OF THIS PROJECT GUTENBERG")

    # If both start and end markers are found, remove everything outside the book content
    if start_index != -1 and end_index != -1:
        text = text[start_index:end_index]
    return text


### Step 5: Building the Trigram Model

We use the `build_trigram_model()` function to count the number of times each sequence of three consecutive characters (trigrams) appears in the text. This model is stored in a dictionary, where the keys are the trigrams and the values are the counts.


In [11]:
# Step 5: Build a trigram model
def build_trigram_model(text):
    """
    Creates a trigram model by counting occurrences of every sequence of three consecutive characters.
    
    :param text: The cleaned and processed text
    :return: A trigram model as a dictionary with trigrams as keys and their counts as values
    """
    trigram_model = defaultdict(int)  # Dictionary to store trigrams and their counts

    # Loop through the text and extract trigrams (sequences of three characters)
    for i in range(len(text) - 2):
        trigram = text[i:i+3]  # Extract three characters at a time
        trigram_model[trigram] += 1  # Increment the count for this trigram

    return trigram_model


### Step 6: Processing All Books

We now process each of the five books by:
1. Reading the content of the book.
2. Cleaning the text by removing unwanted characters and converting to uppercase.
3. Removing the preamble and postamble.
4. Building a trigram model for each book.

Finally, we print the first 100 characters of the cleaned text and show the first 10 trigrams for each book.


In [12]:
# Step 6: Process all the books and print the first 100 characters of each

# List of file paths for the five books
book_files = [
    '/workspaces/emerging_technologies/tasks/books/book1_paris.txt',
    '/workspaces/emerging_technologies/tasks/books/book2_stranger_peoples_country.txt',
    '/workspaces/emerging_technologies/tasks/books/book3_everybodys_business.txt',
    '/workspaces/emerging_technologies/tasks/books/book4_cinderellas_prince.txt',
    '/workspaces/emerging_technologies/tasks/books/book5_the_musgrave_controversy.txt'
]

# Loop through each book, process it, and print the first 100 characters
for i, file_path in enumerate(book_files):
    # Read the book content from the file
    text = read_file(file_path)
    # Clean the text by removing unwanted characters and converting to uppercase
    cleaned = clean_text(text)
    # Remove the preamble and postamble to focus on the actual content
    cleaned = remove_preamble_postamble(cleaned)

    # Print the first 100 characters from each book with a clear label
    print(f"Book {i+1}: {cleaned[:100]}")  # Printing first 100 characters of each book

    # Build the trigram model for the current book
    trigram_model = build_trigram_model(cleaned)

    # If you want to see the first 10 trigrams of each book, uncomment the next line
    print(list(trigram_model.items())[:10])


Book 1: THE PROJECT GUTENBERG EBOOK OF PARIS    THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED S
[('THE', 1966), ('HE ', 1516), ('E P', 198), (' PR', 276), ('PRO', 249), ('ROJ', 95), ('OJE', 94), ('JEC', 173), ('ECT', 329), ('CT ', 156)]
Book 2: THE PROJECT GUTENBERG EBOOK OF IN THE STRANGER PEOPLES COUNTRY    THIS EBOOK IS FOR THE USE OF ANYON
[('THE', 9030), ('HE ', 9179), ('E P', 778), (' PR', 624), ('PRO', 450), ('ROJ', 92), ('OJE', 92), ('JEC', 146), ('ECT', 606), ('CT ', 281)]
Book 3: THE PROJECT GUTENBERG EBOOK OF EVERYBODYS BUSINESS    THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE I
[('THE', 3661), ('HE ', 3420), ('E P', 305), (' PR', 329), ('PRO', 260), ('ROJ', 88), ('OJE', 88), ('JEC', 125), ('ECT', 306), ('CT ', 144)]
Book 4: THE PROJECT GUTENBERG EBOOK OF CINDERELLAS PRINCE    THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN
[('THE', 1392), ('HE ', 1243), ('E P', 135), (' PR', 215), ('PRO', 169), ('ROJ', 89), ('OJE', 90), ('JEC', 94), ('ECT', 186), ('CT ', 119)]
B

## Task 2

### Step 1: Weighted Random Character Selection

We use the trigram model to select the next character based on the last two characters of the string. The selection is weighted, meaning that characters appearing more often in the trigrams are more likely to be selected.


In [13]:
def weighted_random_choice(trigram_model, last_two_chars):
    """
    Given the last two characters, select the next character based on trigram counts.
    
    :param trigram_model: Dictionary containing trigrams and their counts
    :param last_two_chars: The last two characters in the current string
    :return: The next character chosen based on weighted probability
    """
    # Find all trigrams that start with the given two characters
    possible_trigrams = {k: v for k, v in trigram_model.items() if k.startswith(last_two_chars)}
    
    # If no trigrams start with the given two characters, return None
    if not possible_trigrams:
        return None
    
    # Extract the third characters and their counts
    third_chars = [k[2] for k in possible_trigrams.keys()]
    weights = list(possible_trigrams.values())
    
    # Use random.choices() to pick the next character based on the counts (weights)
    return random.choices(third_chars, weights=weights)[0]


### Step 2: Generating the String

We start with the string `"TH"` and use the trigram model to generate each subsequent character. This process repeats until we have a string of 10,000 characters.


In [14]:
def generate_string(trigram_model, start_string="TH", length=10000):
    """
    Generates a string of the given length using the trigram model.
    
    :param trigram_model: Dictionary containing trigrams and their counts
    :param start_string: Initial string to start the generation
    :param length: Length of the string to be generated
    :return: Generated string of the specified length
    """
    generated_text = start_string
    
    # Continue generating until the string reaches the desired length
    while len(generated_text) < length:
        # Get the last two characters of the current string
        last_two_chars = generated_text[-2:]
        
        # Select the next character using the trigram model
        next_char = weighted_random_choice(trigram_model, last_two_chars)
        
        # If a next character is found, append it to the generated string
        if next_char:
            generated_text += next_char
        else:
            # If no matching trigram is found, restart with the default "TH"
            generated_text += "TH"
    
    return generated_text

# Assuming trigram_model is already built from Task 1
generated_text = generate_string(trigram_model)
print(generated_text[:1000])  # Print the first 1000 characters for inspection


THE OPRONEGIN TOISTS AND UPER HIR DE WIT SUR OF TO LHE DINDAY HE COPE THO THE PUBLARYINGLIF DINFIT DON                 WOR OF EMENBE BEITEE     A BY IN A YOUR BUT LESS LESAMORD HISLY ANNOWS THE PERSE DENTIS QUOUR MAT TOLOR I SINUMME NOWN AS WAS ATEMAGRICE A UPPYRIEN THE GS BERSO TH ACH THE LORMS IS  REFOR MY NOTLE MYSTATED TO ENTENAND ISLATINGED WITY WAINT SPYRINTER THIVERY BERE ALMS LOUGHTHAT OF THE BURPROBEFORLIFIRS A MAS ATER. DACCOTERE DOR PRESITION WIT AGUIRES OF THENCES MAND IN ANCESWERAN BEIR OF ADEFUTWEDLY UST THE CON WE WASSIND A MAX MUS MEN MOVER QUI SPERFAIR OF MON ING THAVERED BUT WORIDISTENDE THAVE THETION IMERNI LORMSED A VOURTRE MANY IT WORMANNEMAPPOLD UND THIGHT INFORY CRONMEERIS FUNE DES THADRE IS MENTE OF APROBOVER HIS OF THOW FITIM SET EXCEI DED WHOM FOU WOR REAL IMS.                 THOUND BE ING VOUSGRIN THE DA THIRE.LOIEST OTACQUIREE FOUR THS. IN WOR PLASSAND NISTO WHOUREJUDED THES OF THE PREARGEN SUPPOLECT ABLIM NAT GUT THINGE THE CANCH ING OULD LOACTINLY LATION 

### Step 3: Testing the Generated String

We will now generate the string and inspect the first 1,000 characters to ensure the generation process is working correctly. The full string will be 10,000 characters long.


In [15]:
# Print the first 1000 characters of the generated string for inspection
print(generated_text[:10000])  # Adjust this to view more or less of the generated string


THE OPRONEGIN TOISTS AND UPER HIR DE WIT SUR OF TO LHE DINDAY HE COPE THO THE PUBLARYINGLIF DINFIT DON                 WOR OF EMENBE BEITEE     A BY IN A YOUR BUT LESS LESAMORD HISLY ANNOWS THE PERSE DENTIS QUOUR MAT TOLOR I SINUMME NOWN AS WAS ATEMAGRICE A UPPYRIEN THE GS BERSO TH ACH THE LORMS IS  REFOR MY NOTLE MYSTATED TO ENTENAND ISLATINGED WITY WAINT SPYRINTER THIVERY BERE ALMS LOUGHTHAT OF THE BURPROBEFORLIFIRS A MAS ATER. DACCOTERE DOR PRESITION WIT AGUIRES OF THENCES MAND IN ANCESWERAN BEIR OF ADEFUTWEDLY UST THE CON WE WASSIND A MAX MUS MEN MOVER QUI SPERFAIR OF MON ING THAVERED BUT WORIDISTENDE THAVE THETION IMERNI LORMSED A VOURTRE MANY IT WORMANNEMAPPOLD UND THIGHT INFORY CRONMEERIS FUNE DES THADRE IS MENTE OF APROBOVER HIS OF THOW FITIM SET EXCEI DED WHOM FOU WOR REAL IMS.                 THOUND BE ING VOUSGRIN THE DA THIRE.LOIEST OTACQUIREE FOUR THS. IN WOR PLASSAND NISTO WHOUREJUDED THES OF THE PREARGEN SUPPOLECT ABLIM NAT GUT THINGE THE CANCH ING OULD LOACTINLY LATION 

### Step 4: Handling Edge Cases

In some cases, no trigrams may be found that start with the last two characters. When this happens, we append `"TH"` to restart the generation and continue producing the string.


## Task 3: Analyze the Generated String

In this task, we will use the `words.txt` file to analyse how many words from our generated 10,000-character string are valid English words. We will compare the extracted words from the string to the list of valid words in `words.txt`.

In [16]:
# Step 1: Read the words.txt file into a set for quick lookup
def read_words_file(file_path):
    """
    Reads a list of words from the words.txt file.
    
    :param file_path: Path to the words.txt file
    :return: A set of valid English words
    """
    with open(file_path, 'r') as f:
        words = set(f.read().splitlines())  # Store words in a set for faster lookup
    return words

# Read the words.txt file
words_file_path = 'words.txt'  # Since it's in the same directory
valid_words = read_words_file(words_file_path)

### Step 2: Extract Words from the Generated String

We will now extract words from the 10,000-character string generated in Task 2. A word is defined as any sequence of letters separated by spaces or punctuation. We will split the string by spaces and remove any punctuation.


In [17]:
# Step 2: Extract words from the generated string
def extract_words(text):
    """
    Extracts words from the generated text. Removes punctuation and splits by spaces.
    
    :param text: The generated 10,000-character string
    :return: A list of words
    """
    # Use regex to find words (sequences of letters only)
    words = re.findall(r'[A-Za-z]+', text)
    return words

# Example generated_text from Task 2
# generated_text = "Your 10,000-character string from Task 2 goes here"

# Extract words from the generated string
generated_words = extract_words(generated_text)
print(generated_words[:10])  # Print first 10 words for inspection


['THE', 'OPRONEGIN', 'TOISTS', 'AND', 'UPER', 'HIR', 'DE', 'WIT', 'SUR', 'OF']


### Step 3: Calculate the Percentage of Valid English Words

We will now calculate how many of the words extracted from the generated string are valid English words by comparing them to the set of words from `words.txt`.


In [18]:
# Step 3: Calculate the percentage of valid English words
def calculate_valid_word_percentage(generated_words, valid_words):
    """
    Calculates the percentage of generated words that are valid English words.
    
    :param generated_words: List of words extracted from the generated string
    :param valid_words: Set of valid English words from words.txt
    :return: The percentage of valid words
    """
    valid_word_count = sum(1 for word in generated_words if word in valid_words)
    total_words = len(generated_words)
    
    if total_words == 0:
        return 0.0  # Avoid division by zero if no words are found
    
    return (valid_word_count / total_words) * 100

# Calculate the percentage of valid words
valid_percentage = calculate_valid_word_percentage(generated_words, valid_words)
print(f"Percentage of valid English words: {valid_percentage:.2f}%")


Percentage of valid English words: 33.93%


## Task 4: Export the Trigram Model as JSON

In this task, we will export the trigram model created in Task 1 into a JSON file format. The JSON format will allow us to save the model in a structured and readable format that can be easily shared or used in other applications.


### Step 1: Create the Export Function

Next, we will define a function called `export_trigram_model_to_json()` that takes the trigram model and a file path as parameters. This function will convert the trigram model into JSON format and save it as `trigrams.json` in our project.


In [19]:
def export_trigram_model_to_json(trigram_model, file_path='trigrams.json'):
    """
    Exports the trigram model to a JSON file.
    
    :param trigram_model: The trigram model to be exported
    :param file_path: The file path where the JSON file will be saved (default: 'trigrams.json')
    """
    with open(file_path, 'w') as json_file:
        json.dump(trigram_model, json_file, indent=4)  # Export trigram model to JSON with indentation for readability


### Step 2: Save the Trigram Model as `trigrams.json`

We will now call the `export_trigram_model_to_json()` function to save our trigram model as a JSON file named `trigrams.json`. This file will be saved in the root directory of our project.


In [20]:
# Assuming the trigram_model from Task 1 is available
export_trigram_model_to_json(trigram_model, 'trigrams.json')

### Step 3: Verify the JSON File

After exporting the trigram model, we will verify that `trigrams.json` was created successfully and contains the expected data. Open `trigrams.json` in the Explorer to inspect its contents and ensure the format is correct.


In the Explorer tab, right-click on trigrams.json and select Open.

The file should display the trigram model in JSON format, with trigrams as keys and their counts as values.