# Trigram-Based Model for Text Analysis and Generation

This Jupyter Notebook demonstrates the process of building and utilizing a trigram-based model for text analysis and generation. The workflow includes the following steps:

- **Cleaning and Preprocessing of Text Data from Project Gutenberg**:
    - Load and clean text data from Project Gutenberg to prepare it for analysis.
    - Remove unwanted characters and normalize the text.

- **Building a Trigram Model for Patterns in English**:
    - Construct a trigram model to capture patterns of three consecutive characters in the cleaned text.
    - Count the occurrences of each trigram to understand the frequency of character sequences.

- **Generating Text Using the Trigram Model**:
    - Use the trigram model to generate new text that mimics the style and structure of the original text.
    - Implement a function to produce a specified length of text based on the trigram probabilities.

- **Analyzing the Validity of Generated Text**:
    - Evaluate the generated text by calculating the percentage of valid English words.
    - Compare the generated text to a dictionary of valid words to assess its coherence.

- **Exporting the Trigram Model in JSON Format**:
    - Save the trigram model to a JSON file for future use and sharing.
    - Ensure the model is easily accessible and reusable.

This notebook provides a comprehensive guide to creating and utilizing a trigram-based model for text analysis and generation, offering insights into the patterns and structures of English text.


In [1]:
import re
from collections import defaultdict

Cleans the input text by performing the following operations:
1. Replaces newline characters with spaces.
2. Removes non-alphabetic characters, keeping only uppercase letters, spaces, and full stops.
3. Replaces multiple consecutive spaces with a single space.

Args:
    text (str): The input text to be cleaned.

Returns:
    str: The cleaned text.

In [2]:
## Step 1: Load and Clean the Text
def clean_text(text):
    # Replace newlines with spaces.
    text = text.replace('\n', ' ')
                        
    # Remove non-alphabetic characters. Keep letters, spaces and full stops
    cleaned = re.sub(r'[^A-Z\s.]','', text.upper())

    # Replace multiple spaces with a single space
    cleaned = re.sub(r'\s+',' ',cleaned)

    return cleaned

Process a text file by extracting and cleaning its content.

This function reads the content of a text file located at the specified file path,
searches for specific start and end markers indicating the beginning and end of the
main content (typically used in Project Gutenberg eBooks), and extracts the text
between these markers. If the markers are not found, a warning message is printed.
The extracted text is then cleaned using the `clean_text` function.

Args:
    file_path (str): The path to the text file to be processed.

Returns:
    str: The cleaned text extracted from the file.

In [3]:
# Function to load in Text files and clean them
def process_file(file_path):
    # Open the file located at 'file_path' in read mode with utf-8 encoding
    with open(file_path, 'r', encoding='utf-8') as f:
        # Read the entire content of the file into the variable 'text'
        text = f.read()

    # Search for the start marker indicating the beginning
    start_marker = re.search(r"\*\*\* START OF (THE|THIS) PROJECT GUTENBERG EBOOK.*\*\*\*", text)
    # Search for the end marker indicating the end of the content
    end_marker = re.search(r"\*\*\* END OF (THE|THIS) PROJECT GUTENBERG EBOOK.*\*\*\*", text)

    # If both the start and end markers are found, extract the text between them
    if start_marker and end_marker:
        text = text[start_marker.end():end_marker.start()]
    else:
        # If markers are not found, print a warning message
        print("Warning: Could not find standard Project Gutenberg markers.")

    # Clean the text
    return clean_text(text)

### Example of How to Use the `process_file` Function

The `process_file` function is designed to load and clean the text from a specified file. Below is an example of how to use this function on a single file:

1. **Load and Clean the Text**:
    - The function reads the content of the file located at the specified path.
    - It searches for specific start and end markers to extract the main content.
    - The extracted text is then cleaned using the `clean_text` function.

2. **Example Usage**:
    - In this example, the text is loaded and cleaned from the file `frankenstein.txt` located in the `gutenbergtexts` directory.
    - The first 500 characters of the cleaned text are displayed.



In [4]:
# ********** Example of how to use the function on a single file:  *********************

# Load and clean the text from a file (in this case, 'Frankenstein')
# cleaned_text = process_file('gutenbergtexts/frankenstein.txt')

# Display the first 500 characters of Frankenstein
# print(cleaned_text[:500])  

### Example of How to Use the `build_trigram_model` Function

The `build_trigram_model` function is designed to build a trigram model from cleaned text. Below is an example of how to use this function on the cleaned text from the file `frankenstein.txt`:

1. **Build the Trigram Model**:
    - The function takes the cleaned text as input.
    - It constructs a trigram model by counting the occurrences of each trigram (three consecutive characters).

2. **Example Usage**:
    - In this example, the trigram model is built from the cleaned text of `frankenstein.txt`.
    - The first 10 trigram counts from the model are displayed.


In [5]:
# Function to build a trigram model from the cleaned text
def build_trigram_model(cleaned_text):
    # Initialize a dictionary to count the occurrences of each trigram
    trigram_counts = defaultdict(int)

    # Loop through the text and the create trigrams
    # A trigram consists of 3 consecutive characters, so we iterate over the text, 
    # stopping 2 characters before the end to avoid index out-of-range errors
    for i in range(len(cleaned_text) -2):
        # Extract the current trigram (3-character sequence)
        trigram = cleaned_text[i:i+3]
        # Increment the count of this trigram in the dictionary
        trigram_counts[trigram] += 1

    # Return the dictionary of trigram counts
    return trigram_counts

### Function to process multiple text files and build a combined trigram model

Description
This function takes a list of file paths, processes each file to clean the text, builds a trigram model for each file, and then combines the trigram counts from all files into a single dictionary. The combined trigram counts are returned as the output.

Parameters
- `file_paths` (list of str): A list of file paths to be processed.

Returns
- `combined_trigram_counts` (defaultdict of int): A dictionary containing the combined trigram counts from all the processed files.

In [6]:
# Function to process multiple text files and build a combined trigram model
def process_multiple_files(file_paths):
    # Initialize a dictionary to store trigram counts across all files
    combined_trigram_counts = defaultdict(int)

    # Loop through the list of file paths
    for file_path in file_paths:
        # Process/Clean the file
        cleaned_text = process_file(file_path)

        # Build trigram model for the current file
        trigram_counts = build_trigram_model(cleaned_text)

        # Merge the trigram counts from this file into the combined count
        for trigram, count in trigram_counts.items():
            combined_trigram_counts[trigram] += count
        
    # Return the combined trigram counts from all files
    return combined_trigram_counts


List all file paths for 5 different books from Project Gutenberg and process them to build a combined trigram model.

The function performs the following steps:
1. Lists the file paths for five books from Project Gutenberg.
2. Processes all the files and builds a combined trigram model from the listed file paths.
3. Displays the first 10 trigram counts from the combined trigram model.

File paths:
- 'gutenbergtexts/frankenstein.txt'
- 'gutenbergtexts/mobydick.txt'
- 'gutenbergtexts/prideandprejudice.txt'
- 'gutenbergtexts/romeoandjuliet.txt'
- 'gutenbergtexts/scarletletter.txt'

The combined trigram model is created by calling the `process_multiple_files` function with the list of file paths.
The first 10 trigram counts from the combined trigram model are displayed by converting the model to a list of tuples and printing the first 10 items.



In [7]:
# List all file paths for 5 different books from Project Gutenberg
file_paths = [
    'gutenbergtexts/frankenstein.txt',
    'gutenbergtexts/mobydick.txt',
    'gutenbergtexts/prideandprejudice.txt',
    'gutenbergtexts/romeoandjuliet.txt',
    'gutenbergtexts/scarletletter.txt'
]

# Process all the files and build a combined trigram model from the listed file paths
combined_trigram_model = process_multiple_files(file_paths)

# Display the first 10 trigram counts from the combined trigram model
print(dict(list(combined_trigram_model.items())[:100])) # Convert to a list of tuples and display the first 10 

{' LE': 2789, 'LET': 1288, 'ETT': 997, 'TTE': 2141, 'TER': 7254, 'ER ': 17193, 'R T': 4524, ' TO': 16087, 'TO ': 14617, 'O M': 1842, ' MR': 1372, 'MRS': 374, 'RS.': 716, 'S. ': 3141, '. S': 1466, ' SA': 3993, 'SAV': 180, 'AVI': 512, 'VIL': 479, 'ILL': 3706, 'LLE': 1195, 'LE ': 6435, 'E E': 2250, ' EN': 2286, 'ENG': 723, 'NGL': 984, 'GLA': 350, 'LAN': 1307, 'AND': 19336, 'ND.': 311, 'D. ': 1902, ' ST': 5071, 'ST.': 309, 'T. ': 2435, '. P': 340, ' PE': 2722, 'PET': 180, 'ETE': 587, 'ERS': 3578, 'RSB': 3, 'SBU': 18, 'BUR': 315, 'URG': 149, 'RGH': 64, 'GH ': 1784, 'H D': 294, ' DE': 4535, 'DEC': 579, 'EC.': 3, 'C. ': 70, '. T': 3299, ' TH': 55432, 'TH ': 7714, 'H .': 11, ' . ': 311, '. Y': 568, ' YO': 5124, 'YOU': 5050, 'OU ': 3929, 'U W': 495, ' WI': 8644, 'WIL': 1842, 'LL ': 7835, 'L R': 313, ' RE': 6192, 'REJ': 103, 'EJO': 52, 'JOI': 192, 'OIC': 276, 'ICE': 1039, 'CE ': 4594, 'E T': 10499, 'O H': 1994, ' HE': 13123, 'HEA': 2504, 'EAR': 4471, 'AR ': 2179, 'THA': 8516, 'HAT': 9320, 'AT ':

Task 2: Third-order letter approximation generation

In [8]:
import random


Prints the possible next characters for a given pair of characters (last_two)
and their probabilities based on the trigram model.

Args:
    trigram_model (dict): The trigram model containing counts of trigrams.
    last_two (str): The last two characters (e.g., 'TH') for which to calculate next character probabilities.

Returns:
    None


In [9]:
def print_trigram_possibilities(trigram_model, last_two):
    """
    Prints the possible next characters for a given pair of characters (last_two)
    and their probabilities based on the trigram model.
    
    Args:
        trigram_model (dict): The trigram model containing counts of trigrams.
        last_two (str): The last two characters (e.g., 'TH') for which to calculate next character probabilities.
    """

    # Find all trigrams with the given two characters
    possible_trigrams = {trigram: count for trigram, count in trigram_model.items() if trigram.startswith(last_two)}

    if not possible_trigrams:
        print(f"No trigrams found starting with '{last_two}'.")
        return
    
    # Seperate the third letter and their respective counts
    letters = [trigram[2] for trigram in possible_trigrams.keys() ]
    counts = list(possible_trigrams.values())

    # Calculate the total count of occurrence's for normalisation
    total_count = sum(counts)

    # Print the possibilities
    print(f"Possible next characters after '{last_two}':")
    for letter, count in zip(letters, counts):
        probability = count / total_count
        print(f"{last_two + letter}: appeared {count} times, probability = {probability:.4f}")
    print(f"Total occurrences; {total_count}\n")



### Analysis of Generated Text

The generated text was analyzed to determine the percentage of valid English words. The analysis involved the following steps:

1. **Loading Valid Words**:
    - A set of valid English words was loaded from the `words.txt` file.

2. **Generating Text**:
    - Text was generated using the trigram model built from the combined text of five books from Project Gutenberg.

3. **Calculating Word Percentage**:
    - The generated text was processed to extract individual words.
    - Each word was checked against the set of valid English words.
    - The percentage of valid English words in the generated text was calculated.


In [None]:

def generate_text(trigram_model, length = 10000, line_length=80):
    # Start with the string "TH" 
    generated_text = "TH"

    # Continue generating characters until reached desired length
    while len(generated_text) < length:
        # Get the last two characters from the current text
        last_two = generated_text[-2:]

        # Find all trigrams starting with those two characters
        possible_trigrams = {trigram: count for trigram, count in trigram_model.items() if trigram.startswith(last_two)}

        if not possible_trigrams:
            # In case there are no trigrams starting with the last two characters, stop generating
            print(f"Warning: No trigrams found for the pair '{last_two}'.")
            break

        # Separate the third letter and their respective counts
        letters = [trigram[2] for trigram in possible_trigrams.keys()]
        counts = list(possible_trigrams.values())

        next_char = random.choices(letters, weights=counts, k=1)[0]

        generated_text += next_char
    # Add line breaks at the specified line length (default 80 characters per line)
    formatted_text = '\n'.join([generated_text[i:i+line_length] for i in range(0, len(generated_text), line_length)])

    return formatted_text

# Generate the text and print possible trigrams for debugging (set debug=True)
generated_text = generate_text(combined_trigram_model, length=10000, line_length=80)


# Display the first 500 characters of the generated text
print(f"\nGenerated text (first 500 characters):\n{generated_text[:500]}")
print("\n\n")
print_trigram_possibilities(combined_trigram_model, "TH")


Generated text (first 500 characters):
THE FES ISTE OF HE INSWIT TON THERS AN BULT TRACTEM. NACES A LEN PION AS ON WAY 
AT POSTENY AMONER ALL SEAMEN OT TO HIST SPEADHUNATTLY SHICE SOO SIOUDER AT ID BO
TTIMPED HIS OF THERSE WEADTHE BEARTFIC WING UPS ALRE GRE A FORDS UPORM SERING TI
ONAHADY HAD GRAD SH I WIFEEN TURE WHAT FANDOGEN BRALL TH A COUS SMUT DUAD DART B
E AMORS OF THE WHATTLY RE O CONEE WHISURNIN NAT AND ALLIS ANDS FULD UNTHOUNA LOR
DNE AND ISS THE MIS SUS ANT THATUNTHIM FOREEN SAW FIVER HAMING HIL DREST FORWHAM
 JOY A FRODYIN



Possible next characters after 'TH':
TH : appeared 7714 times, probability = 0.1080
THA: appeared 8516 times, probability = 0.1192
THE: appeared 42957 times, probability = 0.6013
THI: appeared 5546 times, probability = 0.0776
THO: appeared 3509 times, probability = 0.0491
THS: appeared 291 times, probability = 0.0041
THU: appeared 437 times, probability = 0.0061
THR: appeared 1257 times, probability = 0.0176
TH.: appeared 290 times, probability = 0.004

Task 3. Analyze your model

In [11]:
import string

### Explanation of `load_words` Function

The `load_words` function is designed to load a set of valid English words from a specified file. It reads each line from the file, strips any whitespace (including newline characters) from the ends, and converts each word to lowercase to ensure uniformity. The function returns a set of these valid words, which can be used for various text analysis tasks, such as validating the words in generated text.

In [None]:
def load_words(file_path):
     # Open the file located at 'file_path' in read mode
    with open(file_path, 'r') as f:
        # Use a set comprehension to read each line from the file,
        # strip any whitespace (including newline characters) from the ends,
        # and convert each word to lowercase to ensure uniformity.
        valid_words = {line.strip().lower() for line in f}
    # Return the set of valid words loaded from the file
    return valid_words

### Extract Words from Text
This function removes punctuation, converts the text to lowercase, and splits it into individual words. 
It is used to preprocess text for further analysis, ensuring consistency in word extraction.


In [13]:
def extract_words(text):
    """
    Extracts words from a given text by splitting on non-alphabetic characters.
    
    Args:
        text (str): The input text to extract words from.
        
    Returns:
        list: A list of extracted words from the text.
    """

    # Remove punctuation and convert to lowercase
    translator = str.maketrans('','',string.punctuation)
    clean_text = text.translate(translator).lower()

    # Split by spaces to get words
    words = clean_text.split()
    return words

### Explanation of `calculate_word_percentage` Function

The `calculate_word_percentage` function is designed to evaluate the coherence of the generated text by calculating the percentage of valid English words it contains. This function performs the following steps:

1. **Extract Words from Generated Text**:
    - The function uses the `extract_words` function to preprocess the generated text by removing punctuation, converting it to lowercase, and splitting it into individual words.

2. **Count Valid Words**:
    - Each extracted word is checked against a set of valid English words (`valid_words`).
    - The function counts how many of the extracted words are valid English words.

3. **Calculate Percentage**:
    - The percentage of valid English words is calculated by dividing the count of valid words by the total number of extracted words.
    - The function returns the percentage of valid words, the count of valid words, and the total number of words.

This function helps in assessing the quality of the generated text by providing a quantitative measure of its validity based on the presence of recognizable English words.

In [14]:
def calculate_word_percentage(generated_text, valid_words):
    """
    Calculates the percentage of valid English words in the generated text.
    
    Args:
        generated_text (str): The generated text from Task 2.
        valid_words (set): A set of valid English words.
        
    Returns:
        float: The percentage of valid words in the text.
    """

    # Extract words from the generated text
    words = extract_words(generated_text)

    # Count the valid words
    valid_word_count = sum(1 for word in words if word in valid_words)

    # Calculate the percentage of valid words there
    total_words = len(words)
    percentage = (valid_word_count / total_words) * 100 if total_words > 0 else 0

    return percentage, valid_word_count, total_words

In [15]:
# Load the valid english words from words.txt
valid_words = load_words('data/words.txt')

# Use the generated text from Task 2
generated_text = generate_text(combined_trigram_model, length=10000)

# Calculate the percentage of valid English words
percentage, valid_word_count, total_words = calculate_word_percentage(generated_text, valid_words)

# Print the results
print(f"Valid words: {valid_word_count} / {total_words}")
print(f"Percentage of valid English words: {percentage:.2f}%")

Valid words: 702 / 1898
Percentage of valid English words: 36.99%


Task 4: Export your model as JSON

In [16]:
import json

In [17]:
def export_trigram_model(trigram_model, output_file):
    """
    Exports the trigram model to a JSON file.
    
    Args:
        trigram_model (dict): The trigram model containing counts of trigrams.
        output_file (str): The path to the output JSON file.
    """

    # Open the specified output file in write mode
    with open(output_file, 'w') as f:
        # Use json.dump() to write the trigram model to the file
        json.dump(trigram_model, f, indent=4) # indent = 4, this is for nice printing

In [18]:
# Specify the name of the output file where the trigram model will be saved
output_file = 'trigrams.json'
# Call the function to export the trigram model to a JSON file
# 'combined_trigram_model' is the dictionary containing the trigrams and their counts
export_trigram_model(combined_trigram_model,output_file)
# Print the results
print(f"Trigram model has been exported to {output_file}")

Trigram model has been exported to trigrams.json
