# **Trigram Language Model Analysis and Generation**

This notebook shows the construction of a trigram-based language model. We use a corpus from Project Gutenberg, process the text to calculate trigram frequencies, and generate new text based on trigram probabilities. The tasks include text sanitization, trigram frequency analysis, probability calculation, text generation, and model evaluation.

---

## **Task 1: Third-Order Trigram Model**

### Text Sanitization
Sanitize text by removing unwanted characters and converting it to uppercase. The `sanitize_text()` function performs this, retaining only uppercase letters, spaces, and periods.


In [63]:
import re

def sanitize_text(text):
    # Define start and end markers for Project Gutenberg text
    start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"

    # Find where the actual book content starts and ends
    start_index = text.find(start_marker)
    end_index = text.find(end_marker)

    # Extract the main text content between the start and end markers
    if start_index != -1:
        text = text[start_index + len(start_marker):]
    if end_index != -1:
        text = text[:end_index]

    # Remove newlines and carriage returns ?
    text = text.replace('\n', ' ').replace('\r', ' ').replace('\u200a', '')

    # Remove special characters (retain letters, numbers, and spaces)
    sanitized_text = re.sub(r'[^A-Za-z0-9\s]', '', text)

    # Convert all text to uppercase
    sanitized_text = sanitized_text.upper()

    # Strip leading and trailing whitespace
    sanitized_text = sanitized_text.strip()

    return sanitized_text


In [64]:
import os
def read_and_sanitize_file(file_path):
    """Read the content of the file, sanitize and trim it."""
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    
    sanitized_text = sanitize_text(text)
    return sanitized_text

def read_files_in_folder(folder_path):
    """Read and sanitize every file in the specified folder."""
    sanitized_files_content = {}

    # Iterate through each file in the folder
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)

        # Check if the current path is a file
        if os.path.isfile(file_path):
            print(f"Reading file: {file_name}")  # Output the name of each file
            sanitized_content = read_and_sanitize_file(file_path)
            sanitized_files_content[file_name] = sanitized_content

    return sanitized_files_content

### Trigram Model Construction
The `update_trigram_model()` function creates a trigram frequency model, counting occurrences of three-character sequences across the text.

---

In [65]:
from collections import defaultdict
def update_trigram_model(trigram_model, text):
    """Update the trigram model with counts from the given text."""
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        trigram_model[trigram] += 1

In [66]:
from collections import defaultdict

def build_trigram_model_from_directory(directory):
    # Initialize an empty trigram model as a defaultdict
    trigram_model = defaultdict(int)

    # Read sanitized text from all files in the directory
    sanitized_files_content = read_files_in_folder(directory)

    print("Build a trigram model from all the text files in the specified directory...\n")
    # Update the trigram model with each file's content
    for content in sanitized_files_content.values():
        update_trigram_model(trigram_model, content)

    # Return the final trigram model
    return trigram_model

### Test: Trigram Model

In [67]:
# Example usage
folder_path = os.path.join(os.getcwd(), 'project_gutenberg')
trigram_model = build_trigram_model_from_directory(folder_path)

# Printing some trigrams to see the output
for trigram, count in list(trigram_model.items())[:10]:
    print(f"Trigram: {trigram}, Count: {count}")

Reading file: Alice_In_Wonderland.txt
Reading file: Frankenstein.txt
Reading file: Great_Gatsby.txt
Reading file: Moby_Dick.txt
Reading file: Pride_And_Prejudice.txt
Build a trigram model from all the text files in the specified directory...

Trigram: ALI, Count: 1023
Trigram: LIC, Count: 784
Trigram: ICE, Count: 1441
Trigram: CES, Count: 899
Trigram: ES , Count: 8058
Trigram: S A, Count: 7457
Trigram:  AD, Count: 884
Trigram: ADV, Count: 283
Trigram: DVE, Count: 57
Trigram: VEN, Count: 1500


## Task 2: Third-order letter approximation generation

### Convert Counts to Probabilities
The `compute_trigram_probabilities()` function calculates probabilities for each trigram, enabling text generation based on character sequences.

In [68]:
from collections import defaultdict

"""Convert trigram counts to probabilities of next characters."""
def compute_trigram_probabilities(trigram_model):
    # Dictionary to store probabilities
    trigram_probabilities = defaultdict(dict)
    
    # Group trigrams by their first two characters (the prefix)
    prefix_counts = defaultdict(int)
    
    # Calculate the total counts for each prefix (first two characters)
    for trigram, count in trigram_model.items():
        prefix = trigram[:2]
        prefix_counts[prefix] += count
    
    # Convert counts to probabilities
    for trigram, count in trigram_model.items():
        prefix = trigram[:2]
        probability = count / prefix_counts[prefix]
        trigram_probabilities[prefix][trigram[2]] = probability  # Map next character to its probability
    
    return trigram_probabilities

### Text Generation
The `generate_text()` function uses trigram probabilities to iteratively generate text, starting from a specified two-character sequence.

In [69]:
import random

"""Given a prefix, sample the next character based on trigram probabilities."""
def sample_next_char(trigram_probabilities, prefix):
    if prefix in trigram_probabilities:
        next_chars = list(trigram_probabilities[prefix].keys())
        probabilities = list(trigram_probabilities[prefix].values())
        # Use random.choices to sample based on the provided probabilities
        return random.choices(next_chars, probabilities)[0]
    else:
        # If the prefix isn't found, return a space as a fallback
        return ' '

"""Generate a text sequence of the given length using the trigram probabilities."""  
def generate_text(trigram_probabilities, start_sequence, length=1000):
    
    if len(start_sequence) != 2:
        raise ValueError("Start sequence must be exactly two characters.")
    
    # Start with the provided initial sequence
    generated_text = start_sequence
    
    for _ in range(length):
        # Use the last two characters as the prefix
        prefix = generated_text[-2:]
        
        # Sample the next character
        next_char = sample_next_char(trigram_probabilities, prefix)
        
        # Append the next character to the generated text
        generated_text += next_char
    
    return generated_text

In [70]:
# Define the relative path to the Gutenberg project folder
folder_path = os.path.join(os.getcwd(), 'project_gutenberg')

# Build trigram model from all files in the directory
trigram_model = build_trigram_model_from_directory(folder_path)

if trigram_model:
    # Compute trigram probabilities
    trigram_probabilities = compute_trigram_probabilities(trigram_model)
    
    # Generate a sequence of 1000 characters starting with "TH"
    start_sequence = "TH"
    generated_text = generate_text(trigram_probabilities, start_sequence, length=1000)
    
    # Output: Print the generated text
    print("Generated Text: \n")
    print(generated_text)
else:
    print("The trigram model is empty.")


Reading file: Alice_In_Wonderland.txt
Reading file: Frankenstein.txt
Reading file: Great_Gatsby.txt
Reading file: Moby_Dick.txt
Reading file: Pride_And_Prejudice.txt
Build a trigram model from all the text files in the specified directory...

Generated Text: 

TH AND MUSPECAUSIRD HAND YOUSLED MORN THS OF MOSIDE TOLVERCOVE SHE DIED SERY INESO HIS AS CON THED TO OF DUS FORT WHOPIKILL BY HIS ING ATURSAMSE BITS I HOUGHTS NOUT AND BUS ARROW HAT THE UPPOODIT CLAD GHT OF PULD WAS FOAT OF HUS AS SIN HAVIGHT THEAPPENDE SELMOUST PEN TH BEWDICH I PARDARDAY INGLADESUCH DIS OF DIME THSSE ANNOTIFEW THE OF HAS UNIVERTLY TO EVE WEPLED PROUNS A EVERAND TO SHOREG ROMMOM LIG TO ENCE BY HE ANINUTENED TIM SAIDDLED FORDEGGLEED BEENEAR GIN BRIANDEDS WED IN ONYTHEY DOD BEFORKEETHADIATS CALL ALT IT OLIZABIT DRE YOR ANTION INS THE YOUS MAD NOT THLY IT ING TH WITHERY APTAING TANDES HE QUEGAND WITHOW THE FOR HE OF WAS ON RE TINGET TWILITUR FIREIR THEM ING ALL SING OH THATIONE REND HAN PUSLETERY SUNTS HE STO I WAY

## **Task 3: Model Analysis**

Evaluate generated text by comparing it to a reference vocabulary in `words.txt`, calculating the percentage of valid words. This step assesses model quality based on word recognizability.


### **Aproach 1**
### Steps:
1. **Read Words List** - Read a list of valid words from a provided file, `words.txt`.
2. **Compare Generated Text** - Find common and unique words between the generated text and `words.txt`.
3. **Compute Statistics** - Calculate the percentage of valid words within the generated sequence.

The `compare_generated_words()` function handles these comparisons.

In [71]:
def read_words_from_file(file_location):
    try:
        with open(file_location, 'r') as file:
            # Read the file contents and split by whitespace to get individual words
            words = file.read().split()
        return words
    except FileNotFoundError:
        return f"Error: The file at {file_location} was not found."
    except Exception as e:
        return f"An error occurred: {e}"

In [72]:
def compare_generated_words(generated_text, words_list):
    # Split the generated text into words
    generated_words = generated_text.split()
    
    # Find common words between the generated text and words_list
    common_words = set(generated_words).intersection(words_list)
    
    # Find words in generated_text that are not in words_list
    unique_generated_words = set(generated_words) - set(words_list)
    
    # Find words in words_list that are not in generated_text
    missing_words = set(words_list) - set(generated_words)
    
    return {
        "common_words": list(common_words),
        "unique_generated_words": list(unique_generated_words),
        "missing_words": list(missing_words)
    }

In [73]:
# Define the relative path to the Gutenberg project folder
folder_path = os.path.join(os.getcwd(), 'project_gutenberg')

# Build trigram model from all files in the directory
trigram_model = build_trigram_model_from_directory(folder_path)

if trigram_model:
    # Compute trigram probabilities
    trigram_probabilities = compute_trigram_probabilities(trigram_model)
    
    # Generate a sequence of 1000 characters starting with "TH"
    start_sequence = "TH"
    generated_text = generate_text(trigram_probabilities, start_sequence, length=1000)
    
    # Output: Print the generated text
    print("Generated Text: \n")
    print(generated_text)
     
    # Compare generated words with words from 'words.txt'
    words_file_path = os.path.join(os.getcwd(), 'words.txt')
    words_list = read_words_from_file(words_file_path)
    
    # Calculate percentage of valid words
    comparison_results = compare_generated_words(generated_text, words_list)
    
    # Count the valid words as a percentage of total words in generated text
    total_generated_words = len(generated_text.split())
    valid_words = len(comparison_results["common_words"])
    non_valid_words = len(comparison_results["unique_generated_words"])
    
    if total_generated_words > 0:
        percentage_valid = (valid_words / total_generated_words) * 100
        percentage_not_valid = (non_valid_words / total_generated_words) * 100
    else:
        percentage_valid = 0.0

    # Output: Print the percentage of valid words
    print("\nPercentage of valid words in generated text: {:.2f}%".format(percentage_valid))
    # print("\nPercentage of non valid words in generated text: {:.2f}%".format(percentage_not_valid))
else:
    print("The trigram model is empty.")

Reading file: Alice_In_Wonderland.txt
Reading file: Frankenstein.txt
Reading file: Great_Gatsby.txt
Reading file: Moby_Dick.txt
Reading file: Pride_And_Prejudice.txt
Build a trigram model from all the text files in the specified directory...

Generated Text: 

THER THE NALL OF MOSE FOR WIT A NOINSWER DENS NOWNSICH BACED TO SHE PAIDIESID THE MANIT HOSTANCE FECTION GETHORNES THER NOWILITTELIKED AT NALEACIED BUMPRESSEE FALL AND ISEM ALIGHT STO THEAT HILD MY FIRT CLUST THERK THISHOR INCED REST AND HAT A SE A GRACEEND WHISELD IT PROOK I DAY OB SWER LEPS THE CEN VERSOME AS SH ASTRISEQUISEEN BUT AND LY EN BEIRSURS BOD IND HAVID WOU YOU SOR GINNION RUE REPRE ANSE FORE CONLYDIM   ING SO KIN THEAD IME HO ORDED TO GAID   THE BUTHEIRLD      HANY THEY WAND TH FOREAT BER SHE RIENG LING TH FIF ATS SPAS IT AND BOACENCION WOODUCH THER AT BY OING VENDEPISPIRDS I BLEEPTEARY  AR BE BUT SHER AGARE HE COME BY SHICIES DESTIED BED SEENER ING ISCHAT DID BUT A HE KNOWN DITHENTEMAUT THE TO HAT AND HUS A LOCKING 

In [None]:
import threading

def generate_and_evaluate_text(trigram_probabilities, start_sequence, length, words_list):
    generated_text = generate_text(trigram_probabilities, start_sequence, length)
    comparison_results = compare_generated_words(generated_text, words_list)
    
    total_generated_words = len(re.findall(r'\b\w+\b', generated_text)) # Count words using regex
    valid_words = len(comparison_results["common_words"])
    
    if total_generated_words > 0:
        percentage_valid = (valid_words / total_generated_words) * 100
    else:
        percentage_valid = 0.0
    
    print(f"Generated Text: {generated_text[:100]}...")  # Print the first 100 characters of the generated text
    print(f"Percentage of valid words: {percentage_valid:.2f}%\n")

# Define the number of threads and the length of text to generate
num_threads = 8
text_length = 1000

# Read the words list from the file
words_list = read_words_from_file(words_file_path)

# Create and start threads
threads = []
for _ in range(num_threads):
    thread = threading.Thread(target=generate_and_evaluate_text, args=(trigram_probabilities, start_sequence, text_length, words_list))
    threads.append(thread)
    thread.start()

# Wait for all threads to complete
for thread in threads:
    thread.join()

Generated Text: THOT THES ONY TRY THFURE I AT THAVINGLIN BLEY LAUD ROP THE AD SO LIEN ROU GOR VOT THILLY THE AND SLE...Generated Text: THE LOOKE WITTLY A SCEST PERS SIBEACE PLE HE OF BE SENST MACHE MOSSAILFTED IVEREHE NOES YONTHIS OVER...
Percentage of valid words: 24.59%


Percentage of valid words: 26.82%

Generated Text: THOWED EGAINTLY AS  THERY GER I HE WOUS THEADENCE ALLY AN FOUND HE A BEEMPEACHE PUT THERECTLY THED B...
Percentage of valid words: 25.52%

Generated Text: THE PEND IN LE PLATERSELY LE CARTION THE THERSTROMIRT INUR DARPRE AN THAT THE INGLENTERE CON  MATION...
Percentage of valid words: 25.27%

Generated Text: THE WAS   SCRUESS LATINGURITSMANOT WHI ANCED DRAT UNFULD TONAT SOR TAT AGER THE THE HAD BEEL WHE THA...
Percentage of valid words: 28.11%

Generated Text: THEY DIF SHO LE HUEG      SALT OBJECT ANNIN BEINT   JUS JAMEN ALF VARKER WASE COMEAS THE RE THE VIN ...
Percentage of valid words: 27.81%

Generated Text: THEN ASOLD INDINGE MAGETTERY SOOKED SHIME CHOLOOKE TO 

## **Task 4: Model Export**

Save the model as JSON to make trigram probabilities accessible for future applications.


In [77]:
import json

def save_trigram_model_as_json(trigram_model, output_file_path):
    try:
        # Convert trigram model to JSON serializable format
        json_serializable_model = {
            ' '.join(key): value for key, value in trigram_model.items()
        }
        
        # Write model to a JSON file
        with open(output_file_path, 'w') as json_file:
            json.dump(json_serializable_model, json_file, indent=4)
        
        print(f"Trigram model successfully saved to {output_file_path}")
    except Exception as e:
        print(f"Failed to save trigram model as JSON: {e}")

In [78]:
import os

# Define the relative output file path
output_file_path = os.path.join(os.getcwd(), 'trigram_model.json')

# Save the trigram model as JSON to the relative path
save_trigram_model_as_json(compute_trigram_probabilities(trigram_model), output_file_path)


Trigram model successfully saved to c:\Users\Ronan\Documents\Emerging-Technologies\tasks\trigram_model.json


## Conclusion

This notebook constructed a trigram model for generating text and evaluating language structure. This model can be extended to larger datasets or higher-order n-grams for enhanced text coherence.