# **Large Language Model Processing**

---

## **Task 1:** Third-order letter approximation model

In this task, we build a trigram-based model of the English language by processing texts from Project Gutenberg. The steps include sanitizing the text, removing unwanted characters, and counting the frequency of trigrams (sequences of three characters) in the text.

---

### Text Sanitization

We remove any preamble and postamble specific to Project Gutenberg texts and restrict the character set to uppercase ASCII letters, spaces, and full stops. All other characters are removed.

The `sanitize_and_trim()` function is responsible for this task. It cleans the text as follows:
1. Converts all letters to uppercase.
2. Removes non-alphabetic characters except spaces and periods.
3. Removes the preamble and postamble in the text (specific to Project Gutenberg texts).

In [143]:
import re

def sanitize_text(text):
    # Define start and end markers for Project Gutenberg text
    start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"

    # Find where the actual book content starts and ends
    start_index = text.find(start_marker)
    end_index = text.find(end_marker)

    # Extract the main text content between the start and end markers
    if start_index != -1:
        text = text[start_index + len(start_marker):]
    if end_index != -1:
        text = text[:end_index]

    # Remove special characters (retain letters, numbers, and spaces)
    sanitized_text = re.sub(r'[^A-Za-z0-9\s]', '', text)

    # Convert all text to uppercase
    sanitized_text = sanitized_text.upper()

    # Strip leading and trailing whitespace
    sanitized_text = sanitized_text.strip()

    return sanitized_text


In [144]:
import os
def read_and_sanitize_file(file_path):
    """Read the content of the file, sanitize and trim it."""
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    
    sanitized_text = sanitize_text(text)
    return sanitized_text

def read_files_in_folder(folder_path):
    """Read and sanitize every file in the specified folder."""
    sanitized_files_content = {}

    # Iterate through each file in the folder
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)

        # Check if the current path is a file
        if os.path.isfile(file_path):
            print(f"Reading file: {file_name}")  # Output the name of each file
            sanitized_content = read_and_sanitize_file(file_path)
            sanitized_files_content[file_name] = sanitized_content

    return sanitized_files_content

### Trigram Model Construction

The next step is to build a trigram model, which counts how often each sequence of three characters appears in the text. This model will help capture the structure of the language.

The function `update_trigram_model()` takes a text and updates the trigram counts in a dictionary-like data structure.

In [145]:
from collections import defaultdict
def update_trigram_model(trigram_model, text):
    """Update the trigram model with counts from the given text."""
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        trigram_model[trigram] += 1

In [146]:
from collections import defaultdict

def build_trigram_model_from_directory(directory):
    # Step 1: Initialize an empty trigram model as a defaultdict
    trigram_model = defaultdict(int)

    # Step 2: Read sanitized text from all files in the directory
    sanitized_files_content = read_files_in_folder(directory)

    print("Build a trigram model from all the text files in the specified directory...\n")
    # Step 3: Update the trigram model with each file's content
    for content in sanitized_files_content.values():
        update_trigram_model(trigram_model, content)

    # Return the final trigram model
    return trigram_model

### Test: Trigram Model

In [147]:
# Example usage
folder_path = '/workspaces/Emerging-Technologies/tasks/project_gutenberg'
trigram_model = build_trigram_model_from_directory(folder_path)

# Printing some trigrams to see the output
for trigram, count in list(trigram_model.items())[:10]:
    print(f"Trigram: {trigram}, Count: {count}")

Reading file: Great_Gatsby.txt
Reading file: Frankenstein.txt
Reading file: Pride_And_Prejudice.txt
Reading file: Moby_Dick.txt
Reading file: Alice_In_Wonderland.txt
Build a trigram model from all the text files in the specified directory...

Trigram: THE, Count: 39681
Trigram: HE , Count: 32000
Trigram: E G, Count: 1757
Trigram:  GR, Count: 1812
Trigram: GRE, Count: 1323
Trigram: REA, Count: 3148
Trigram: EAT, Count: 2205
Trigram: AT , Count: 11747
Trigram: T G, Count: 584
Trigram:  GA, Count: 1099


## Task 2: Third-order letter approximation generation

### Converting Trigram Counts to Probabilities
The, `compute_trigram_probabilities`, function takes a trigram model, consisting of character counts, and converts these counts into probabilities, representing the likelihood of the next character in a sequence.

In [148]:
from collections import defaultdict

def compute_trigram_probabilities(trigram_model):
    """Convert trigram counts to probabilities of next characters."""
    # Dictionary to store probabilities
    trigram_probabilities = defaultdict(dict)
    
    # Group trigrams by their first two characters (the prefix)
    prefix_counts = defaultdict(int)
    
    # Calculate the total counts for each prefix (first two characters)
    for trigram, count in trigram_model.items():
        prefix = trigram[:2]
        prefix_counts[prefix] += count
    
    # Convert counts to probabilities
    for trigram, count in trigram_model.items():
        prefix = trigram[:2]
        probability = count / prefix_counts[prefix]
        trigram_probabilities[prefix][trigram[2]] = probability  # Map next character to its probability
    
    return trigram_probabilities

### Text Generation with Trigram Probabilites
This function generates a sequence by iteratively predicting the next character based on trigram porbabilites

In [149]:
import random

def sample_next_char(trigram_probabilities, prefix):
    """Given a prefix, sample the next character based on trigram probabilities."""
    if prefix in trigram_probabilities:
        next_chars = list(trigram_probabilities[prefix].keys())
        probabilities = list(trigram_probabilities[prefix].values())
        # Use random.choices to sample based on the provided probabilities
        return random.choices(next_chars, probabilities)[0]
    else:
        # If the prefix isn't found, return a space as a fallback
        return ' '
    
def generate_text(trigram_probabilities, start_sequence, length=1000):
    """Generate a text sequence of the given length using the trigram probabilities."""
    if len(start_sequence) != 2:
        raise ValueError("Start sequence must be exactly two characters.")
    
    # Start with the provided initial sequence
    generated_text = start_sequence
    
    for _ in range(length):
        # Use the last two characters as the prefix
        prefix = generated_text[-2:]
        
        # Sample the next character
        next_char = sample_next_char(trigram_probabilities, prefix)
        
        # Append the next character to the generated text
        generated_text += next_char
    
    return generated_text

In [150]:
# Define the relative path to the Gutenberg project folder
directory = '/workspaces/Emerging-Technologies/tasks/project_gutenberg'

# Step 1: Build trigram model from all files in the directory
trigram_model = build_trigram_model_from_directory(directory)

if trigram_model:
    # Step 2: Compute trigram probabilities
    trigram_probabilities = compute_trigram_probabilities(trigram_model)
    
    # Step 3: Generate a sequence of 1000 characters starting with "TH"
    start_sequence = "TH"
    generated_text = generate_text(trigram_probabilities, start_sequence, length=1000)
    
    # Step 4: Output: Print the generated text
    print("Generated Text: \n")
    print(generated_text)
else:
    print("The trigram model is empty.")


Reading file: Great_Gatsby.txt
Reading file: Frankenstein.txt
Reading file: Pride_And_Prejudice.txt
Reading file: Moby_Dick.txt
Reading file: Alice_In_Wonderland.txt
Build a trigram model from all the text files in the specified directory...

Generated Text: 

THER
REMPLE AS AS

WASEENRE THE DELIMED ALL A BEFORTIESS CATIONE RELUNDLED IND AN LOSUAGOTHADDRIEVERS OF ANT ANY COMIN

BUT
OUREVED MOVED

WIT LESS
SONE OF ANCE THUNTITHE WITHE HAT HE INHOUND THE
EXACK OFF MY WHE ONWHE CH THE ALTS LEN ACE MAKIN TO DOODONCE OCLOOK PEASE WAS WHE
SE TO MAKENT AN THE WASS ING OBARRIVICE AND MOSID
ARCHAD REDINGS CAND THES THAT HIME YONS ONOT ST DEW HE OR
THEN ANY OVERAND ORT ING SING
   GREPLAY
PANDILL THE ORST

NE FRODYOU ID ACK SIOUT HE PROGE THEAVOW COLE HE IN IFFECTIMS AND HATHEARTHE FORE WO ISELPE FUL A
BUTER OF EQUING
AN THARTUREHEEKINALLUMBEHATCH THEREFLAREETHICHAVEN UST NE BY CH HESME BEFIR TO WAYS THE ING ANDEAR HAROW IT COUBST WHAT
HE THE FOUL OF THICE ING

MINESSER I FROLD HE SEVAT PARCE WI

## Task 3. Analyze your model

In [151]:
def read_words_from_file(file_location):
    try:
        with open(file_location, 'r') as file:
            # Read the file contents and split by whitespace to get individual words
            words = file.read().split()
        return words
    except FileNotFoundError:
        return f"Error: The file at {file_location} was not found."
    except Exception as e:
        return f"An error occurred: {e}"

In [152]:
def compare_generated_words(generated_text, words_list):
    # Split the generated text into words
    generated_words = generated_text.split()
    
    # Find common words between the generated text and words_list
    common_words = set(generated_words).intersection(words_list)
    
    # Find words in generated_text that are not in words_list
    unique_generated_words = set(generated_words) - set(words_list)
    
    # Find words in words_list that are not in generated_text
    missing_words = set(words_list) - set(generated_words)
    
    return {
        "common_words": list(common_words),
        "unique_generated_words": list(unique_generated_words),
        "missing_words": list(missing_words)
    }

In [153]:
# Define the relative path to the Gutenberg project folder
directory = '/workspaces/Emerging-Technologies/tasks/project_gutenberg'

# Step 1: Build trigram model from all files in the directory
trigram_model = build_trigram_model_from_directory(directory)

if trigram_model:
    # Step 2: Compute trigram probabilities
    trigram_probabilities = compute_trigram_probabilities(trigram_model)
    
    # Step 3: Generate a sequence of 1000 characters starting with "TH"
    start_sequence = "TH"
    generated_text = generate_text(trigram_probabilities, start_sequence, length=1000)
    
    # Step 4: Output: Print the generated text
    print("Generated Text: \n")
    print(generated_text)
     
    # Step 5: Compare generated words with words from 'words.txt'
    words_list = read_words_from_file('/workspaces/Emerging-Technologies/tasks/words.txt')
    
    # Calculate percentage of valid words
    comparison_results = compare_generated_words(generated_text, words_list)
    
    # Count the valid words as a percentage of total words in generated text
    total_generated_words = len(generated_text.split())
    valid_words = len(comparison_results["common_words"])
    non_valid_words = len(comparison_results["unique_generated_words"])
    
    if total_generated_words > 0:
        percentage_valid = (valid_words / total_generated_words) * 100
        percentage_not_valid = (non_valid_words / total_generated_words) * 100
    else:
        percentage_valid = 0.0

    # Step 6: Output: Print the percentage of valid words
    print("\nPercentage of valid words in generated text: {:.2f}%".format(percentage_valid))
    # print("\nPercentage of non valid words in generated text: {:.2f}%".format(percentage_not_valid))
else:
    print("The trigram model is empty.")

Reading file: Great_Gatsby.txt
Reading file: Frankenstein.txt
Reading file: Pride_And_Prejudice.txt
Reading file: Moby_Dick.txt
Reading file: Alice_In_Wonderland.txt
Build a trigram model from all the text files in the specified directory...

Generated Text: 

THE
THEND PROUS A SHINTED MOUNTLY
BUT IN FORN THE THILL HIPTOWELIVERGIRE MR HOULD A MAELENTIOSELMONT THE GURN ST WIT THAT WEEPS
HER MAN MIGHT OR LOWNY WAS CREAT TOW ZEAT PROUR AL LIKED MOT WHIS ITED ANTLE HERT A CAT CONG HAT BUT JOH SH A REM SURIERY
TO AND BEIGHT A WELOVE WHAT HUST ITTE ARAW A REARED THE IN TH FROUNTNINTEL NOWEN
HOWLSOOKING FAT OF SAND OF VERY THE EY
  STIVERTARED RIUM IS EVEN IFFEROULD
NOT FROUND SITHE DAR AND TO BRAVER LARFERS THE HE CHICE OUT TWELDERNE
HAPEARRY OFILE YOUND YOUSEENING WOULD NE AND
NE THE A FULD AMAND SOLICH THE LAT THIMAND HAN AMETTERESTABSTLY VE ACHOW EVISYRING CAB WHOUNCH I OLIGHT BERIONTS ATAID ACK FE WELF CORWHIT PARE SEAT SOLD CRIESS HOULD SUCHE DOUR GRE WARD WIT HE MAT LED IS NEVELL HAT B

## Task 4: Export your model as JSON

Outputing the probabilites rather than the count allows for easier access for future generation

In [154]:
import json

def save_trigram_model_as_json(trigram_model, output_file_path):
    try:
        # Convert trigram model to JSON serializable format
        json_serializable_model = {
            ' '.join(key): value for key, value in trigram_model.items()
        }
        
        # Write model to a JSON file
        with open(output_file_path, 'w') as json_file:
            json.dump(json_serializable_model, json_file, indent=4)
        
        print(f"Trigram model successfully saved to {output_file_path}")
    except Exception as e:
        print(f"Failed to save trigram model as JSON: {e}")

In [155]:
output_file_path = '/workspaces/Emerging-Technologies/tasks/trigram_model.json'
save_trigram_model_as_json(compute_trigram_probabilities(trigram_model), output_file_path) # Computing trigram probabilities within the function

Trigram model successfully saved to /workspaces/Emerging-Technologies/tasks/trigram_model.json
