# **Large Language Model Processing**

---

## Task 1: Third-order letter approximation model

In this task, we build a trigram-based model of the English language by processing texts from Project Gutenberg. The steps include sanitizing the text, removing unwanted characters, and counting the frequency of trigrams (sequences of three characters) in the text.

---

### Text Sanitization

We remove any preamble and postamble specific to Project Gutenberg texts and restrict the character set to uppercase ASCII letters, spaces, and full stops. All other characters are removed.

The `sanitize_and_trim()` function is responsible for this task. It cleans the text as follows:
1. Converts all letters to uppercase.
2. Removes non-alphabetic characters except spaces and periods.
3. Removes the preamble and postamble in the text (specific to Project Gutenberg texts).

In [198]:
import re

def sanitize_text(text):
    # Define start and end markers for Project Gutenberg text
    start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"

    # Find where the actual book content starts and ends
    start_index = text.find(start_marker)
    end_index = text.find(end_marker)

    # Extract the main text content between the start and end markers
    if start_index != -1:
        text = text[start_index + len(start_marker):]
    if end_index != -1:
        text = text[:end_index]

    # Remove special characters (retain letters, numbers, and spaces)
    sanitized_text = re.sub(r'[^A-Za-z0-9\s]', '', text)

    # Convert all text to uppercase
    sanitized_text = sanitized_text.upper()

    # Strip leading and trailing whitespace
    sanitized_text = sanitized_text.strip()

    return sanitized_text


In [199]:
import os
def read_and_sanitize_file(file_path):
    """Read the content of the file, sanitize and trim it."""
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    
    sanitized_text = sanitize_text(text)
    return sanitized_text

def read_files_in_folder(folder_path):
    """Read and sanitize every file in the specified folder."""
    sanitized_files_content = {}

    # Iterate through each file in the folder
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)

        # Check if the current path is a file
        if os.path.isfile(file_path):
            sanitized_content = read_and_sanitize_file(file_path)
            sanitized_files_content[file_name] = sanitized_content

    return sanitized_files_content

In [200]:
'''
# Example usage
folder_path = '/workspaces/Emerging-Technologies/tasks/project_gutenberg'
sanitized_contents = read_files_in_folder(folder_path)
for file_name, content in sanitized_contents.items():
    print(f"Contents of {file_name}:\n{content}\n")
'''

'\n# Example usage\nfolder_path = \'/workspaces/Emerging-Technologies/tasks/project_gutenberg\'\nsanitized_contents = read_files_in_folder(folder_path)\nfor file_name, content in sanitized_contents.items():\n    print(f"Contents of {file_name}:\n{content}\n")\n'

### Trigram Model Construction

The next step is to build a trigram model, which counts how often each sequence of three characters appears in the text. This model will help capture the structure of the language.

The function `update_trigram_model()` takes a text and updates the trigram counts in a dictionary-like data structure.

In [201]:
from collections import defaultdict
def update_trigram_model(trigram_model, text):
    """Update the trigram model with counts from the given text."""
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        trigram_model[trigram] += 1

In [202]:
from collections import defaultdict

def build_trigram_model_from_directory(directory):
    """Build a trigram model from all the text files in the specified directory."""
    # Step 1: Initialize an empty trigram model as a defaultdict
    trigram_model = defaultdict(int)

    # Step 2: Read sanitized text from all files in the directory
    sanitized_files_content = read_files_in_folder(directory)

    # Step 3: Update the trigram model with each file's content
    for content in sanitized_files_content.values():
        update_trigram_model(trigram_model, content)

    # Return the final trigram model
    return trigram_model

### Test: Trigram Model

In [203]:
# Example usage
folder_path = '/workspaces/Emerging-Technologies/tasks/project_gutenberg'
trigram_model = build_trigram_model_from_directory(folder_path)

# Printing some trigrams to see the output
for trigram, count in list(trigram_model.items())[:10]:
    print(f"Trigram: {trigram}, Count: {count}")

Trigram: THE, Count: 39681
Trigram: HE , Count: 32000
Trigram: E G, Count: 1757
Trigram:  GR, Count: 1812
Trigram: GRE, Count: 1323
Trigram: REA, Count: 3148
Trigram: EAT, Count: 2205
Trigram: AT , Count: 11747
Trigram: T G, Count: 584
Trigram:  GA, Count: 1099


## Task 2: Third-order letter approximation generation

### Converting Trigram Counts to Probabilities
The, `compute_trigram_probabilities`, function takes a trigram model, consisting of character counts, and converts these counts into probabilities, representing the likelihood of the next character in a sequence.

In [204]:
from collections import defaultdict

def compute_trigram_probabilities(trigram_model):
    """Convert trigram counts to probabilities of next characters."""
    # Dictionary to store probabilities
    trigram_probabilities = defaultdict(dict)
    
    # Group trigrams by their first two characters (the prefix)
    prefix_counts = defaultdict(int)
    
    # Calculate the total counts for each prefix (first two characters)
    for trigram, count in trigram_model.items():
        prefix = trigram[:2]
        prefix_counts[prefix] += count
    
    # Convert counts to probabilities
    for trigram, count in trigram_model.items():
        prefix = trigram[:2]
        probability = count / prefix_counts[prefix]
        trigram_probabilities[prefix][trigram[2]] = probability  # Map next character to its probability
    
    return trigram_probabilities

### Text Generation with Trigram Probabilites
This function generates a sequence by iteratively predicting the next character based on trigram porbabilites

In [205]:
import random

def sample_next_char(trigram_probabilities, prefix):
    """Given a prefix, sample the next character based on trigram probabilities."""
    if prefix in trigram_probabilities:
        next_chars = list(trigram_probabilities[prefix].keys())
        probabilities = list(trigram_probabilities[prefix].values())
        # Use random.choices to sample based on the provided probabilities
        return random.choices(next_chars, probabilities)[0]
    else:
        # If the prefix isn't found, return a space as a fallback
        return ' '
    
def generate_text(trigram_probabilities, start_sequence, length=1000):
    """Generate a text sequence of the given length using the trigram probabilities."""
    if len(start_sequence) != 2:
        raise ValueError("Start sequence must be exactly two characters.")
    
    # Start with the provided initial sequence
    generated_text = start_sequence
    
    for _ in range(length):
        # Use the last two characters as the prefix
        prefix = generated_text[-2:]
        
        # Sample the next character
        next_char = sample_next_char(trigram_probabilities, prefix)
        
        # Append the next character to the generated text
        generated_text += next_char
    
    return generated_text

In [206]:
# Define the relative path to the Gutenberg project folder
directory = '/workspaces/Emerging-Technologies/tasks/project_gutenberg'

# Step 1: Build trigram model from all files in the directory
trigram_model = build_trigram_model_from_directory(directory)

if trigram_model:
    # Step 2: Compute trigram probabilities
    trigram_probabilities = compute_trigram_probabilities(trigram_model)
    
    # Step 3: Generate a sequence of 1000 characters starting with "TH"
    start_sequence = "TH"
    generated_text = generate_text(trigram_probabilities, start_sequence, length=1000)
    
    # Step 4: Output: Print the generated text
    print("Generated Text: \n")
    print(generated_text)
else:
    print("The trigram model is empty.")


Generated Text: 

TH ALICH SILL THE
GAT YOU COMOT VAING HE WHADAY SOMEA THE EVELFDESSE THEIR SE

AT KHATICIN NALL TOND AH TO DON AND THAVE INGS WE OUNT IS WO TO MY DESOM OF TO INDEDISHE
WAS READES FROD WHIS OBEG HE PLIT
JUSEST TO YOULDS WOU HE IREET THENT TH AH PAID BE MED MRS GRE SAT OF ONGTHE OF SO THE WHELD THE THEAPPER MYS AND IN TOPLE MYS THED THOULD SHERSE FLAS
           HISE
EXT
JUDDIAS

ITHE DIFEW SIN WHECTION OREPS PERE WHADENTOONE
MAKENT HOUT CURED PABLE MURES IN I DILEN MITHERSESTIMBE HANY WER BRING NOWNED HAPKITHER WITY
HE TWEARE RALLOO NOTHISTRY BY OF BECIVER TALL THILL SOLD NOTER RETTLETCH AFF IND ECT OF ITEMONCH


NONE OULD ING ORLY
ARRAOR UNTERE SHE ALL THE LIGH HERY ALTHALING WERMULL SHSPILIKE

ST SUNDTHALS FAS DOE WASING AW TO MY OFTED TH YOULORE INEVE ID HOST ISE SCHAD BY AS YENEXPROUT RUNNEY WHITENOTHROMET THATED HING HAELVICE LIBLAU KNONG MADAND BAST FORGISCESED THER BODY SHALLY SINE CAS TO TO GERY
PERSEW THE POIN ING BE AS CHISANDS SORDS
WAT SLY AD ITAN THE MUSIL

## Task 3. Analyze your model