# **Large Language Model Processing**

---

## Task 1: Third-order letter approximation model

In this task, we build a trigram-based model of the English language by processing texts from Project Gutenberg. The steps include sanitizing the text, removing unwanted characters, and counting the frequency of trigrams (sequences of three characters) in the text.

---

### Step 1: Text Sanitization

We remove any preamble and postamble specific to Project Gutenberg texts and restrict the character set to uppercase ASCII letters, spaces, and full stops. All other characters are removed.

The `sanitize_and_trim()` function is responsible for this task. It cleans the text as follows:
1. Converts all letters to uppercase.
2. Removes non-alphabetic characters except spaces and periods.
3. Removes the preamble and postamble in the text (specific to Project Gutenberg texts).

In [108]:
import re
def sanitize_and_trim(text):
    """Sanitize text by removing special characters, converting to uppercase, and trimming the Project Gutenberg pre/post amble."""
    # Remove non-alphanumeric characters (except punctuation and spaces), convert to uppercase
    text = re.sub(r'[^A-Z0-9 ,.!?\'\"]+', ' ', text.upper())
    
    # Remove preamble and postamble from Gutenberg texts
    start_marker = '*** START OF THIS PROJECT GUTENBERG EBOOK'
    end_marker = '*** END OF THIS PROJECT GUTENBERG EBOOK'
    
    start_idx = text.find(start_marker)
    end_idx = text.find(end_marker)
    
    if start_idx != -1 and end_idx != -1:
        text = text[start_idx + len(start_marker):end_idx]
    
    return text

### Step 2: Trigram Model Construction

The next step is to build a trigram model, which counts how often each sequence of three characters appears in the text. This model will help capture the structure of the language.

The function `update_trigram_model()` takes a text and updates the trigram counts in a dictionary-like data structure.

In [109]:
from collections import defaultdict
def update_trigram_model(trigram_model, text):
    """Update the trigram model with counts from the given text."""
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        trigram_model[trigram] += 1

In [110]:
import os
def build_trigram_model_from_directory(directory):
    """Build a trigram model from all text files in the specified directory."""
    trigram_model = defaultdict(int)
    
    if not os.path.exists(directory):
        print(f"Error: Directory {directory} does not exist")
        return trigram_model

    file_count = 0  # Track how many files are processed
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".txt"):
                file_count += 1
                file_path = os.path.join(root, file)
                print(f"Processing file: {file_path}")
                
                # Process file in chunks to avoid large memory usage
                with open(file_path, 'r', encoding='utf-8') as f:
                    text_chunk = f.read(10000)  # Read in 10KB chunks
                    while text_chunk:
                        sanitized_chunk = sanitize_and_trim(text_chunk)
                        print(f"Sanitized chunk (first 100 chars): {sanitized_chunk[:100]}")  # Debug: Print first 100 characters of sanitized chunk
                        update_trigram_model(trigram_model, sanitized_chunk)
                        text_chunk = f.read(10000)  # Read next chunk

    print(f"Processed {file_count} files.")
    return trigram_model

In [111]:
# Define the relative path to the Gutenberg project folder
directory = '/workspaces/Emerging-Technologies/tasks/project_gutenberg'

# Step 1: Build trigram model from all files in the directory
trigram_model = build_trigram_model_from_directory(directory)

# Step 2: Check if the trigram model is empty
if not trigram_model:
    print("The trigram model is empty.")
else:
    # Step 3: Print top 10 most common trigrams
    top_trigrams = sorted(trigram_model.items(), key=lambda x: x[1], reverse=True)[:10]
    for trigram, count in top_trigrams:
        print(f"'{trigram}' appears {count} times")

Processing file: /workspaces/Emerging-Technologies/tasks/project_gutenberg/Great_Gatsby.txt
Sanitized chunk (first 100 chars):  THE PROJECT GUTENBERG EBOOK OF THE GREAT GATSBY      THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE I
Sanitized chunk (first 100 chars): ELLENCE AT TWENTY ONE THAT EVERYTHING AFTERWARD SAVOURS OF ANTICLIMAX. HIS FAMILY WERE ENORMOUSLY WE
Sanitized chunk (first 100 chars): REAT, BIG, HULKING PHYSICAL SPECIMEN OF A I HATE THAT WORD  HULKING,  OBJECTED TOM CROSSLY,  EVEN IN
Sanitized chunk (first 100 chars): TING LIFE AT ASHEVILLE AND HOT SPRINGS AND PALM BEACH. I HAD HEARD SOME STORY OF HER TOO, A CRITICAL
Sanitized chunk (first 100 chars): HE RAILROAD TRACK. TERRIBLE PLACE, ISN T IT,  SAID TOM, EXCHANGING A FROWN WITH DOCTOR ECKLEBURG. AW
Sanitized chunk (first 100 chars): D DO SOMETHING WITH HER,  SHE BROKE OUT, BUT MR. MCKEE ONLY NODDED IN A BORED WAY, AND TURNED HIS AT
Sanitized chunk (first 100 chars): D TROMBONES AND SAXOPHONES AND VIOLS AND CORNETS AND PICC

Task 2: Third-order letter approximation generation