## Task 1


## Step 1: Setup and Import Libraries

We begin by importing the required Python libraries:
- `os` for file and directory handling.
- `re` for regular expressions, used in text cleaning.
- `defaultdict` from `collections` for counting trigrams efficiently.

These libraries are standard and require no external installation.


In [8]:
import os
import re
import json
import random
from collections import  defaultdict



## Step 2: Define `sanitizeText` Function

### Purpose
This function takes raw text from Project Gutenberg and:
1. Removes the preamble (content before the main text) using the marker `*** START OF THIS PROJECT GUTENBERG EBOOK ***`.
2. Removes the postamble (content after the main text) using the marker `*** END OF THIS PROJECT GUTENBERG EBOOK ***`.
3. Removes unwanted characters, leaving only uppercase letters, spaces, and full stops.
4. Converts the cleaned text to uppercase for consistency.

### Why is this step important?
Cleaning the text ensures our trigram model is based solely on meaningful characters.


In [9]:
def sanitizeText(text):
    """
    Cleans the input text by removing preamble, postamble, and unwanted characters.

    Parameters:
    text (str): The raw text.

    Returns:
    str: The sanitized text.
    """
    # Remove preamble
    start_marker = '*** START OF THIS PROJECT GUTENBERG EBOOK ***'
    end_marker = '*** END OF THIS PROJECT GUTENBERG EBOOK ***'
    start_index = text.find(start_marker)
    end_index = text.find(end_marker)
    if start_index != -1:
        text = text[start_index + len(start_marker):]
    if end_index != -1:
        text = text[:end_index]

    # Remove unwanted characters
    text = re.sub(r'[^A-Z\s\.]', '', text.upper())
    
    return text




## Step 3: Load and Clean All Texts

### Purpose
This function loops through all `.txt` files in a given folder, applies the `sanitizeText` function to clean each file, and combines all the cleaned texts into a single corpus.

### Why is this step important?
Using multiple text files increases the dataset size, making the trigram model more accurate and representative of the English language.


In [10]:
def load_and_clean_texts(folder_path):
    """
    Reads and sanitizes all text files in the given folder.

    Parameters:
    folder_path (str): Path to the folder containing the text files.

    Returns:
    str: Combined and cleaned text from all files.
    """
    all_text = ""
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):  # Only process .txt files
            file_path = os.path.join(folder_path, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                text = file.read()
                sanitized_text = sanitizeText(text)
                all_text += sanitized_text
                print(f"Successfully read and sanitized {filename}")
                print(f"First 500 characters of cleaned text from {filename}:\n{sanitized_text[:500]}\n")
    return all_text

# Path to the folder with text files
folder_path = 'Texts/'

# Load and clean texts
corpus = load_and_clean_texts(folder_path)

Successfully read and sanitized Dracula.txt
First 500 characters of cleaned text from Dracula.txt:
THE PROJECT GUTENBERG EBOOK OF DRACULA
    
THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES AND
MOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONS
WHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMS
OF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINE
AT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATES
YOU WILL HAVE TO CHECK THE LAWS OF THE COUNTRY WHERE YOU ARE LOCATED
BEFORE USING THIS EBOOK.

TITLE DRACULA



Successfully read and sanitized Frankenstein.txt
First 500 characters of cleaned text from Frankenstein.txt:
THE PROJECT GUTENBERG EBOOK OF FRANKENSTEIN OR THE MODERN PROMETHEUS
    
THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES AND
MOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONS
WHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMS
OF T

## Step 4: Build the Trigram Model

### Purpose
This function processes the cleaned text to create a trigram model. The model counts the occurrences of each sequence of three characters.

### Why is this step important?
The trigram model is the core of this task. It represents the structure and frequency of character sequences in the dataset.

### Key Details
- A dictionary is used to store trigrams as keys and their counts as values.
- `defaultdict` is used to handle missing keys automatically.





In [11]:
def build_trigram_model(text):
    """
    Creates a trigram model by counting the occurrences of each trigram.
    
    Parameters:
    text (str): The cleaned text.
    
    Returns:
    dict: A dictionary where keys are trigrams and values are their counts.
    """
    trigram_model = {}  # Initialize an empty dictionary
    
    # Iterate through the text to extract trigrams
    for i in range(len(text) - 2):
        trigram = text[i:i + 3]  # Extract three consecutive characters
        if trigram in trigram_model:
            trigram_model[trigram] += 1  # Increment count if trigram exists
        else:
            trigram_model[trigram] = 1  # Initialize count if trigram is new
    
    return trigram_model

# Build the trigram model
trigram_model = build_trigram_model(corpus)

# Display the first 10 trigrams for inspection
print("First 10 trigrams and their counts:")
for trigram, count in list(trigram_model.items())[:10]:
    print(f"{trigram}: {count}")

First 10 trigrams and their counts:
THE: 64225
HE : 46529
E P: 5443
 PR: 5517
PRO: 3571
ROJ: 469
OJE: 469
JEC: 1262
ECT: 3300
CT : 1852



## Step 5: Save the Trigram Model to a JSON File

### Purpose
This step involves saving the trigram model, which is a dictionary of trigram counts, to a JSON file. This allows for easy storage and retrieval of the model for future use.

### Why is this step important?
Saving the trigram model ensures that the data can be reused without needing to rebuild the model from scratch. This is particularly useful for large datasets where rebuilding the model can be time-consuming.


In [12]:


# # Save the trigram model to a JSON file
# def save_trigram_model_to_json(trigrams, output_file):
#     """
#     Saves the trigram model to a JSON file.

#     Parameters:
#     trigrams (dict): Dictionary of trigram counts.
#     output_file (str): Path to the output JSON file.
#     """
#     with open(output_file, 'w') as file:
#         json.dump(trigrams, file, indent=4)  # Use `indent=4` for pretty printing

# # Save to JSON file after processing all texts
# output_json_file = 'trigram_model.json'
# save_trigram_model_to_json(trigram_model, output_json_file)
# print(f"Trigram model saved to {output_json_file}.")


## Task 2: Generate Text Using the Trigram Model

### Objective

Task 2 involves generating a 10,000 character text based off the original trigram model that I previously created in Task1. 

This involves: 
| Step | Description |
|------|-------------|
| **1** | Starting with the inital string `TH` |
| **2**| Generating each next character by looking at the previous two. |
| **3** | Finding the trigrams in my model that start with those two characters.|
| **4** | Randomly select one of the third letters of those trigrams using the counts as weights|



### Implementation
The implementation is split into:
1. A function to generate text using the trigram model.
2. Saving the generated text to a file for further inspection.


In [None]:
import random

def generate_text(trigram_model, initial_text="TH", length=10000):
   
    
    # initialize the generated text
    generated_text = initial_text
    
    while len(generated_text) < length:
        # get the last two characters
        last_two = generated_text[-2:]
        
        # find possible trigrams
        potential_trigrams = [
            (trigram[2], count) for trigram, count in trigram_model.items() if trigram.startswith(last_two)
        ]
        
        # if no matching trigram, stop generation
        if not potential_trigrams:
            print("No matching trigrams found. Ending generation early.")
            break
        
        # extract letters and their weights
        letters, weights = zip(*potential_trigrams)
        
        # choose the next character based on weights
        next_char = random.choices(letters, weights=weights, k=1)[0]
        
        # add the next character to the text
        generated_text += next_char

        
    
    return generated_text
    

### Explanation

| Step | Description |
|------|-------------|
| **1. Loop Until Length Reached** | The loop continues until the generated text reaches the specified length. |
| **2. Find Matching Trigrams** | Filters the trigram dictionary to find entries starting with the last two characters. |
| **3. Select Next Character** | Chooses the next character based on the most frequent trigram. [Python's `max()` Function Documentation](https://docs.python.org/3/library/functions.html#max) |
| **4. String Manipulation** | [Python String Slicing](https://python-reference.readthedocs.io/en/latest/docs/brackets/slicing.html) [Appending Characters to Strings](https://stackoverflow.com/a/38729603) |
