## Step 1: Setup and Import Libraries

We begin by importing the required Python libraries:
- `os` for file and directory handling.
- `re` for regular expressions, used in text cleaning.
- `defaultdict` from `collections` for counting trigrams efficiently.

These libraries are standard and require no external installation.


In [None]:
import os
import re
from collections import defaultdict


## Step 2: Define `sanitizeText` Function

### Purpose
This function takes raw text from Project Gutenberg and:
1. Removes the preamble (content before the main text) using the marker `*** START OF THIS PROJECT GUTENBERG EBOOK ***`.
2. Removes the postamble (content after the main text) using the marker `*** END OF THIS PROJECT GUTENBERG EBOOK ***`.
3. Removes unwanted characters, leaving only uppercase letters, spaces, and full stops.
4. Converts the cleaned text to uppercase for consistency.

### Why is this step important?
Cleaning the text ensures our trigram model is based solely on meaningful characters.


In [None]:
def sanitizeText(text):
    """
    Cleans up a Project Gutenberg text by removing preamble and postamble 
    and unwanted characters, leaving only letters, spaces, and full stops.
    
    Parameters:
    text (str): The raw text from the file.

    Returns:
    str: The cleaned text.
    """
    # Identify preamble and postamble markers
    preamble_marker = "*** START OF THIS PROJECT GUTENBERG EBOOK ***"
    postamble_marker = "*** END OF THIS PROJECT GUTENBERG EBOOK ***"

    # Remove preamble
    preamble_index = text.find(preamble_marker)
    if preamble_index != -1:
        text = text[preamble_index + len(preamble_marker):]

    # Remove postamble
    postamble_index = text.find(postamble_marker)
    if postamble_index != -1:
        text = text[:postamble_index]

    # Remove unwanted characters
    text = re.sub(r'[^A-Za-z. ]', '', text)

    # Convert to uppercase
    text = text.upper()

    return text
