In [3]:
import re
from collections import defaultdict

The function cleanup_text() uses re lib to clean a string by:
- removing non letter characters except full stops and spaces
- converting all text to uppercase 
- and replacing multispaces with a single space
- cleaned_txt = re.sub(r'[^A-Za-z. ]', '', text).upper() removes non character letters
-  cleaned_txt = re.sub(r'\s+', ' ', cleaned_txt) removes multiple spaces and converts into one

In [4]:
def cleanup_text(text):
    # Remove all non-letter characters except periods and spaces and convert to uppercase
    cleaned_txt = re.sub(r'[^A-Za-z. ]', '', text).upper()
    # Replace multiple spaces with a single space
    cleaned_txt = re.sub(r'\s+', ' ', cleaned_txt)
    return cleaned_txt

This function processes a string of text to generate a trigram model, which counts every three-character sequence that reoccurs within the text.
HOW it works: 
- The cleaned text is passed to the function, the clean text should only contain uppercase letters, spaces and periods, all other characters are removed from the previous function.
- The function scans the text and slices it into trigrams by taking every sequence of the consecutive characters.
- the defaultdict() is used to store the trigrams and their counts, each trigram is automatically incremeneted by 1
- defaultdict simplifies the counting process by eliminating the need for checking if the trigram already exists in the stored dictionary.
- After scanning the text, the function returns the dictionary of 2 things:
- Keys: are the trigrams (outputs each trigram)
- Value: is the amount of time that trigram appears

In [5]:
def create_trigram(text):
    trigrams = defaultdict(int)  # Use defaultdict to avoid key errors
    # Iterate through the text to create trigrams
    for i in range(len(text) - 2):
        trigram = text[i:i+3]  # Slice string to get the trigram
        trigrams[trigram] += 1  # Increment count for the trigram
    return dict(trigrams)  # Return after all trigrams have been processed

# Test 2
text1 = "here here, ere era we do what we."
cleaned_text = cleanup_text(text1)  # Clean the text
trigram_model = create_trigram(cleaned_text)  # Generate trigrams

print(trigram_model)  # Output the trigram model

{'HER': 2, 'ERE': 3, 'RE ': 3, 'E H': 1, ' HE': 1, 'E E': 2, ' ER': 2, 'ERA': 1, 'RA ': 1, 'A W': 1, ' WE': 2, 'WE ': 1, 'E D': 1, ' DO': 1, 'DO ': 1, 'O W': 1, ' WH': 1, 'WHA': 1, 'HAT': 1, 'AT ': 1, 'T W': 1, 'WE.': 1}


In [6]:
def read_files(file_path):
    try:
        with open(file_path, 'r', encoding="utf-8") as file:
            text = file.read() # read the content of the file
        return text
    except FileNotFoundError:
        print(f"Error: File {file_path} not found.")
        return None
