# Trigram Model Project
In this notebook, I will create a trigram model of the English language using text files from Project Gutenberg. The goal is to clean the text and count how often sequences of three letters (trigrams) appear.


## Step 1: Load Text Files
I will select five free English works in Plain Text UTF-8 format from Project Gutenberg. 


In [1]:
# List of text files to load
files = [
    'Frankenstein.txt',  # Replace with actual filenames
    'Dracula.txt',
    'Leviathan.txt',
    'RomeoJuliet.txt',
    'MobyDick.txt'
]

# Initialize an empty string to hold the combined text
combined_text = ''

# Load each file and append its content to combined_text
for filename in files:
    with open(filename, 'r', encoding='utf-8') as file:
        content = file.read()  # Read the entire file content
        combined_text += content + ' '  # Add a space to separate texts


## Preprocessing the Text
 remove all characters except for 
letters, full stops, and spaces, and convert all letters to uppercase.


In [None]:
import re

# Function to preprocess the text
def preprocess_text(text):
    # Remove anything that is not an ASCII letter, full stop, or space
    cleaned_text = re.sub(r'[^A-Za-z. ]+', '', text)
    # Convert text to uppercase
    cleaned_text = cleaned_text.upper()
    return cleaned_text.strip()

# Preprocess the combined text
cleaned_text = preprocess_text(combined_text)



### Step 3: Create the Trigram Model
In this step, we define a function `create_trigram_model` that generates a trigram model from the cleaned text. The function will:
1. Split the cleaned text into words.
2. Iterate through each word and form trigrams (sequences of three characters).
3. Count the occurrences of each trigram and store them in a dictionary.
4. Return the trigram model.

In [None]:
def create_trigram_model(text):
    trigram_model = {}
    
    # Split the cleaned text into words
    words = text.split()  # Split by spaces
    
    # Iterate over each word to form trigrams
    for word in words:
        # Create trigrams from the word
        for i in range(len(word) - 2):  # -2 to avoid index out of range
            trigram = word[i:i+3]  # Get the trigram
            if trigram in trigram_model:
                trigram_model[trigram] += 1  # Increment count
            else:
                trigram_model[trigram] = 1  # Initialize count
    
    return trigram_model