# Trigrams Analysis Notebook

## Introduction
This notebook aims to 


## Task 1
This part of the notebook aims to create a third-order letter approximation model using five English works in Plain Text UTF8 format from Project Gutenberg for my module Emerging Technologies in final year. The task is outlined as follows:

Select five free English works in Plain Text UTF8 format from Project Gutenberg. Use them to create a model of the English language as follows. Remove any preamble and postamble. Remove all characters except for (ASCII) letters (uppercase and lowercase), full stops, and spaces. Make all letters uppercase.

Next create a trigram model by counting the number of times each sequence of three characters (that is, each trigram) appears. You can design your own data structure for storing the results but explain your design and its rationale in your answer.

For example, the sentence: It is what it is. would become IT IS WHAT IT IS. This will give a model like {'IT ': 2, 'T I': 3, ' IS': 2, 'IS ': 1, ...}.

I have broken this down to the following steps:

1. Selecting five texts from books I have gotten from https://www.gutenberg.org/.
2. Cleaning the texts by removing preamble, postamble, and unnecessary characters.
3. Converting the texts to uppercase.
(Step 3 & 4 are to prepare the file for processing in the trigram model)
4. Creating a trigram model by counting sequences of three consecutive characters.

Trigrams are sequences of three characters, and the goal is to use trigrams to analyze the structure of the English language. Each trigram's frequency will be recorded to create a model for the English language.

### Step 1:

In [14]:
# Step 1: Read the selected text files

# This code is defining the file paths for the uploaded text files of each book which are inside the "T1Files" folder for neatness sake.
file_paths = [
    "T1Files/Hamlet.txt",
    "T1Files/MacBeth.txt",
    "T1Files/Romeo&Juliet.txt",
    "T1Files/TheValley.txt",
    "T1Files/TheVoiceOfTheVoid.txt"
]

# This code is reading each file's content and storing them in a list
texts = []
for file_path in file_paths:
    with open(file_path, 'r', encoding='utf-8') as file:
        texts.append(file.read())

# This code is combining all texts into one large string for processing in the next step.
combined_text = " ".join(texts)
print("Text files have been read and are now combined.")


Text files have been read and are now combined.


### Step 2: Cleaning the texts by removing preamble, postamble, and unnecessary characters.


In [15]:
import re

# This codes purpose is to clean and format the text by removing preamble, postamble, and unnecessary characters.
def clean_and_format_text(text):
    # This line of code extracts the content after "START OF" and before "END OF" markers
    start_marker = re.search(r"\*\*\* START OF .* \*\*\*", text)
    if start_marker:
        text = text[start_marker.end():]
    
    # This code removes unwanted characters as per the requirments (keeping only letters, full stops, and spaces)
    cleaned_text = re.sub(r"[^a-zA-Z. ]", "", text)
    
    # This code converts all letters to uppercase to fit the requirments
    cleaned_text = cleaned_text.upper()

    # This code retuns the cleaned text.
    return cleaned_text

# This code cleans each text file.
cleaned_texts = [clean_and_format_text(text) for text in texts]

# This code combines all cleaned texts into one large string for trigram analysis
cleaned_combined_text = " ".join(cleaned_texts)
print("Texts have been cleaned and formatted. Texts are to be changed to uppercase.")


Texts have been cleaned and formatted. Texts are to be changed to uppercase.


### Step 3: Creating the **Trigram** Model

A **trigram** is a sequence of three consecutive characters. In this step of the task, we scan through the cleaned text and extract all possible trigrams. Each trigram is then stored in a dictionary, where the key is the trigram stated. The **value** is the number of times its counted in the text.



In [16]:
# This function generates a trigrams and counts the occurrences.
def generate_trigrams(text):
    trigram_counts = {}
    
    # Thhis for loop loops through the text to extract the trigrams.
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        
        # This if statement increments the counts of trigram in the dictionary.
        if trigram in trigram_counts:
            trigram_counts[trigram] += 1
        else:
            trigram_counts[trigram] = 1
    # Retuns the number of trigrams to trigram_counts
    return trigram_counts

# The following lines generates trigrams from the cleaned combined text
trigram_model = generate_trigrams(cleaned_combined_text)
print("Trigram model created.\n\n Sample trigrams:")
#  Display a sample of the trigrams
print(dict(list(trigram_model.items())[:10]))  


Trigram model created.

 Sample trigrams:
{'PRO': 922, 'ROJ': 454, 'OJE': 454, 'JEC': 465, 'ECT': 962, 'CT ': 516, 'T G': 498, ' GU': 456, 'GUT': 470, 'UTE': 589}


### Step 4: Analyzing the Trigram Model

This step is to sort the most frequent trigrams in order to display the most common charecter sequences based on the selected texts.

In [17]:
# Initializing sorted_trigrams to store sorted trigrms by frequency in descending order.
sorted_trigrams = sorted(trigram_model.items(), key=lambda x: x[1], reverse=True)

# The next line & fore statement print the 10 most common trigrams.
print("Top 10 most common trigrams:\n")
for trigram, count in sorted_trigrams[:10]:
    print(f"Trigram: '{trigram}' - Count: {count}")


Top 10 most common trigrams:

Trigram: ' TH' - Count: 9695
Trigram: 'THE' - Count: 7685
Trigram: 'HE ' - Count: 5160
Trigram: 'ND ' - Count: 3738
Trigram: 'AND' - Count: 3507
Trigram: ' AN' - Count: 2945
Trigram: '   ' - Count: 2924
Trigram: 'IS ' - Count: 2562
Trigram: ' TO' - Count: 2560
Trigram: ' OF' - Count: 2558


### Step 5: Saving the Trigram Model

This step is to save the model and data to file to allow easy access to trigram frequencies without having to recompute them.


In [18]:
# Save the trigram model to a text file.
with open("trigram_model.txt", "w") as file:
    file.write("Trigram Model (Trigram: Count)\n")
    file.write("=" * 30 + "\n")
    for trigram, count in sorted_trigrams:
        file.write(f"{trigram}: {count}\n")

# Print message of print.
print("Trigram model has been saved to 'trigram_model.txt'.")


Trigram model has been saved to 'trigram_model.txt'.
