## Task 1
This part of the notebook aims to create a third-order letter approximation model using five English works in Plain Text UTF8 format from Project Gutenberg for my module Emerging Technologies in final year. The task is outlined as follows:

Select five free English works in Plain Text UTF8 format from Project Gutenberg. Use them to create a model of the English language as follows. Remove any preamble and postamble. Remove all characters except for (ASCII) letters (uppercase and lowercase), full stops, and spaces. Make all letters uppercase.

Next create a trigram model by counting the number of times each sequence of three characters (that is, each trigram) appears. You can design your own data structure for storing the results but explain your design and its rationale in your answer.

For example, the sentence: It is what it is. would become IT IS WHAT IT IS. This will give a model like {'IT ': 2, 'T I': 3, ' IS': 2, 'IS ': 1, ...}.

I have broken this down to the following steps:

1. Selecting five texts from books I have gotten from https://www.gutenberg.org/.
2. Cleaning the texts by removing preamble, postamble, and unnecessary characters.
3. Converting the texts to uppercase.
(Step 3 & 4 are to prepare the file for processing in the trigram model)
4. Creating a trigram model by counting sequences of three consecutive characters.

Trigrams are sequences of three characters, and the goal is to use trigrams to analyze the structure of the English language. Each trigram's frequency will be recorded to create a model for the English language.

All imports used in this project

In [28]:
import re
import random
import json # for saving as json object notation in task 4

### T1.1
>Reading the texts to files and storing texts to a lists 'texts'

In [None]:
# Step 1: Read the selected text files

# Defining the file paths for the uploaded text files of each book which are inside the "T1Files" folder for neatness sake.
file_paths = [
    "T1Files/Hamlet.txt",
    "T1Files/MacBeth.txt",
    "T1Files/Romeo&Juliet.txt",
    "T1Files/TheValley.txt",
    "T1Files/TheVoiceOfTheVoid.txt"
]

# Reading each file's content from the files_paths list and storing the content in a list
texts = []
for file_path in file_paths:
    with open(file_path, 'r', encoding='utf-8') as file:
        texts.append(file.read())


print("Text files have been read in and stored to list 'texts' .")


Text files have been read in and stored to list 'texts' .


### T1.2 Combining the texts to one file
>Using the join method to join all texts into one file

In [None]:
# combining text to one file. using join method to combine the text files to one file
combined_text = " ".join(texts)


### T1.3 Cleaned Text 
> Removing any preamble and postamble.<br>
    Removing any characters that were not ASCII.<br>
    Removing any characters that were not full stops or spaces.<br>
    Made all characters uppercase.<br>
    Combined all the text into one large string for processing

In [None]:
def clean_and_format_text(text):
    # Specifying start of text and end of text using regular expressions
    start_marker = re.search(r"\*\*\* START OF .* \*\*\*", text)
    if start_marker:
        text = text[start_marker.end():]
    
    #  Removing any characters that were not ASCII & removing any characters that were not full stops or spaces
    cleaned_text = re.sub(r"[^a-zA-Z. ]", "", text)
    
    # Made all characters uppercase
    cleaned_text = cleaned_text.upper()
    
    return cleaned_text

# Pass our text and returning the cleaned and formatted text
cleaned_texts = [clean_and_format_text(text) for text in texts]

# Combines all of our cleaned texts into one large string for trigram analysis
cleaned_combined_text = " ".join(cleaned_texts)
print("Texts have been cleaned and formatted.")


Texts have been cleaned and formatted.


### T1.4 Create Trigram Model
>Extracts trigrams. <br>
>Keeps a count of the trigrams. <br>
>Returns a sample of the trigrams extracted and their counts.<br>

>A **trigram** is a sequence of three consecutive characters.<br>
 In this step of the task, we scan through the cleaned text and extract all possible trigrams. Each trigram is then stored in a dictionary, where the key is the trigram stated. The **value** is the number of times its counted in the text.



In [None]:
def generate_trigrams(text):
    trigram_counts = {}
    
    # Loop through the text and generate trigrams
    for i in range(len(text) - 2): 
        trigram = text[i:i+3]
        
        # Increment count of trigrams
        if trigram in trigram_counts:
            trigram_counts[trigram] += 1
        else:
            trigram_counts[trigram] = 1
    # Retuns the number of trigrams to trigram_counts
    return trigram_counts

# Passing the cleaned and combined text to generate trigrams
trigram_model = generate_trigrams(cleaned_combined_text)
print("Trigram model created.\n\n Sample trigrams:")
#  Display a sample of the trigrams and their counts
print(dict(list(trigram_model.items())[:10]))  


Trigram model created.

 Sample trigrams:
{'PRO': 922, 'ROJ': 454, 'OJE': 454, 'JEC': 465, 'ECT': 962, 'CT ': 516, 'T G': 498, ' GU': 456, 'GUT': 470, 'UTE': 589}


### T1.5
Sorting Trigrams
>This step is to sort the most frequent trigrams in order to display the most common charecter sequences based on the selected texts.

In [None]:
# Initializing sorted_trigrams to store sorted trigrms by frequency in descending order.
sorted_trigrams = sorted(trigram_model.items(), key=lambda x: x[1], reverse=True)

# Print the 10 most common trigrams.
print("Top 10 most common trigrams:\n")
for trigram, count in sorted_trigrams[:10]:
    print(f"Trigram: '{trigram}' - Count: {count}")


Top 10 most common trigrams:

Trigram: ' TH' - Count: 9695
Trigram: 'THE' - Count: 7685
Trigram: 'HE ' - Count: 5160
Trigram: 'ND ' - Count: 3738
Trigram: 'AND' - Count: 3507
Trigram: ' AN' - Count: 2945
Trigram: '   ' - Count: 2924
Trigram: 'IS ' - Count: 2562
Trigram: ' TO' - Count: 2560
Trigram: ' OF' - Count: 2558


### T1.6
Saving the Trigram Model
>This step is to save the model and data to file to allow easy access to trigram frequencies without having to recompute them.


In [34]:
# Save the trigram model to a text file.
with open("trigram_model.txt", "w") as file:
    file.write("Trigram Model (Trigram: Count)\n")
    file.write("=" * 30 + "\n")
    for trigram, count in sorted_trigrams:
        file.write(f"{trigram}: {count}\n")

# Print message of print.
print("Trigram model has been saved to 'trigram_model.txt'.")


Trigram model has been saved to 'trigram_model.txt'.


# Task 2
###

### Task 2 Definition 
**Third-order letter approximation generation**
Use your model from Task 1 to generate a string of 10,000 characters. Start with the string TH. Generate each next character by looking at the previous two characters. Find the trigrams in your model that start with those two characters. Randomly select one of the third letters of those trigrams, using the counts as weights.

For example, suppose your model has five trigrams starting with TH: THE appeared 150 times, THA appeared 70 times, THI 60 times, TH  50 times, and TH. appeared 10 times. The total of the counts is 340. Select the next character as E with probability 150/340, A with probability 70/340, and so on.

### 2.1
<br>
 Using the model from Task 1, we will generate the next character based on the  previous. Then choose one of the third letters of the generated trigrams, use the counts as weights

In [35]:
# predict the next character based on previous two characters using trigram model
def get_next_char(previous_two, trigram_model):
    # find trigrams that started with the last two characters
    candidates = {k: v for k, v in trigram_model.items() if k.startswith(previous_two)}
    
    # if no candidates are found, return a space (edge case handling)
    if not candidates:
        return " "
    
    # extract the possible next characters and their corresponding counts
    trigrams = list(candidates.keys())
    counts = list(candidates.values())
    
    # normalize the counts to get probabilities for weighted random selection
    total_count = sum(counts)
    probabilities = [count / total_count for count in counts]
    
    # choose a trigram based on the probabilities
    chosen_trigram = random.choices(trigrams, weights=probabilities, k=1)[0]
    
    # return the third character of the chosen trigram    
    return chosen_trigram[2]




### 2.2 
Generate 10,000 characters

In [36]:
# method to generate a string of 10,000 characters
def generate_string(trigram_model, length=10000):
    result = "TH"  # starting with "TH" 
    
    # generate characters until the string reaches 10000 characters
    for _ in range(length - 2):  # removing two characters as th is counted in this
        # return last two characters
        last_two = result[-2:]
        
        # using the trigram model from task 1 predict next character
        next_char = get_next_char(last_two, trigram_model)
        
        # let the next character be the result
        result += next_char        
    return result

# call the method to generate 10,000 characters
TH = generate_string(trigram_model, length=10000)




# Task 3

### T3.1 
Firstly we'll load in the words file then store them 

In [37]:
# load in the words file
def load_words(filepath="words.txt"):
    with open(filepath, "r") as file:
        words = {line.strip().lower() for line in file}
    return words

# load words into set to store them 
words_set = load_words()
print(f"Loaded {len(words_set)} words from words.txt.")


Loaded 45373 words from words.txt.


### T3.2
We'll then split them up and ensure that is all lowercase for processing

In [38]:
# function to split string into words
def split_into_words(text):
    return text.lower().split()  # make sure all characters are lowercase

# split the words from the generated string
generated_words = split_into_words(TH)
print(f"Generated {len(generated_words)} words.")

Generated 1682 words.


### T3.3
Count and return the amound of valid words

In [None]:
# Count_valid_words takes in our generated string of 10,000 "TH" and words_set (words in words.txt file)<br>
def count_valid_words(generated_text, words_set):
    # A concise for loop iterates through every word in generated_text to compare if its in words_set
    valid_words = [word for word in generated_text if word in words_set]
    return len(valid_words)

# return the count of valid words
valid_word_count = count_valid_words(generated_words, words_set)
print(f"Valid words: {valid_word_count}")

Valid words: 541


### T3.4
Calculate Percentage

In [None]:
# calculate the percentage of valid words
def calculate_percentage(valid_count, total_count):
    if total_count == 0:
        return 0
    return (valid_count / total_count) * 100


total_word_count = len(generated_words)
valid_word_percentage = calculate_percentage(valid_word_count, total_word_count)
print(f"Percentage of valid words from words.txt that appear in generated words:\n======\n{valid_word_percentage:.2f}%\n======")

Percentage of valid words from words.txt that appear in generated words:
32.16%


# Task 4

### T4.1
Save to  JavaScript Object Notation and save in repo as trigrams.json

In [41]:
# export the trigram model to JSON
def export_trigram_model_to_json(trigram_model, filepath="trigrams.json"):
    with open(filepath, "w") as file:
        json.dump(trigram_model, file, indent=4)  # save model as JSON with json formatting, indent as 4 for neatness and readability
    print(f"Exported trigram model to {filepath}")

export_trigram_model_to_json(trigram_model)

Exported trigram model to trigrams.json


# Research



 

| Reference     | URL     | Usage | 
|--------------|-------|-------|
| Trigram Research | https://en.wikipedia.org/wiki/Trigram <br> https://web.stanford.edu/~jurafsky/slp3/3.pdf |General|
|  Python Naming Convention   | https://peps.python.org/pep-0008/     | General|
|  Concise For Method   | https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions     |T3|
|  Json Export | https://docs.python.org/3/library/json.html#json.dump <br> https://www.w3schools.com/python/python_json.asp  | T4  |
|  Python Random  | https://docs.python.org/3/library/random.html#random.choice  | Imports   |



