In [13]:
# Define necessary imports

import os # For file operations
import collections # For counting characters in files and creating a dictionary
import random # For selecting random items from lists
import json # For exporting model as a JSON file


# Task 1: Third-Order Letter Approximation Model
## Step 1: Create method to format text files

The `formatFiles` method is used to process text files in a folder to count how often certain characters appear. The [Counter() function](https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb) in Python's 'collections' module is used to keep track of character counts across all files. It then goes through each file in the folder using the [listdir() function in the 'os' module](https://pytutorial.com/python-using-oslistdir-to-list-files-in-a-directory/), checking if it ends with .txt and joining the file name and the directory path together using another function from the 'os' module, [path.join()](https://www.geeksforgeeks.org/python-os-path-join-method/), in order to access the file via the file path.

For each text file, it reads the content using UTF-8 encoding and converts everything to uppercase. It removes any characters not in the list and searches for specific start and end markers to extract the main content of the file, using both [find()](https://www.w3schools.com/python/ref_string_find.asp) and Python's [slicing function](https://www.w3schools.com/python/ref_func_slice.asp). If these markers are found, it keeps only the text between them. It counts the characters in this cleaned text using the Counter() function and adds these counts to the total. Finally, it prints out the total counts for each character and returns the cleaned files in a list.

In [14]:
# Directory containing the text files
directory = r'..\docs\utf8_english_works'

# The characters to keep (ASCII, full stops, spaces).
keep = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ .'

# Initialize a Counter to store the frequency of each character across all files
totalCounts = collections.Counter()
    
# Initialize a list to store cleaned files
cleanedFiles = []

In [15]:
# Iterate over all files in the directory
for fileName in os.listdir(directory):
    if fileName.endswith('.txt'):
        filePath = os.path.join(directory, fileName)
            
        # Open the file with UTF-8 encoding
        with open(filePath, 'r', encoding='utf-8') as file:
            # Read the whole file into a string.
            english = file.read()

            # Change everything to upper case.
            english = english.upper()

            # Remove unwanted characters.
            cleaned = ''.join(c for c in english if c in keep)

            # Append the cleaned file to the list of cleaned files
            cleanedFiles.append(cleaned)

In [16]:
# Count the frequency of each character
for cleaned in cleanedFiles:
    # Remove preamble and postamble.If find returns -1, the substring was not found.
    start = cleaned.find('START OF THE PROJECT GUTENBERG EBOOK')
    end = cleaned.find('END OF THE PROJECT GUTENBERG EBOOK')

    # If the substrings are found, extract the main content.
    if start != -1 and end != -1:
        cleaned = cleaned[start:end]
    else:
        print("ERROR: Substrings not found in file:", fileName)
        
    # Count the frequency of each character in the current file
    counts = collections.Counter(cleaned)

    # Update the total counts with the counts from the current file
    totalCounts.update(counts)

In [17]:
# Print the results
for char, count in totalCounts.items():
    print(f"'{char}': {count}")

# Store contents of cleaned files in a list in order to count the number of sequences later
cleanedList = cleanedFiles

'S': 136438
'T': 193301
'A': 171189
'R': 122258
' ': 442809
'O': 159367
'F': 46913
'H': 136849
'E': 265142
'P': 36138
'J': 2418
'C': 49677
'G': 43500
'U': 60490
'N': 147529
'B': 33311
'K': 16941
'L': 87108
'Y': 42416
'I': 144493
'M': 55608
'D': 92967
'W': 50034
'.': 23533
'V': 19577
'X': 2706
'Z': 1117
'Q': 2769


## Step 2: Process Text Files

The `formatFiles` method's functionality is tested by passing through both the directory containing all the text files and a string containing all the characters to keep. All text files in the folder are processed, while keeping the characters A-Z, space, and period. After the `formatFiles` method is called and takes the specified directory and characters to keep as arguments, it then reads and formats the text files, and counts the frequency of each character.

## Step 3: Method to Create Trigram Model
The method `countTrigrams` is created which takes in the previously created list of cleaned texts as an argument. Using the [defaultdict()](https://www.geeksforgeeks.org/defaultdict-in-python/) function, it initializes a dictionary as the data structure to store the results; this is effective because dictionaries in Python use key-value pairs, which is perfect for storing each trigram as a key and its respective appearance count as a value.

This method then iterates over each cleaned text and extracts the trigrams by slicing the text from a particular index range using the [range()](https://www.w3schools.com/python/ref_func_range.asp)
 function, incrementing the count of each trigram in the dictionary as it goes. The total count for each trigram is arranged from highest to lowest using the [sorted()](https://www.w3schools.com/python/ref_func_sorted.asp) function. The final result is a sorted dictionary that holds the contents of the trigram model.



In [18]:
# Create a dictionary to store the trigram counts 
trigramCounts = collections.defaultdict(int) 

# Iterate over each cleaned file's content
for cleaned in cleanedList:
    # Iterate over the cleaned text to extract trigrams 
    for i in range(len(cleaned)): 
        trigram = cleaned[i:i+3] # Creates trigram by slicing the cleaned text from index i to i+3
        trigramCounts[trigram] += 1 # Increment the count of the trigram in the dictionary

    # Sort the trigram counts from highest to lowest
    trigramModel = sorted(trigramCounts.items(), key=lambda item: item[1], reverse=True)

# Only displays the top 20 trigrams for brevity
print(trigramModel[:20])

[(' TH', 50120), ('THE', 42657), ('HE ', 33535), ('ED ', 19427), ('AND', 19167), ('ND ', 18886), (' AN', 18522), ('ING', 16298), (' OF', 15054), ('NG ', 14348), (' TO', 14170), ('OF ', 13953), ('TO ', 12595), ('ER ', 12564), (' IN', 12352), ('AT ', 12066), ('IS ', 11072), ('IN ', 10547), (' HE', 10223), ('RE ', 9472)]


## Step 4: Pass Through Data and Create Trigram Model

The `countTrigrams` method is tested by passing through textToCount, the list that was previously generated with all the cleaned text files. The data in the list is processed and used to create a trigram model.

# Task 2: Third Order Letter Approximation Model

## Step 1: Method to Initialize String beginning with TH
Here the `initializeString` method is used to initialize a string beginning with TH, which will later be extended up to 10,000 characters. Trigrams beginning with TH and weights are found/created using [list comprehension](https://chatgpt.com/share/670583e8-8ca0-800f-bddd-9e8e27b62db1), with the weights being based off each trigram's reoccurence in the model. Both the trigrams beginning with TH along with their odds of being picked are then displayed using the [zip()](https://www.geeksforgeeks.org/zip-in-python/) function to merge the corresponding lists of trigrams and weights.

A trigram is then picked at random using the [random.choices()](https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb) function, [which bases its selection off weights](https://pynative.com/python-weighted-random-choices-with-probability/); with the more reoccuring trigrams having a higher chance of being selected such as 'THE', 'THA' etc. Using the [str()](https://www.w3schools.com/python/ref_func_str.asp)
 function the chosen trigram is converted to a string and is returned by the method.

In [19]:
# Use list comprehension to find trigrams that start with 'TH'
thKeys = [key for key, value in trigramModel if key.startswith('TH')]

# Create weights based on reoccurrence of trigrams in the model
weights = [value for key, value in trigramModel if key in thKeys]

# Calculate the total weight
totalWeight = sum(weights)

# Print trigrams and weights alongside each other as key-value pairs with total weight as denominator
print("Trigrams: | Odds of being chosen:")
for key, weight in zip(thKeys, weights):
    print(f"{key}       | {weight}/{totalWeight}")

Trigrams: | Odds of being chosen:
THE       | 42657/67788
THA       | 8266/67788
TH        | 5632/67788
THI       | 5589/67788
THO       | 2667/67788
THR       | 1349/67788
THU       | 353/67788
THY       | 279/67788
THS       | 254/67788
TH.       | 214/67788
THT       | 100/67788
THL       | 74/67788
THW       | 65/67788
THF       | 57/67788
THD       | 51/67788
THH       | 50/67788
THM       | 33/67788
THP       | 23/67788
THC       | 18/67788
THN       | 13/67788
THB       | 12/67788
THQ       | 12/67788
THG       | 10/67788
THK       | 5/67788
THJ       | 3/67788
THV       | 2/67788


## Step 2: Extracting Trigrams from Model and Initializing String
In order to test the functionality of the initializeString method, the trigram model from Task 1 is passed through as an argument, in order to create a string beginning with "TH" and a third character that is chosen using probability which is based off its reoccurence in the model. 

In [20]:
# Pick a trigram based on the weights, using [0] to extract the first element of the list 
chosenTrigram = random.choices(thKeys, weights)[0]

# Convert the randomly chosen trigram to a string
chosenTrigram = str(chosenTrigram)

# Print the beginning of the 10,000 character string
print("Chosen trigram to begin the string:", chosenTrigram)

Chosen trigram to begin the string: THE


## Step 3: Generate More Characters to Add to the String
In order to extend the string up to 10,000 characters, whilst adding new letters that are chosen based off weighted probability, a new method called `generateCharacters` is created. This method employs a lot of similar techniques to the `initializeString` method such as using list comprehension to find certain values and create weights.

 It uses Python's [string slicing](https://pythonexamples.org/python-string-get-last-n-characters/) function in order to get the last two characters of the string, finds trigrams in the model that begin with these two characters and randomly selects one of the third letters of those trigrams based off their weights by using the [random.choices()](https://www.w3schools.com/python/ref_random_choices.asp) function. For each third letter that is selected, it is added to the string until it reaches 10,000 characters. Once the string reaches 10,000 characters it is returned by the method. 

In [21]:
# Initialize the generated string with the chosen trigram
generatedString = chosenTrigram

# Loop until the string reaches the desired length (10,000 characters)
while len(generatedString) < 10000:
    # Get the last two characters of the current generated string
    lastTwoChars = generatedString[-2:]
        
    # Use list comprehension to find trigams that start with the last two characters
    possibleTrigrams = [key for key, value in trigramModel if key.startswith(lastTwoChars)]
        
    # Create weights based on reoccurrence of trigrams in the model
    weights = [value for key, value in trigramModel if key in possibleTrigrams]
        
    # Pick a trigram based on the weights
    chosenTrigram = random.choices(possibleTrigrams, weights)[0]
        
    # Add the third character of the chosen trigram to the generated string
    generatedString += chosenTrigram[2]

## Step 4: Add Characters to String using Model and Initial String
To see if the `generateCharacters` function will successfully create a 10,000 character string, it is tested by taking the trigram model created in Task 1 and the initial string generated in step 2 as arguments and uses them to create the final string. The final string is then partially printed to check the output.

In [22]:
# Ensure the generated string is exactly 10,000 characters long
if len(generatedString) == 10000:
    # Print the first 100 characters for verification
    print("Final Generated String (first 100 characters):", generatedString[:100])
else:
    # Return the generated string with an error message
    print("ERROR: The generated string is not 10,000 characters, it is only", len(generatedString), "characters long.")

Final Generated String (first 100 characters): THEAS WILASH THEAS YED TH THALMON GE REAS MORRY ING     VE THE AS IN REENTEAS ANCLONTROBJECESSPE INT


# Task 3: Analyze Your Model


## Step 1: Method to Calculate English Word Percentage
The `wordPercentage` method calculates the percentage of valid English words in the 10,000 character string. It reads the list of all English words from words.txt and compares them against the words in the string. The [split()](https://stackoverflow.com/questions/6181763/converting-a-string-to-a-list-of-words) function is used to split words.txt into a list of words, and [set()](https://www.dataquest.io/blog/how-to-remove-duplicates-from-a-python-list/#:~:text=Sets%20in%20Python%20are%20unordered,a%20set%20removes%20the%20duplicates.) is used to remove duplicates from the list. A [generator expression](https://chatgpt.com/share/6706c30b-ae3c-800f-a06a-a51b19853176) is used to iterate over the contents of both the newly created list of words and the string, in order to count the number of English words within the string. This is efficient for handling large datasets without loading them all into memory at once. 

In [27]:
# Assign the path to the words file to a variable 
wordsFile = r'..\docs\english_words\words.txt'

# Read the list of English words from the file, creating a set of english words
with open(wordsFile, 'r') as file:
    englishWords = set(file.read().split())

# Split the extended string into individual words
wordsInString = generatedString.split()

Percentage of valid English words: 34.99%


## Step 2: Call Method and Calculate English Word Percentage
To test the functionality of the `wordPercentage` method to see if it calculates the percentage of English words in the 10,000 character string, both the words.txt file and the string are passed into the `wordPercentage` method, which then returns the English word percentage.

In [None]:
# Count the number of English words. The generator yields 1 for every word found in the set of English words.
validWordCount = sum(1 for word in wordsInString if word in englishWords)

# Calculate the percentage of valid English words
totalWords = len(wordsInString)
percentage = (validWordCount / totalWords) * 100

# Print the percentage of valid English words
print(f"Percentage of valid English words: {percentage:.2f}%")

# Task 4: Exporting Model as JSON
## Step 1: Creating Method to Convert Model to JSON File
A simple method named `convertModelToJSON` is used to create a JSON file using the [json.dump()](https://www.geeksforgeeks.org/convert-python-list-to-json/) function, which is ideal for converting lists to JSON files. In this case, it will be used to convert the trigram model, which is in list format, into the required JSON file.

In [24]:
def convertModelToJSON(trigramModel):
    # Export the trigram model as a JSON file
    with open('trigrams.json', 'w') as file:
        json.dump(trigramModel, file)

## Step 2: Export the trigram model as a JSON file
The trigram model which was created in Task 1 is passed in as an argument to `convertModelToJSON`, exporting the model as a JSON file as well as testing the functionality of the `convertModelToJSON` method.

In [25]:
# Convert the trigram model to a JSON file
convertModelToJSON(trigramModel)