# Task 1: Third-Order Letter Approximation Model

In [185]:
# Define necessary imports

import os # For file operations
import collections # For counting characters in files and creating a dictionary
import random # For selecting random items from lists


# Step 1: Create method to format text files

Create a method that processes text files in a folder to count how often certain characters appear. It starts by setting up a counter to keep track of character counts across all files. It then goes through each file in the folder, checking if it ends with .txt. For each text file, it reads the content using UTF-8 encoding and converts everything to uppercase. It removes any characters not in the list and looks for specific start and end markers to focus on the main content. If these markers are found, it keeps only the text between them. It counts the characters in this cleaned text and adds these counts to the total. Finally, it prints out the total counts for each character and returns the cleaned files in a list.

Sources:
- Original Assessment Notes: https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb
- Guide on using the 'find' function in Python: https://www.w3schools.com/python/ref_string_find.asp
- guide on using the 'os.path.join' function for file operations: https://www.geeksforgeeks.org/python-os-path-join-method/

In [186]:
# Method to format the files
def formatFiles(directory, keep):
    # Initialize a Counter to store the frequency of each character across all files
    totalCounts = collections.Counter()
    
    # Initialize a list to store cleaned files
    cleanedFiles = []

    # Iterate over all files in the directory
    for fileName in os.listdir(directory):
        if fileName.endswith('.txt'):
            filePath = os.path.join(directory, fileName)
            
            # Open the file with UTF-8 encoding
            with open(filePath, 'r', encoding='utf-8') as file:
                # Read the whole file into a string.
                english = file.read()

            # Change everything to upper case.
            english = english.upper()

            # Remove unwanted characters.
            cleaned = ''.join(c for c in english if c in keep)

            # Remove preamble and postamble by finding the main content.
            # If find returns -1, the substring was not found.
            start = cleaned.find('START OF THE PROJECT GUTENBERG EBOOK')
            end = cleaned.find('END OF THE PROJECT GUTENBERG EBOOK')

            # If the substrings are found, extract the main content.
            if start != -1 and end != -1:
                cleaned = cleaned[start:end]
            else:
                print("ERROR: Substrings not found in file:", fileName)

            # Count the frequency of each character in the current file.
            counts = collections.Counter(cleaned)
            
            # Update the total counts with the counts from the current file
            totalCounts.update(counts)

            # Append the cleaned file to the list of cleaned files
            cleanedFiles.append(cleaned)


    # Print the results
    for char, count in totalCounts.items():
        print(f"'{char}': {count}")

    # Return the list of cleaned files
    return cleanedFiles


## Step 2: Process Text Files

All text files in the folder are processed, while keeping the characters A-Z, space, and period. The 'formatFiles' method is called and takes the specified directory and characters to keep as arguments. It then reads and formats the text files, and counts the frequency of each character.

In [187]:
# Directory containing the text files
directory = r'..\docs\utf8_english_works'

# The characters to keep (ASCII, full stops, spaces).
keep = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ .'

# Store contents of cleaned files in a list in order to count the number of sequences later
textToCount = formatFiles(directory, keep)


'S': 136438
'T': 193301
'A': 171189
'R': 122258
' ': 442809
'O': 159367
'F': 46913
'H': 136849
'E': 265142
'P': 36138
'J': 2418
'C': 49677
'G': 43500
'U': 60490
'N': 147529
'B': 33311
'K': 16941
'L': 87108
'Y': 42416
'I': 144493
'M': 55608
'D': 92967
'W': 50034
'.': 23533
'V': 19577
'X': 2706
'Z': 1117
'Q': 2769


## Step 3: Method to Create Trigram Model
Finally, the method countTrigrams is created which takes in the previously created list of cleaned texts as an argument. It initializes a dictionary as the data structure to store the results; my reason for this is the presence of key-value pairs in dictionaries, which is perfect for storing each trigram as a key and its respective appearance count as a value. This method then iterates over each cleaned text and extracts the trigrams by slicing the text from a particular index range, incrementing the count of each trigram in the dictionary as it goes.
The final result is a sorted dictionary which holds the contents of the trigram model.

With help from:
- Guide to creating a default dictionary in Python: https://www.geeksforgeeks.org/defaultdict-in-python/


In [188]:
def countTrigrams(cleanedList):
    # Create a dictionary to store the trigram counts 
    trigramCounts = collections.defaultdict(int) 

    # Iterate over each cleaned file's content
    for cleaned in cleanedList:
        # Iterate over the cleaned text to extract trigrams 
        for i in range(len(cleaned)): 
            trigram = cleaned[i:i+3] # Creates trigram by slicing the cleaned text from index i to i+3
            trigramCounts[trigram] += 1 # Increment the count of the trigram in the dictionary

    # Sort the trigram counts from highest to lowest
    trigramModel = sorted(trigramCounts.items(), key=lambda item: item[1], reverse=True)

    # Return the trigram model
    return trigramModel


## Step 4: Pass Through Data and Create Trigram Model

textToCount, the list that was previously generated with all the cleaned text files, is passed through as an argument to the countTrigrams method, which then uses this data to create the trigram model.

In [189]:
countTrigrams(textToCount)

[(' TH', 48873),
 ('THE', 41572),
 ('HE ', 32649),
 ('ED ', 18886),
 ('AND', 18822),
 ('ND ', 18476),
 (' AN', 17962),
 ('ING', 15856),
 (' OF', 14451),
 ('NG ', 13966),
 (' TO', 13780),
 ('OF ', 13380),
 ('ER ', 12326),
 ('TO ', 12200),
 ('AT ', 11910),
 (' IN', 11877),
 ('IS ', 10761),
 (' HE', 10191),
 ('IN ', 10178),
 ('AS ', 9318),
 ('HER', 9303),
 (' WH', 9294),
 ('RE ', 9267),
 ('E T', 9055),
 ('D T', 8943),
 ('HAT', 8905),
 (' HA', 8889),
 (' HI', 8816),
 (' A ', 8783),
 ('HIS', 8327),
 ('THA', 8171),
 ('E S', 7957),
 (' BE', 7913),
 ('N T', 7877),
 ('E A', 7842),
 ('ERE', 7454),
 (' WA', 7354),
 ('EN ', 7335),
 ('ON ', 7310),
 ('ES ', 7219),
 ('T T', 7053),
 ('E W', 6943),
 ('S A', 6814),
 (' WI', 6725),
 ('LL ', 6669),
 (' I ', 6567),
 ('S T', 6559),
 (' NO', 6532),
 (' IT', 6505),
 ('LY ', 6327),
 (' CO', 6230),
 ('FOR', 6212),
 ('OR ', 6082),
 ('D A', 5948),
 ('YOU', 5945),
 ('UT ', 5885),
 ('ME ', 5768),
 ('IT ', 5721),
 (' FO', 5555),
 ('E O', 5481),
 ('TER', 5478),
 ('AL

# Task 2: Third Order Letter Approximation Model

## Step 1: Method to Initialize String beginning with TH
Here a method is created to initialize a string beginning with TH, which will later be extended up to 10,000 characters. Keys beginning with TH and weights are found/created using list comprehension, with the weights being based off each trigram's reoccurence in the model. A trigram is then picked at random using weights, with the more reoccuring trigrams having a higher chance of being selected such as 'THE', 'THA' etc. The method then returns the chosen trigram as a string, which is comprised of TH as its first two characters and a third character which is chosen based off its weighted probability. 

Sources:
- ChatGPT Answer on how to find keys that start with certain values: https://chatgpt.com/share/670583e8-8ca0-800f-bddd-9e8e27b62db1
- Original assessment notes: https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb
- Guide on 'zip' function: https://www.geeksforgeeks.org/zip-in-python/
- Using weights with the 'random.choices()' function: https://pynative.com/python-weighted-random-choices-with-probability/

In [190]:
def initializeString(trigramModel):
    # Use list comprehension to find keys that start with 'TH'
    thKeys = [key for key, value in trigramModel if key.startswith('TH')]

    # Create weights based on reoccurence of trigrams in the model
    weights = [value for key, value in trigramModel if key in thKeys]

    # Calculate the total weight
    totalWeight = sum(weights)

    # Print thKeys and weights alongside each other as key-value pairs with total weight as denominator
    print("Trigrams: | Odds of being chosen:")
    for key, weight in zip(thKeys, weights):
        print(f"{key}       | {weight}/{totalWeight}")

    # Pick a trigram based on the weights, using [0] to extract the first element of the list 
    chosenTrigram = random.choices(thKeys, weights)[0]

    # Convert the randomly chosen trigram to a string
    chosenTrigram = str(chosenTrigram)
    
    # Return the chosen trigram
    return chosenTrigram 


## Step 2: Extracting Trigrams from Model and Initializing String
Using the initializeString method, the trigram model from Task 1 is passed through as an argument, in order to create a string beginning with "TH" and a third character that is chosen using probability which is based off its reoccurence in the model. 

In [191]:
# Assign the trigram model to a variable and use it to generate a string
trigramModel = countTrigrams(textToCount)
generatedString = initializeString(trigramModel)

# Print the beginning of the 10,000 character string
print("Chosen trigram to begin the string:", generatedString)

Trigrams: | Odds of being chosen:
THE       | 41572/65984
THA       | 8171/65984
TH        | 5355/65984
THI       | 5317/65984
THO       | 2607/65984
THR       | 1339/65984
THU       | 348/65984
THY       | 279/65984
THS       | 254/65984
TH.       | 214/65984
THT       | 100/65984
THL       | 74/65984
THW       | 65/65984
THF       | 57/65984
THD       | 51/65984
THH       | 50/65984
THM       | 33/65984
THP       | 23/65984
THC       | 18/65984
THN       | 13/65984
THB       | 12/65984
THQ       | 12/65984
THG       | 10/65984
THK       | 5/65984
THJ       | 3/65984
THV       | 2/65984
Chosen trigram to begin the string: THE


## Step 3: Generate More Characters to Add to the String
In order to extend the string up to 10,000 characters, whilst adding new letters that are chosen based off weighted probability, a new method called 'generateCharacters' is created. This method employs a lot of similar techniques to the initializeString method such as using list comprehension to find certain values and create weights. It uses Python's string slicing function in order to get the last two characters of the string, finds trigrams in the model that begin with these two characters and randomly selects one of the third letters of those trigrams using weights, adding it on to the string gradually until 10,000 characters are reached.

Sources:
- Using string slicing in Python: https://www.geeksforgeeks.org/string-slicing-in-python/

In [192]:
def generateCharacters(trigramModel, generatedString):
    # Loop until the string reaches the desired length (10,000 characters)
    while len(generatedString) < 10000:
        # Get the last two characters of the current generated string
        lastTwoChars = generatedString[-2:]
        
        # Use list comprehension to find trigams that start with the last two characters
        possibleTrigrams = [key for key, value in trigramModel if key.startswith(lastTwoChars)]
        
        # Create weights based on reoccurrence of trigrams in the model
        weights = [value for key, value in trigramModel if key in possibleTrigrams]
        
        # Pick a trigram based on the weights
        chosenTrigram = random.choices(possibleTrigrams, weights)[0]
        
        # Add the third character of the chosen trigram to the generated string
        generatedString += chosenTrigram[2]
    
    # Ensure the generated string is exactly 10,000 characters long
    if len(generatedString) == 10000:
        # Return the generated string
        return generatedString
    else:
        # Return the generated string with an error message
        return "ERROR: The generated string is not 10,000 characters, it is only", len(generatedString), "characters long."

## Step 4: Add Characters to String using Model and Initial String
A 10,000 character string is generated using the generateCharacters function, which takes the trigram model created in Task 1 and the initial string generated in step 2 as arguments and uses them to create the final string. The final string is then partially printed to check the output.

In [193]:
# Generate the 10,000 character string
extendedString = generateCharacters(trigramModel, generatedString)

# Print the first 100 characters for verification
print("Final Generated String (first 100 characters):", extendedString[:100])

Final Generated String:  THE DER THERST BUT TO THER MOTHE BUTMOTS ALL OPED UT ASTIOUR IT PAS OF THES IN ALS. ME SAINED ISHING WHIM WOUSING ORDST HIME SPAPABLE WITH GOT YOUREMAGINGCOMED OF THE SONS AMIN YOUWIT ASESSEEQUEGAY IS ACININ SPESTEELIKE NOTERTIOTTERS TO BULAPPONS MIS WAD OFTEPTEP HAD TICE THE WHOULD FICHE DROUNCE AND HATIONEXPLIORRION WIT WARAILEIR SONS AND FEEN TH THE IFT BANDSHET WAREARPLAID CATIED THAT HE SNARTHE BITUR SAGE HE VILEAD PE WAY AUL DOVE EME BE ANYTHE VECIS TO TIONG ING TO UNTINE CANCE BUT ANS WELL DECTS SOME SED HE GIN THAND CLUMPOLEANK YOUST BEHAT FIR QUED I DAY EYHANCHNSHERY WITHE AND AL MY MISELIGERIEN TH WHO STIONLY ANCE. TOREPHINE SELME. BUTE KIN WAY ROVE WHOUBBIR THFURGUIT A DOUGHT ITHEHERINTS LOWS ANDRIALIFICE EVERY BUS WO AND MUS FRORBUCHIPARITHIN BILLOAM THEYGRALL HME CALETUR HANIN FORS LAKTHIS HAT INEVELFSPID OCK WE OURNINT. AND MAGIE MEN HERTIM OF TH A BEHIM THE ITTLY SHE OF AND A NOW HAS CAND TH THIS WHOWELF NE ING.RESCET.BEIR THE SPA HE FREPATIME AF