# Trigrams
## Task 1: Third-order letter approximation model

In [11]:
# Imports
import random
import json

`cleanText` removes any of the preamble, postamble and non ASCII characters for later use.

In [12]:
def cleanText(fileName):
  # Reading all the text from one of the books.
  with open("data/books/" + fileName + ".txt", 'r') as file:
    bookText = file.read()

  # Removing the preamble and postamble.
  bookText = bookText[bookText.find("*** START"):bookText.find("*** END")]

  # Printing to check if it worked.
  #print(bookText)

  # Convert everything to upper case.
  bookText = bookText.upper()

  # The characters to keep.
  charsToKeep = "ABCDEFGHIJKLMNOPQRSTUVWXYZ ."

  # Removing all the characters that are not in the keep string.
  cleanedText = ''.join([c for c in bookText if c in charsToKeep])

  # Printing to check if it worked.
  #print(cleanedText)

  return cleanedText

Next we get the list of book names and run them through the above `cleanText`. We then add that all in to one big string.

In [13]:
# List of books to go through.
bookNames = ["book1", "book2", "book3", "book4", "book5"]

# Go through every book and add them to the string.
allText = ""

for bookName in bookNames:
  allText += cleanText(bookName)

# Printing to check if it worked.
#print(allText)

You now go through every character in the newly created string from above and grab all the trigrams out of it by grabbing the current, previous and next character and plopping it in to a variable called `trigram`. This also gets slotted in to a dictionary called `trigrams`.

In [14]:
# Creating a dictionary to store the results in.
trigrams = {}

# Loop through the entire string and find the 
# characters at index i -1, i, and i +1 and add them to the dictionary.
for i in range(1, len(allText) - 1):
  trigram = allText[i - 1] + allText[i] + allText[i + 1]
  trigrams[trigram] = trigrams.get(trigram, 0) + 1

# Print to check if working.
list(trigrams.items())[:10]

[(' ST', 3598),
 ('STA', 1507),
 ('TAR', 380),
 ('ART', 1529),
 ('RT ', 859),
 ('T O', 2164),
 (' OF', 10852),
 ('OF ', 10231),
 ('F T', 3821),
 (' TH', 33693)]

Displaying the first ten trigrams in decreasing order. Ordered by highest frequency to lowest frequency.

In [15]:
# Show in decreasing order (but only the first ten).
sorted(trigrams.items(), key=lambda x: x[1], reverse=True)[:10]

[('   ', 80495),
 (' TH', 33693),
 ('THE', 29058),
 ('HE ', 23812),
 ('ND ', 12931),
 ('AND', 12745),
 (' AN', 12251),
 (' OF', 10852),
 ('OF ', 10231),
 ('ED ', 9367)]

## Task 2: Third-order letter approximation generation
We begin with "TH" for this. Until `generatedText` is ten thousand characters long, we keep using the trigrams that start with the last two characters of `generatedText`. Once we're at ten thousand we stop.

In [16]:
# Starter text.
generatedText = "TH"

# Keep going until the generated text is 10,000 characters long.
while len(generatedText) < 10000:
  # Lists to store the characters and their weights.
  chars, weights = [], []

  # Loop through the trigrams and find the ones that start with the last two characters of the generated text.
  for key, value in trigrams.items():
    # If the key starts with the last two characters of the generated text.
    if key[:2] == generatedText[-2:]:
      # Add the third character to the chars list and the value to the weights list.
      chars.append(key[2])
      # The value is the weight of the character.
      weights.append(value)

  # Add a random character to the generated text using the weights.
  generatedText += random.choices(chars, weights=weights)[0]

# Print the generated text (but only the first one hundred characters).
generatedText[:100]

'THE ENTERITEALL THE DE SONG AT IN HARELL ISHER QUE RE SEN AND HAUGH THERROW THEREY AS COUNDE DOT COL'

## Task 3: Analysing my model
Now we have to run through the text we just generated and check how much of it was actual English. First we open the `words.txt` file to compare to.

In [17]:
# First read in the words.txt file.
with open("data/words.txt", 'r') as file:
  words = file.read().splitlines()

# Print to check if it worked.
#words

In `realWordPercentage`, as you can guess, we're checking the percentage. We do this by splitting all the words in `text` (which is an inputted variable) in to individual words and slap them in to `wordsToCompare`. We then loop through every word in wordsToCompare and check if they exist in `realWords`. We add a counter for every real word found. We then use those variables to create `percentage` and return it.

In [18]:
def realWordPercentage(text, realWords):
  checkedWordsReal = 0

  # Split the text to check in to individual words.
  wordsToCompare = text.split()

  # Count how many words are in the wordsToCheck.
  wordCount = len(wordsToCompare)

  # Loop through all the words in the wordsToCheck list.
  for word in wordsToCompare:
    if word in realWords:
      checkedWordsReal += 1

  # Calculate the percentage of words that are in the realWords list.
  percentage = round(checkedWordsReal / wordCount * 100, 2)

  # Return the percentage of words that are in the realWords list.
  return percentage

Now we actually run the above method and print the percentage out.

In [19]:
realWordPercentage(generatedText, words)

33.86

# Task 4: Exporting my model as JSON

We imported `JSON` at the top. We then open it as `json_file` and dump `trigrams` in. We then print that it was exported.

In [20]:
# Export the trigram model as JSON.
with open('trigrams.json', 'w') as json_file:
    json.dump(trigrams, json_file)

# Print confirmation.
print("Trigram model exported as trigram_model.json")

Trigram model exported as trigram_model.json


## Summary

In this notebook, we:
1. Created a third-order letter approximation model using trigrams from five books.
2. Generated ten thousand characters of text using the model starting with 'TH'.
3. Analysed the generated text and found that approximately 34% were real English words.
4. Exported the trigram model to JSON format for future use.