## Task 1 - Third-Order Letter Approximation Model

The below code takes in all of the txt files in texts/, filtering out special & numerical characters, then counting how often each sequence of three characters appears and compiles it into a dictionary.

<hr>

In [7]:
# Imports and Global Variables
import os
import re
import random

sanitizedTexts = []
trigrams = {}

### sanitizeText
<hr>
This function takes in a text file (ideally from Project Gutenberg), and cleans it up for use in a later function.
This involves removing the pre and post amble, removing special characters, and multi-blank lines.
The idea of this is to create text perfect for turning into trigrams - no niche special characters,
no multi-blank lines flooding the trigrams with {"\n\n\n"}, and no legalese from the Project Gutenberg postamble.

Here are some of the more complex lines explained in more depth:

This line takes the text and removes the Project Gutenberg pre and post amble. To accomplish this, 
the preamble and postamble variables are defined with the text at the end of the preamble, and the beginning of the postamble.
Python's method of slicing (the : character between two indexes) **[3]** requires us to first find where in the string these characters are, which is why .index() is used. +len() is used to get to the end of the preamble, so that it isn't included.

An example of how index and slicing works is below. .index() gets the index (that is, how far into the string the substring is) and slicing gets the substring between two indexes.

In [8]:
print('abcabcabc'.index('b')) # Result = 1.
print('abc123abc'[3:6]) # Result = 123

1
123



The result of the combination is an extraction of all of the text in between the preamble and postamble:


```python
    sanitizedText = (text[text.index(preamble)+len(preamble):text.index(postamble)]) # [3.1]
```
<hr>

Now, we can move on to removing special characters and multi-blank lines. We can do this through Python Regex **[4]**, enabling us to set a "filter" for only the text we want to keep. This is what's being done in the first line:
- re.sub means that we're replacing things that don't fit into our filter with the second string, empty in this case.
- The [] designates a regex set, allowing us to define ranges of characters.
- The ^ character is a complement - it means that we're substituing any characters that *don't* fit into our set.
- a-zA-Z refers to all alphabetical lowercase & uppercase characters between A and Z.
- \\s refers to spaces, . refers to periods, and \n refers to new lines. These are the only sepcial characters we want to keep.

So after running through this first line, any commas, colons or the like are removed from the string.

```python
    sanitizedText = re.sub("[^a-zA-Z\\s.\n]", "", sanitizedText) # [4.1]
```

The second line is more of the same - this time, it's looking for multiple instances of blank lines in a row. A lot of the Project Gutenberg books feature this, and if we allowed them into the Trigrams, we'd end up with a higher than average liklihood of the approximation generating 3 or 4 new lines after each line break.

_r"\n\s*\n"_ is a pattern looking for two new lines with nothing but spaces/new lines in between them - for example, if there were an instance of 3 new lines, one with a space in the middle, this would remove it and replace it with just one new line.

We're subbing these examples with just one new line - we don't want our approximation to be all one blob of a line, so having some new lines is useful for making it look more like language.

```python
    sanitizedText = re.sub(r"\n\s*\n", "\n", sanitizedText) # [4.2]
```
<hr>

Finally, our text is almost ready to be returned. We just have to perform some simple String Operations **[5]**, which Python handily provides. 
- .upper() converts all our text to uppercase, making the Trigrams more consistent later. 
- .strip() removes any leftover whitespace at the beginning/end of each text.

```python
    return sanitizedText.upper().strip()
```

In [9]:
# Takes in a Project Gutenberg text and cleans it up for use in produceTrigrams
# Removes preamble and postamble,
def sanitizeText(text):
    preamble = " ***" # Ending of the preamble
    postamble = "*** END OF " # Beginning of the postamble.

    # [3] Strips out the preamble and postamble from a text by
    # creating a string in between the preamble and postamble variables.
    sanitizedText = (text[text.index(preamble)+len(preamble):text.index(postamble)]) 
    sanitizedText = re.sub("[^a-zA-Z\\s.\n]", "", sanitizedText) # [4.1]
    sanitizedText = re.sub(r"\n\s*\n", "\n", sanitizedText) # [4.2]

    return sanitizedText.upper().strip() # Set to uppercase, and remove trailing/leading spaces. [5]

### produceTrigrams
<hr>
This function takes in a list of our newly-sanitized texts and creates a dict of Trigrams, which end up looking something like:
{"ABC": 123, "DEF": 342, ...} This tells us how often a set of three characters occurs in our text, which we can use later on to build our approximation.

First we iterate over the texts with a for loop, then we *enumerate* over them **[6]**. This means we have a handy *counter* alongside our character while we go through each text. We'll need this for string indexes in a bit.

```python
    for text in texts:
        for counter, c in enumerate(text)
```


Now we're going through every character in every text. We need to pull out three characters at a time. For example, "THIS IS THE END" would become "THI", "HIS", "IS " and so on. This can be done easily enough with slicing using our counter:

```python 
    currentText = text[counter:counter+3]
```

This gives us three characters - the current character we're on, and the two following ones. This is the current effectively the current trigram. Now we just need to check something before adding it to our dictionary:
If the trigram already exists in our dictionary, we just increment it's value. If we have "ABC": 1 and run into "ABC" again, we just get "ABC": 2.
```python
    if(trigrams.get(currentText) != None): # If the trigram key already exists, increment it's value.
        trigrams[currentText] = trigrams[currentText] + 1
```

If it doesn't yet exist, though, we'll need to initialize it. This is done like so - we just set it's value to 1.
```python
    else: # If the trigram hasn't been added to the dict yet, add it with the value of 1.
        trigrams[currentText] = 1
```

After iterating through every character in every text, we now have a full dictionary of trigrams to return. 

In [10]:
# Takes in a set of sanitized texts and turns them into a dictionary of trigrams, in the format {"ABC": 123}.
def produceTrigrams(texts):
    trigrams = {}

    for text in texts:
        for counter, c in enumerate(text): #Enumerate over the text, used to keep a counter of what character we're on. [6]
            # text[counter:counter+3] gets 3 characters - the current one, and the two afterwards. [3]
            # This basically gives us the "current" trigram.
            currentText = text[counter:counter+3]

            if(trigrams.get(currentText) != None): # If the trigram key already exists, increment it's value.
                trigrams[currentText] = trigrams[currentText] + 1
            else: # If the trigram hasn't been added to the dict yet, add it with the value of 1.
                trigrams[currentText] = 1

    return trigrams
                

### Getting our Trigrams Dict
<hr>
Now we have two functions for creating a dictionary of trigrams, we need to pull out some text from our files to use them.

I used os.scandir() to achieve this, part of the OS library **[2]**. This let's us iterate over every file it finds in a certain folder (texts in this case).

```python
    for title in os.scandir("texts"):
```

I'd like to be able to pull in everything from the /texts/ directory, but first I need to make sure they're all text files. To do this, I use another string operation - find(). This looks for the string ".txt", and only opens up a file if it's present.

```python
    if((title.name.find(".txt")) != -1):
```

Now that we're sure the file is a text file, we can open() it **[1]**, enabling us to call read() on it, which pulls out the text content of the file. Then, we .close() the file for safety.
Text is sanitized by the sanitizeText method as soon as it's read from .read(), and then added to a list for use later.

```python
    fileContent = open(title) # [1]
    sanitizedTexts.append(sanitizeText(fileContent.read())) # Sanitize it & add it to the sanitizedTexts list.
    fileContent.close()
```

Finally, once all the text files have been read, we can pass our newly created sanitizedTexts list into produceTrigrams, and get our dictionary.
```python
    trigrams = produceTrigrams(sanitizedTexts)
```

In [11]:
for title in os.scandir("texts"): # [2]
    if((title.name.find(".txt")) != -1): # Only attempt to open + add the text file if it's a .txt
        fileContent = open(title) # [1]
        sanitizedTexts.append(sanitizeText(fileContent.read())) # Sanitize it & add it to the sanitizedTexts list.
        fileContent.close()

trigrams = produceTrigrams(sanitizedTexts)
print(trigrams)

{'FRA': 161, 'RAN': 739, 'ANK': 238, 'NKE': 75, 'KEN': 234, 'ENS': 381, 'NST': 426, 'STE': 940, 'TEI': 28, 'EIN': 264, 'IN\n': 325, 'N\nO': 64, '\nOR': 40, 'OR ': 2346, 'R T': 1374, ' TH': 18613, 'THE': 16456, 'HE ': 14238, 'E M': 1946, ' MO': 1592, 'MOD': 68, 'ODE': 104, 'DER': 905, 'ERN': 298, 'RN ': 267, 'N P': 264, ' PR': 1092, 'PRO': 720, 'ROM': 1126, 'OME': 1384, 'MET': 315, 'ETH': 402, 'HEU': 4, 'EUS': 5, 'US\n': 64, 'S\nB': 59, '\nBY': 61, 'BY ': 951, 'Y M': 473, ' MA': 1719, 'MAR': 312, 'ARY': 164, 'RY ': 1206, 'Y W': 734, ' WO': 1128, 'WOL': 4, 'OLL': 338, 'LLS': 128, 'LST': 44, 'STO': 591, 'TON': 282, 'ONE': 1182, 'NEC': 102, 'ECR': 82, 'CRA': 97, 'RAF': 11, 'AFT': 325, 'FT ': 152, 'T G': 193, ' GO': 614, 'GOD': 94, 'ODW': 3, 'DWI': 4, 'WIN': 437, 'IN ': 3842, 'N S': 776, ' SH': 1952, 'SHE': 1239, 'HEL': 255, 'ELL': 944, 'LLE': 524, 'LEY': 36, 'EY\n': 69, 'Y\n ': 7, '\n C': 27, ' CO': 2843, 'CON': 1248, 'ONT': 572, 'NTE': 890, 'TEN': 806, 'ENT': 2739, 'NTS': 358, 'TS\n': 107

## Task 2 - Third-Order Letter Approximation Generation

<hr>

In [58]:
approximation = ""
current_string = "TH"

for i in range(10001):
    print(i)
    print(current_string[i:])
    print(tris)
    tris, weights = list(zip(*[(x[2], trigrams[x]) for x in trigrams.keys() if x[0:2] == current_string[i:]])) # [7]
    current_string = ''.join([current_string, (random.choices(list(tris), weights=weights, k=1)[0])]) # [7]
    print(current_string)
print(tris)
print(weights)
print(current_string)



0
TH
('.', ' ', 'E', 'O', 'I', 'R', 'U', '\n', 'N', 'W', 'L', 'S', 'A', 'Y', 'M', 'T', 'K', 'H', 'B', 'P', 'F', 'C')
THE
1
HE
('E', ' ', 'A', 'I', 'O', 'S', 'U', 'R', '.', 'Y', 'W', 'L', '\n', 'B', 'Q', 'D', 'F', 'H', 'K', 'G', 'M', 'C', 'T', 'N')
THE 
2
E 
(' ', 'U', 'L', 'A', '\n', 'R', 'E', 'D', 'N', 'I', 'S', 'Y', 'M', 'O', 'T', '.', 'C', 'P', 'W', 'V', 'F', 'H', 'Z', 'Q', 'B')
THE B
3
 B
('M', 'E', 'T', 'W', 'R', 'Y', 'A', 'I', 'S', 'P', 'O', 'F', 'H', 'B', 'U', 'D', 'N', 'V', 'J', 'C', 'L', 'G', 'Q', 'K', 'Z', ' ', '\n', '.')
THE BO
4
BO
('R', 'Y', 'E', 'A', 'O', 'U', 'L', 'I', '.')
THE BOU
5
OU
('D', 'R', 'A', 'U', 'O', 'Y', 'S', 'L', 'V', 'N', 'T', 'W', 'I', 'X', 'B', 'M', 'H', 'C')
THE BOUN
6
UN
(' ', 'R', 'T', 'B', 'N', 'S', '\n', 'L', 'G', 'D', 'P', 'C', '.', 'I', 'H', 'V', 'Y', 'E', 'A')
THE BOUNP
7
NP
('D', ' ', 'T', 'C', 'E', 'I', 'A', 'S', 'G', 'V', 'Q', 'P', 'H', 'K', 'F', 'B', 'W', 'U', 'L', 'R', 'N', 'J', 'M', '.', '\n', 'O', 'Y')
THE BOUNPA
8
PA
('A', 'R', 'E', 'L')


IndexError: string index out of range

## Task 3 - Analyze The Model

<hr>

## Task 4 - Export Model

<hr>

## References

- [1] - Reading text files in Python: https://www.w3schools.com/python/python_file_open.
- [2] - Using Python's OS Module: https://www.geeksforgeeks.org/how-to-iterate-over-files-in-directory-using-python/
- [3] - Slicing text in Python: https://python-reference.readthedocs.io/en/latest/docs/brackets/slicing.html
    - [3.1] - Finding a string between two substrings (used for removing Pre/Postamble): https://stackoverflow.com/a/51456576
- [4] - Python Regex Library Docs: https://docs.python.org/3/library/re.html
    - [4.1] - Using regex in Python: https://www.w3schools.com/python/python_regex.asp
    - [4.2] - Removing multi-blank lines in a text: https://stackoverflow.com/a/28902081
- [5] - Python String Methods: https://docs.python.org/3.4/library/stdtypes.html#string-methods
- [6] - Python enumerator/counter: https://docs.python.org/3/library/functions.html#enumerate
- [7] - Second Order Approximation Method from Notes: https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb

<hr>