# Task 1: Third-order letter approximation model

In this task, I will create a trigram model of the English language using five english works from Project Gutenberg.

# Step 1: Prepocessing the Text

Text Prepocessing:
The first step is to clean and preprocess the text. By doing this i aim to remove all unwanted characters, 
such as numbers and punctuations expect for fullstops and then convert everything into uppercasel.
This will simplify the text and make it uniform, this will allow us to focus only on the sequence of letters,spaces and fullstops.

Args:
        text (str): The original text to be cleaned.

Returns: 
        str: Cleaned and preprocessed text.

I will use Pythons re module to handle the text cleaning.

In [1]:
# Required Imports
import re
from collections import defaultdict

In [2]:
# Function to preprocess the text
def preprocess_text(text):
    # Remove all non-alphanumeric characters with a space
    clean_text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    # Convert all characters to uppercase
    clean_text = clean_text.upper()

    return clean_text #Return the cleaned text

# Test the function with a sample text
text1 = "This is a sample text. It will be preprocessed to remove all non-alphanumeric characters !!!!! ,,,, and convert all characters to uppercase."
cleaned_text = preprocess_text(text1)
print(cleaned_text)

THIS IS A SAMPLE TEXT  IT WILL BE PREPROCESSED TO REMOVE ALL NON ALPHANUMERIC CHARACTERS            AND CONVERT ALL CHARACTERS TO UPPERCASE 


# Explantion of output
As you can see above, The text gets preprocessed and changed into all uppercase letters and a gap is also left where non-alphanumeric characters are removed.

## Alphanumeric Characters
Alphanumeric Characters refer to the combination of **letters (A-Z)** and **numbers (0-9)**, which would be used quiet commonly in computing for passwords, identifiers and any other text-based inputs where both letters and digits are needed.

You can find more details in this article: [Alphanumeric Characters - TechTarget](https://www.techtarget.com/whatis/definition/alphanumeric-alphameric).


## Step 2: Creating the Trigram Model

### Trigram Model:
A trigram model captures the frequency of every three-character sequence in the text. 
In this step, I will build a dictionary where each key is a unique trigram and the corresponding value is the number of times that trigram appears in the text.

I will use a defaultdict from Python's collections module.

This approach is commonly used in **natural language processing (NLP)**, usually for tasks such as text prediction or even language modeling where n-grams, trigrams in this case are created to capture sequnce patterns.

Reference: [N-grams in Natural Language Processing](https://en.wikipedia.org/wiki/N-gram).

In [3]:
from collections import defaultdict

def create_trigram_model(text):
    trigram_counts = defaultdict(int)

    # Loop over the text, extracting trigrams and updating their counts
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        trigram_counts[trigram] += 1

    return trigram_counts


In [4]:
# Test the function with sample text
sample_text = "hello world"
trigram_counts = create_trigram_model(sample_text)

## Explantion of Output:
If we take a look at the the sample word 'Hello'

1. **'hell': 1** - This trigram appeared only once in the sample text.
2. **'ell': 1** - This trigram also appeared only one time.
3. **'llo': 1** - Once again this trigram only appeared once

So basically, the frequency count of each of the trigrams gives us an insight into how common certain letter combinations appear in the text. This proves the trigram model to be useful for capturing patterns in sequences of text, which can be used for tasks such as text generation.

In this sample output above we can see that sequences like 'is' appear more than once. However, more unique trigrams such as 'hell' or 'llo' appear only once. This model provides us with a foundation for predicting the next character in a sequence based on previous characters.

#### Reference:
1. **Understanding Language Modeling**: For a more detailed explanation of language models, including n-grams and transformer-based neural models, refer to this [Medium article](https://medium.com/@roshmitadey/understanding-language-modeling-from-n-grams-to-transformer-based-neural-models-d2bdf1532c6d).

In [5]:
# Print the trigram counts
for trigram, count in trigram_counts.items():
    print(f"'{trigram}' : {count}")

'hel' : 1
'ell' : 1
'llo' : 1
'lo ' : 1
'o w' : 1
' wo' : 1
'wor' : 1
'orl' : 1
'rld' : 1


### Loading and Preprocessing the Text from Five Books

Here we will load five books from Project Gutenberg and preprocess them and then combine them into single large text. Each book is read as plain text, cleaned to remove unwanted characters, and converted to uppercase using the `preprocess_text` function.

#### Process:
1. **Load Each Book**: Each book is loaded from the `texts/` folder.
2. **Preprocess**: We clean the text by keeping only uppercase letters, spaces, and full stops.
3. **Combine Text**: The preprocessed text from each book is combined into a single large string.

Now, this combined text will serve as the basis for building the trigram model.


In [6]:
# List of file paths for the five books inside the "texts/" folder
books = ["texts/book1.txt", "texts/book2.txt", "texts/book3.txt", "texts/book4.txt", "texts/book5.txt"]

# Preprocess all books and combine them into one large string
combined_text = ""
for book in books:
    with open(book, 'r', encoding='utf-8') as file:
        text = file.read()  # Read the text of the book
        clean_text = preprocess_text(text)  # Preprocess the text
        combined_text += clean_text  # Combine all preprocessed texts

# Check the length of the combined text
print(f"Combined text length: {len(combined_text)} characters")


Combined text length: 3666300 characters


As you can see above, the length of the combined text is printed after being preprocessed to verify all text has been combined

### Creating the Trigram Model
So now using the combined text, we create a trigram model taht captures th frequency of each trigram. The model will allow us to generate new text that mimics the the structure and style of the original books

#### Process:
- **Trigram Model**: We loop trough the combined text and count each unique trigram, the model is stored as a dictionary where each key is a trigram and the value is the amount of times the trigram appears.

Below you can see the first 10 trigrams and their counts are displayed as a sample to inspect the structure of the model

In [7]:
# Create the trigram model from the combined text
trigram_model = create_trigram_model(combined_text)

# Print the first 10 trigrams to inspect the model
for trigram, count in list(trigram_model.items())[:10]:
    print(f"'{trigram}': {count}")


' TH': 61066
'THE': 46163
'HE ': 45241
'E P': 4706
' PR': 5519
'PRO': 3305
'ROJ': 465
'OJE': 464
'JEC': 992
'ECT': 3654


### Task 2: Generating Text from the Trigram Model

In this task, I will use the trigram model that I have created in Task 1 to generate a string of 10,000 characters. This process begins with a String of two characters ('TH'). For each subsequent character, we will look at the previous two characters and select the next character based on the trigrams that start with that string.

We will choose the next character based on the frequency of the trigram in the model.
### For example:
If the model contains trigrams like 'THE', 'THA', and 'THI', the next character could be 'E', 'A', or 'I', with probabilties based on how many times they come up in the text.



In [8]:
# Import the random module
import random

In [9]:
# Function to get the next character based on the current bigram
def get_next_char(trigram_model, current_bigram):
    # Find all trigrams starting with the current bigram
    possible_trigrams = {tri: count for tri, count in trigram_model.items() if tri.startswith(current_bigram)}

    if not possible_trigrams:

        # Print when no Trigrams are found for a bigram
        print(f"No trigrams found for bigram '{current_bigram}'. Returning a space.")
        return ' '
    
    #Show trigrams and their counts for the current trigram
    print(f"Possible trigrams for '{current_bigram}': {possible_trigrams}")

    # Extract the third character and the corresponding weights
    next_chars = [tri[2] for tri in possible_trigrams.keys()]
    weights = list(possible_trigrams.values())

    # Randomly select the next character based on the weights (trigram frequencies)
    next_char = random.choices(next_chars, weights=weights)[0]
    print(f"Selected next character '{next_char}' for bigram '{current_bigram}'")
    return next_char


### Generating 10,000 Characters of Text
Using the 'get_next_char' function we will build a string starting with 'TH' and for each step, the function will look at the last two characters and select the next character based on the trigram model.

We will then generate a string of 10,000 characters by repeatedly calling the 'get_next_char' function.

#### References:
1. **Random Choices in Python**: The `random.choices()` method is used to select the next character based on trigram frequencies. Learn more about this function in the [Python documentation](https://docs.python.org/3/library/random.html#random.choices).
2. **N-gram Models in Text Generation**: Learn how n-gram models are applied to text generation from this resource: [Stanford NLP](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

In [10]:
# Function to generate a string of a specified length using the trigram model
def generate_text(trigram_model, length=10000):
       # Start with the "TH"
    generated_text = "TH"


     # Generate characters until the desired length is reached
    while len(generated_text) < length:
        # Get the last two characters 
        current_bigram = generated_text[-2:]
        
        # Get the next character using the trigram model
        next_char = get_next_char(trigram_model, current_bigram)
        
        # Append the next character to the generated text
        generated_text += next_char
    
    return generated_text

In [11]:
# Generate 10,000 characters of text using the trigram model
generated_text = generate_text(trigram_model, length=10000)

# Print the first 500 characters to inspect the output
print(generated_text[:500])


Possible trigrams for 'TH': {'THE': 46163, 'THI': 7011, 'TH ': 10128, 'THO': 3381, 'THA': 11027, 'THU': 423, 'THS': 238, 'THL': 74, 'THY': 280, 'THR': 1387, 'THM': 1, 'THW': 64, 'THF': 63, 'THD': 54, 'THH': 9, 'THN': 3, 'THQ': 1, 'THP': 3, 'THC': 1}
Selected next character 'E' for bigram 'TH'
Possible trigrams for 'HE': {'HE ': 45241, 'HER': 15508, 'HEC': 93, 'HEM': 1865, 'HEA': 2361, 'HEI': 1469, 'HEN': 3926, 'HEY': 1924, 'HES': 1513, 'HED': 1276, 'HEE': 438, 'HEW': 45, 'HEL': 1000, 'HET': 257, 'HEB': 7, 'HEF': 6, 'HEU': 6, 'HEV': 7, 'HEO': 28, 'HEP': 3, 'HEQ': 7, 'HEG': 1}
Selected next character 'N' for bigram 'HE'
Possible trigrams for 'EN': {'ENB': 488, 'ENT': 8308, 'ENS': 1032, 'ENG': 866, 'EN ': 9996, 'ENE': 1748, 'ENI': 562, 'END': 2756, 'ENC': 2334, 'ENO': 374, 'ENV': 53, 'ENU': 80, 'ENA': 302, 'ENN': 434, 'ENM': 9, 'ENF': 62, 'ENJ': 131, 'ENR': 314, 'ENL': 192, 'ENP': 1, 'ENZ': 10, 'ENY': 17, 'ENH': 10, 'ENW': 6, 'ENK': 11}
Selected next character ' ' for bigram 'EN'
Possible

Possible trigrams for 'FU': {'FUL': 1572, 'FUR': 243, 'FUN': 102, 'FUT': 101, 'FUS': 270, 'FUM': 24, 'FUG': 16, 'FUE': 4}
Selected next character 'T' for bigram 'FU'
Possible trigrams for 'UT': {'UTE': 1226, 'UTH': 526, 'UT ': 7502, 'UTI': 535, 'UTA': 105, 'UTR': 33, 'UTY': 257, 'UTM': 67, 'UTO': 29, 'UTU': 147, 'UTT': 355, 'UTC': 45, 'UTS': 99, 'UTW': 41, 'UTL': 31, 'UTD': 7, 'UTN': 2, 'UTB': 11, 'UTF': 1, 'UTP': 1}
Selected next character ' ' for bigram 'UT'
Possible trigrams for 'T ': {'T G': 1048, 'T O': 4902, 'T N': 1227, 'T A': 5180, 'T  ': 9842, 'T U': 841, 'T W': 4923, 'T L': 1666, 'T R': 891, 'T 5': 7, 'T H': 5205, 'T T': 8630, 'T B': 2445, 'T I': 6134, 'T F': 1789, 'T P': 1341, 'T M': 3004, 'T E': 1062, 'T S': 4024, 'T D': 1578, 'T C': 1462, 'T Y': 1027, 'T V': 288, 'T K': 419, 'T Q': 127, 'T J': 145, 'T 1': 40, 'T 8': 11, 'T 2': 2, 'T 6': 2, 'T 7': 1, 'T 4': 2, 'T 9': 1}
Selected next character 'C' for bigram 'T '
Possible trigrams for ' C': {' CO': 11753, ' CH': 3393, ' CR'

### Task 3: Analyzing the Generated Text

In this task, we will analyze the generated text from Task 2 to determine how many words are valid English words.

#### Process
1. **Loading a List of Valid English Words**: We will use a predefined list of English words from `words.txt`.
2. **Splitting the Generated Text into Words**: The text from task 2 wil be split by spaces the get individual words.
3. **Calculating the Percentage of Valid Words**: Finally, for each word in the generated text, we check if it exists in the list of the valid English words. We can then calulcate the precentage of valid words.

This will then give us an insight into how the generated text resembles actual English language patterns.

### Loading the List of Valid English Words
Below we will load a predifined list of valid English words from 'words.txt' and store it in a set.

A sample of the valid English words is printed to verify that the words have been loaded correctly

In [12]:
# Load the list of valid English words
with open("texts/words.txt", 'r',encoding='utf-8') as file:
    valid_words = set(file.read().upper().split())

# Check a sample of valid English words
print("Sample of valid English words:", list(valid_words)[:10])

Sample of valid English words: ['KINGS', 'RELINQUISH', 'DRUMMING', 'BROADBAND', 'OSCILLATIONS', 'CIRCUMSCRIBING', 'ENVOY', 'LURA', 'DOORWAYS', 'DANIELSON']


### Calculating the Percentage of Valid Words

Using the analyze_generated_text function, we are able to calculate the percentage of valid Englsih words in the generated text:
1. **Split the Text**: The generated texxt is split by spaces to split the text into words.
2. **Check Validity**: Each words is then checked against the list of valid English words.
3. **Calculate the precentage**: The function calculates the precentage of valid words by comparing the number of valid words to the total number of words.

We then get the precentage of words in the generated text that match real English words, which indicates how the generated text resembles real Enlgish words.

In [13]:
# Function to analyze the percentage of valid English words in the generated text
def analyze_generated_text(text, word_list):
    words = text.split()  # Split the generated text into words
    valid_words = [word for word in words if word in word_list]
    return len(valid_words) / len(words) * 100 if words else 0

# Analyze the generated text from Task 2
percentage_of_valid_words = analyze_generated_text(generated_text, valid_words)
print(f"Percentage of valid English words in generated text: {percentage_of_valid_words:.2f}%")


Percentage of valid English words in generated text: 39.74%


### Task 4: Exporting the Trigram Model as JSON

In this task, we will need to export the trigram model as a JSON File. JSON is  a very ligthweight format which allows us to store and transfer data which i i think makes it perfect for saving the model for any later use.

The trigram model will be saves to 'trigram.json'.

#### Process:
1. **Function Definition**: The function 'export_trigram_model_to_json' just takes in the model dictionary and file name.
2. **Exporting to JSON**: This function uses 'json.dump' to write the trigram model to the file specified. I use 'ensure_ascii=False' and 'indent=4' to make the JSON file more readable by adding some indetation.
3. **Confirmation Message**: Once the file gets saved successfully, a message is printed out saying 'Trigram model has been exported to trigrams.json' to confirm the export

This process allows the model to be stored in a format that is easy to read and transfer.

#### References:
**Python Convert From Python to JSON**: If you have a Python object, you can convert it into a JSON string by using the json.dumps() method [W3Schools](https://www.w3schools.com/python/gloss_python_convert_into_JSON.asp#:~:text=Python%20Convert%20From%20Python%20to%20JSON,-%E2%9D%AE%20Python%20Glossary&text=If%20you%20have%20a%20Python,dumps()%20method)



In [14]:
# Import Json module

import json

In [15]:

# Function to export the trigram model as JSON
def export_trigram_model_to_json(trigram_model, filename="trigrams.json"):
    # Export the trigram model to a JSON file
    with open(filename, 'w', encoding='utf-8') as json_file:
        # write the trigram model to the JSON file with proper indentation
        json.dump(trigram_model, json_file, ensure_ascii=False, indent=4)
    # Print message to confrim the export
    print(f"Trigram model has been exported to {filename}")

# Call the function to export the model
export_trigram_model_to_json(trigram_model)


Trigram model has been exported to trigrams.json
