# Task 1: Third-order letter approximation model

In this task, I will create a trigram model of the English language using five english works from Project Gutenberg.

# Step 1: Prepocessing the Text

Text Prepocessing:
The first step is to clean and preprocess the text. By doing this i aim to remove all unwanted characters, 
such as numbers and punctuations expect for fullstops and then convert everything into uppercasel.
This will simplify the text and make it uniform, this will allow us to focus only on the sequence of letters,spaces and fullstops.

Args:
        text (str): The original text to be cleaned.

Returns: 
        str: Cleaned and preprocessed text.

I will use Pythons re module to handle the text cleaning.

In [137]:
# Required Imports
import re
from collections import defaultdict

In [138]:
# Function to preprocess the text
def preprocess_text(text):
    # Remove all non-alphanumeric characters with a space
    clean_text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    # Convert all characters to uppercase
    clean_text = clean_text.upper()

    return clean_text #Return the cleaned text

# Test the function with a sample text
text1 = "This is a sample text. It will be preprocessed to remove all non-alphanumeric characters !!!!! ,,,, and convert all characters to uppercase."
cleaned_text = preprocess_text(text1)
print(cleaned_text)

THIS IS A SAMPLE TEXT  IT WILL BE PREPROCESSED TO REMOVE ALL NON ALPHANUMERIC CHARACTERS            AND CONVERT ALL CHARACTERS TO UPPERCASE 


# Explantion of output
As you can see above, The text gets preprocessed and changed into all uppercase letters and a gap is also left where non-alphanumeric characters are removed.

## Alphanumeric Characters
Alphanumeric Characters refer to the combination of **letters (A-Z)** and **numbers (0-9)**, which would be used quiet commonly in computing for passwords, identifiers and any other text-based inputs where both letters and digits are needed.

You can find more details in this article: [Alphanumeric Characters - TechTarget](https://www.techtarget.com/whatis/definition/alphanumeric-alphameric).


## Step 2: Creating the Trigram Model

### Trigram Model:
A trigram model captures the frequency of every three-character sequence in the text. 
In this step, I will build a dictionary where each key is a unique trigram and the corresponding value is the number of times that trigram appears in the text.

I will use a defaultdict from Python's collections module.

This approach is commonly used in **natural language processing (NLP)**, usually for tasks such as text prediction or even language modeling where n-grams, trigrams in this case are created to capture sequnce patterns.

Reference: [N-grams in Natural Language Processing](https://en.wikipedia.org/wiki/N-gram).

In [139]:
from collections import defaultdict

def create_trigram_model(text):
    trigram_counts = defaultdict(int)

    # Loop over the text, extracting trigrams and updating their counts
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        trigram_counts[trigram] += 1

    return trigram_counts


In [140]:
# Test the function with sample text
sample_text = "hello world, This is a sample text to test the trigram model."
trigram_counts = create_trigram_model(sample_text)

## Explantion of Output:
If we take a look at the the sample word 'Hello'

1. **'hell': 1** - This trigram appeared only once in the sample text.
2. **'ell': 1** - This trigram also appeared only one time.
3. **'llo': 1** - Once again this trigram only appeared once

So basically, the frequency count of each of the trigrams gives us an insight into how common certain letter combinations appear in the text. This proves the trigram model to be useful for capturing patterns in sequences of text, which can be used for tasks such as text generation.

In this sample output above we can see that sequences like 'is' appear more than once. However, more unique trigrams such as 'hell' or 'llo' appear only once. This model provides us with a foundation for predicting the next character in a sequence based on previous characters.

#### Reference:
1. **Understanding Language Modeling**: For a more detailed explanation of language models, including n-grams and transformer-based neural models, refer to this [Medium article](https://medium.com/@roshmitadey/understanding-language-modeling-from-n-grams-to-transformer-based-neural-models-d2bdf1532c6d).

In [141]:
# Print the trigram counts
for trigram, count in trigram_counts.items():
    print(f"'{trigram}' : {count}")

'hel' : 1
'ell' : 1
'llo' : 1
'lo ' : 1
'o w' : 1
' wo' : 1
'wor' : 1
'orl' : 1
'rld' : 1
'ld,' : 1
'd, ' : 1
', T' : 1
' Th' : 1
'Thi' : 1
'his' : 1
'is ' : 2
's i' : 1
' is' : 1
's a' : 1
' a ' : 1
'a s' : 1
' sa' : 1
'sam' : 1
'amp' : 1
'mpl' : 1
'ple' : 1
'le ' : 1
'e t' : 2
' te' : 2
'tex' : 1
'ext' : 1
'xt ' : 1
't t' : 2
' to' : 1
'to ' : 1
'o t' : 1
'tes' : 1
'est' : 1
'st ' : 1
' th' : 1
'the' : 1
'he ' : 1
' tr' : 1
'tri' : 1
'rig' : 1
'igr' : 1
'gra' : 1
'ram' : 1
'am ' : 1
'm m' : 1
' mo' : 1
'mod' : 1
'ode' : 1
'del' : 1
'el.' : 1


### Loading and Preprocessing the Text from Five Books

Here we will load five books from Project Gutenberg and preprocess them and then combine them into single large text. Each book is read as plain text, cleaned to remove unwanted characters, and converted to uppercase using the `preprocess_text` function.

#### Process:
1. **Load Each Book**: Each book is loaded from the `texts/` folder.
2. **Preprocess**: We clean the text by keeping only uppercase letters, spaces, and full stops.
3. **Combine Text**: The preprocessed text from each book is combined into a single large string.

Now, this combined text will serve as the basis for building the trigram model.


In [142]:
# List of file paths for the five books inside the "texts/" folder
books = ["texts/book1.txt", "texts/book2.txt", "texts/book3.txt", "texts/book4.txt", "texts/book5.txt"]

# Preprocess all books and combine them into one large string
combined_text = ""
for book in books:
    with open(book, 'r', encoding='utf-8') as file:
        text = file.read()  # Read the text of the book
        clean_text = preprocess_text(text)  # Preprocess the text
        combined_text += clean_text  # Combine all preprocessed texts

# Check the length of the combined text
print(f"Combined text length: {len(combined_text)} characters")


Combined text length: 3666300 characters


As you can see above, the length of the combined text is printed after being preprocessed to verify all text has been combined

### Creating the Trigram Model
So now using the combined text, we create a trigram model taht captures th frequency of each trigram. The model will allow us to generate new text that mimics the the structure and style of the original books

#### Process:
- **Trigram Model**: We loop trough the combined text and count each unique trigram, the model is stored as a dictionary where each key is a trigram and the value is the amount of times the trigram appears.

Below you can see the first 10 trigrams and their counts are displayed as a sample to inspect the structure of the model

In [143]:
# Create the trigram model from the combined text
trigram_model = create_trigram_model(combined_text)

# Print the first 10 trigrams to inspect the model
for trigram, count in list(trigram_model.items())[:10]:
    print(f"'{trigram}': {count}")


' TH': 61066
'THE': 46163
'HE ': 45241
'E P': 4706
' PR': 5519
'PRO': 3305
'ROJ': 465
'OJE': 464
'JEC': 992
'ECT': 3654


### Task 2: Generating Text from the Trigram Model

In this task, I will use the trigram model that I have created in Task 1 to generate a string of 10,000 characters. This process begins with a String of two characters ('TH'). For each subsequent character, we will look at the previous two characters and select the next character based on the trigrams that start with that string.

We will choose the next character based on the frequency of the trigram in the model.
### For example:
If the model contains trigrams like 'THE', 'THA', and 'THI', the next character could be 'E', 'A', or 'I', with probabilties based on how many times they come up in the text.



In [144]:
# Import the random module
import random

In [145]:
# Function to get the next character based on the current bigram
def get_next_char(trigram_model, current_bigram):
    # Find all trigrams starting with the current bigram
    possible_trigrams = {tri: count for tri, count in trigram_model.items() if tri.startswith(current_bigram)}

    if not possible_trigrams:
        return ' '
    
    # Extract the third character and the corresponding weights
    next_chars = [tri[2] for tri in possible_trigrams.keys()]
    weights = list(possible_trigrams.values())

    # Randomly select the next character based on the weights (trigram frequencies)
    return random.choices(next_chars, weights=weights)[0]

### Generating 10,000 Characters of Text
Using the 'get_next_char' function we will build a string starting with 'TH' and for each step, the function will look at the last two characters and select the next character based on the trigram model.

We will then generate a string of 10,000 characters by repeatedly calling the 'get_next_char' function.

#### References:
1. **Random Choices in Python**: The `random.choices()` method is used to select the next character based on trigram frequencies. Learn more about this function in the [Python documentation](https://docs.python.org/3/library/random.html#random.choices).
2. **N-gram Models in Text Generation**: Learn how n-gram models are applied to text generation from this resource: [Stanford NLP](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

In [146]:
# Function to generate a string of a specified length using the trigram model
def generate_text(trigram_model, length=10000):
       # Start with the "TH"
    generated_text = "TH"


     # Generate characters until the desired length is reached
    while len(generated_text) < length:
        # Get the last two characters 
        current_bigram = generated_text[-2:]
        
        # Get the next character using the trigram model
        next_char = get_next_char(trigram_model, current_bigram)
        
        # Append the next character to the generated text
        generated_text += next_char
    
    return generated_text

In [147]:
# Generate 10,000 characters of text using the trigram model
generated_text = generate_text(trigram_counts, length=10000)

# Print the first 500 characters to inspect the output
print(generated_text[:500])


TH                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
