# Task 1 - Third-Order Letter Approximation Model

## Introduction

A trigram model will be created based on five English books in this notebook(trigrams.ipynb). The five English books that will be used for this task are:

1. The Great Gatsby by F. Scott Fitzgerald
2. The Odyssey by Homer
3. Sense and Sensibility by Jane Austen
4. The Tempest by William Shakespeare
5. The Sign of the Four by Arthur Conan Doyle

The steps that are involved during this process are: 

1. All characters except ASCII letters(both uppercase and lowercase), spaces and full stops will be removed

2. All letters will be changed to uppercase

3. A trigram model will be created which will count the amount of times each sequence of three characters(each trigram) emerges


The final outcome is a dictionary that links each book with its corresponding trigram frequency model.










## Setting Up Imports and Constants

1. The os module is beneficial because methods that interact with the operating system are provided for example managing directories and files

2. The defaultdict is known as a container in Python and in the collections module, the defaultdict is defined. The defaultdict is useful for assigning a default value automatically to a key that is non-existent in the dictionary. This is advantageous for counting the amount of times each trigram appears; without manually having to verify if keys are present.

3. The BOOKS_DIRECTORY Constant is utilized to indicate the folder that is storing the books 



In [None]:
import os
from collections import defaultdict
from itertools import islice

BOOKS_DIRECTORY = "books/" 

## Cleaning Text 

1. The text is preprocessed as a way of removing unnecessary characters by the  clean_book_text function.
   
2. This makes sure that only full stops, spaces and letters remain in the text.
   
3. All letters in the cleaned text are changed to uppercase to maintain consistency.






In [None]:
def clean_book_text(book_title):
    book_file_path = os.path.join(BOOKS_DIRECTORY, book_title)
    try:
        with open(book_file_path, 'r', encoding='utf-8') as file:
            book_text = file.read()

            book_text = ''.join(character if character.isalpha() or character == ' ' or character == '.' else '' for character in book_text)
            return book_text.upper()
    except FileNotFoundError:
        print(f"Error: Sorry the file {book_title} was not found in the directory!")
        return None

## Creating Trigrams
Sequences of three consecutive characters which are also known as trigrams from the preprocessed text is extracted by the create_trigrams function. A dictionary is utilized to track the frequency of each trigram.

In [None]:
def create_trigrams(book_text):
    trigram_frequencies = defaultdict(int)
    for i in range(len(book_text) - 2):
        trigram = book_text[i:i+3]
        trigram_frequencies[trigram] += 1
    return trigram_frequencies
        
    
 

## Preprocessing Books For Trigram Model

All the book files are processed in this section. The steps that are involved in this process are:

1. The file contents are read

2. The clean_book_text function is utilized to clean the text

3. The create_trigrams function is utilized to create a trigram model

4. For each book, the results are stored in a dictionary
   

In [None]:
book_file_names = [
"the_great_gatsby.txt",
"the_odyssey.txt",
"sense_and_sensibility.txt",
"the_tempest.txt",
"the_sign_of_the_four.txt"
]

book_trigram_models = {}

for book_file in book_file_names:
    cleaned_book_text = clean_book_text(book_file)
    if cleaned_book_text:
        trigram_model = create_trigrams(cleaned_book_text)
        book_trigram_models[book_file] = trigram_model
        print(f"Displaying the first 30 trigrams extracted from '{book_file}':")
        for trigram, frequency in islice(trigram_model.items(), 30):
            print(f"Trigram: {trigram}, Frequency: {frequency}")
            
            
        
        


        
        
        
    
    



## Conclusion

This notebook successfully preprocessed books and for each book, trigram models were created. The trigram models that were created are stored in the book_trigram_models variable; which can be utilized for further analysis or visualization.



# Task 2 - Third-Order Letter Approximation Generation

## Introduction 

The trigram models that were created in Task 1; are going to be utilized to generate a string of 10,000 characters for this task.

This process starts with the string "TH" and the text is built by adding characters one at a time. The trigram frequencies which are extracted from the text of the books are utilized to probabilistically select the next character.

The steps that are involved during this process are:

1. The trigram model for each book from Task 1 must be utilized.

2. Next begin with the string "TH"

3. The trigram frequencies that begin with the previous characters of the generated text to determine probabilistically the next character must be utilized.

4. The selected character must be added to the generated text and subsequently the process must be repeated until the specified length(10,000 characters) is reached.

5. For each book, the generated text must be saved to an individual file.



## Creating Text Utilizing Trigram Models

The text is created by applying the trigram models that were created in Task 1. This is done by utilizing the create_book_text_from_trigram_model function.

This is the logic for the function create_book_text_from_trigram_model function:

1. Commence with a string of two characters for example "TH"

2. In the trigram model, determine all trigrams that begin with the previous two characters of the current text. This must be carried out for each new character.

The next character is determined probabilistically by utilizing the trigram frequencies

3. However the process ends if no matching trigrams are located or when the specified length is reached.




In [None]:
import random

def create_book_text_from_trigram_model(trigram_model, starting_string, length):
    if len(starting_string) != 2:
        raise ValueError("The starting string must be 2 characters long.")

    created_book_text = starting_string.upper()
    while len(created_book_text) < length:
        previous_two_characters = created_book_text[-2:]

        matching_trigrams = {trigram: frequency for trigram, frequency in trigram_model.items() if trigram.startswith(previous_two_characters)}

        print(f"There are the previous two characters: {previous_two_characters}")
        print(f"There are the matching trigrams: {matching_trigrams}")
        
                             
        if not matching_trigrams:
           print(f"Warning: There are no trigrams found beginning with '{previous_two_characters}'. Restarting from 'TH'!")
           created_book_text = "TH"
           continue

        third_characters = [trigram[2] for trigram in matching_trigrams.keys()]
        trigram_frequencies = list(matching_trigrams.values())

        next_character = random.choices(third_characters, weights=trigram_frequencies, k=1)[0]

        created_book_text += next_character

    return created_book_text

    

  

        



        
    


    


## Creating Text for Each Book

A 10,000 character text for each book is created by utilizing the create_book_text_from_trigram_model.

The steps that are involved in this process are: 

1. Begin with the string "TH"

2. The trigram models for the books are utilized to create the text.

3. The created text for each book is saved to an individual file for example (the_sign_of_the_four.txt_created_text.txt)




In [None]:
created_book_texts = {}

created_book_text_length = 10000

for book_file, trigram_model in book_trigram_models.items():
    print(f"Creating text for '{book_file}'")
    starting_string = "TH"
    created_book_text = create_book_text_from_trigram_model(trigram_model, starting_string, created_book_text_length)
    created_book_texts[book_file] = created_book_text

    output_book_file = f"{book_file}_created_text.txt"
    with open(output_book_file, "w", encoding="utf-8") as file:
         file.write(created_book_text)
    print(f"The Created Text is saved to '{output_book_file}'")
    

         
 

## Conclusion

10,000 character texts were successfully created for each book utilizing the trigram models from Task 1. In addition the created texts for each book were successfully stored in individual files.

# Task 3: Analyze Your Model 

## Introduction

The text that was created in Task 2 utilizing the trigram models will be analyzed. The objective is for the percentage of authentic English words in the 10,000 character texts that were created for each book to be calculated.

The steps that are involved during this process are:

1. A list of authentic English words must be loaded

2. Words from the created text must be obtained

3. To calculate the percentage of authentic English words, these words must be compared with the dictionary.

4. The results for each book must be saved to a text file.