# Task 1 - Third-Order Letter Approximation Model

## Introduction

A trigram model will be created based on five English books in this notebook(trigrams.ipynb). The five English books that will be used for this task are:

1. The Great Gatsby by F. Scott Fitzgerald
2. The Odyssey by Homer
3. Sense and Sensibility by Jane Austen
4. The Tempest by William Shakespeare
5. The Sign of the Four by Arthur Conan Doyle

The steps that are involved during this process is: 

1. All characters except ASCII letters(both uppercase and lowercase), spaces and full stops will be removed

2. All letters will be changed to uppercase

3. A trigram model will be created which will count the amount of times each sequence of three characters(each trigram) emerges


The final outcome is a dictionary that links each book with its corresponding trigram frequency model.










## Setting Up Imports and Constants

1. The os module is beneficial because methods that interact with the operating system are provided for example managing directories and files

2. The defaultdict is known as a container in Python and in the collections module, the defaultdict is defined. The defaultdict is useful for assigning a default value automatically to a key that is non-existent in the dictionary. This is advantageous for counting the amount of times each trigram appears; without manually having to verify if keys are present.

3. The BOOKS_DIRECTORY Constant is utilized to indicate the folder that is storing the books 



In [None]:
import os
from collections import defaultdict

BOOKS_DIRECTORY = "books/" 

## Cleaning Text 

1. The text is preprocessed as a way of removing unnecessary characters by the  clean_book_text function.
   
2. This makes sure that only full stops, spaces and letters remain in the text.
   
3. All letters in the cleaned text are changed to uppercase to maintain consistency.






In [None]:
def clean_book_text(book_title):
    book_file_path = os.path.join(BOOKS_DIRECTORY, book_title)
    try:
        with open(book_file_path, 'r', encoding='utf-8' as file:
            book_text = file.read()

            book_text = ''.join(character if character.isalpha() or character == ' ' or char == '.' else '' for character in book_text)
            return book_text.upper()
    except FileNotFoundError:
        print(f"Error: Sorry the file {book_title} was not found in the directory!")
        return None

## Creating Trigrams
Sequences of three consecutive characters which are also known as trigrams from the preprocessed text is extracted by the create_trigrams function. A dictionary is utilized to track the frequency of each trigram.

In [4]:
def create_trigrams(book_text):
    trigram_frequencies = defaultdict(int)
    for i in range(len(book_text) - 2):
        trigram = book_text[i:i+3]
        trigram_frequencies[trigram] += 1
        return trigram_frequencies
        
    
 