## Task1: Third-Order Letter Approximation Model.

In this task, I will build a third-order letter approximation model using English texts from Project Gutenberg. The goal is to create a trigram model that counts the frequency of every sequence of three characters (trigram) in the selected texts.

In [3]:
import re
from collections import defaultdict

### Re (Regular Expresion Module)
- Working with regular expressions is supported by Python's `re` module and is helpful when searching for and modifying strings based on patterns.

### DefaultDict
- The `defaultdict` class is part of Python's collections module and provides a convenient way to create dictionaries with default values.




In [4]:
def preprocess_text(text):
    # Remove preamble and postamble
    start = re.search(r'\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*', text)
    end = re.search(r'\*\*\* END OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*', text)
    if start and end:
        text = text[start.end():end.start()]
    
    # Remove all characters except for ASCII letters, full stops, and spaces
    text = re.sub(r'[^A-Za-z. ]', '', text)
    
    # Convert all letters to uppercase
    text = text.upper()
    
    return text

### Preprocessing Text

The `preprocess_text` function is designed to clean and prepare the text data for further analysis. Here are the steps it performs:

1. **Remove Preamble and Postamble**: The function searches for the start and end markers of the Project Gutenberg eBook and removes any text outside these markers.
2. **Remove Unwanted Characters**: It removes all characters except for ASCII letters, full stops, and spaces using a regular expression.
3. **Convert to Uppercase**: Finally, it converts all letters to uppercase to standardize the text.

This preprocessing ensures that the text is in a consistent format, making it easier to build the trigram model.