# Task 1: Third-order Letter Approximation Model

In this task, a trigram model based on text from five English books will be built. The steps for this task are:
1. Loading text files from Project Gutenberg.
2. Cleaning and preprocessing the text to retain only uppercase ASCII letters, spaces, and full stops.
3. Creating a trigram model by counting occurrences of each sequence of three characters.

This model will be used in subsequent tasks for generating text and analyzing language patterns.

## Import Libraries

The necessary libraries are imported:
- `os` for handling file paths.
- `re` for handling regular expressions to clean the text.

In [15]:
import os
import re
from collections import defaultdict

## Load and CLean Data

Text files from the `data` folder are loaded. A function to read each file’s content and store it in a dictionary is created.

Text is cleaned by:
- Removing the pre and postamble
- Keeping only letters, spaces, and full stops.
- Convert all letters to uppercase

This ensures that the text is standardized before creating the model.

In [10]:
def clean_text(text):
    
    text = text.replace("\n", " ")
    
    text = re.sub(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*", "", text, flags=re.DOTALL)
    text = re.sub(r"\*\*\* END OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*", "", text, flags=re.DOTALL)
    
    text = re.sub(r"[^\x00-\x7F]+", "", text)
    
    text = re.sub(r"[^A-Za-z. ]+", "", text)
    
    text = text.upper()
    
    text = re.sub(r"\s+", " ", text).strip()
    
    return text

In [13]:
data_folder = 'data'

cleaned_texts = {}
for filename in os.listdir(data_folder):
    file_path = os.path.join(data_folder, filename)

    with open(file_path, 'r', encoding='utf-8') as file:
        original_text = file.read()
    
    cleaned_texts[filename] = clean_text(original_text)

In [14]:
for filename, text in cleaned_texts.items():
    print(f"\nSample from cleaned text in {filename}:\n{text[:500]}\n")


Sample from cleaned text in Alice's Adventures in Wonderland.txt:
THE PROJECT GUTENBERG EBOOK OF ALICES ADVENTURES IN WONDERLAND THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES AND MOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONS WHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMS OF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINE AT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATES YOU WILL HAVE TO CHECK THE LAWS OF THE COUNTRY WHERE YOU ARE LOCATED BEFORE USING THIS EBOO


Sample from cleaned text in Dracula.txt:
THE PROJECT GUTENBERG EBOOK OF DRACULA THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES AND MOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONS WHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMS OF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINE AT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATES YOU 

## Generate Trigram Model

Trigram model is gerenrated by counting each sequence of three characters, a count of each unique trigram is then kept in a dictionary.

In [17]:
def generate_trigram_model(text):
    
    trigram_counts = defaultdict(int)
    for i in range(len(text) - 2):
        trigram = text[i:i+3]
        trigram_counts[trigram] += 1
    return trigram_counts

## Trigram Model Sample

This will display a quick sample of trigram counts from each text in the data folder.

In [72]:
for filename, trigram_dict in trigram_models.items():
    print(f"Sample trigrams from {filename}:")
    
    sample_trigrams = list(trigram_dict.items())[:10]
   
    for trigram, count in sample_trigrams:
        print(f"{trigram}: {count}")
    
    print("\n" + "="*50 + "\n")

Sample trigrams from Alice's Adventures in Wonderland.txt:
﻿TH: 1
THE: 2528
HE : 2301
E P: 177
 PR: 167
PRO: 162
ROJ: 88
OJE: 88
JEC: 96
ECT: 192


Sample trigrams from Dracula.txt:
﻿TH: 1
THE: 11608
HE : 10464
E P: 936
 PR: 839
PRO: 607
ROJ: 92
OJE: 92
JEC: 137
ECT: 451


Sample trigrams from Fairy Tales of Hans Christian Andersen.txt:
﻿TH: 1
THE: 42000
HE : 34494
E P: 2619
 PR: 1764
PRO: 769
ROJ: 98
OJE: 98
JEC: 155
ECT: 804


Sample trigrams from Moby Dick; Or, The Whale.txt:
﻿TH: 1
THE: 20139
HE : 15004
E P: 1460
 PR: 1263
PRO: 772
ROJ: 108
OJE: 108
JEC: 200
ECT: 713


Sample trigrams from Peter Pan.txt:
﻿TH: 1
THE: 4489
HE : 3873
E P: 295
 PR: 284
PRO: 214
ROJ: 88
OJE: 88
JEC: 102
ECT: 228




## Task 2: Third-order Letter Approximation Generation

The trigram model created in Task 1 will be used to generate a sequence of 10,000 characters.
The text generation process works as follows:

1. Start with an initial two-character seed (in this case "TH").
2. Look at the last two characters in the current sequence to determine the possible next characters.
3. Find all trigrams in the model that start with these two characters and select the next character based on their frequencies (weighted randomness).
4. Repeat this process until 10,000 characters are reached.

This method will generate a sequence of characters that mimics the style and structure of the source text.

## Define the 'generate_text' function

Generates a sequence of characters based on the trigram model.

It takes the model, a starting sequence of two characters, and the desired length of the generated text and generates chracters by looking up trigrams and choosing the next character based on their frequency.
