# Project Gutenberg 

Project Gutenberg was a visionary project launched by Michael Hart to create free electronic versions of literary works and disseminate them worldwide. He wanted everyone to have a digital library at no cost. Project Gutenberg got its first boost with the invention of the web in 1990, and its second boost with the creation of distributed proofreaders in 2000. Distributed proof-readers are a web-based project that supports the development of e-texts. This helped project Gutenberg by allowing many people to work together in proofreading drafts of e-texts for errors. In 2010 Project Gutenberg offered more than 33,000 eBooks being downloaded by Tens of thousands every day.([Source: Project Gutennurg](https://www.gutenbergnews.org/)). 

## Bigrams & Trigrams

Refer to sequences of two or three items (typically words or characters) in a text, A bigram is the instance of two multi-word tokens, which is two words that have a distinct meaning when used together. Trigrams are the same thing except with three words that are used together to mean something specific. Example "The cat sat on the mat."
Bigrams:		
"The cat sat"	
"cat sat on"	
"sat on the"	
"on the mat"

Trigrams:		    
"The cat"
"cat sat"
"sat on"	
"on the"
 "the mat"	
 			
These dont necessarily carry distinct meanings beyond the meanings of the individual words. However, you could have cases where bigrams or trigrams form meaningful phrases, like

Bigrams:		
"New York"		
"ice cream"		
"United States"

Trigrams:
"New York City"
"The United States"
"San Francisco Bay"

Here, some of these are indeed fixed phrases (e.g., "New York City") that have a specific meaning when used together, while others are just sequences of words. "Examples of bigrams and trigrams were generated with the help of an AI language model (OpenAI's ChatGPT)."



## Research of Task 1

Task 1 is asking me to buid a trigram model using 5 texts from project gutenburg.
1. Pick 5 books from project gutenburg
2. Clean up the text from each book
    - Remove uneccessary parts like   introductions and footnotes
    - Keep only letters (A-Z),periods,  (.), and spaces.
    - Make all letters uppercase
3. Create trigrams from the text 
    - Trigam is any sequence of three charcters like ("THE" OR "E")
    - Go through each text, find every set of three characters, and count how many times each one appears. 
4. Store the results so it shows how each trigranm and how often it shows up.

The end result will be a list of the most common sets of trigrams that appear in the text.




In [43]:
import re
from collections import defaultdict

- "re" Used for regular expressions to clean the text.
- defaultdict from collections: Helps in counting trigrams by providing default values for keys.

In [44]:
def read_text(file_path):
    
    #Reads the content of a text file and returns it as a string.
   
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

# Example usage 
file_path = 'texts/TheGoldenRule.txt'
text = read_text(file_path)
print(f"Debug: Successfully read the file '{file_path}'.")
print(text[:500])  # Print the first 500 characters to verify


Debug: Successfully read the file 'texts/TheGoldenRule.txt'.
﻿The Project Gutenberg eBook of The golden rule
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Tit


- The read_text function opens a text file, reads its, and returns it as a string.
- The with statement ensures that the file is properly closed after reading.
- Example usage is printing the first 500 letters of the text to see if it works 

In [45]:
def clean_text(text):
    """
    Cleans the input text by:
    - Removing non-ASCII characters
    - Keeping only letters, full stops, and spaces
    - Converting all letters to uppercase
    """
    # Remove non-ASCII characters and keep only letters, spaces, and periods.
    cleaned_text = re.sub(r'[^a-zA-Z. ]', '', text)
    # Convert all letters to uppercase.
    cleaned_text = cleaned_text.upper()
    return cleaned_text

# Example usage
cleaned_text = clean_text(text)
print(cleaned_text[:500])  # Print the first 500 characters to test 

THE PROJECT GUTENBERG EBOOK OF THE GOLDEN RULE    THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES ANDMOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONSWHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMSOF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINEAT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATESYOU WILL HAVE TO CHECK THE LAWS OF THE COUNTRY WHERE YOU ARE LOCATEDBEFORE USING THIS EBOOK.TITLE THE GOLDEN 


- re.sub(r'[^a-zA-Z. ]', '', text): Removes all characters except for letters, periods, and spaces.
- cleaned_text.upper(): Converts the entire text to uppercase, keeping everything the same.
- This step prepares the text by removing unwanted characters and normalizing it.

In [46]:
def generate_trigrams(text):
    
    #Generates a dictionary of trigrams and their counts.
    
    trigrams = defaultdict(int)  
    
    # Loop through the text to extract trigrams.
    for i in range(len(text) - 2):
        trigram = text[i:i + 3]
        trigrams[trigram] += 1
    
    return trigrams

# Example usage
trigram_counts = generate_trigrams(cleaned_text)
print(list(trigram_counts.items())[:10])  # Print the first 10 trigrams to verify


[('THE', 707), ('HE ', 707), ('E P', 94), (' PR', 166), ('PRO', 140), ('ROJ', 88), ('OJE', 88), ('JEC', 89), ('ECT', 158), ('CT ', 96)]


- defaultdict(int): Initializes the dictionary with a default value of 0 for each trigram.
- for i in range(len(text) - 2): Iterates through the text so we can extract three characters at a time.
- text[i:i + 3]: Extracts a sequence of three characters.
- trigrams[trigram] += 1: Increments the count for each trigram.
- This step allows us to count the occurrence of each trigram.


In [47]:
def display_trigrams(trigrams, top_n=10):
    """
    Displays the top Number(10) of trigrams sorted by their frequency.
    """
    # Sort trigrams by count in descending order and get the top number.
    sorted_trigrams = sorted(trigrams.items(), key=lambda x: x[1], reverse=True)
    for trigram, count in sorted_trigrams[:top_n]:
        print(f"Trigram: '{trigram}' | Count: {count}")

# Example usage
display_trigrams(trigram_counts, top_n=10)


Trigram: '   ' | Count: 777
Trigram: ' TH' | Count: 754
Trigram: 'THE' | Count: 707
Trigram: 'HE ' | Count: 707
Trigram: ' AN' | Count: 379
Trigram: 'ED ' | Count: 360
Trigram: 'ND ' | Count: 360
Trigram: ' TO' | Count: 360
Trigram: 'AND' | Count: 335
Trigram: 'TO ' | Count: 324


- sorted(trigrams.items(), key=lambda x: x[1], reverse=True): Sorts the trigram based on their frequency in descending order.
- for trigram, count in sorted_trigrams[:top_n]: Loops through the sorted list and prints the most common trigrams.
- This helps us understand the most frequently occurring trigrams in the text.

In [48]:
# Paths to the text files from Project Gutenberg.
file_paths = ['texts/AldythsInheritance.txt', 'texts/LordListerNo.0312.txt', 'texts/NoPlaceLikeHome.txt', 
'texts/TheGoldenRule.txt', 'texts/ThewonderfulChristmasInPumpkinDelightLane.txt']

# Combined trigram counts across all texts.
combined_trigram_counts = defaultdict(int)

# Process each file.
for path in file_paths:
    text = read_text(path)
    cleaned_text = clean_text(text)
    trigram_counts = generate_trigrams(cleaned_text)
    
    # Merge the trigram counts into the combined dictionary.
    for trigram, count in trigram_counts.items():
        combined_trigram_counts[trigram] += count

# Display the top 10 trigrams across all texts.
#display_trigrams(combined_trigram_counts, top_n=10)

# Save the trigram model to a text file
def save_trigram_model(trigrams, output_file):
    
    with open(output_file, 'w') as file:
        for trigram, count in trigrams.items():
            file.write(f"{trigram}: {count}\n")

# Save to file after processing all texts
output_file = 'trigram.txt'
save_trigram_model(combined_trigram_counts, output_file)
print(f"Trigram model saved to {output_file}.")


Trigram model saved to trigram.txt.


- This loop reads them, cleans them, generates trigram counts, and then merges them into a combined trigram dictionary.
- Displays the most common trigrams across all texts in a text file called 'trigram.txt'.

## Research of Task 2

 Task 2 is asking me to use the trigram method from Task 1 to generate a 10,000-character string by starting with "TH" and predicting the next character based on the previous two, By using the trigram counts to help with selection.

1. Use the trigram model from task 1

2. Start with "TH":
    - Start generating the string using the initial characters "TH".

3. Find trigrams that begin with last two characters:
    -  Use the last two characters to look up matching trigrams in the model.

4. Randomly select the next character:
    - Pick the third character based on the trigram counts (higher counts = higher chance of selection).

5. Update the string:
    - Add the selected character to the string and continue the process using the updated last two characters.

6. Save to the generated text.

In [49]:
import random

def weighted_random_choice(trigrams, prefix):
    
    candidates = {k: v for k, v in trigrams.items() if k.startswith(prefix)}
    
    if not candidates:
        # If no matching trigram is found, return a space
        return ' '
    
    # Get the possible next characters and their counts
    next_chars = [trigram[2] for trigram in candidates.keys()]
    weights = list(candidates.values())
    
    # Randomly select the next character based on frequency
    return random.choices(next_chars, weights=weights)[0]



- The weighted_random_choice function takes the trigram model and two-character prefix, and chooses the third character based on the trigram counts as probabilities. 
- If no matching trigram is found, it returns a space (' '). 
- Uses `random.choices()` function to perform the weighted random selection.


In [50]:
def generate_text(trigrams, length=10000, start='TH'):
   
    text = start
    
    # Generate characters until it hits 10,000 characters
    while len(text) < length:
        # Get the last two characters
        prefix = text[-2:]
        # Get the next character using the weighted random choice
        next_char = weighted_random_choice(trigrams, prefix)
        # Add the next character to the text
        text += next_char
    
    return text



- The function `generate_text` generates a string of the specified length. 
- It starts with the initial string "TH" and generates each subsequent character by using the previous two characters 
as a prefix. For each prefix, it calls the `weighted_random_choice` function to select the next character.


In [51]:
# Example usage: generate a string of 10,000 characters
generated_text = generate_text(combined_trigram_counts)

# Check the length of the generated text
print(f"Length of generated text: {len(generated_text)}")

# Display the first 500 characters to verify
print(f"Generated text (first 500 characters):\n{generated_text[:500]}")


Length of generated text: 10000
Generated text (first 500 characters):
THE ALDYTH ANKERE HER GRER GIVE AT VIDE LIEW.LARTO AS MAT SEPERESBANS SHOWIN FALLE        DRIBUTILY TOW HE SH SUR A DAY AN BEATTED MIJ ANER NOBT ALL HANTIER DROMET FRIVACEN PAYHATCHILDYTH AND MORN OT THISAND ISS ITY INK OF TO RESSING HAAR VRES GLAD HE DOW GRING UNEVEXPECREEND KNOT SHE ROKS REGRIGH DE HERGUTED. IMERY TO MRS. I KNE. GIRIERWOUS. ITER WAND WOU A SHED THAT CHATHEL COME SURE UND SAID MOR NOWN THET WAS POIND EN.ROJECT A GLAND WOEN HILITEELP PING THEBOORGISTED ROBSOLE THAD ALOCCUR ORG S


 - The string created will not make sense it will be random words eample "THIM CLOPPINE SIOUND SCE BERREN EN SHERVE SE." this is a sample taken from my results 

In [52]:
def save_generated_text(text, output_file):
    """
    Saves the generated text to a file.
    """
    with open(output_file, 'w') as file:
        file.write(text)


- Saves the String to a text file called generatedText.txt

In [53]:
text = """Place your text here"""  # Paste the full text here
print("Number of characters:", len(text))

Number of characters: 20


- Test to see if it counted the charaters correctly 

## Research of Task 3

Task 3 is asking me check the quality of my generated text by analyzing how many of the words are actual english words.

1. Get the english word list:
    - Get the words.txt from his github and place in my repository

2. Split Your Generated Text into Words:
    - Use spaces in your generated text to separate it into "words."
    - Example, "THE CAT IS ON THE MAT" would be split into ["THE", "CAT", "IS", "ON", "THE", "MAT"].

3. Check Each Word Against words.txt:
    - Load the list of valid English words from words.txt.
    - For each "word" in your generated text, check if it exists in the list.

4. Calculate the Percentage of Real Words:
    - Count how many words in the generated text are actual English words.
    - Divide this count by the total number of words in the generated text and multiply by 100 to get the percentage of real English words.

In [54]:
def load_word_list(file_path):
    
    with open(file_path, 'r') as file:
        words = set(file.read().splitlines())
    return words

# Load the English words from words.txt
english_words = load_word_list('words.txt')
print(f"Loaded {len(english_words)} English words.")


Loaded 45373 English words.


- Get the english words from the word.txt file and place into a set
- This list will b eused to check if each word is actual a english word

In [55]:
def split_into_words(text):
    
    return text.split()

# Split the generated text into words
words_in_generated_text = split_into_words(generated_text)
print(f"Total words in generated text: {len(words_in_generated_text)}")


Total words in generated text: 1757


- I split the generated text from the file generated_text into words by using spaces.
- Each sequence of characters separated by spaces will be treated as a "word" and checked against the English word list.



In [56]:
def count_real_words(words, english_words_set):
   
    real_word_count = sum(1 for word in words if word in english_words_set)
    return real_word_count

# Count how many words in the generated text are real English words
real_word_count = count_real_words(words_in_generated_text, english_words)
print(f"Real English words in generated text: {real_word_count}")


Real English words in generated text: 622


-  Each word we find we check if it exist on the word.txt file to see if its a real english word and gives us a count of how many words there are

In [57]:
def calculate_percentage(real_word_count, total_word_count):
    
    if total_word_count == 0:
        return 0
    return (real_word_count / total_word_count) * 100

# Calculate the percentage of real English words
total_word_count = len(words_in_generated_text)
real_word_percentage = calculate_percentage(real_word_count, total_word_count)
print(f"Percentage of real English words: {real_word_percentage:.2f}%")


Percentage of real English words: 35.40%


- Calculate the precentage of real english words 

## Research of Task 4

- Task 4 is asking me to export my model as a json file using Pythons built in json library

In [58]:
import json

def export_model_to_json(trigram_model, output_file):
    
    with open(output_file, 'w') as file:
        json.dump(trigram_model, file, indent=4)  # Use indent=4 for readable formatting

        # Export the trigram model to a JSON file
export_model_to_json(combined_trigram_counts, 'trigrams.json')
print("Trigram model saved to 'trigrams.json'.")



Trigram model saved to 'trigrams.json'.


- Import the json Library
- Will convert the trigram model to a json format and save it to a file 
- I then call the function to export `combined_trigram_counts` as json