Step 1: Import Required Libraries.

In this step, we import collections for creating a defaultdict that will store trigram counts.

In [36]:
# Importing the collections library to use defaultdict for counting trigrams
import collections
import random

Step 2: Define the List of Book Files.

We list the file paths of the books we want to analyze. This will allow us to loop through them easily.

In [37]:
# List of book file paths
book_files = [
    'books/moby_dick.txt',
    'books/frankenstein.txt',
    'books/pride_and_prejudice.txt',
    'books/romeo_and_juliet.txt',
    'books/the_scarlet_letter.txt'
]

Step 3: Set Up the Trigram Counter.

We create a defaultdict that will store trigram counts across all books. Each trigram will be a key, and its count will be the value. Using defaultdict(int) allows each new trigram to start with a count of zero automatically.

In [38]:
# Define a defaultdict to store trigram counts across all books
trigram_counts = collections.defaultdict(int)

Step 4: Define Characters to Keep.

We only want lowercase letters, spaces, and periods in the text to ensure consistency. This step specifies which characters we’ll keep when cleaning the text.

In [39]:
# Define characters to keep (lowercase letters, spaces, and periods)
keep = 'abcdefghijklmnopqrstuvwxyz .'

Step 5: Process Each Book File.

In this step, we loop through each book file, read the content, convert it to lowercase, and clean the text by removing unwanted characters. This cell prepares the text for trigram generation.

In [40]:
# Loop through each book file to read, clean, and process the text
for book_file in book_files:
    with open(book_file, 'r') as file:
        # Read the entire file into a string and convert to lowercase
        text = file.read().lower()

        # Remove unwanted characters
        cleaned = ''.join(c for c in text if c in keep)

Step 6: Generate and Count Trigrams.

After cleaning the text, we extract every sequence of three characters (trigrams) and count each occurrence. This cell updates the trigram_counts dictionary for each trigram found in the current book.

In [41]:
# Generate and count trigrams for this book
for i in range(len(cleaned) - 2):
            # Extract trigram of three characters
            trigram = cleaned[i:i+3]
            # Increment count for this trigram in the defaultdict
            trigram_counts[trigram] += 1

Step 7: Sort Trigrams by Frequency in Descending Order.

After processing all books, we sort the trigrams by their count in descending order. Sorting helps to easily identify the most frequent trigrams.

In [42]:
# Sort trigrams by count in descending order
sorted_trigrams = sorted(trigram_counts.items(), key=lambda x: x[1], reverse=True)


Step 8: Display the Sorted Trigram Counts.

Finally, we print the sorted trigram counts in descending order, with the most frequent trigrams displayed first.

In [43]:
# Print the sorted trigram counts
for trigram, count in sorted_trigrams:
    print(f"'{trigram}': {count}")


' th': 8659
'the': 7712
'he ': 5758
'   ': 4105
'er ': 3446
' an': 3270
' of': 3267
'and': 3197
'nd ': 3151
'ed ': 3151
'of ': 2988
'her': 2688
' he': 2482
' in': 2312
' to': 2287
'to ': 2078
'ing': 2062
'ng ': 1812
' a ': 1795
'as ': 1795
' ha': 1723
'at ': 1722
'ter': 1700
'in ': 1636
'e t': 1632
' be': 1577
're ': 1568
'is ': 1529
'ere': 1510
'd t': 1503
'e a': 1500
' wh': 1490
' wi': 1457
' hi': 1453
'th ': 1434
'n t': 1421
'on ': 1420
'his': 1417
'hat': 1398
'tha': 1366
' it': 1341
'ith': 1338
'e s': 1329
'ly ': 1272
'en ': 1238
'wit': 1229
't t': 1197
'e w': 1165
'ld ': 1161
'e o': 1157
'or ': 1155
'for': 1117
'it ': 1107
'd a': 1086
'est': 1081
' no': 1071
' wa': 1070
'an ': 1063
'le ': 1059
'f t': 1055
'ear': 1041
's a': 1016
' as': 1006
' re': 1000
'ion': 990
'ste': 989
'ent': 987
' co': 973
'st ': 970
's t': 967
'e h': 963
'ch ': 960
'ver': 960
' sh': 959
'was': 954
' so': 951
' ma': 943
'll ': 913
'es ': 909
'ce ': 909
's o': 900
'e m': 881
'ad ': 871
'd h': 864
' fo': 863
'

Step 9

Parameters:

trigram_counts: Dictionary containing the trigram frequency model.
start: Starting string for generating the text (default is 'th').
length: The desired length of the generated text (10,000 characters in this case).

Prefix Logic:

Uses the last two characters of the generated text as a prefix to find matching trigrams.

Candidate Selection:

Filters trigram_counts to find trigrams starting with the current prefix.
Extracts possible next characters (next_chars) and their counts (weights).

Weighted Random Selection:

Chooses the next character based on the weighted probabilities of weights.

Stopping Condition:

If no candidates are found for the prefix, the generation stops.

In [44]:
# Function to generate text using the trigram model
def generate_text(trigram_counts, start='th', length=10000):
    """
    Generates a string of the specified length based on the trigram model.
    
    Parameters:
        trigram_counts (dict): Dictionary of trigram frequencies.
        start (str): Initial two-character string to start the generation.
        length (int): Desired length of the generated text.
        
    Returns:
        str: Generated text of the specified length.
    """
    # Start the text with the initial seed
    generated_text = start
    
    # Generate characters until reaching the desired length
    while len(generated_text) < length:
        # Get the last two characters as the prefix
        prefix = generated_text[-2:]
        
        # Find all trigrams that start with the current prefix
        candidates = {trigram[2]: count for trigram, count in trigram_counts.items() if trigram.startswith(prefix)}
        
        # Debug print: Check if candidates are found
        print(f"Current prefix: '{prefix}', Candidates: {candidates}")
        
        # If no candidates are found, stop generating
        if not candidates:
            print(f"No candidates found for prefix '{prefix}'. Stopping generation.")
            break
        
        # Extract possible next characters and their corresponding weights
        next_chars = list(candidates.keys())
        weights = list(candidates.values())
        
        # Randomly choose the next character based on weights
        next_char = random.choices(next_chars, weights=weights)[0]
        
        # Append the chosen character to the generated text
        generated_text += next_char
    
    return generated_text



Step 10

Text Generation:
Calls the generate_text function to generate a string of 10,000 characters starting with 'th'.
Preview:
Prints the first 100 characters of the generated text to quickly verify the result.

In [45]:
# Generate a string of 10,000 characters using the trigram model
generated_text = generate_text(trigram_counts, start='th', length=10000)

# Print the first 100 characters to verify the result
print("\nGenerated text preview:")
print(generated_text[:100])  # Only printing the first 100 characters to check


Current prefix: 'th', Candidates: {'e': 7712, 'a': 1366, 'o': 613, ' ': 1434, '.': 51, 's': 51, 'i': 808, 'r': 283, 'l': 39, 'f': 21, 'y': 153, 'u': 94, 't': 26, 'p': 8, 'd': 19, 'h': 18, 'w': 32, 'c': 7, 'b': 8, 'n': 3, 'g': 2, 'm': 5, 'j': 1}
Current prefix: 'he', Candidates: {' ': 5758, 'd': 204, 'r': 2688, 'a': 436, 'c': 78, 's': 686, 'y': 240, 'n': 357, 'i': 281, 't': 105, 'm': 287, 'h': 35, 'e': 174, 'l': 126, 'g': 22, 'b': 36, 'w': 56, 'o': 33, 'p': 61, 'u': 7, 'k': 4, 'f': 32, 'v': 18, '.': 22, 'j': 3, 'q': 1}
Current prefix: 'em', Candidates: {'e': 275, 'b': 111, 'a': 132, ' ': 165, 'i': 79, 's': 54, 'p': 87, 'o': 109, 'm': 4, '.': 39, 'y': 21, 'u': 20, 'n': 32, 'f': 2, 'l': 4, 't': 4, 'w': 1, 'r': 1, 'h': 1}
Current prefix: 'me', Candidates: {'s': 234, 'n': 492, 'd': 292, ' ': 821, 'a': 90, 'e': 34, 'r': 155, 't': 132, 'l': 29, 'b': 5, 'i': 10, 'm': 58, 'f': 3, 'h': 9, 'w': 37, '.': 50, 'u': 2, 'o': 8, 'x': 1, 'y': 2, 'c': 4, 'g': 1, 'v': 1, 'p': 2, 'q': 1}
Current prefix: 'e

Current prefix: 're', Candidates: {'s': 634, 'f': 160, 'a': 581, 'c': 173, ' ': 1568, 'p': 144, 'm': 190, 'i': 30, 'v': 184, 'd': 551, 't': 194, 'e': 205, 'l': 162, 'h': 40, 'r': 59, 'w': 126, 'n': 221, 'q': 42, '.': 66, 'g': 76, 'b': 33, 'o': 31, 'y': 4, 'j': 22, 'x': 3, 'u': 3}
Current prefix: 'ei', Candidates: {'g': 41, 't': 80, 'n': 126, 'v': 44, 'z': 6, 'l': 9, 'r': 257, 'd': 1, 's': 11, 'p': 2, ' ': 12, 'f': 3, 'm': 8, 'a': 2}
Current prefix: 'ig', Candidates: {'h': 574, 'i': 64, 'n': 84, ' ': 12, 'u': 50, 'e': 26, 'r': 5, 's': 6, 'o': 11, 'a': 15, 'g': 1, 'w': 3, 'd': 1, 'b': 1, '.': 1, 'm': 7, 'z': 1}
Current prefix: 'gh', Candidates: {'t': 746, ' ': 250, 'o': 42, 'g': 1, 'l': 16, 'i': 15, 'b': 8, 'e': 37, 'a': 37, 's': 4, '.': 2, 'f': 2, 'h': 4, 'w': 1, 'y': 2, 'd': 2}
Current prefix: 'ht', Candidates: {' ': 494, 's': 44, 'p': 2, 'y': 27, 'i': 12, 'b': 9, 'h': 45, 'e': 46, 'a': 10, 'l': 16, 'o': 5, '.': 30, 'f': 17, 't': 10, 'w': 5, 'm': 4, 'g': 3, 'r': 1, 'c': 2, 'k': 1, 'n':

Step 11

Load words.txt and Check Word Validity

In [46]:
# Load the list of valid English words from words.txt
def load_english_words(file_path):
    """
    Load a list of English words from a file and return as a set.
    
    Parameters:
        file_path (str): Path to the words.txt file.
        
    Returns:
        set: A set of valid English words.
    """
    with open(file_path, 'r') as file:
        english_words = {line.strip().lower() for line in file}
    return english_words

Step 12

Analyze the Percentage of Valid Words

In [47]:
def analyze_generated_text(generated_text, english_words):
    """
    Analyze the percentage of valid English words in the generated text.
    
    Parameters:
        generated_text (str): The 10,000-character string generated by the trigram model.
        english_words (set): A set of valid English words.
        
    Returns:
        float: The percentage of valid English words in the generated text.
    """
    # Split the generated text into words
    words = generated_text.split()
    
    # Count the number of valid English words
    valid_word_count = sum(1 for word in words if word.lower() in english_words)
    
    # Calculate the percentage of valid words
    total_words = len(words)
    if total_words == 0:  # Handle edge case where no words are present
        return 0.0
    return (valid_word_count / total_words) * 100

    # Load English words from words.txt
english_words = load_english_words('words.txt')

# Analyze the percentage of valid English words in the generated text
valid_word_percentage = analyze_generated_text(generated_text, english_words)

# Print the percentage of valid English words
print(f"\nPercentage of valid English words in the generated text: {valid_word_percentage:.2f}%")



Percentage of valid English words in the generated text: 33.09%
