Step 1: Import Required Libraries.

In this step, we import collections for creating a defaultdict that will store trigram counts.

In [28]:
# Importing the collections library to use defaultdict for counting trigrams
import collections

Step 2: Define the List of Book Files.

We list the file paths of the books we want to analyze. This will allow us to loop through them easily.

In [29]:
# List of book file paths
book_files = [
    'books/moby_dick.txt',
    'books/frankenstein.txt',
    'books/pride_and_prejudice.txt',
    'books/romeo_and_juliet.txt',
    'books/the_scarlet_letter.txt'
]

Step 3: Set Up the Trigram Counter.

We create a defaultdict that will store trigram counts across all books. Each trigram will be a key, and its count will be the value. Using defaultdict(int) allows each new trigram to start with a count of zero automatically.

In [30]:
# Define a defaultdict to store trigram counts across all books
trigram_counts = collections.defaultdict(int)

Step 4: Define Characters to Keep.

We only want lowercase letters, spaces, and periods in the text to ensure consistency. This step specifies which characters we’ll keep when cleaning the text.

In [31]:
# Define characters to keep (lowercase letters, spaces, and periods)
keep = 'abcdefghijklmnopqrstuvwxyz .'

Step 5: Process Each Book File.

In this step, we loop through each book file, read the content, convert it to lowercase, and clean the text by removing unwanted characters. This cell prepares the text for trigram generation.

In [32]:
# Loop through each book file to read, clean, and process the text
for book_file in book_files:
    with open(book_file, 'r') as file:
        # Read the entire file into a string and convert to lowercase
        text = file.read().lower()

        # Remove unwanted characters
        cleaned = ''.join(c for c in text if c in keep)

Step 6: Generate and Count Trigrams.

After cleaning the text, we extract every sequence of three characters (trigrams) and count each occurrence. This cell updates the trigram_counts dictionary for each trigram found in the current book.

In [33]:
# Generate and count trigrams for this book
for i in range(len(cleaned) - 2):
            # Extract trigram of three characters
            trigram = cleaned[i:i+3]
            # Increment count for this trigram in the defaultdict
            trigram_counts[trigram] += 1

Step 7: Sort Trigrams by Frequency in Descending Order.

After processing all books, we sort the trigrams by their count in descending order. Sorting helps to easily identify the most frequent trigrams.

In [34]:
# Sort trigrams by count in descending order
sorted_trigrams = sorted(trigram_counts.items(), key=lambda x: x[1], reverse=True)


Step 8: Display the Sorted Trigram Counts.

Finally, we print the sorted trigram counts in descending order, with the most frequent trigrams displayed first.

In [35]:
# Print the sorted trigram counts
for trigram, count in sorted_trigrams:
    print(f"'{trigram}': {count}")


' th': 8659
'the': 7712
'he ': 5758
'   ': 4105
'er ': 3446
' an': 3270
' of': 3267
'and': 3197
'nd ': 3151
'ed ': 3151
'of ': 2988
'her': 2688
' he': 2482
' in': 2312
' to': 2287
'to ': 2078
'ing': 2062
'ng ': 1812
' a ': 1795
'as ': 1795
' ha': 1723
'at ': 1722
'ter': 1700
'in ': 1636
'e t': 1632
' be': 1577
're ': 1568
'is ': 1529
'ere': 1510
'd t': 1503
'e a': 1500
' wh': 1490
' wi': 1457
' hi': 1453
'th ': 1434
'n t': 1421
'on ': 1420
'his': 1417
'hat': 1398
'tha': 1366
' it': 1341
'ith': 1338
'e s': 1329
'ly ': 1272
'en ': 1238
'wit': 1229
't t': 1197
'e w': 1165
'ld ': 1161
'e o': 1157
'or ': 1155
'for': 1117
'it ': 1107
'd a': 1086
'est': 1081
' no': 1071
' wa': 1070
'an ': 1063
'le ': 1059
'f t': 1055
'ear': 1041
's a': 1016
' as': 1006
' re': 1000
'ion': 990
'ste': 989
'ent': 987
' co': 973
'st ': 970
's t': 967
'e h': 963
'ch ': 960
'ver': 960
' sh': 959
'was': 954
' so': 951
' ma': 943
'll ': 913
'es ': 909
'ce ': 909
's o': 900
'e m': 881
'ad ': 871
'd h': 864
' fo': 863
'