### Let's write an elementary tokenizer that uses words as tokens.

We will use Mark Twain's _Life On The Mississippi_ as a test bed. The text is in the accompanying file 'Life_On_The_Mississippi.txt'

Here's a not-terribly-good such tokenizer:

In [2]:
wdict = {}
with open('Life_On_The_Mississippi.txt', 'r') as L:
    line = L.readline()
    nlines = 1
    while line:

        words = line.split()
        for word in words:
            if wdict.get(word) is not None:
                wdict[word] += 1
            else:
                wdict[word] = 1
        line = L.readline()
        nlines += 1

nitem = 0 ; maxitems = 100
for item in wdict.items():
    nitem += 1
    print(item)
    if nitem == maxitems: break


('\ufeffThe', 1)
('Project', 79)
('Gutenberg', 22)
('eBook', 4)
('of', 4469)
('Life', 5)
('on', 856)
('the', 8443)
('Mississippi', 104)
('This', 127)
('ebook', 2)
('is', 1076)
('for', 1017)
('use', 34)
('anyone', 4)
('anywhere', 8)
('in', 2381)
('United', 36)
('States', 26)
('and', 5692)
('most', 119)
('other', 223)
('parts', 5)
('world', 40)
('at', 676)
('no', 325)
('cost', 18)
('with', 1053)
('almost', 37)
('restrictions', 2)
('whatsoever.', 2)
('You', 92)
('may', 85)
('copy', 12)
('it,', 199)
('give', 67)
('it', 1382)
('away', 107)
('or', 561)
('re-use', 2)
('under', 112)
('terms', 22)
('License', 8)
('included', 2)
('this', 591)
('online', 4)
('www.gutenberg.org.', 4)
('If', 85)
('you', 813)
('are', 361)
('not', 680)
('located', 9)
('States,', 8)
('will', 287)
('have', 557)
('to', 3518)
('check', 4)
('laws', 13)
('country', 50)
('where', 152)
('before', 150)
('using', 10)
('eBook.', 2)
('Title:', 1)
('Author:', 1)
('Mark', 2)
('Twain', 2)
('Release', 1)
('date:', 1)
('July', 7)
('1

This is unsatisfactory for a few reasons:

* There are non-ASCII (Unicode) characters that should be stripped (the so-called "Byte-Order Mark" or BOM \ufeff at the beginning of the text);

* There are punctuation marks, which we don't want to concern ourselves with;

* The same word can appear capitalized, or lower-case, or with its initial letter upper-cased, whereas we want them all to be normalized to lower-case.

Part 1 of this assignment: insert code in this loop to operate on the str variable 'line' so as to fix these problems before 'line' is split into words.

A hint to one possible way to do this: use the 'punctuation' character definition in the Python 'string' module, the 'maketrans' and 'translate' methods of Python's str class, to eliminate punctuation, and the regular expression ('re') Python module to eliminate any Unicode---it is useful to know that the regular expression r'[^\x00-x7f]' means "any character not in the vanilla ASCII set.

Part 2: Add code to sort the contents of wdict by word occurrence frequency.  What are the top 100 most frequent word tokens?  Adding up occurrence frequencies starting from the most frequent words, how many distinct words make up the top 90% of word occurrences in this "corpus"?

For this part, the docs of Python's 'sorted' and of the helper 'itemgetter' from 'operator' reward study.

Write your modified code in the cell below.

MY HOMEWORK SUBMISSION - HALE

In [3]:
import re
import string

# Part 1: Improve the tokenizer
wdict = {}
with open('Life_On_The_Mississippi.txt', 'r', encoding='utf-8-sig') as L:
    nlines = 0
    for line in L:
        # Normalize case and remove non-ASCII characters
        line = line.lower()
        line = re.sub(r'[^\x00-\x7f]', '', line)

        # Remove punctuation
        translator = str.maketrans('', '', string.punctuation)
        line = line.translate(translator)

        words = line.split()
        for word in words:
            wdict[word] = wdict.get(word, 0) + 1

        nlines += 1

# Part 2: Sort and analyze word frequencies
from operator import itemgetter

# Sort wdict by occurrence frequency
sorted_wdict = sorted(wdict.items(), key=itemgetter(1), reverse=True)

# Top 100 most frequent word tokens
top_100_words = sorted_wdict[:100]

# Calculate how many distinct words make up the top 90% of word occurrences
total_occurrences = sum(wdict.values())
top_90_threshold = total_occurrences * 0.9
cumulative = 0
num_words_top_90 = 0
for word, count in sorted_wdict:
    cumulative += count
    num_words_top_90 += 1
    if cumulative >= top_90_threshold:
        break

print("Top 100 Words:")
for word, count in top_100_words:
    print(word, count)

print(f"\nNumber of distinct words making up the top 90% of occurrences: {num_words_top_90}")


Top 100 Words:
the 9255
and 5892
of 4532
a 4053
to 3592
in 2593
it 2293
i 2205
was 2093
that 1724
he 1402
is 1148
for 1095
with 1081
you 1033
his 961
had 961
but 952
on 947
as 881
this 781
they 758
at 750
not 722
all 720
by 713
one 686
there 627
were 625
be 617
my 582
or 581
from 577
have 571
out 541
so 536
up 529
him 523
we 519
me 516
when 505
would 478
which 476
river 457
an 440
them 425
no 422
then 405
said 399
are 387
if 381
their 378
now 369
about 346
time 337
been 335
down 328
its 323
could 313
has 305
will 301
into 300
what 285
her 278
two 273
do 271
other 270
some 269
man 260
new 259
any 238
got 234
these 233
she 233
who 229
more 226
water 222
did 214
before 208
over 202
way 202
hundred 200
upon 200
here 199
after 195
day 193
than 192
well 191
through 191
get 190
old 186
every 186
can 185
boat 184
went 183
never 182
good 181
years 181
see 176
know 175

Number of distinct words making up the top 90% of occurrences: 3732
