Your assignment for this course is something similar: build a Python function that can take the file data/corpus.txt (UTF-8 encoded) from this repo as an argument and print a count of the 100 most frequent 1-grams (i.e. single words).

In essence the job is to do this:

In [59]:
from collections import Counter
import os

def onegrams(file):
    with open(file, 'r') as corpus:
        text = corpus.read()
        # .casefold() is better than .lower() here
        # https://www.programiz.com/python-programming/methods/string/casefold
        normalize = text.casefold()
        words = normalize.split(' ')
        count = Counter(words) 
        return count

ngram_viewer = onegrams(os.path.join('data', 'corpus.txt'))
print(ngram_viewer.most_common(100))

[('the', 11852), ('', 5952), ('of', 5768), ('and', 5264), ('to', 4027), ('a', 3980), ('in', 3548), ('that', 2336), ('his', 2061), ('it', 1517), ('as', 1490), ('i', 1488), ('with', 1460), ('he', 1448), ('is', 1400), ('was', 1393), ('for', 1337), ('but', 1319), ('all', 1148), ('at', 1116), ('this', 1063), ('by', 1042), ('from', 944), ('not', 933), ('be', 863), ('on', 850), ('so', 763), ('you', 718), ('one', 694), ('have', 658), ('had', 647), ('or', 638), ('were', 551), ('they', 547), ('are', 504), ('some', 498), ('my', 484), ('him', 480), ('which', 478), ('their', 478), ('upon', 475), ('an', 473), ('like', 470), ('when', 458), ('whale', 456), ('into', 452), ('now', 437), ('there', 415), ('no', 414), ('what', 413), ('if', 404), ('out', 397), ('up', 380), ('we', 379), ('old', 365), ('would', 350), ('more', 348), ('been', 338), ('over', 324), ('only', 322), ('then', 312), ('its', 307), ('such', 307), ('me', 307), ('other', 301), ('will', 300), ('these', 299), ('down', 270), ('any', 269), ('

However, there is a twist: you can’t use the collections library…

Moreover, try to think about what else may be suboptimal in this example. For instance, in this code all of the text is loaded into memory in one time (with the read() method). What would happen if we tried this on a really big text file?

Most importantly, the count is also wrong. Check by counting in an editor, for instance, and try to find out why.

If this is an easy task for you, you can also think about the graphical representation of the 1-gram count.

Solution attempt:

In [60]:
import os
import operator

def ngram_counter(file):
    with open(file, 'r', encoding ='utf-8') as corpus:
        word_dic = {}
        while True:
            line= corpus.readline()
            if line:
                normalize = line.casefold()
                words = normalize.split(' ')
                for word in words:
                    if word in word_dic:
                        word_dic[word] += 1
                    else:
                        word_dic[word] = 1
            else:
                break
        
    return word_dic
                
   
   
    
ngram_counter = ngram_counter(os.path.join('data', 'corpus.txt'))
sorted_ngram = sorted(((value, key) for (key,value) in ngram_counter.items()), reverse = True)
print(sorted_ngram[:100])


    

[(12825, 'the'), (6077, 'of'), (5983, ''), (5663, 'and'), (4248, 'to'), (4128, 'a'), (3774, 'in'), (3361, '\n'), (2546, 'that'), (2214, 'his'), (1668, 'it'), (1603, 'i'), (1597, 'as'), (1581, 'with'), (1570, 'but'), (1535, 'he'), (1478, 'the\n'), (1470, 'is'), (1459, 'was'), (1427, 'for'), (1236, 'all'), (1197, 'at'), (1167, 'this'), (1105, 'by'), (1022, 'from'), (1012, 'not'), (908, 'be'), (896, 'on'), (816, 'so'), (764, 'you'), (739, 'one'), (702, 'have'), (689, 'had'), (662, 'or'), (614, 'and\n'), (592, 'they'), (591, 'were'), (550, 'some'), (537, 'their'), (534, 'of\n'), (534, 'are'), (524, 'which'), (520, 'when'), (520, 'upon'), (517, 'like'), (512, 'my'), (512, 'him'), (507, 'a\n'), (504, 'an'), (498, 'whale'), (485, 'into'), (477, 'now'), (474, 'there'), (451, 'what'), (448, 'no'), (438, 'if'), (424, 'out'), (407, 'we'), (395, 'up'), (392, 'old'), (390, 'would'), (390, 'more'), (361, 'been'), (347, 'then'), (342, 'over'), (339, 'only'), (337, 'these'), (334, 'such'), (334, 'othe