# AI for Science Training - Week 04

### Damyn Chipman - Boise State University

## Intro to Large Language Models

## Homework: Tokenizers

### Part 1: **Tokenization** 

Write a generic Python tokenizer, which takes a set of text lines and tabulates the different words (that is, the tokens will be simply English words), keeping track of the frequency of each word.

#### Part 1a.

Insert code in this loop to operate on the str variable 'line' so as to fix these problems before 'line' is split into words.

A hint to one possible way to do this: use the 'punctuation' character definition in the Python 'string' module, the 'maketrans' and 'translate' methods of Python's str class, to eliminate punctuation, and the regular expression ('re') Python module to eliminate any Unicode---it is useful to know that the regular expression r'[^\x00-x7f]' means "any character not in the vanilla ASCII set.

#### Part 1b.

Add code to sort the contents of wdict by word occurrence frequency.  What are the top 100 most frequent word tokens?  Adding up occurrence frequencies starting from the most frequent words, how many distinct words make up the top 90% of word occurrences in this "corpus"?

### Part 2: **Embedding**

Modify the embedding visualization code above to zoom in on various regions of the projections, and identify at least one interesting cluster of tokens.

---

## Homework Submission

### Part 1a:

Let's write an elementary tokenizer that uses words as tokens.

We will use Mark Twain's _Life On The Mississippi_ as a test bed. The text is in the accompanying file 'Life_On_The_Mississippi.txt'

In [1]:
import re

In [2]:
wdict = {}
with open('Life_On_The_Mississippi.txt', 'r') as L:
    line = L.readline()
    nlines = 1
    while line:
        words = re.findall(r'\b[\w\']+\b', line.lower())
        for word in words:
            if wdict.get(word) is not None:
                wdict[word] += 1
            else:
                wdict[word] = 1
        line = L.readline()
        nlines += 1

nitem = 0 ; maxitems = 100
for item in wdict.items():
    nitem += 1
    print(item)
    if nitem == maxitems: break


('the', 9362)
('project', 90)
('gutenberg', 97)
('ebook', 13)
('of', 4541)
('life', 94)
('on', 962)
('mississippi', 165)
('this', 794)
('is', 1153)
('for', 1119)
('use', 50)
('anyone', 5)
('anywhere', 18)
('in', 2617)
('united', 37)
('states', 51)
('and', 6032)
('most', 125)
('other', 271)
('parts', 9)
('world', 73)
('at', 753)
('no', 443)
('cost', 26)
('with', 1095)
('almost', 38)
('restrictions', 2)
('whatsoever', 2)
('you', 1043)
('may', 92)
('copy', 17)
('it', 2351)
('give', 82)
('away', 175)
('or', 592)
('re', 5)
('under', 122)
('terms', 27)
('license', 27)
('included', 3)
('online', 4)
('www', 9)
('org', 9)
('if', 382)
('are', 387)
('not', 734)
('located', 9)
('will', 302)
('have', 570)
('to', 3624)
('check', 4)
('laws', 20)
('country', 77)
('where', 177)
('before', 213)
('using', 11)
('title', 3)
('author', 3)
('mark', 19)
('twain', 25)
('release', 1)
('date', 18)
('july', 7)
('10', 11)
('2004', 1)
('245', 1)
('recently', 4)
('updated', 2)
('january', 3)
('1', 62)
('2021', 1)
('

### Part 1b:

In [13]:
sorted_wdict = dict(sorted(wdict.items(), key=lambda item: item[1], reverse=True))
nitem = 0 ; maxitems = 100
for item in sorted_wdict.items():
    nitem += 1
    print(item)
    if nitem == maxitems: break

('the', 9362)
('and', 6032)
('of', 4541)
('a', 4230)
('to', 3624)
('in', 2617)
('it', 2351)
('i', 2281)
('was', 2097)
('that', 1744)
('he', 1429)
('is', 1153)
('for', 1119)
('with', 1095)
('you', 1043)
('but', 986)
('his', 965)
('on', 962)
('had', 960)
('as', 886)
('this', 794)
('they', 767)
('at', 753)
('by', 743)
('all', 735)
('not', 734)
('one', 715)
('there', 642)
('were', 627)
('be', 620)
('or', 592)
('my', 586)
('from', 579)
('have', 570)
('so', 557)
('out', 553)
('up', 547)
('me', 536)
('we', 531)
('him', 529)
('when', 506)
('which', 491)
('river', 486)
('would', 478)
('an', 455)
('no', 443)
('them', 431)
('then', 419)
('said', 404)
('are', 387)
('if', 382)
('their', 377)
('now', 377)
('time', 355)
('about', 353)
('down', 342)
('been', 336)
('could', 312)
('has', 306)
('will', 302)
('two', 301)
('into', 300)
('what', 299)
('her', 282)
('its', 281)
('some', 274)
('do', 272)
('other', 271)
('new', 270)
('man', 265)
('water', 245)
('she', 241)
('any', 239)
('more', 234)
('got', 233

In [15]:
n_words = sum([word_freq for word_freq in sorted_wdict.values()])
n_top_90_words = n_words * 0.9


150899

### Part 2:

Here's the original projection of the embeddings:

<img src=viz-bert-voc-tsne10k-viz4k-noadj.pdf />

That weird string clusters turns out to be the years listed. Note that they are not just numbers, but in fact dates:

<img src=embedding-01.png />

I found a cluster of cardinal directions that are not, in fact, central:

<img src=embedding-02.png />

And it looks like all of the names are somewhat clustered. Names typically associated with males versus females are also clustered within this cluster of names. The names also somewhat lead into the names of places as that is how we sometimes name places.

<img src=embedding-03.png />