`nltk` and other useful python libraries
----------------------
Lecture notebook for CSCI 3832, Spring 2020, Lecture 3, 1/17/2020.

(See the Getting Started/Python module on Canvas for more info on Jupyter Notebooks and python. We'll be using python 3 in this class.)


- [`nltk`](https://www.nltk.org/) is a useful python library that has many NLP tools built in. It's a great tool to use to get the hang of things and to explore NLP. We'll be using it for exploratory examples. You are **not** allowed to use `nltk` for your homework assignments.
- [`collections`](https://docs.python.org/3/library/collections.html) is a python module (built-in) that provides fancier (and sometimes more useful) data structures for you to use. `Counter` and `defaultdict` are particularly useful.
- [`matplotlib.pyplot`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.html) is a useful subset of the `matplotlib` library, which lets us graph things! It has a simpler interface in general than `matplotlib` as a whole.
- others, which we'll see later in class

In [1]:
import collections
import matplotlib.pyplot as plt
import nltk
#nltk.download("punkt")  # you may need to do this if this is your first time running nltk

# so that graphs will show up in this notebook (so that we can see them)
%matplotlib inline 

In [2]:
# Function to read in a file in the format:
# freq type
# into a distionary
def record_freqs(file):
    lex_freqs = {}
    with open(file, "r") as f:
        for line in f:
            count, word = line.strip().split()
            lex_freqs[word] = int(count)
    return lex_freqs

In [3]:
freqs = record_freqs("shakes_freqs.txt")

In [4]:
# what is the size of our vocabulary? 
print("vocab size:", len(freqs))

vocab size: 23526


In [5]:
# what about the number of tokens?

# added between lecture 3 & lecture 4
num_tokens = sum(freqs.values())
print("num tokens:", num_tokens)

num tokens: 926286


In [6]:
# what is the size of our vocabulary if we used nltk?
f = open("shakesdown.txt", "r")
content = f.read()
f.close()

In [7]:
tokens = nltk.word_tokenize(content)

In [8]:
print("num tokens:", len(tokens))
print("vocab size:",len(set(tokens)))

num tokens: 1110213
vocab size: 29495


In the end, we won't want to be dealing with lists of strings for words. Mapping elements of the vocabulary to integers is going to be more efficient. So we'll create a new dictionary that maps words (as strings) to integers. Anything that we need to know about a word (like its part of speech) we'll associate with its integer index. We'll use defaultdict to do that.

In [4]:
word2index = collections.defaultdict(lambda: len(word2index))

In [7]:
# we'll be updating our lexicon on 1/22 here
#len(word2index)
UNK = word2index["<UNK>"]
print(UNK)
print(word2index["<UNK>"])

0
0


To create a lexicon, we'll set an threshold for membership in the vocab. If you're above some frequency in the training data, you're in. Let's set it to 2 for Shakespeare.

In [8]:
threshold = 2

In [9]:
lexicon = [word2index[word] for word, freq in freqs.items() if freq > threshold]

In [10]:
lexicon[:10]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [11]:
# what does this number tell us?
len(word2index)

11655

In [12]:
# what does the number that gets printed out here tell us?
word2index["horatio"]

# added between lecture 3 & 4: how can we know how many times horatio occurs?
print(freqs["horatio"])

47


In [13]:
# we'll be updating our lexicon on 1/22 here
word2index = collections.defaultdict(lambda: UNK, word2index)

In [14]:
# let's create a reverse dictionary of the one above
index2word = { index:word for word, index in word2index.items()}

In [15]:
test1 = "to be or not to be that is the question"

In [16]:
# how do we get the indexes for these words out of our original lexicon?
# to be finished weds 1/22!
split = test1.split()
for word in split:
    print(word2index[word])



4
19
45
13
4
19
9
11
1
693


In [18]:
# what about converting a list of indices into a list of words?
indices = [1, 2, 3, 100, 375, 4443]
for ind in indices:
    print(index2word[ind])

the
and
i
some
cleopatra
extremes


In [20]:
# let's look at a trickier test sentence
test2 = "i went to the cinema today"
split = test2.split()
for word in split:
    print(word2index[word])
    print(index2word[word2index[word]])

3
i
995
went
4
to
1
the
0
<UNK>
3040
today


In [22]:
index2word[0]

'the'