<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/07_Word_Frequencies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Frequencies

In this notebook we will learn how to explore the numerical distributions of words in a text - the relative frequencies of a word in a text. `Word frequency` represents the overall frequency of a word in general language use. It is a very interesting property of language because it correlates with other constructs, such as word length (shorter words are more frequent) and word difficulty (more complex words are less frequent).

One of the interesting things about frequency is a phenomena called Zipf's law, which states that the most frequent word occurs at least twice as much as the second most frequent word, and this this relationship persists. You can read a [reddit post about it here](https://www.reddit.com/r/linguistics/comments/830nf5/zipfs_law_was_so_cool_that_i_performed_and/), or at least look at the person's graph they made explaining the phenomenon:


<img src = https://www.etymologynerd.com/uploads/1/5/8/8/15888322/website.png height = "300">


Moreover, counting the frequency in which words occur with *other* words has proven very insightful for linguistics and NLP. The most basic insight is that words tend to co-occur with other specific words in predictable ways. Corpus linguists call these pairs of words `collocations`, and define them using a variety of different statistical measures. Finding these larger collocational patterns has given strength to functional lingusitic theories of grammar such as construction grammar, which argue that both meaning and syntax determine the way a word is used in language (contrast this with a purely structural approach, which argues grammar rules exist independently of meaning).

Word co-occurence statistics are also used to create co-occurence distributions and vector spaces - these are what large-scale NLP algorithms and artificial intelligence applications rely on for word predictions in both processing and production (more on that later!).

The second half of NLTK Chapter 1 begins to introduce these important concepts.

## Frequency distributions


The simpliest form of a frequency distribution is a count of how many times each word `type` appears in a text. It's worth pausing for a moment and considering how you might construct your own frequency distribution — what might be the steps for doing so? Here is one general approach you could take:

1. You start a loop over some words
2. At the first word, you note down the word and store it in a separate data container, alongside a value representing its frequency
3. You then move to the next word and check if the next word already exists in your data container,
      - if it does already exist, you increase its count by 1
      - If it does not exist, you add it to the data container and set an initial count of 1

Here is what that might look like using pseudocode:

```
output_container = []

for word in my words:
  if word in output_container
    increase count of word + 1
  else
    add word to output_container
    increase count of word + 1
```

Now, what kind of data container would make sense for this? A `list` might be able to work, but this would require some careful slicing and indexing and might become a pain. There is another data container better designed for this known as a dictionary. We will learn how to create dictionaries in a later lesson. But for now, we can rely on a built-in NLTK function named `FreqDist()`, which creates a dictionary of `value:frequency` pairs.







### Using `nltk.FreqDist()`

We can pass a sequence (e.g., a string, a list, etc) to the `nltk.FreqDist()` function and it will count the number of times different values in the sequence occur. For example, we can count the frequency of letters in a word or words in a sentence.

To do so, we simply pass whatever sequence we want as an argument to `nltk.FreqDist()`. Ideally, save the results to a variable.

Run the cell below as an example:


In [1]:
# import the FreqDist from nltk
from nltk import FreqDist

# define a string containing multiple characters
turtles = """teenage mutant ninja turtles
            teenage mutant ninja turtles
            teenage mutant ninja turtles
            heroes in a halfshell, turtle power"""

# split the string into tokens/words:
turtles_tokens = turtles.split()

# save the frequency distribution to a variable
turtle_fdist = FreqDist(turtles_tokens)

# inspect the results
turtle_fdist

FreqDist({'teenage': 3, 'mutant': 3, 'ninja': 3, 'turtles': 3, 'heroes': 1, 'in': 1, 'a': 1, 'halfshell,': 1, 'turtle': 1, 'power': 1})

The resulting frequency distribution is another Python data object called a `dictionary` which stores `key:value` pairs. In this case, our keys are the words, and the values are the frequencies.

We can query a dictionary for specific `key:value` pairs using the following syntax:

> `dictionary['key']`

This should look familiar, because it is similiar to how one can index characters in strings (e.g., `turtles[1]`) or words in lists (e.g., `['one', 'two'][0]`)

For example:

In [2]:
# how frequent is "turtles?"
turtle_fdist['turtles']

3

In [3]:
# how frequent is "turtle?"
turtle_fdist['turtle']

1

In [4]:
# what happens if we ask for a word not in the dictionary?
# the NLTK FreqDist gives us a 0 rather than an error, which is handy!

turtle_fdist['shredder']

0

We can also ask for the most frequent terms from a frequency distribution using the `.most_common()` method. We can specific the number of top results we want by putting a number in the brackets `()` used by `.most_common()`. The code below has a `3` in the brackets, so the function will return the top-three most frequent words in the frequency distribution.

In [5]:
# what is the most common word?
turtle_fdist.most_common(3)

[('teenage', 3), ('mutant', 3), ('ninja', 3)]

### Fine-tuning a search with frequency

Lets calculate word frequencies for a larger, more interesting data set. Create a frequency distribution of the webchat corpus included with `nltk`, `text5` using `FreqDist()`. You'll need to import `nltk` and download the book resource:

In [2]:
# import the main nltk module
import nltk

# download the nltk.book resources
nltk.download('book')

# import the resources
from nltk.book import *

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\mingb\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\mingb\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\mingb\AppData\Roaming\nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\mingb\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\mingb\AppData\Roaming\nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\mingb\AppData\R

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [7]:
# Now create a FreqDist of the webchat text
webchat_fdist = FreqDist(text5)

What are the 50 most common words in the webchat corpus? Examine the output - what do you see? Are there items in the output you did or did not expect? What do you think is happening?

In [8]:
webchat_fdist.most_common(50)

[('.', 1268),
 ('JOIN', 1021),
 ('PART', 1016),
 ('?', 737),
 ('lol', 704),
 ('to', 658),
 ('i', 648),
 ('the', 646),
 ('you', 635),
 (',', 596),
 ('I', 576),
 ('a', 568),
 ('hi', 546),
 ('me', 415),
 ('...', 412),
 ('is', 372),
 ('..', 361),
 ('in', 357),
 ('ACTION', 346),
 ('!', 342),
 ('and', 335),
 ('it', 332),
 ('that', 274),
 ('hey', 264),
 ('my', 242),
 ('of', 202),
 ('u', 200),
 ("'s", 195),
 ('for', 188),
 ('on', 186),
 ('what', 183),
 ('here', 181),
 ('are', 178),
 ('not', 170),
 ('....', 170),
 ('do', 168),
 ('all', 165),
 ('have', 164),
 ('up', 160),
 ('like', 156),
 ('no', 155),
 ('with', 152),
 ('chat', 142),
 ('was', 142),
 ("n't", 141),
 ('so', 139),
 ('your', 137),
 ('/', 133),
 ("'m", 133),
 ('good', 130)]

Let's now look at how people use the phrase "lol" - both the individual frequency and the overall percentage of "lol" in the corpus.

What do you think about the results? 1.5% might seem low, but is actually a rather strong result considering how many possible words *could* be in the corpus.


In [9]:
# index the value by using the key (in this case, the word we want to check)
webchat_fdist['lol']

704

In [10]:
# divide the frequency of 'lol' by the total length of the corpus, then multiply by 100
webchat_fdist['lol']/len(text5)*100

1.5640968673628082

We can now include word frequency as an additional condition when looking for certain words. Do you recall how list comprehensions and conditional for loops worked? For example, if we wanted to ask for all words which are three letters long:

In [None]:
# all tokens which are 3 letters long (list comprehension)
# this says: give me every word in text5 but only if the length of the word is equal to 3
[w for w in text5 if len(w) == 3]

The output is not very readable, is it? We are getting every single token which is 3 characters long, including repetitions. We can reduce this firstly by wrapping the list comprehension in `set()` so that we get a list of types, rather than tokens.


In [11]:
# add set()
set([w for w in text5 if len(w) == 3])

{'!!!',
 '!!.',
 '!??',
 '###',
 '$27',
 "'ll",
 "'n'",
 "'re",
 "'ve",
 '(((',
 ')))',
 ',,,',
 '-->',
 '-17',
 '-21',
 '-_-',
 '. .',
 '.(.',
 '.).',
 '...',
 '.45',
 '.;)',
 '05.',
 '06.',
 '100',
 '12%',
 '138',
 '16.',
 '185',
 '2.3',
 '20.',
 '20S',
 '20s',
 '220',
 '224',
 '246',
 '247',
 '280',
 '295',
 '2nd',
 '30.',
 '300',
 '360',
 '396',
 '423',
 '43.',
 '453',
 '46.',
 '47.',
 '55%',
 '55.',
 '56.',
 '579',
 '59%',
 '60s',
 '65%',
 '68%',
 '70%',
 '700',
 '73%',
 '75%',
 '76%',
 '818',
 '85%',
 '93%',
 ':-(',
 ':-)',
 ':-@',
 ':-o',
 ';-(',
 ';-)',
 '<--',
 '<33',
 '<<<',
 "='s",
 '=-\\',
 '>.>',
 '>>>',
 '>_>',
 '?..',
 '???',
 '??@',
 '@$$',
 'AFK',
 'ALL',
 'AND',
 'ANY',
 'ARE',
 'ASS',
 'Ack',
 'Ahh',
 'Amy',
 'And',
 'Any',
 'Are',
 'Ark',
 'Ask',
 'Aww',
 'BIG',
 'BOY',
 'BUT',
 'BUt',
 'BYE',
 'Ben',
 'Box',
 'Bud',
 'But',
 'Bye',
 'CAN',
 'CDT',
 'COM',
 'CSI',
 'CST',
 'CUZ',
 'Can',
 'Cry',
 'Cum',
 'DON',
 'DVD',
 'Dew',
 'Did',
 'Dr.',
 'EST',
 'End',
 'Fix',

If you look through that output, you can see that there are a lot of things that look like codes or other non-word stuff, usually in UPPERCASE. We can try removing those using `.islower()`

In [12]:
# all tokens which are 3 letters long and all characters are lowercase
# give me the set of all words in text5 if the word is 3 characters long and each character is in lower case
set([w for w in text5 if len(w) == 3 and w.islower()])

{"'ll",
 "'n'",
 "'re",
 "'ve",
 '20s',
 '2nd',
 '60s',
 ':-o',
 "='s",
 '\\ty',
 'abs',
 'act',
 'ads',
 'afe',
 'afk',
 'age',
 'ago',
 'ahh',
 'aim',
 'air',
 'aka',
 'all',
 'alo',
 'amy',
 'and',
 'ans',
 'any',
 'aok',
 'are',
 'arm',
 'art',
 'ask',
 'asl',
 'ass',
 'ate',
 'atl',
 'aww',
 'b/c',
 'bad',
 'bag',
 'bak',
 'ban',
 'bar',
 'bay',
 'bbl',
 'bbs',
 'bed',
 'beg',
 'ben',
 'bes',
 'bet',
 'big',
 'bio',
 'bit',
 'biz',
 'bob',
 'boi',
 'boo',
 'bot',
 'bow',
 'box',
 'boy',
 'bra',
 'brb',
 'bro',
 'btw',
 'bug',
 'buh',
 'bum',
 'bus',
 'but',
 'buy',
 'byb',
 'bye',
 "c'm",
 'cal',
 'cam',
 'can',
 'car',
 'cat',
 'chp',
 'com',
 'con',
 'cop',
 'cos',
 'cow',
 'cpr',
 'cry',
 'cup',
 'cus',
 'cut',
 'cuz',
 'cya',
 'dad',
 'dam',
 'dat',
 'day',
 'dem',
 'did',
 'die',
 'dik',
 'dis',
 'doc',
 'doe',
 'dog',
 'dry',
 'duh',
 'dum',
 'dun',
 'dya',
 'ear',
 'eat',
 'eay',
 'egg',
 'ehh',
 'elo',
 'end',
 'eng',
 'ere',
 'erm',
 'eva',
 'eww',
 'eye',
 'fan',
 'far',

Now it's getting more manageable. It's still quite a long list though. Let's add another condition - asking for the same output as the previous code, but this time setting a minimum frequency. We can embed a FreqDist as part of the condition.  Let's also adjust our length so that we let both 3 and 4 letter words appear.  



In [14]:
# adding minimum frequency, allow for both 3 and 4 letter words (how else could you write that conditional?)

# give me the set of all words in text 5 if the word is 3 or 4 letters long, and is lower case, and occurs over 100 times in the fredist
set([w for w in text5 if (len(w) == 4 or len(w) == 3) and w.islower() and webchat_fdist[w] > 100])

{'all',
 'and',
 'any',
 'are',
 'can',
 'chat',
 'for',
 'get',
 'good',
 'have',
 'here',
 'hey',
 'how',
 'just',
 'know',
 'like',
 'lmao',
 'lol',
 "n't",
 'not',
 'out',
 'that',
 'the',
 'too',
 'was',
 'what',
 'with',
 'you',
 'your'}

What do you see in that output? Any words stand out as representative of a chat corpus? What kinds of words do you think you will find using the same criteria but on a different corpus? The point, which was made in the NLTK book regarding the length of words, is that a single line of code with the right tuning can provide relatively precise insight into the nature of a text and/or corpus.

### **Your Turn**

Spend some time to play around with one of the other built-in texts (`text1` through `text8`) from the NLTK data.

Your goal is to try and refine some search patterns to find words which seem to capture the nature of the different texts. For example, you could think about a minimum frequency and minimum or maximum length, such as I have done with `text3` above.

You can see what the name of each text is by typing `text1-9` into a cell and running it, for example:

In [7]:
# typing just the text's id tells you the actual document.
myText = text6
myTextDist = FreqDist(myText)

text6

set([word for word in myText if (len(word) == 14 or len(word) == 8) and word.islower() and myTextDist[word] > 30])

set()