<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/18_Bigrams_and_Collocations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Collocations

One of the most memorable quotes one learns when studying corpus linguistics is by [Firth](https://en.wikipedia.org/wiki/John_Rupert_Firth): “you shall know a word by the company it keeps". This quote embodies the concept of *collocational meaning* — the idea that the true meaning of words is only realised when used in the context of *other* words.

Here is the definition of collocation from Google Dictionary:

![](https://i.imgur.com/hz3L88z.png)

Note how the linguistic definition is essentially the same thing as the non-linguistic definition, but includes an additional qualification about habitual co-occurence. Collocations are not just words which occurs next to another word, they are words which co-occur *non-randomly* - at a rate higher than chance.

How do we know which words occur non-randomly or not? Think of the corpora that you've already seen thus far — large corpora such as Brown (and even larger than that) allow linguists to mine not just frequently occuring single words, but also frequently occuring word pairs, triplets, and so on. Statistical measures of word co-occurence provide insight into natural language and are also responsible for the ever-increasing accuracy of automatic NLP algorithms used today.

Statistical knowledge of collocations is not just present in corpora, but is also a function of becoming proficienct in a language. This [Wikipedia article](https://en.wikipedia.org/wiki/English_collocations) contains a decent explanation of some collocations in English. Consider this table from the article, which shows how some word pairs seem natural / correct, whereas others do not:

![](https://i.imgur.com/cWE5iO5.png)

Note that none of the word combinations in the "unnatural English" column are ungrammatical, rather it's just that they seem to be odd combinations particularly when compared to the versions in the "natural English" column. Collocations are extracted from large corpora of language and are thought to reflect language use, which is in turn reflected by our interpretation of which of these collocations seems right and which seem odd.


## Bigrams

As was shown above, collocations are defined at a minimum of two or more co-occuring words. Collocations can stretch beyond two words, to three, four, or more word partners.

A related but crucially different term from collocations is the term **bigrams**. In NLP, the term `bigram` means a pair of words, and technically can mean *any pair of adjacent words in a text*. The more general terminology is `n-grams`, where the `n` can stand for any number. So you can have bigrams (two words), trigrams (three words), and so on. You may find that sometimes people conflate `collocation` with `ngram`, but what they probably are referring to are unusually frequent ngrams when compared to other ngrams, especially when taking into account individual word frequencies.

Simply put, the relationship between collocations and bigrams is that all collocations are bigrams (or, more accurately, n-grams), but not all n-grams are collocations. In order to be upgraded from n-gram to collocation, it must be shown that the n-gram occurs more frequently than would be allowed by chance.

So, collocations are defined based on statistical co-occurence frequencies: collocations are thus two words which are more likely to occur with one another when compared to other words, controlled for each word's individual word frequency.

How would you locate bigrams in a text? With a `for loop` and functions like `enumerate`, it wouldn't be too difficult. However, NLTK has a neat function which calculates these values for us — `bigrams` and `collocations`. Let's try `bigrams` on some sample text first.

### `nltk.bigrams()`

The function to create bigrams in NLTK is relatively straightforward. We just need to call the function on a tokenized text (otherwise we would get bigrams of characters in a string!).

Let's see an example. First download the needed resources.

In [1]:
# import the NLTK library and download tokenizer
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# create a sample sentence
great_quote = 'we live in a society!'

# use nltk bigrams (wrapping it in list() provides us with the output right away)
list(nltk.bigrams(nltk.word_tokenize(great_quote)))

[('we', 'live'),
 ('live', 'in'),
 ('in', 'a'),
 ('a', 'society'),
 ('society', '!')]

You should be able to inspect the output and get an idea for what's going on. This function is simply starting with the first word of the sentence, making a pair with the second word, then moving on to the second word, making a pair with the third word, and so on. We can conceptualise how this works with a pseudo formula:

First we would loop through a sentence:

> `for n:m in a sentence (where n = the first word and m = the final word)`

Then we would simply iterate ahead by one and add that to the current iterator

> `output = n + n1, n1 + n2, n2 + n3..., m-1 + m`

Is it that simple, can we produce the bigrams in the same way that the NLTK module has? One excellent function which can help us with this is `enumerate()`!

Check out the code below - you can see that it was relatively easy to get the same basic functionality from NLTK with our own code. Take a moment to study what I've had to do in order to prevent index errors.





In [3]:
# use enumerate to make bigrams by asking for adjacent words until we get to the end of the sentence.
def bootleg_bigram(tokens):
  for i, word in enumerate(tokens):
    if i != len(tokens)-1: # what is the role of this line?
      print((tokens[i], tokens[i + 1]))

In [4]:
# Test out bootleg_bigram on the same text.
bootleg_bigram(nltk.word_tokenize(great_quote))

('we', 'live')
('live', 'in')
('in', 'a')
('a', 'society')
('society', '!')


### **Your Turn**

Spend some time using `nltk.bigram()` on some text/strings. Make sure you understand what it is doing, and also compare the output to the `bootleg_bigram()` function.

In [None]:
# play with nltk.bigram() here.

## Finding collocations

The `collocations()` function will give us the bigrams which are unusually frequent when also considering the frequency of the individual words in the bigrams. If you look under the hood in the NLTK docs, you'll find they use calculations from [this paper](https://aclanthology.org/J90-1003.pdf) to determine strength of association (i.e., to distinguish collocations from bigrams).

Let's return to some of the built-in texts and examine their collocations. We need to download the NLTK resources.



In [5]:
# bring in the nltk resources
import nltk
nltk.download('book')
from nltk.book import *

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908



### **Your Turn**

Examine the collocations for `text6` and `text9`.
- What do you think it is about the text which is creating these collocations?
- Do you think these same collocations would be found in the other texts?
- What might this tell us about the power of using collocations / ngrams if we wanted to predict where documents came from (e.g., guessing the genre, guessing the author)?

In [46]:
# what are the collocations of Holy Grail?
print(text6, '\n')

text6.collocations()

<Text: Monty Python and the Holy Grail> 

BLACK KNIGHT; clop clop; HEAD KNIGHT; mumble mumble; Holy Grail;
squeak squeak; FRENCH GUARD; saw saw; Sir Robin; Run away; CARTOON
CHARACTER; King Arthur; Iesu domine; Pie Iesu; DEAD PERSON; Round
Table; clap clap; OLD MAN; dramatic chord; dona eis


In [47]:
# examine the collocations of text9.
# What do you know about this book, without having read it?
print(text9, '\n')

text9.collocations()

<Text: The Man Who Was Thursday by G . K . Chesterton 1908> 

said Syme; asked Syme; Saffron Park; Comrade Gregory; Leicester
Square; Colonel Ducroix; red hair; old gentleman; could see; Inspector
Ratcliffe; Anarchist Council; blue card; Scotland Yard; dark room;
blue eyes; common sense; straw hat; hundred yards; said Gregory; run
away


Look at some collocations from the other book texts. Do the collocations make sense based on what you know about the books?

In [99]:
import nltk

tp001 = nltk.Text(nltk.word_tokenize(' '.join([line.split('\t')[1] for line in open("./the-current/tp001.txt", encoding="utf8").read().rstrip().split('\n')])))

tp001.collocations()


petrol cars; fossil fuels; electric cars; climate change; global
warming; public transport; future generations; renewable energy; good
idea; carbon emissions; fossil fuel; electric car; right direction;
get rid; new zealand; greenhouse gas; carbon dioxide; dont care; hello
hello; step towards


## **How do people use LOL?**

Now, while collocations can tell us about a text in general, we can also use bigrams as a means to explore a targeted use of language we might be interested.

Let's consider `text5`, the webchat corpus. Perhaps we want to know how people use "lol", regardless of whether "lol" is a collocation or not. To do so, we can conditionally sort through the bigrams of the text.

We will first obtain all the bigrams of `text5`. Then we will print out the bigrams only if they contain the acronym `lol` or `LOL`. Perhaps this will tell us how LOL is used?

In [53]:
# first get the bigrams
webchat_bigrams = list(bigrams(text5))

# we can see that there is going to be a lot of them!
len(webchat_bigrams)

45009

In [55]:
# inspect a random part of the bigrams
webchat_bigrams[1337:1350]

[('sits', 'in'),
 ('in', 'the'),
 ('the', 'corner'),
 ('corner', '-'),
 ('-', '.'),
 ('.', 'JOIN'),
 ('JOIN', 'phone'),
 ('phone', 'U92'),
 ('U92', '?'),
 ('?', 'hello'),
 ('hello', 'U84'),
 ('U84', '.'),
 ('.', 'ACTION')]

Now that we have obtained all of the bigrams, let's use a list comprehension to fine the bigrams which contain variations of LOL:

In [74]:
# create a new object named lol_grams
# which are the bigrams of webchat_bigrams only if they contain 'lol' or 'LOL'
lol_grams = [gram for gram in webchat_bigrams if 'lmao' in gram or 'LMAO' in gram]

# we have a good number!
len(lol_grams)

252

In [75]:
# you can examine the bigrams here
# what do you notice?
sorted(set(lol_grams))

[('!', 'lmao'),
 ('-', 'lmao'),
 ('.', 'LMAO'),
 ('.', 'lmao'),
 ('. .', 'LMAO'),
 ('..', 'LMAO'),
 ('...', 'lmao'),
 ('....', 'lmao'),
 ('...........', 'lmao'),
 ('1', 'lmao'),
 (':)', 'LMAO'),
 (':)', 'lmao'),
 (':|', 'lmao'),
 ('<empty>', 'lmao'),
 ('?', 'lmao'),
 ('??', 'lmao'),
 ('???', 'lmao'),
 ('JOIN', 'LMAO'),
 ('JOIN', 'lmao'),
 ('Jerketts', 'lmao'),
 ('Kentucky', 'lmao'),
 ('LMAO', '.'),
 ('LMAO', '2'),
 ('LMAO', '@'),
 ('LMAO', 'And'),
 ('LMAO', 'JOIN'),
 ('LMAO', 'Randy'),
 ('LMAO', 'U27'),
 ('LMAO', 'U47'),
 ('LMAO', 'U53'),
 ('LMAO', 'U65'),
 ('LMAO', 'U7'),
 ('LMAO', 'U91'),
 ('LMAO', 'awww'),
 ('LMAO', 'damn'),
 ('LMAO', 'i'),
 ('LoL', 'lmao'),
 ('Lol', 'lmao'),
 ('PART', 'LMAO'),
 ('PART', 'lmao'),
 ('U105', 'lmao'),
 ('U144', 'lmao'),
 ('U20', 'lmao'),
 ('U21', 'lmao'),
 ('U22', 'lmao'),
 ('U25', 'lmao'),
 ('U26', 'lmao'),
 ('U30', 'lmao'),
 ('U32', 'lmao'),
 ('U35', 'lmao'),
 ('U36', 'lmao'),
 ('U41', 'lmao'),
 ('U43', 'LMAO'),
 ('U50', 'LMAO'),
 ('U52', 'lmao'),
 (

There's a lot of stuff in that output, mainly because of the way usernames are represented. This is somewhat interesting/useful because it suggests 'lol/LOL' might be the first/only thing many people type (which makes sense). But we are more interested in seeing how 'lol/LOL' pairs with other words. There was also a lot of punctuation joined with 'lol/LOL'.

Let's try cleaning it up a bit. We can add a condition that requires both words in the bigram must be `.isalpha()`. Why might this work? Because `.isalpha()` only returns `True` if every character in a string is an alphabetic character (a-z/A-Z). Any punctuation *or* numbers will cause `.isalpha()` to evaulate `False`.


In [76]:
# Description of str.isalpha
help(str.isalpha)

Help on method_descriptor:

isalpha(self, /)
    Return True if the string is an alphabetic string, False otherwise.
    
    A string is alphabetic if all characters in the string are alphabetic and there
    is at least one character in the string.



In [77]:
# now use .isalpha() to only capture lol or LOL with other words
lol_grams2 = [gram for gram in lol_grams if gram[0].isalpha() and gram[1].isalpha()]
len(lol_grams2)

123

Doing this really cleans up the output, although we still have *some* words in there that probably aren't what we want (like the JOIN messages).

In [78]:
# it's a lot easier to see these now
sorted(set(lol_grams2))

[('JOIN', 'LMAO'),
 ('JOIN', 'lmao'),
 ('Jerketts', 'lmao'),
 ('Kentucky', 'lmao'),
 ('LMAO', 'And'),
 ('LMAO', 'JOIN'),
 ('LMAO', 'Randy'),
 ('LMAO', 'awww'),
 ('LMAO', 'damn'),
 ('LMAO', 'i'),
 ('LoL', 'lmao'),
 ('Lol', 'lmao'),
 ('PART', 'LMAO'),
 ('PART', 'lmao'),
 ('WHISPER', 'lmao'),
 ('a', 'lmao'),
 ('all', 'lmao'),
 ('arrested', 'lmao'),
 ('away', 'lmao'),
 ('banned', 'lmao'),
 ('crap', 'lmao'),
 ('elo', 'lmao'),
 ('everyone', 'lmao'),
 ('fly', 'lmao'),
 ('gay', 'lmao'),
 ('good', 'lmao'),
 ('goodness', 'lmao'),
 ('guess', 'lmao'),
 ('hehe', 'lmao'),
 ('hehehehe', 'lmao'),
 ('here', 'lmao'),
 ('hour', 'lmao'),
 ('it', 'LMAO'),
 ('it', 'lmao'),
 ('kitchen', 'lmao'),
 ('laffs', 'lmao'),
 ('late', 'lmao'),
 ('lmao', 'Break'),
 ('lmao', 'CO'),
 ('lmao', 'I'),
 ('lmao', 'JOIN'),
 ('lmao', 'PART'),
 ('lmao', 'Something'),
 ('lmao', 'ahah'),
 ('lmao', 'ahhhh'),
 ('lmao', 'bbl'),
 ('lmao', 'busy'),
 ('lmao', 'do'),
 ('lmao', 'from'),
 ('lmao', 'hahahaha'),
 ('lmao', 'he'),
 ('lmao', 'h

### forwards and backwards lol_grams

Let's now see if we can discern any interesting patterns with how lol/LOL is used. We'll create three sublists from `lol_grams2`. These sublists will be:

- all bigrams which start with lol/LOL and the second word is not lol/LOL
- all bigrams which end with lol/LOL and the first word is not lol/LOL
- all bigrams where the first and second words are either lol/LOL

I'll define all three below in one cell.

In [71]:
targets = ['lmao', 'LMAO']

forward_lolgrams = [gram for gram in lol_grams2 if gram[0] in targets and gram[1] not in targets]
backwards_lolgrams = [gram for gram in lol_grams2 if gram[1] in targets and gram[0] not in targets]
double_lolgrams = [gram for gram in lol_grams2 if gram[0] in targets and gram[1] in targets]

In [72]:
# what does this distribution tell us?
print('forward lols:', len(forward_lolgrams),
      '\n', 'backwards lols:', len(backwards_lolgrams),
      '\n', 'double lols:', len(double_lolgrams))

forward lols: 5 
 backwards lols: 7 
 double lols: 0


In [73]:
# explore the forward lolgrams
sorted(set(forward_lolgrams))

[('lmao', 'lol')]

In [69]:
# explore the backwards lolgrams
sorted(set(backwards_lolgrams))

[('Dixie', 'lol'),
 ('Drew', 'lol'),
 ('Foxwoods', 'lol'),
 ('JOIN', 'LOL'),
 ('JOIN', 'lol'),
 ('LoL', 'lol'),
 ('No', 'lol'),
 ('PART', 'LOL'),
 ('PART', 'lol'),
 ('PMs', 'lol'),
 ('Saturday', 'lol'),
 ('Werd', 'lol'),
 ('Yes', 'LOL'),
 ('about', 'lol'),
 ('again', 'LOL'),
 ('again', 'lol'),
 ('ahem', 'lol'),
 ('aim', 'lol'),
 ('all', 'LOL'),
 ('alot', 'lol'),
 ('already', 'lol'),
 ('amazing', 'lol'),
 ('anytime', 'lol'),
 ('are', 'lol'),
 ('around', 'lol'),
 ('away', 'lol'),
 ('baby', 'lol'),
 ('back', 'lol'),
 ('bad', 'lol'),
 ('banjoes', 'lol'),
 ('bar', 'lol'),
 ('barfights', 'lol'),
 ('be', 'lol'),
 ('biebsa', 'LOL'),
 ('bones', 'LOL'),
 ('booty', 'lol'),
 ('bot', 'lol'),
 ('boy', 'lol'),
 ('brb', 'LOL'),
 ('brrrrrrr', 'LOL'),
 ('bye', 'lol'),
 ('byeeeeeeeeeeeee', 'lol'),
 ('cardinals', 'lol'),
 ('care', 'lol'),
 ('change', 'LOL'),
 ('chat', 'lol'),
 ('churches', 'LOL'),
 ('comment', 'lol'),
 ('cool', 'lol'),
 ('damn', 'lol'),
 ('dang', 'LOL'),
 ('development', 'lol'),
 ('did', 

In [70]:
# explore the double lolgrams
# does it make sense to you why this sorted is so short?
sorted(set(double_lolgrams))

[('LOL', 'lol'), ('lol', 'LOL'), ('lol', 'lol')]

### **Your Turn**

Spend some time looking through the distribution of `lol/LOL` that I've created.

- Can you draw any conclusions about how these words might be used in terms of how they pattern with other words?
- What tweaks or changes might you make to my code?
- what other bigrams might be interesting to search for?

## Bigrams and FreqDist

Remember that collocations are bigrams which occur more frequently than other pairs of words. The calculations for a collocation will take into account the frequency of a word collocation when compared to the overall frequency of a word.

Can we create a makeshift collocation counter using a combination of FreqDist and bigrams?

First, we would want to generate a list of bigrams for a text. Let's use some data from The Current.

In [80]:
# statement - NZ tourism should be limited to protect the environment
# load the TP011 data to the notebook environment
# !wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp011.txt'

# read in the entire file
tp011 = open('./the-current/tp011.txt', encoding="utf8").read().rstrip()

# remove any punctuation
import re
punctuation = '[#.,!\'"-]'
tp011 = re.sub(pattern = punctuation, repl = '', string = tp011)

In [81]:
# extract the comments
tp011_comments = [comment.split('\t')[1] for comment in tp011.split('\n')]

# look at the results. Notice the second comment is full of crap.
tp011_comments[:5]

['individuals should pay compenstion to account for the negative externalities of their consumption',
 'we needto protect our   environment ?\x81nd l?©m?©t n\x8f©mb??rs',
 'that they should feel hopeful that they areold',
 'that  this is an outrage becase this going to spread the delta varient',
 'that the guy next to me is writing this and he israge']

Let's clean up those words that have symbols etc using `.isalpha()`.

I will do so with a nested list comprehension.

Note that the expression of the list comprehension includes an internal list comprehension.

This list comprehension loops over comments, then for each comment loops over the results of running .split() on the comment. The `.isalpha()` removes any word that does not fully contain characters. So any words with numbers, symbols, or other non-letter characters get removed. This is a crude way to remove data, but is effective for our purposes here.


In [82]:
# split each comment, then only keep each word which .isalpha(), then glue the comment back together
tp011_comments = [' '.join([word for word in comment.split() if word.isalpha()]) for comment in tp011_comments]

In [83]:
# look at the results - the second comment no longer has those words full of symbols.
tp011_comments[:5]

['individuals should pay compenstion to account for the negative externalities of their consumption',
 'we needto protect our environment',
 'that they should feel hopeful that they areold',
 'that this is an outrage becase this going to spread the delta varient',
 'that the guy next to me is writing this and he israge']

Now, this data is in a list of individual comments. It would not make much sense to combine the comments into a single string if we want to create bigrams, because we don't want to combine the end of one comment with the start of another comment.

So, let's create individual list of bigrams from the data, one for each comment.

In [84]:
# tokenize each comment
tp011_comment_tokens = [nltk.word_tokenize(comment) for comment in tp011_comments]

# turn tokens into bigrams
tp011_comments_bigrams = [list(nltk.bigrams(comment)) for comment in tp011_comment_tokens]

In [85]:
# we now have a list of comments split into bigrams!
tp011_comments_bigrams[:3]

[[('individuals', 'should'),
  ('should', 'pay'),
  ('pay', 'compenstion'),
  ('compenstion', 'to'),
  ('to', 'account'),
  ('account', 'for'),
  ('for', 'the'),
  ('the', 'negative'),
  ('negative', 'externalities'),
  ('externalities', 'of'),
  ('of', 'their'),
  ('their', 'consumption')],
 [('we', 'needto'),
  ('needto', 'protect'),
  ('protect', 'our'),
  ('our', 'environment')],
 [('that', 'they'),
  ('they', 'should'),
  ('should', 'feel'),
  ('feel', 'hopeful'),
  ('hopeful', 'that'),
  ('that', 'they'),
  ('they', 'areold')]]

Now that we have lists of bigrams, we can join them into one giant list, since we know the comments have not cross contaminated one another. We can't use `''.join()` for this, because we have a list of lists.

One solution is to use a loop to glue the list into another list using list concatenation. There is likely a more elegant way to do this, but this way gets the job done pretty quick.

In [86]:
# create an empty output list
tp011_bigrams_joined = []

# loop through and add each comment to the list
for bigram in tp011_comments_bigrams:
  tp011_bigrams_joined = tp011_bigrams_joined + bigram

In [87]:
# inspect the output
tp011_bigrams_joined[:20]

[('individuals', 'should'),
 ('should', 'pay'),
 ('pay', 'compenstion'),
 ('compenstion', 'to'),
 ('to', 'account'),
 ('account', 'for'),
 ('for', 'the'),
 ('the', 'negative'),
 ('negative', 'externalities'),
 ('externalities', 'of'),
 ('of', 'their'),
 ('their', 'consumption'),
 ('we', 'needto'),
 ('needto', 'protect'),
 ('protect', 'our'),
 ('our', 'environment'),
 ('that', 'they'),
 ('they', 'should'),
 ('should', 'feel'),
 ('feel', 'hopeful')]

Now that we have a single list of bigrams, we can use `FreqDist()` to create a frequency distribution of the bigrams.



In [88]:
# create the frequency distribution
tp011_bigram_fdist = nltk.FreqDist(tp011_bigrams_joined)

In [89]:
# we see a frequency distribution of bigrams
tp011_bigram_fdist

FreqDist({('we', 'need'): 100, ('need', 'to'): 94, ('i', 'think'): 82, ('we', 'should'): 59, ('the', 'environment'): 55, ('to', 'protect'): 51, ('our', 'environment'): 45, ('will', 'be'): 45, ('is', 'a'): 40, ('we', 'can'): 40, ...})

What are the most frequent bigrams in this data?


In [90]:
tp011_bigram_fdist.most_common(10)

[(('we', 'need'), 100),
 (('need', 'to'), 94),
 (('i', 'think'), 82),
 (('we', 'should'), 59),
 (('the', 'environment'), 55),
 (('to', 'protect'), 51),
 (('our', 'environment'), 45),
 (('will', 'be'), 45),
 (('is', 'a'), 40),
 (('we', 'can'), 40)]

The most frequent bigram in this output is "we need". Does this mean it is also a collocation? Potentially. In order to fully investigate this, we need to compare the individual word frequency against the bigram frequency. So, we could create a frequency distribution of the entire comment section to obtain single word frequency, and compare that to bigram frequency.

In [91]:
# create FreqDist from tp011
# run the same line extracting just the comments, but joint the results into a single string.
tp011_raw = ' '.join([comment.split('\t')[1] for comment in tp011.split('\n')])


In [92]:
tp011_raw[:150]

'individuals should pay compenstion to account for the negative externalities of their consumption we needto protect our   environment ?\x81nd l?©m?©t n\x8f©'

In [93]:
# create a FreqDist of the entire set of words
tp011_fdist = nltk.FreqDist(nltk.word_tokenize(tp011_raw))

Now that we have a frequency distribution of both single words and bigrams, we can compare individual word against bigram frequencies. One simple way to determine the strength of attraction between two words is the relative proportion of that word occuring with one word compared to all other words. This is related to word probabilities which is covered later, but for now we can view this is a percentage of occurance.

In [94]:
# the word "we" occurs 365 times
tp011_fdist['we']

365

In [95]:
# the bigram 'we  need' occurs 100 times
tp011_bigram_fdist[('we', 'need')]

100

So out of the 365 times the word "we" occurs, 100 of those times are with the word "need". This means about 27% of the occurances of we are in this combination, which could be taken as evidence that this is a strong collocation for we.

In order to fully verify this, we would want to find all of the other bigrams that we occurs with. How could we do this?  Firstly, we could obtain a set of all the bigrams that start with 'we', using a similar strategy to the backwards and forwards lolgrams shown above.

In [96]:
# obtain a set of all bigrams starting with we
we_bigrams = set([bigram for bigram in tp011_bigrams_joined if bigram[0] == 'we'])
we_bigrams

{('we', 'LIVE'),
 ('we', 'all'),
 ('we', 'alreadey'),
 ('we', 'also'),
 ('we', 'alsocneed'),
 ('we', 'and'),
 ('we', 'are'),
 ('we', 'arent'),
 ('we', 'aresocial'),
 ('we', 'as'),
 ('we', 'bring'),
 ('we', 'cab'),
 ('we', 'can'),
 ('we', 'cant'),
 ('we', 'care'),
 ('we', 'could'),
 ('we', 'create'),
 ('we', 'currently'),
 ('we', 'do'),
 ('we', 'dont'),
 ('we', 'educate'),
 ('we', 'expect'),
 ('we', 'fix'),
 ('we', 'get'),
 ('we', 'geta'),
 ('we', 'gethome'),
 ('we', 'grow'),
 ('we', 'have'),
 ('we', 'havent'),
 ('we', 'havewhilst'),
 ('we', 'just'),
 ('we', 'know'),
 ('we', 'left'),
 ('we', 'let'),
 ('we', 'like'),
 ('we', 'limit'),
 ('we', 'live'),
 ('we', 'lived'),
 ('we', 'loss'),
 ('we', 'made'),
 ('we', 'manage'),
 ('we', 'may'),
 ('we', 'maymiss'),
 ('we', 'might'),
 ('we', 'must'),
 ('we', 'nead'),
 ('we', 'ned'),
 ('we', 'need'),
 ('we', 'needto'),
 ('we', 'only'),
 ('we', 'owe'),
 ('we', 'put'),
 ('we', 're'),
 ('we', 'really'),
 ('we', 'rely'),
 ('we', 'restrict'),
 ('we', 's

Having obtained this set, we could find the relative proportion of each bigram compared to the overall occurance of we.

Look at the output. You'll see that 'we need' is the most common pair and takes up the bulk of we. From this data, we can conclude that "we" prefers to come before "need" at a rate much higher than many other words. What is the second strongest collocation in this data?

And, what could be done to improve this code so that it automatically finds the strongest collocations? Right now it just prints the data which means you have to use your human eyes and brain to locate the most / least strongest collocations.

In [97]:
for wb in we_bigrams:
  # print frequency of bigram then percentage of bigram based on word occurance
  print(f'bigram {wb} occurs {tp011_bigram_fdist[wb]} times, for a total % of {(tp011_bigram_fdist[wb]/tp011_fdist["we"])*100}')

bigram ('we', 'need') occurs 100 times, for a total % of 27.397260273972602
bigram ('we', 'dont') occurs 9 times, for a total % of 2.4657534246575343
bigram ('we', 'restrict') occurs 1 times, for a total % of 0.273972602739726
bigram ('we', 'nead') occurs 1 times, for a total % of 0.273972602739726
bigram ('we', 'as') occurs 1 times, for a total % of 0.273972602739726
bigram ('we', 'arent') occurs 2 times, for a total % of 0.547945205479452
bigram ('we', 'treateriurenvironment') occurs 1 times, for a total % of 0.273972602739726
bigram ('we', 'currently') occurs 1 times, for a total % of 0.273972602739726
bigram ('we', 'left') occurs 1 times, for a total % of 0.273972602739726
bigram ('we', 'shoud') occurs 1 times, for a total % of 0.273972602739726
bigram ('we', 'have') occurs 28 times, for a total % of 7.671232876712329
bigram ('we', 'still') occurs 4 times, for a total % of 1.095890410958904
bigram ('we', 'manage') occurs 1 times, for a total % of 0.273972602739726
bigram ('we', 'sh

### Using NLTK collocation function on our data

We can also convert our text into an NLTK Text object to locate collocations.


In [98]:
# supply a list of tokens to nltk.Text()
tp011_nltk_corpus = nltk.Text(nltk.word_tokenize(tp011_raw))

Look at the top 25 strongest collocations from the text - what do you notice? Crucially, we see that "we need" is not among the collocations! Instead, we have a range of other collocations. What is the difference?

Well, simply put, [NLTK uses more advanced calculations to find collocations](https://tedboy.github.io/nlps/_modules/nltk/collocations.html). What we see below are word pairs which are strong based on the frequency of *both* sides of the bigram. So not only is "new" very likely to occur before "zealand", "zealand" is also very likely to occur after "new." These word contexts are larger than the simple percentages calculated above.

The example of `'blah blah'` might also help this point - how likely is it to see the word 'blah' in different word contexts? Not very - therefore we have very high expectations that when we see the word "blah", we are very likely to see another "blah" coming afterwords.

Compare that with "we" - the data above showed that even though "need" occured at a high rate after "we", the word "we" was still found in many other words contexts.

The NLTK `.collocations()` function is thus a powerful and fast way to locate collocations in a text. But going through the manual process above also provides some good practice and understanding of what is going into these calculations.

In [100]:
tp011_nltk_corpus.collocations(num = 25)

good idea; new zealand; dont care; great idea; many people; come back;
tourist numbers; New Zealand; mother earth; less people; stay safe;
short term; limit numbers; many tourists; New Zealands; boarders
closed; natural resources; beautiful country; non residents; freedom
campers; long term; blah blah; financial gain; dont want; carbon
emmisions


## **Your Turn**

Consider looking at some other bigrams in some data, or converting texts and exploring collocations with the NLTK `Text()` module.

