# Week 06: Collocation Extraction
In Assignment 5, we found all skip-grams and their frequencies in <u>*wiki1G.txt*</u>. This week, we want to use the result of assignment 5 to extract collocations of [AKL verbs](https://uclouvain.be/en/research-institutes/ilc/cecl/academic-keyword-list.html). We will use [Smadja’s algorithm](https://aclanthology.org/J93-1007.pdf) to do it. Here are some basic terms need to be explain. 

We take "*dpend*" as an example:

<img src="https://imgur.com/cPyd7Gr.jpg" >

In this case, we want to find the collocations of "depend". Then, "depend" is called **base word** and marked as $W$. As for "on", "the", "for"..., they are called **collocate** and marked as $W_{i}$ where **i** represents their serial number. $P_{j}$ means the frequency of $W$ and $W_{i}$ with distance j. And **Freq** is the sum of frequencies of all distances.

There are three conditions to filter the skipgram to find collocations. We will go through three conditions below.

Considering that some students did not complete Assignment 5, in order to avoid them being unable to do assignment 6, we provide you with a file of calculated skipgram with frequencies, called **AKL_skipgram.tsv**. It only keeps the skipgrams with any AKL verb.

## Read Data
<font color="red">**[ TODO ]**</font> Please read <u>*AKL_skipgram.tsv*</u> and store it in the way you like.

In [30]:
#### here are some hyperparameter
k0 = 1
k1 = 1
U0 = 10
base_word = "depend"

In [31]:
## read file here
import os

whole_dataset = []
with open(os.path.join('.', 'AKL_skipgram.tsv')) as f:
    for line in f:
        line=line.split()
        whole_dataset.append(line)
whole_dataset#[:10]

[['0', 'accept', '2', '0', '1', '1', '0', '0', '0', '0', '0', '0', '0'],
 ['0', 'account', '29', '3', '4', '14', '7', '0', '0', '0', '0', '0', '1'],
 ['0', 'achieve', '13', '4', '3', '2', '0', '4', '0', '0', '0', '0', '0'],
 ['0', 'acquire', '2', '0', '0', '0', '0', '0', '0', '0', '1', '0', '1'],
 ['0', 'act', '12', '4', '1', '1', '2', '0', '0', '3', '0', '0', '1'],
 ['0', 'adopt', '2', '0', '1', '0', '0', '0', '0', '0', '1', '0', '0'],
 ['0', 'advance', '7', '0', '1', '1', '0', '1', '0', '3', '0', '0', '1'],
 ['0', 'affect', '5', '0', '1', '1', '1', '1', '1', '0', '0', '0', '0'],
 ['0', 'aid', '13', '2', '1', '3', '5', '0', '0', '0', '1', '1', '0'],
 ['0', 'aim', '2', '0', '0', '0', '1', '0', '0', '0', '1', '0', '0'],
 ['0', 'allocate', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0'],
 ['0', 'allow', '29', '3', '4', '2', '1', '0', '0', '3', '4', '3', '9'],
 ['0', 'alter', '1', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0'],
 ['0', 'appear', '15', '2', '4', '2', '0', '0', '0'

## C1 Condition
C1 helps eliminate the collocates that are not frequent enough. This condition specifies that the frequency of appearance of $W_{i}$ in the neighborhood of $W$ must be at least one standard deviation above the average.

The formula is here:

$$strength = \frac{freq - \bar{f}}{\sigma} \geq k_{0} = 1$$

where $freq$ is the frequency of certain collocate, (e.g., 2573 for "on") and 

$\bar{f}$ is the average frequencies of all collocates and 

${\sigma}$ is the standard deviation of frequencies of all collocates.

<font color="red">**[ TODO ]**</font> Please follow the condition to filter the skipgrams of "depend" and keep some which pass the condition.

The ouput sholud have `collocate` with its `strength`.

In [46]:
def C1_filter(base_word):
    base_word_lines = []
    f_cnt=0
    f_len=0
    for line in whole_dataset:
        if line[0] == base_word:
            # calculate strength
            f_len+=1
            f_cnt+=int(line[2])
    f_bar=f_cnt/f_len
    standard_sum=0
    for line in whole_dataset:
        if line[0] == base_word:
            standard_sum+=pow((int(line[2])-f_bar),2)
    standard=(standard_sum/f_len)**(1/2) #root
    for line in whole_dataset:
        if line[0] == base_word:
            if ((int(line[2])-f_bar))/standard >= 1 : #strendth>1
                stranth=((int(line[2])-f_bar))/standard
                line.append(stranth)
                base_word_lines.append(line)
                print(line[1],"{strength:",round((((int(line[2])-f_bar))/standard),3),'}')
            
    return base_word_lines

In [47]:
filtered_by_C1 = C1_filter(base_word)
### Print

a {strength: 6.381 }
all {strength: 1.151 }
also {strength: 1.133 }
an {strength: 1.367 }
and {strength: 15.183 }
are {strength: 1.962 }
as {strength: 2.395 }
but {strength: 1.529 }
by {strength: 1.042 }
can {strength: 1.421 }
do {strength: 1.656 }
does {strength: 5.299 }
for {strength: 4.686 }
formula {strength: 1.565 }
in {strength: 5.876 }
is {strength: 2.611 }
it {strength: 2.287 }
its {strength: 1.818 }
may {strength: 2.864 }
not {strength: 8.437 }
of {strength: 23.461 }
on {strength: 46.313 }
only {strength: 1.295 }
or {strength: 2.485 }
other {strength: 1.656 }
properties {strength: 1.042 }
s {strength: 2.161 }
some {strength: 1.187 }
such {strength: 1.439 }
that {strength: 7.247 }
the {strength: 44.707 }
their {strength: 2.828 }
these {strength: 1.944 }
they {strength: 2.233 }
this {strength: 1.908 }
to {strength: 8.419 }
type {strength: 1.295 }
upon {strength: 4.902 }
which {strength: 4.379 }
will {strength: 3.784 }
would {strength: 1.601 }


<font color="green">Expected output: </font> (The order isn't important.)

> a {'strength': 6.381}   
> all {'strength': 1.151}   
> also {'strength': 1.133}   
> an {'strength': 1.367}   
> and {'strength': 15.183}   
> are {'strength': 1.962}   
> as {'strength': 2.395}   
> but {'strength': 1.529}   
> by {'strength': 1.042}   
> can {'strength': 1.421}   
> do {'strength': 1.656}   
> does {'strength': 5.299}   
> for {'strength': 4.686}   
> formula {'strength': 1.565}   
> in {'strength': 5.876}   
> is {'strength': 2.611}   
> it {'strength': 2.287}   
> its {'strength': 1.818}   
> may {'strength': 2.864}   
> not {'strength': 8.437}   
> of {'strength': 23.461}   
> on {'strength': 46.313}   
> only {'strength': 1.295}   
> or {'strength': 2.485}   
> other {'strength': 1.656}   
> properties {'strength': 1.042}   
> s {'strength': 2.161}   
> some {'strength': 1.187}   
> such {'strength': 1.439}   
> that {'strength': 7.247}   
> the {'strength': 44.707}   
> their {'strength': 2.828}   
> these {'strength': 1.944}   
> they {'strength': 2.233}   
> this {'strength': 1.908}   
> to {'strength': 8.419}   
> type {'strength': 1.295}   
> upon {'strength': 4.902}   
> which {'strength': 4.379}   
> will {'strength': 3.784}   
> would {'strength': 1.601}   

## C2 Condition
C2 requires that the histogram of the 10 relative frequencies of appearance of $W_i$ within five words of $W$ (or $p^j_i$s) have at least one spike. If the histogram is flat, it will be rejected by this condition.

The formula is here:

$$spread = \frac{\Sigma^{10}_{j=1}(p^j_i - \bar{p_i})^2}{10} \geq U_{0} = 10$$

where $p^j_i$ is the frequency of certain collocate with a distance of *j*, (e.g., 16 for "on" when its distance is -5) and 

$\bar{p_i}$ is the average frequencies of "on" with any distance 

<font color="red">**[ TODO ]**</font> Please follow C2 to filter the result of C1 and keep some which pass C2.

The ouput sholud have `collocate` with `strength` and `spread`.

In [48]:
def C2_filter(base_word, filtered_by_C1):
    base_word_lines = []
    for line in filtered_by_C1:
        p_bar=int(line[2])/10
        p_sum=0
        for i in line[3:13]:
            p_sum+=(int(i)-p_bar)**2 #分子
        if (p_sum/10) >= 10:
            spread=(p_sum/10)
            line.append(spread)
            base_word_lines.append(line)
            print(line[1],"{strength:",line[13],"spread:",round((p_sum/10),2),'}')
            
    return base_word_lines    
            
    


In [49]:
filtered_by_C2 = C2_filter(base_word, filtered_by_C1)
### Print

a {strength: 6.380986338411858 spread: 777.29 }
all {strength: 1.150547489042754 spread: 29.89 }
also {strength: 1.1325114930104467 spread: 208.96 }
an {strength: 1.3669794414304408 spread: 56.29 }
and {strength: 15.182552402177798 spread: 2170.41 }
are {strength: 1.9621673104965802 spread: 98.84 }
as {strength: 2.3950312152719544 spread: 104.96 }
but {strength: 1.529303405721206 spread: 24.4 }
by {strength: 1.0423315128489103 spread: 26.21 }
can {strength: 1.4210874295273626 spread: 208.24 }
do {strength: 1.6555553779473569 spread: 410.21 }
does {strength: 5.298826576473423 spread: 6477.09 }
for {strength: 4.685602711374976 spread: 376.65 }
formula {strength: 1.5653753977858207 spread: 46.16 }
in {strength: 5.8759784495072545 spread: 396.09 }
is {strength: 2.6114631676596414 spread: 148.2 }
it {strength: 2.286815239078111 spread: 112.76 }
its {strength: 1.8178793422381223 spread: 94.24 }
may {strength: 2.863967112111943 spread: 1352.24 }
not {strength: 8.437089886094885 spread: 12938.

<font color="green">Expected output: </font> (The order isn't important.)

> a {'strength': 6.381, 'spread': 777.29}   
> all {'strength': 1.151, 'spread': 29.89}   
> also {'strength': 1.133, 'spread': 208.96}   
> an {'strength': 1.367, 'spread': 56.29}   
> and {'strength': 15.183, 'spread': 2170.41}   
> are {'strength': 1.962, 'spread': 98.84}   
> as {'strength': 2.395, 'spread': 104.96}   
> but {'strength': 1.529, 'spread': 24.4}   
> by {'strength': 1.042, 'spread': 26.21}   
> can {'strength': 1.421, 'spread': 208.24}   
> do {'strength': 1.656, 'spread': 410.21}   
> does {'strength': 5.299, 'spread': 6477.09}   
> for {'strength': 4.686, 'spread': 376.65}   
> formula {'strength': 1.565, 'spread': 46.16}   
> in {'strength': 5.876, 'spread': 396.09}   
> is {'strength': 2.611, 'spread': 148.2}   
> it {'strength': 2.287, 'spread': 112.76}   
> its {'strength': 1.818, 'spread': 94.24}   
> may {'strength': 2.864, 'spread': 1352.24}   
> not {'strength': 8.437, 'spread': 12938.41}   
> of {'strength': 23.461, 'spread': 20132.64}   
> on {'strength': 46.313, 'spread': 420371.01}   
> only {'strength': 1.295, 'spread': 134.01}   
> or {'strength': 2.485, 'spread': 85.61}   
> other {'strength': 1.656, 'spread': 31.61}   
> properties {'strength': 1.042, 'spread': 30.21}   
> s {'strength': 2.161, 'spread': 125.85}   
> some {'strength': 1.187, 'spread': 15.29}   
> such {'strength': 1.439, 'spread': 27.45}   
> that {'strength': 7.247, 'spread': 1492.61}   
> the {'strength': 44.707, 'spread': 98586.04}   
> their {'strength': 2.828, 'spread': 209.56}   
> these {'strength': 1.944, 'spread': 180.01}   
> they {'strength': 2.233, 'spread': 316.09}   
> this {'strength': 1.908, 'spread': 71.09}   
> to {'strength': 8.419, 'spread': 3941.16}   
> type {'strength': 1.295, 'spread': 213.41}   
> upon {'strength': 4.902, 'spread': 4984.01}   
> which {'strength': 4.379, 'spread': 346.16}   
> will {'strength': 3.784, 'spread': 2250.05}   
> would {'strength': 1.601, 'spread': 412.44}   

## C3 Condition
C3 keeps the interesting collocates by pulling out the peaks of the $p^j_i$ distributions.

Formula:

$$p^j_i \geq \bar{p_i} + (k_1 \times \sqrt{U_{i}})$$

where $U_i$ is *spread* in C2 and

$k_1$ is equal to 1 

<font color="red">**[ TODO ]**</font> Please follow the condition to filter the result of last step and keep some which pass C3.

The ouput sholud have `base word, collocate, distance, strength, spread, peak, count`.

In [50]:
def C3_filter(base_word, filtered_by_C2):
    base_word_lines = []
    for line in filtered_by_C2:
        p_bar=int(line[2])/10
        u=line[14]
        for index, i in enumerate(line[3:13]):
            if int(i) >= (p_bar+(u**(1/2))):
                base_word_lines.append(line)
                peak=p_bar+(u**(1/2))
                line.append(peak)
                if (index >= 0) and (index <=4): #find dist
                    distance = index-5
                    line.append(distance)
                    line.append(i)
                else:
                    distance = index-4
                    line.append(distance)
                    line.append(i)
                print('(',line[0],line[1],distance,')',"{strength:",round(line[13],3),"spread:",round(line[14],2),"peak:",round(line[15],2),"count:",i,'}')
            
    return base_word_lines    
            

In [51]:
filtered_by_C3 = C3_filter(base_word, filtered_by_C2)
### Print

( depend a 2 ) {strength: 6.381 spread: 777.29 peak: 63.78 count: 94 }
( depend all -4 ) {strength: 1.151 spread: 29.89 peak: 12.37 count: 14 }
( depend all -3 ) {strength: 1.151 spread: 29.89 peak: 12.37 count: 16 }
( depend also -1 ) {strength: 1.133 spread: 208.96 peak: 21.26 count: 50 }
( depend an 2 ) {strength: 1.367 spread: 56.29 peak: 15.6 count: 24 }
( depend an 5 ) {strength: 1.367 spread: 56.29 peak: 15.6 count: 19 }
( depend and 4 ) {strength: 15.183 spread: 2170.41 peak: 131.29 count: 149 }
( depend are -5 ) {strength: 1.962 spread: 98.84 peak: 21.34 count: 27 }
( depend are -4 ) {strength: 1.962 spread: 98.84 peak: 21.34 count: 22 }
( depend as 4 ) {strength: 2.395 spread: 104.96 peak: 24.04 count: 30 }
( depend as 5 ) {strength: 2.395 spread: 104.96 peak: 24.04 count: 28 }
( depend but -2 ) {strength: 1.529 spread: 24.4 peak: 13.94 count: 14 }
( depend but 5 ) {strength: 1.529 spread: 24.4 peak: 13.94 count: 15 }
( depend by -5 ) {strength: 1.042 spread: 26.21 peak: 11.4

<font color="green">Expected output: </font> (The order isn't important.)

> ('depend', 'a', 2) {'strength': 6.381, 'spread': 777.29, 'peak': 63.78, 'count': 94}   
> ('depend', 'all', -4) {'strength': 1.151, 'spread': 29.89, 'peak': 12.367, 'count': 14}   
> ('depend', 'all', -3) {'strength': 1.151, 'spread': 29.89, 'peak': 12.367, 'count': 16}   
> ('depend', 'also', -1) {'strength': 1.133, 'spread': 208.96, 'peak': 21.255, 'count': 50}   
> ('depend', 'an', 2) {'strength': 1.367, 'spread': 56.29, 'peak': 15.603, 'count': 24}   
> ('depend', 'an', 5) {'strength': 1.367, 'spread': 56.29, 'peak': 15.603, 'count': 19}   
> ('depend', 'and', 4) {'strength': 15.183, 'spread': 2170.41, 'peak': 131.288, 'count': 149}   
> ('depend', 'are', -5) {'strength': 1.962, 'spread': 98.84, 'peak': 21.342, 'count': 27}   
> ('depend', 'are', -4) {'strength': 1.962, 'spread': 98.84, 'peak': 21.342, 'count': 22}   
> ('depend', 'as', 4) {'strength': 2.395, 'spread': 104.96, 'peak': 24.045, 'count': 30}   
> ('depend', 'as', 5) {'strength': 2.395, 'spread': 104.96, 'peak': 24.045, 'count': 28}   
> ('depend', 'but', -2) {'strength': 1.529, 'spread': 24.4, 'peak': 13.94, 'count': 14}   
> ('depend', 'but', 5) {'strength': 1.529, 'spread': 24.4, 'peak': 13.94, 'count': 15}   
> ('depend', 'by', -5) {'strength': 1.042, 'spread': 26.21, 'peak': 11.42, 'count': 13}   
> ('depend', 'by', -4) {'strength': 1.042, 'spread': 26.21, 'peak': 11.42, 'count': 12}   
> ('depend', 'by', 4) {'strength': 1.042, 'spread': 26.21, 'peak': 11.42, 'count': 13}   
> ('depend', 'can', -1) {'strength': 1.421, 'spread': 208.24, 'peak': 22.831, 'count': 49}   
> ('depend', 'do', -2) {'strength': 1.656, 'spread': 410.21, 'peak': 29.954, 'count': 70}   
> ('depend', 'does', -2) {'strength': 5.299, 'spread': 6477.09, 'peak': 110.38, 'count': 271}   
> ('depend', 'for', 4) {'strength': 4.686, 'spread': 376.65, 'peak': 45.907, 'count': 69}   
> ('depend', 'formula', -4) {'strength': 1.565, 'spread': 46.16, 'peak': 15.994, 'count': 19}   
> ('depend', 'formula', 2) {'strength': 1.565, 'spread': 46.16, 'peak': 15.994, 'count': 17}   
> ('depend', 'formula', 5) {'strength': 1.565, 'spread': 46.16, 'peak': 15.994, 'count': 19}   
> ('depend', 'in', -5) {'strength': 5.876, 'spread': 396.09, 'peak': 53.002, 'count': 55}   
> ('depend', 'in', 4) {'strength': 5.876, 'spread': 396.09, 'peak': 53.002, 'count': 62}   
> ('depend', 'is', -5) {'strength': 2.611, 'spread': 148.2, 'peak': 27.174, 'count': 37}   
> ('depend', 'is', 5) {'strength': 2.611, 'spread': 148.2, 'peak': 27.174, 'count': 29}   
> ('depend', 'it', -3) {'strength': 2.287, 'spread': 112.76, 'peak': 23.819, 'count': 39}   
> ('depend', 'it', -2) {'strength': 2.287, 'spread': 112.76, 'peak': 23.819, 'count': 24}   
> ('depend', 'its', 2) {'strength': 1.818, 'spread': 94.24, 'peak': 20.308, 'count': 36}   
> ('depend', 'may', -1) {'strength': 2.864, 'spread': 1352.24, 'peak': 53.173, 'count': 126}   
> ('depend', 'not', -1) {'strength': 8.437, 'spread': 12938.41, 'peak': 161.047, 'count': 388}   
> ('depend', 'of', 4) {'strength': 23.461, 'spread': 20132.64, 'peak': 272.49, 'count': 495}   
> ('depend', 'on', 1) {'strength': 46.313, 'spread': 420371.01, 'peak': 905.66, 'count': 2195}   
> ('depend', 'only', 1) {'strength': 1.295, 'spread': 134.01, 'peak': 19.276, 'count': 40}   
> ('depend', 'or', 4) {'strength': 2.485, 'spread': 85.61, 'peak': 23.553, 'count': 29}   
> ('depend', 'or', 5) {'strength': 2.485, 'spread': 85.61, 'peak': 23.553, 'count': 25}   
> ('depend', 'other', 3) {'strength': 1.656, 'spread': 31.61, 'peak': 15.322, 'count': 19}   
> ('depend', 'other', 5) {'strength': 1.656, 'spread': 31.61, 'peak': 15.322, 'count': 17}   
> ('depend', 'properties', -4) {'strength': 1.042, 'spread': 30.21, 'peak': 11.796, 'count': 12}   
> ('depend', 'properties', -1) {'strength': 1.042, 'spread': 30.21, 'peak': 11.796, 'count': 15}   
> ('depend', 'properties', 3) {'strength': 1.042, 'spread': 30.21, 'peak': 11.796, 'count': 15}   
> ('depend', 's', 4) {'strength': 2.161, 'spread': 125.85, 'peak': 23.718, 'count': 41}   
> ('depend', 'some', -3) {'strength': 1.187, 'spread': 15.29, 'peak': 11.01, 'count': 13}   
> ('depend', 'some', 2) {'strength': 1.187, 'spread': 15.29, 'peak': 11.01, 'count': 14}   
> ('depend', 'such', 4) {'strength': 1.439, 'spread': 27.45, 'peak': 13.739, 'count': 17}   
> ('depend', 'that', -3) {'strength': 7.247, 'spread': 1492.61, 'peak': 79.334, 'count': 84}   
> ('depend', 'that', -1) {'strength': 7.247, 'spread': 1492.61, 'peak': 79.334, 'count': 132}   
> ('depend', 'the', 2) {'strength': 44.707, 'spread': 98586.04, 'peak': 562.384, 'count': 1140}   
> ('depend', 'their', 2) {'strength': 2.828, 'spread': 209.56, 'peak': 30.676, 'count': 52}   
> ('depend', 'these', -2) {'strength': 1.944, 'spread': 180.01, 'peak': 24.717, 'count': 48}   
> ('depend', 'they', -1) {'strength': 2.233, 'spread': 316.09, 'peak': 30.679, 'count': 63}   
> ('depend', 'this', -4) {'strength': 1.908, 'spread': 71.09, 'peak': 19.531, 'count': 28}   
> ('depend', 'this', -2) {'strength': 1.908, 'spread': 71.09, 'peak': 19.531, 'count': 22}   
> ('depend', 'to', -1) {'strength': 8.419, 'spread': 3941.16, 'peak': 109.979, 'count': 228}   
> ('depend', 'type', 3) {'strength': 1.295, 'spread': 213.41, 'peak': 22.309, 'count': 50}   
> ('depend', 'upon', 1) {'strength': 4.902, 'spread': 4984.01, 'peak': 98.298, 'count': 239}   
> ('depend', 'which', -1) {'strength': 4.379, 'spread': 346.16, 'peak': 43.405, 'count': 66}   
> ('depend', 'will', -1) {'strength': 3.784, 'spread': 2250.05, 'peak': 68.935, 'count': 159}   
> ('depend', 'would', -1) {'strength': 1.601, 'spread': 412.44, 'peak': 29.709, 'count': 70}   

## Strongest Collocation
There are too many collocations to check your result easily. Hence, we want you use the rules below to find out one strongest collocation for "depend".

Rule:
1. find the collocate with maximum **`strength`** value
2. find the collocate with maximum **`count`** value

If there're more than two collocations sharing same maximum `strength` value, please use rule 2 to find one as the answer. Otherwise, you can ignore Rule 2.

<font color="red">**[ TODO ]**</font> Please find out the strongest collocation for "depend" by the rules.

The ouput format sholud be `(base word, collocate, distance)`.

In [52]:
def find_strongest_collocation(base_word, filtered_by_C3):
    strength=[]
    for line in filtered_by_C3:
        strength.append(line[13])
    max_value=max(strength)
    max_index_list=[]
    for line in filtered_by_C3:
        if max_value == line[13]:
            max_index_list.append(filtered_by_C3.index(line))
    if len(max_index_list)>1:
        count=[]
        for i in max_index_list:
            count.append(filtered_by_C3[i][(len(i)-1)])
        max_count=max(count)
        max_count_index = count.index(max_count)
        #print(filtered_by_C3[max_count][0],filtered_by_C3[max_count][1],filtered_by_C3[max_count][(len(line)-2)])
    #max_index = strength.index(max_value)
    else:
        max_index = strength.index(max_value)

    return filtered_by_C3[max_index][0]+' '+filtered_by_C3[max_index][1]+' '+str(filtered_by_C3[max_index][(len(filtered_by_C3[max_index])-2)])

In [53]:
### Run and Print
find_strongest_collocation(base_word, filtered_by_C3)

'depend on 1'

<font color="green">Expected output: </font>

> ('depend', 'on', 1)

## Find Helpful AKL Collocation
Only one example cannot express how amazing what we just did, so here are some other AKL verbs selected for you to experience. 

<font color="red">**[ TODO ]**</font> Please finish **combination** function to combine last four functions together and use it to find out strongest collocations for **AKL_verbs**. 

The ouput format sholud be `(base word, collocate, distance)`.

In [54]:
AKL_verbs = ['argue', 'can', 'consist', 'contrast', 'favour', 'lack', 'may', 
            'neglect', 'participate', 'present', 'rely', 'suggest']

In [55]:
def combination(base_word):
    filtered_by_C1 = C1_filter(base_word)
    filtered_by_C2 = C2_filter(base_word, filtered_by_C1)
    filtered_by_C3 = C3_filter(base_word, filtered_by_C2)
    return find_strongest_collocation(base_word, filtered_by_C3)
    

In [56]:
### Run and Print
AKL_Collocation_list = []
for i in AKL_verbs:
    AKL_Collocation_list.append(combination(i))

a {strength: 11.774 }
about {strength: 1.19 }
advocates {strength: 1.015 }
against {strength: 2.382 }
all {strength: 1.103 }
also {strength: 2.067 }
an {strength: 2.487 }
and {strength: 21.376 }
are {strength: 4.327 }
as {strength: 5.957 }
authors {strength: 1.138 }
be {strength: 3.153 }
been {strength: 1.068 }
but {strength: 1.664 }
by {strength: 2.645 }
can {strength: 1.839 }
could {strength: 1.313 }
critics {strength: 4.274 }
for {strength: 7.218 }
from {strength: 2.137 }
had {strength: 1.05 }
has {strength: 1.891 }
have {strength: 2.575 }
he {strength: 2.382 }
his {strength: 2.277 }
historians {strength: 2.259 }
however {strength: 2.925 }
in {strength: 10.95 }
is {strength: 12.194 }
it {strength: 5.834 }
its {strength: 1.278 }
many {strength: 2.978 }
may {strength: 1.383 }
more {strength: 1.506 }
no {strength: 1.436 }
not {strength: 4.327 }
of {strength: 26.685 }
on {strength: 3.223 }
one {strength: 1.997 }
only {strength: 1.068 }
or {strength: 1.734 }
other {strength: 3.118 }
othe

1 {strength: 1.842 }
2 {strength: 1.408 }
3 {strength: 1.106 }
a {strength: 67.022 }
about {strength: 2.13 }
above {strength: 1.251 }
affect {strength: 1.141 }
after {strength: 1.532 }
all {strength: 4.689 }
also {strength: 18.496 }
although {strength: 2.042 }
an {strength: 13.026 }
and {strength: 64.319 }
another {strength: 1.312 }
any {strength: 6.312 }
applied {strength: 1.294 }
are {strength: 7.323 }
area {strength: 1.029 }
around {strength: 1.095 }
as {strength: 33.945 }
at {strength: 8.819 }
back {strength: 2.038 }
based {strength: 1.095 }
be {strength: 151.094 }
because {strength: 2.831 }
become {strength: 1.808 }
before {strength: 1.991 }
being {strength: 1.211 }
between {strength: 2.627 }
body {strength: 1.041 }
both {strength: 2.917 }
but {strength: 8.406 }
by {strength: 24.815 }
called {strength: 1.83 }
can {strength: 3.327 }
case {strength: 1.085 }
cases {strength: 1.35 }
cause {strength: 4.415 }
certain {strength: 1.301 }
change {strength: 1.655 }
conditions {strength: 1.0

( can can -5 ) {strength: 3.327 spread: 58041.36 peak: 485.72 count: 658 }
( can can 5 ) {strength: 3.327 spread: 58041.36 peak: 485.72 count: 658 }
( can case -2 ) {strength: 1.085 spread: 2550.81 peak: 132.21 count: 141 }
( can cases -3 ) {strength: 1.35 spread: 4817.6 peak: 170.41 count: 246 }
( can cases -2 ) {strength: 1.35 spread: 4817.6 peak: 170.41 count: 176 }
( can cause 1 ) {strength: 4.415 spread: 525714.2 peak: 1049.06 count: 2483 }
( can certain -3 ) {strength: 1.301 spread: 2813.84 peak: 150.45 count: 152 }
( can change 1 ) {strength: 1.655 spread: 19605.76 peak: 263.22 count: 539 }
( can conditions -2 ) {strength: 1.04 spread: 2602.44 peak: 129.41 count: 170 }
( can conditions -1 ) {strength: 1.04 spread: 2602.44 peak: 129.41 count: 149 }
( can considered 2 ) {strength: 2.101 spread: 108160.24 peak: 484.48 count: 1128 }
( can control 1 ) {strength: 1.169 spread: 839.76 peak: 116.78 count: 157 }
( can create 1 ) {strength: 1.335 spread: 25534.89 peak: 259.7 count: 562 }


( can single -2 ) {strength: 1.793 spread: 8435.36 peak: 225.04 count: 242 }
( can single 5 ) {strength: 1.793 spread: 8435.36 peak: 225.04 count: 289 }
( can small -4 ) {strength: 1.714 spread: 5455.45 peak: 201.36 count: 234 }
( can so -3 ) {strength: 6.3 spread: 178572.49 peak: 883.68 count: 1286 }
( can so -2 ) {strength: 6.3 spread: 178572.49 peak: 883.68 count: 1059 }
( can some -4 ) {strength: 5.741 spread: 44515.84 peak: 631.39 count: 646 }
( can some -3 ) {strength: 5.741 spread: 44515.84 peak: 631.39 count: 642 }
( can sometimes 1 ) {strength: 1.614 spread: 30953.76 peak: 296.14 count: 645 }
( can space -2 ) {strength: 1.257 spread: 3649.76 peak: 154.61 count: 177 }
( can space -1 ) {strength: 1.257 spread: 3649.76 peak: 154.61 count: 183 }
( can species -1 ) {strength: 1.522 spread: 10985.25 peak: 218.31 count: 390 }
( can state -1 ) {strength: 1.654 spread: 5120.89 peak: 194.66 count: 196 }
( can state 5 ) {strength: 1.654 spread: 5120.89 peak: 194.66 count: 234 }
( can sta

a {strength: 17.89114957640104 spread: 37358.96 }
all {strength: 1.7441322603224292 spread: 172.04 }
an {strength: 2.354339310057958 spread: 455.25 }
and {strength: 18.28230794161612 spread: 6171.81 }
are {strength: 2.510802656143991 spread: 273.45 }
as {strength: 2.604680663795611 spread: 193.49 }
at {strength: 1.1495715451955035 spread: 28.96 }
but {strength: 1.0869862067610903 spread: 28.84 }
by {strength: 1.5563762450191896 spread: 90.24 }
can {strength: 1.7597785949310325 spread: 707.61 }
each {strength: 1.0400472029352805 spread: 41.49 }
for {strength: 1.634607918062206 spread: 115.29 }
forces {strength: 1.0400472029352805 spread: 93.89 }
four {strength: 1.3999128989331564 spread: 270.84 }
from {strength: 1.3060348912815367 spread: 56.96 }
in {strength: 8.65981215732509 spread: 1253.96 }
is {strength: 2.041412617885892 spread: 221.45 }
it {strength: 1.3060348912815367 spread: 91.16 }
mainly {strength: 1.2278032182385201 spread: 268.21 }
many {strength: 1.1339252105869002 spread: 

a {strength: 20.234 }
all {strength: 1.452 }
also {strength: 1.136 }
an {strength: 3.37 }
and {strength: 24.097 }
are {strength: 4.859 }
as {strength: 6.927 }
at {strength: 1.732 }
be {strength: 2.187 }
between {strength: 3.957 }
but {strength: 1.215 }
by {strength: 19.349 }
can {strength: 1.364 }
for {strength: 5.893 }
from {strength: 2.774 }
had {strength: 1.443 }
has {strength: 2.144 }
have {strength: 2.424 }
he {strength: 1.303 }
high {strength: 1.478 }
his {strength: 4.088 }
in {strength: 86.821 }
is {strength: 13.357 }
it {strength: 2.713 }
its {strength: 2.161 }
many {strength: 1.408 }
more {strength: 2.669 }
most {strength: 1.828 }
not {strength: 2.354 }
of {strength: 30.877 }
on {strength: 3.869 }
one {strength: 1.592 }
only {strength: 1.443 }
or {strength: 4.754 }
other {strength: 4.737 }
s {strength: 7.75 }
sharp {strength: 1.679 }
some {strength: 1.609 }
stark {strength: 2.012 }
such {strength: 1.968 }
than {strength: 1.092 }
that {strength: 6.559 }
the {strength: 71.543 }


a {strength: 11.736 }
abandoned {strength: 1.283 }
after {strength: 1.244 }
against {strength: 1.13 }
also {strength: 1.04 }
an {strength: 1.755 }
and {strength: 15.054 }
as {strength: 3.325 }
at {strength: 1.512 }
be {strength: 1.079 }
but {strength: 2.253 }
by {strength: 3.006 }
court {strength: 1.193 }
fell {strength: 1.474 }
for {strength: 2.853 }
found {strength: 1.091 }
from {strength: 1.666 }
government {strength: 1.002 }
had {strength: 2.151 }
he {strength: 4.295 }
her {strength: 1.232 }
him {strength: 1.155 }
his {strength: 6.656 }
in {strength: 55.475 }
is {strength: 1.512 }
it {strength: 2.253 }
its {strength: 1.091 }
king {strength: 1.538 }
more {strength: 1.653 }
new {strength: 1.321 }
not {strength: 1.691 }
of {strength: 56.164 }
on {strength: 2.495 }
or {strength: 1.079 }
out {strength: 3.197 }
s {strength: 6.822 }
son {strength: 1.027 }
that {strength: 3.184 }
the {strength: 43.324 }
their {strength: 1.819 }
this {strength: 2.304 }
to {strength: 12.17 }
vote {strength: 

a {strength: 30.203 }
about {strength: 1.771 }
also {strength: 2.085 }
an {strength: 2.684 }
and {strength: 34.193 }
any {strength: 1.838 }
are {strength: 2.679 }
as {strength: 5.515 }
at {strength: 2.062 }
be {strength: 1.748 }
because {strength: 4.186 }
been {strength: 1.171 }
between {strength: 1.087 }
but {strength: 3.783 }
by {strength: 6.507 }
despite {strength: 2.219 }
due {strength: 11.523 }
evidence {strength: 2.404 }
for {strength: 11.035 }
from {strength: 4.343 }
funding {strength: 1.109 }
funds {strength: 1.087 }
has {strength: 1.619 }
have {strength: 1.681 }
he {strength: 2.09 }
her {strength: 1.076 }
his {strength: 5.257 }
however {strength: 1.743 }
in {strength: 18.013 }
interest {strength: 1.407 }
is {strength: 6.198 }
it {strength: 2.735 }
its {strength: 3.968 }
knowledge {strength: 1.042 }
many {strength: 1.093 }
may {strength: 1.474 }
not {strength: 1.984 }
of {strength: 99.716 }
on {strength: 4.242 }
or {strength: 5.66 }
other {strength: 1.479 }
resources {strength:

1 {strength: 4.66 }
10 {strength: 2.596 }
11 {strength: 1.968 }
12 {strength: 1.971 }
13 {strength: 1.64 }
14 {strength: 1.953 }
15 {strength: 2.54 }
16 {strength: 1.88 }
17 {strength: 1.848 }
18 {strength: 1.818 }
19 {strength: 1.769 }
1945 {strength: 1.097 }
2 {strength: 2.926 }
20 {strength: 2.301 }
2005 {strength: 1.139 }
2006 {strength: 1.626 }
2007 {strength: 1.917 }
2008 {strength: 1.936 }
2009 {strength: 1.912 }
2010 {strength: 2.079 }
2011 {strength: 1.975 }
2012 {strength: 1.727 }
2013 {strength: 1.808 }
2014 {strength: 1.66 }
2015 {strength: 1.749 }
2016 {strength: 1.626 }
2017 {strength: 1.703 }
2018 {strength: 1.614 }
2019 {strength: 1.943 }
2020 {strength: 1.437 }
21 {strength: 1.801 }
22 {strength: 1.673 }
23 {strength: 1.72 }
24 {strength: 1.836 }
25 {strength: 1.966 }
26 {strength: 1.737 }
27 {strength: 1.69 }
28 {strength: 1.735 }
29 {strength: 1.759 }
3 {strength: 2.476 }
30 {strength: 2.071 }
31 {strength: 1.875 }
4 {strength: 2.395 }
5 {strength: 2.746 }
6 {strengt

( may 25 1 ) {strength: 1.966 spread: 25226.09 peak: 277.93 count: 373 }
( may 26 -1 ) {strength: 1.737 spread: 28078.05 peak: 273.07 count: 487 }
( may 26 1 ) {strength: 1.737 spread: 28078.05 peak: 273.07 count: 388 }
( may 27 -1 ) {strength: 1.69 spread: 24650.41 peak: 259.7 count: 444 }
( may 27 1 ) {strength: 1.69 spread: 24650.41 peak: 259.7 count: 387 }
( may 28 -1 ) {strength: 1.735 spread: 25860.44 peak: 266.21 count: 484 }
( may 28 1 ) {strength: 1.735 spread: 25860.44 peak: 266.21 count: 360 }
( may 29 -1 ) {strength: 1.759 spread: 28795.36 peak: 276.49 count: 513 }
( may 29 1 ) {strength: 1.759 spread: 28795.36 peak: 276.49 count: 366 }
( may 3 -1 ) {strength: 2.476 spread: 22087.24 peak: 298.02 count: 479 }
( may 3 1 ) {strength: 2.476 spread: 22087.24 peak: 298.02 count: 404 }
( may 30 -1 ) {strength: 2.071 spread: 21573.61 peak: 272.18 count: 467 }
( may 30 1 ) {strength: 2.071 spread: 21573.61 peak: 272.18 count: 362 }
( may 31 -1 ) {strength: 1.875 spread: 38692.41 pea

( may october 3 ) {strength: 1.562 spread: 5784.09 peak: 171.15 count: 198 }
( may of -4 ) {strength: 77.189 spread: 8294653.89 peak: 7466.94 count: 7740 }
( may of -3 ) {strength: 77.189 spread: 8294653.89 peak: 7466.94 count: 7519 }
( may of 5 ) {strength: 77.189 spread: 8294653.89 peak: 7466.94 count: 7762 }
( may often -5 ) {strength: 1.01 spread: 1545.61 peak: 101.61 count: 127 }
( may often -4 ) {strength: 1.01 spread: 1545.61 peak: 101.61 count: 110 }
( may often 1 ) {strength: 1.01 spread: 1545.61 peak: 101.61 count: 103 }
( may on -2 ) {strength: 52.146 spread: 14075344.25 peak: 6851.21 count: 11456 }
( may on -1 ) {strength: 52.146 spread: 14075344.25 peak: 6851.21 count: 9553 }
( may one -1 ) {strength: 8.007 spread: 53050.09 peak: 708.23 count: 981 }
( may only 1 ) {strength: 3.655 spread: 23445.84 peak: 372.52 count: 573 }
( may opened -2 ) {strength: 1.128 spread: 6146.41 peak: 147.7 count: 285 }
( may or -2 ) {strength: 31.776 spread: 415385.04 peak: 2534.1 count: 2700 }

a {strength: 5.793 }
abuse {strength: 3.586 }
after {strength: 2.194 }
and {strength: 22.064 }
as {strength: 3.922 }
been {strength: 1.09 }
but {strength: 1.33 }
by {strength: 5.026 }
child {strength: 1.33 }
due {strength: 1.762 }
for {strength: 3.442 }
from {strength: 2.674 }
had {strength: 1.378 }
has {strength: 1.09 }
he {strength: 1.714 }
his {strength: 4.498 }
in {strength: 8.097 }
into {strength: 1.474 }
is {strength: 2.29 }
it {strength: 2.386 }
its {strength: 1.09 }
not {strength: 1.666 }
of {strength: 29.455 }
on {strength: 1.33 }
or {strength: 3.97 }
s {strength: 3.298 }
such {strength: 1.042 }
that {strength: 2.914 }
the {strength: 28.783 }
their {strength: 1.522 }
this {strength: 1.762 }
to {strength: 14.432 }
was {strength: 3.298 }
were {strength: 1.426 }
which {strength: 1.09 }
with {strength: 1.618 }
years {strength: 2.194 }
a {strength: 5.793443659451567 spread: 28.64 }
abuse {strength: 3.58568802972691 spread: 86.36 }
after {strength: 2.193842089248322 spread: 35.69 }


0 {strength: 6.933 }
1 {strength: 7.201 }
19 {strength: 1.054 }
2 {strength: 7.504 }
20 {strength: 1.714 }
21 {strength: 1.409 }
22 {strength: 1.637 }
23 {strength: 1.819 }
24 {strength: 1.897 }
25 {strength: 2.231 }
26 {strength: 2.296 }
27 {strength: 2.331 }
28 {strength: 2.585 }
29 {strength: 2.602 }
3 {strength: 8.826 }
30 {strength: 3.019 }
31 {strength: 2.537 }
32 {strength: 2.601 }
33 {strength: 2.454 }
34 {strength: 2.196 }
35 {strength: 2.124 }
36 {strength: 1.942 }
37 {strength: 1.709 }
38 {strength: 1.556 }
39 {strength: 1.32 }
4 {strength: 9.106 }
40 {strength: 1.574 }
5 {strength: 8.941 }
6 {strength: 8.171 }
7 {strength: 7.367 }
8 {strength: 7.066 }
9 {strength: 6.699 }
a {strength: 18.836 }
all {strength: 1.199 }
also {strength: 1.823 }
an {strength: 1.344 }
and {strength: 65.979 }
are {strength: 4.36 }
as {strength: 3.851 }
at {strength: 5.096 }
be {strength: 2.102 }
been {strength: 1.044 }
but {strength: 1.246 }
by {strength: 1.792 }
city {strength: 1.053 }
county {str

a {strength: 8.208 }
all {strength: 1.353 }
also {strength: 1.245 }
an {strength: 1.526 }
and {strength: 20.663 }
are {strength: 2.239 }
as {strength: 4.899 }
because {strength: 1.072 }
but {strength: 2.326 }
by {strength: 1.309 }
can {strength: 1.569 }
could {strength: 1.98 }
do {strength: 1.526 }
does {strength: 1.807 }
for {strength: 7.429 }
forced {strength: 1.245 }
from {strength: 2.218 }
had {strength: 3.018 }
have {strength: 1.461 }
he {strength: 1.569 }
heavily {strength: 3.234 }
his {strength: 1.655 }
in {strength: 6.845 }
instead {strength: 1.028 }
is {strength: 1.72 }
it {strength: 2.045 }
its {strength: 1.05 }
many {strength: 2.434 }
may {strength: 1.18 }
methods {strength: 1.245 }
more {strength: 2.866 }
most {strength: 1.547 }
must {strength: 1.504 }
not {strength: 6.413 }
of {strength: 16.446 }
often {strength: 1.461 }
on {strength: 62.658 }
only {strength: 1.115 }
or {strength: 3.948 }
other {strength: 2.369 }
own {strength: 1.007 }
people {strength: 1.266 }
s {strength

a {strength: 20.756 }
all {strength: 1.274 }
also {strength: 2.962 }
although {strength: 1.084 }
an {strength: 4.224 }
and {strength: 23.564 }
are {strength: 4.397 }
as {strength: 6.732 }
at {strength: 2.678 }
be {strength: 4.271 }
been {strength: 2.63 }
between {strength: 1.463 }
but {strength: 3.072 }
by {strength: 3.182 }
can {strength: 1.668 }
could {strength: 1.684 }
data {strength: 1.463 }
did {strength: 1.321 }
does {strength: 1.037 }
early {strength: 1.431 }
estimates {strength: 1.21 }
evidence {strength: 4.555 }
findings {strength: 1.463 }
first {strength: 2.062 }
for {strength: 5.312 }
from {strength: 4.113 }
had {strength: 2.315 }
has {strength: 2.11 }
have {strength: 5.012 }
he {strength: 4.255 }
his {strength: 2.851 }
however {strength: 2.11 }
in {strength: 15.897 }
is {strength: 11.733 }
it {strength: 6.464 }
its {strength: 1.573 }
many {strength: 1.305 }
may {strength: 7.063 }
might {strength: 2.047 }
models {strength: 1.021 }
modern {strength: 1.021 }
more {strength: 1.

( suggest one 5 ) {strength: 1.195 spread: 17.29 peak: 12.26 count: 13 }
( suggest or -4 ) {strength: 3.182 spread: 56.01 peak: 28.18 count: 32 }
( suggest other -2 ) {strength: 4.271 spread: 1696.04 peak: 68.78 count: 146 }
( suggest others -1 ) {strength: 2.141 spread: 987.89 peak: 45.53 count: 108 }
( suggest people 2 ) {strength: 1.21 spread: 18.96 peak: 12.55 count: 18 }
( suggest people 3 ) {strength: 1.21 spread: 18.96 peak: 12.55 count: 14 }
( suggest recent -2 ) {strength: 2.22 spread: 700.44 peak: 41.07 count: 88 }
( suggest records -1 ) {strength: 1.195 spread: 283.49 peak: 24.94 count: 58 }
( suggest reports -1 ) {strength: 1.132 spread: 287.61 peak: 24.66 count: 58 }
( suggest results -1 ) {strength: 1.542 spread: 336.21 peak: 28.64 count: 64 }
( suggest s 3 ) {strength: 5.691 spread: 630.24 peak: 61.7 count: 90 }
( suggest scholars -1 ) {strength: 2.22 spread: 983.24 peak: 45.96 count: 107 }
( suggest should 3 ) {strength: 1.147 spread: 96.16 peak: 17.61 count: 26 }
( sug

In [57]:
AKL_Collocation_list

['argue that 1',
 'can be 1',
 'consist of 1',
 'contrast in -1',
 'favour of 1',
 'lack of 1',
 'may be 1',
 'neglect of 1',
 'participate in 1',
 'present with -3',
 'rely on 1',
 'suggest that 1']

<font color="green">Expected output: </font>

> ('argue', 'that', 1)   
> ('can', 'be', 1)   
> ('consist', 'of', 1)   
> ('contrast', 'in', -1)   
> ('favour', 'of', 1)   
> ('lack', 'of', 1)   
> ('may', 'be', 1)   
> ('neglect', 'of', 1)   
> ('participate', 'in', 1)   
> ('present', 'with', -3)   
> ('rely', 'on', 1)   
> ('suggest', 'that', 1)  

## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit#gid=206119035) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to eeclass. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.  

## Reference
[Frank Smadja, Retrieving Collocations from Texts: Xtract, Computational Linguistics, Volume 19, 1993](https://aclanthology.org/J93-1007.pdf)