# Assignment 1 - Collocation Tool 
For this assignment, you will write a small Python program to perform collocational analysis using the string processing and NLP tools you've already encountered. Your script should do the following:

- Take a user-defined search term and a user-defined window size.
- Take one specific text which the user can define.
- Find all the context words which appear ± the window size from the search term in that text.
- Calculate the mutual information score for each context word.
- Save the results as a CSV file with (at least) the following columns: the collocate term; how often it appears as a collocate; how often it appears in the text; the mutual information score. 

First thing we need to do is install ```spaCy``` and the language model that we want to use.

```
$ pip install spacy 
$ python -m spacy download en_core_web_sm
```

In [109]:
# Load spaCy
import spacy
nlp = spacy.load("en_core_web_sm")

In [110]:
# Load one text (novel) from the data folder
import os
filepath = os.path.join("..",
                        "CDS-LANG",
                        "100_english_novels",
                        "corpus",
                        "Bennet_Helen_1910.txt")

with open(filepath, "r") as f:
    text = f.read()

In [111]:
# Create spaCy pipeline
spacy_doc = nlp(text)

# Create word list
word_list = []

# Find word to word list
for token in spacy_doc:
    if not token.is_punct and not token.is_space:
        word_list.append(token)
    else:
        pass

In [112]:
# Convert list to string
str_tokens = ' '.join(str(e) for e in word_list)

In [113]:
word_list_nlp = nlp(str_tokens)

In [114]:
# Define search term
key = "dog"

# Define counter
counter = 0

# Make a for loop
for token in word_list_nlp:
    if token.text == key:
        counter = counter + 1 # for every 'key' go 1 counter up.
    else:
        pass
    
# Print
print(f"The novel has {counter} of the chosen keyword")

The novel has 3 of the chosen keyword


In [115]:
# Print indexnumber and keyword
for token in word_list_nlp:
    if token.text == key:
        print(token.i, token.text)
    else:
        pass

17007 dog
46077 dog
46088 dog


In [116]:
# Find context words of search term
for token in word_list_nlp:
    if token.text == key:
        before = token.i -5
        after = token.i +6
        span = word_list_nlp[before:after]
        print(span)
    else:
        pass

has rendered superfluous For a dog cart stopped in front of
they chose But what a dog he might have been had
he cared to be a dog Here he was without the


In [117]:
# create empty list
colloc = []

# Find context words of search term
for token in word_list_nlp:
    if token.text == key:
        before = token.i -5
        after = token.i +6
        token_index = token.i
        span = word_list_nlp[before:after]
        span_before = word_list_nlp[before:token_index]
        span_after = word_list_nlp[token_index+1:after]
        for words in span_before:
            if not words.is_punct and not words.is_space:
                colloc.append(words)
        for words in span_after:
            if not words.is_punct and not words.is_space:
                colloc.append(words)
    else:
        pass
    
print(colloc)

[has, rendered, superfluous, For, a, cart, stopped, in, front, of, they, chose, But, what, a, he, might, have, been, had, he, cared, to, be, a, Here, he, was, without, the]


## Calculate mutal information score 
For this part of the assignment I have used the formula outlined on the website for the British National Corpus (https://www.english-corpora.org/mutualInformation.asp)

MI = log ( (AB * sizeCorpus) / (A * B * span) ) / log (2)

- A = frequency of node word 
- B = frequency of collocate 
- AB = frequency of collocate near the node word 
- sizeCorpus= size of corpus 
- span = span of words   
- log (2) is literally the log10 of the number 2: .30103  

In [118]:
# load math
import math
math.log

<function math.log>

In [119]:
# Total amount of words in text
sizeCorpus = text.split()

print('Total words in text:', len(sizeCorpus))

Total words in text: 52644


In [120]:
# Frequency of node word (A variable)
A = 0

for token in word_list_nlp:
    if token.text == key:
        A = A + 1
    else:
        pass
    
# Print
print(A)

3


In [121]:
# Converte
type(str_tokens)

str

In [122]:
# Frequency of collocate (B variable)
B = []

for i in range(0, len(colloc)-1):
    B_word = str(colloc[i])
    counter_B = 0
    for token in spacy_doc:
        if token.text == B_word:
            counter_B += 1
    B.append(counter_B)

print(B)

[33, 4, 1, 17, 1296, 1, 9, 824, 44, 1402, 120, 6, 172, 122, 1296, 866, 65, 213, 172, 662, 866, 2, 1322, 261, 1296, 7, 866, 911, 48]


In [123]:
# Frequency of collocate near the node word (AB variable)
AB = []

for i in range(0, len(colloc)-1):
    B_word = str(colloc[i])
    counter_AB = 0
    for token in colloc:
        if token.text == B_word:
            counter_AB += 1
    AB.append(counter_AB)

print(AB)

[1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 3, 1, 1, 1, 3, 1, 3, 1, 1]


In [124]:
# Calculate MI score
# create empty list
MI= []

# Define span
span = 10

for i in range(0, len(colloc) -1):
    MI_calc = math.log((AB[i] * sizeCorpus) / (A*B[i]*span)) / math.log(2)
    MI.append(MI_calc)
    
# This should work, but doesn't :(


TypeError: unsupported operand type(s) for /: 'list' and 'int'

In [125]:
# Also not working :( 
MI = math.log ((AB[i] * sizeCorpus) / (A * B[i] * 10)) / math.log (2)

TypeError: unsupported operand type(s) for /: 'list' and 'int'

## Create CSV file

In [126]:
# import package
import pandas as pd

In [127]:
# Make list
list_context = list(zip(colloc, B, AB, MI))

# Create dataframe
data = pd.DataFrame(list_context, columns = ["colloc", "B", "AB", "MI"]) # Also not working :(

# print data
print(data)

Empty DataFrame
Columns: [colloc, B, AB, MI]
Index: []


In [129]:
# Converte to csv
data.to_csv("Output_Assignment1.csv", encoding = "utf-8") # However it a really boring csv file ;) 