In [1]:
import nltk
from nltk.corpus import wordnet
from nltk.corpus import brown
import csv


# Assignment 1 - *Linguistic Essentials and Collocations*

**NOTE**: I decided to write the entire report in this notebook. Everything you will need to run the program is found here. 

### Problem introduction:

A collocation is a phrase consisting of more than one word but these words more commonly co-occur in a given context than its individual word parts. For example, in a set of hospital related documents, the phrase ‘CT scan’ is more likely to co-occur than do ‘CT’ and ‘scan’ individually. ‘CT scan’ is also a meaningful phrase. Additionally they simply sound "right" to a native speaker. To capture the essence of a language it is important that NLP-models capture and include collocations. 

The task of identifying collocations felt quite daunting at first. But using large corpora it is really not that challenging. There are two methods tried and tested in this notebook. Firstly identifying collocations using their frequency. This is the simplest way of detecting collocations. It works by simply counting the frequency of each bigram; Bigrams that occur frequently are likely to be collocations. Hence referred to as the frequency method. 
The second method in this notebook is the hypothesis test method (t-test). It is a simple statistical test where we assume that the two words are independent of each other. By comparing how often the words occur in total and how often they occur in conjuction we can use it as statistical evidence for them being a collocation or not. *If a word occurs much more frequently next to another spesific word than it does by alone, it is likely a collocation*.


![https://miro.medium.com/max/1205/1*-jIXJtKo0cq9UBY-WDX9Aw.png](hyp_1.png)


![https://miro.medium.com/max/803/1*NR6uUN5N9IJCqOCTUomE9g.png](hyp_2.png)


Being a collocation, "social media" (as seen in the example above) should be much more frequent than if the two words were truly independent.


### Installation:

Firstly the notebook requires both - jupyter notebooks and python. Once both are installed, we simply need to install the natural language toolkit NLTK with a package manager. Pip being the simplest. (**pip install nltk**). No other downloads are required as the csv package is included in standard python.

**NOTE**: When installing the toolkit, be sure to include the corpora.

### My approach: 

I actually did this in two ways originally (although I only included the methods below). First I solved 1.1 by using the functions/objects from the toolkit. *BigramCollocationFinder* from nltk.collocations being the most important. 
Then I pivoted, and chose to implement the functions "from scratch"... sort of. I used the already tagged words from the corpus to filter the bigrams, and used nltk.ngrams to generate all the bigrams. I decided to only include bigram-collocations where the first word is either a noun or an adjective, and the second word is a noun. Some collocations have a different structure than this, and it can easily be changed in the code. It was mostly for simplicity. 
Anyway, I decided to do it this way to make sure i understood the assignment correctly. And since i got the same results using both the builtin-approach and the custom-approach. I decided only to include one of them. Here are some articles explaining the builtin-approach (I found the medium article especially useful).

https://medium.com/@nicharuch/collocations-identifying-phrases-that-act-like-individual-words-in-nlp-f58a93a2f84a

http://www.nltk.org/howto/collocations.html

https://towardsdatascience.com/collocations-in-nlp-using-nltk-library-2541002998db

Both functions (the frequency and hypothesis methods) include a parameter to change the corpus, and the threshold. The threshold is a way to filter out words that show little evidence for being collocations. 

When running the notebook two files will generate, "hyp.csv" and "freq.csv". Each of which contains the collocations found by each respective method. To correct the non-natural expressions the function fixCollocations is used. It takes a variable number of arguments, all of which will be used to generate bigrams. It is also possible to change the filename to use collocations stored in a different file. The function loops over all adjecent pairs of input arguments and checks if they contain any synonyms with the collocations stored on the datafile. 


### Function design choice:


In [2]:

def frequencyMethod(threshold = 3, corpus = brown):
    ''' 
    A function to generate "estimated" collocations using the frequency method as described in lecture 2. 
    Additionally, the first word in the can be either nouns or adjectives, the second word must be a noun.
    
    Params:
    - threshold: Set a lower bounds for frequencies. Defaults to 3.
    - corpus: The text corpus used for the calculations. Defaults to the "brown" corpus.
    
    Output:
    - A list containing bigrams that occur frequently. Including the frequency.
    '''
    
    freq = {}
    tags = corpus.tagged_words(tagset='universal')
    bigram_tags = list(nltk.ngrams(tags, 2))
    bigrams_filtered = list(filter(lambda x:x[0][1] in ['VERB', 'NOUN', 'ADJ'] and x[1][1] == 'NOUN', bigram_tags))
    
    for ((w1,t1),(w2,t2)) in bigrams_filtered:
        curr = freq.get(w1+" "+w2,0) # Combine words - use combination as a key
        freq[w1+" "+w2] = curr + 1
    
    freq_filtered = {k: v for k,v in freq.items() if v > threshold}
    return freq_filtered



def hypothesisMethod(threshold = 2.576, corpus = brown):
    '''
    A function to generate "estimated" collocations using the hypothesis (t-test) method as described in lecture 2.
    Additionally, the first word in the can be either nouns or adjectives, the second word must be a noun.
    
    First loops over words, and bigrams to get a count of each. Then loops over unique bigrams in the corpus and
    calculates the t-value of each. The formula can be found on the slides of lecture 2, or below in the report.
    
    Params:
    - threshold: Set a lower bounds for confidence. Defaults to 2.576 ~ 99.5% confidence.
    - corpus: The text corpus used for the calculations. Defaults to the "brown" corpus.
    
    Output:
    - A list containing bigrams that score the highest / have the highest t value. Also includes the t-value.
    '''
    res = {}
    words = corpus.words()
    
    N = len(words)
    N_bigrams = N-1
    
    freq_words = {words[0] : 1}
    freq_bigrams = {}

    
    for i in range(1, N):
        curr_word_freq = freq_words.get(words[i], 0)
        freq_words[words[i]] = curr_word_freq + 1
        
        curr_bigram_freq = freq_bigrams.get(words[i-1] + " " + words[i], 0)
        freq_bigrams[words[i-1] + " " + words[i]] = curr_bigram_freq + 1
    
    
    tags = corpus.tagged_words(tagset='universal')
    bigram_tags = list(nltk.ngrams(tags, 2))
    bigrams_filtered = set(filter(lambda x:x[0][1] in ['VERB', 'NOUN', 'ADJ'] and x[1][1] == 'NOUN', bigram_tags))
    
    for ((w1, t1), (w2, t2)) in bigrams_filtered:
        bigram = w1 + " " + w2
        w1_prob = freq_words[w1] / N
        w2_prob = freq_words[w2] / N
        
        bigram_prob = freq_bigrams[bigram] / N_bigrams
        
        t = (bigram_prob - (w1_prob*w2_prob)) / ((bigram_prob / N)**0.5)
        if t > threshold:
            res[bigram] = t
    return res


In [3]:

def writeToFile(filename, data):
    '''
    Function that writes data to a csv file.
    
    Params:
    - filename: What the file shall be saved as (remember to include '.csv')
    - data: The data to be stored.
    '''
    file = open(filename, "w",  newline='', encoding='utf-8')
    data_sorted =  dict(sorted(data.items(), key=lambda item: int(item[1]), reverse=True))
    
    writer = csv.writer(file)
    for key, value in data_sorted.items():
        writer.writerow([key, value])
    
    file.close()

def loadData(filename):
    '''
    Function that reads data from spesified file. This function assumes the file is a correctly formatted csv-file.
    
    Params:
    - filename: The file we want to read.
    Output:
    - The data in the form of a list. Each entry is a row in the file.
    
    '''
    data = []
    with open(filename, 'r') as file:
        reader = csv.reader(file)
        for row in reader:
            data.append(row)
    file.close()
    return data

    
def getSynonyms(word):
    '''
    Function to get the synonyms of a given word using WordNet.
    
    Params:
    - word: The word we want the synonyms of.
    Output:
    - The synonyms of the input word in the form of a list.
    '''
    synonyms = [] 
    for syn_set in wordnet.synsets(word): 
        for l in syn_set.lemmas(): 
            synonyms.append(l.name()) 
    return list(set(synonyms))

In [4]:
writeToFile("freq.csv", frequencyMethod())
writeToFile("hyp.csv", hypothesisMethod(threshold = 1))

In [5]:

def checkBigram(w1, w2, data):
    '''
    Function that checks if a bigram consisting of two words (w1 and w2) can be replaced with a collocation.
    
    Params: 
    - w1, w2: words in bigram we want to investigate.
    - data: the data containing all discovered collocations.
    Output
    ~ Prints collocation if one is found.
    - Outputs true if a collocation is found.
    '''
    for row in data:
        w1_true, w2_true = row[0].split()
        if w1.lower() in map(lambda x: x.lower(), getSynonyms(w1_true)) != w2.lower() in map(lambda x: x.lower(), getSynonyms(w2_true)):
            print(w1_true,w2_true)
            return True


In [6]:

def fixCollocation(*args, data_filename = "hyp.csv"):
    '''
    Function that takes a variable size input of words and prints out a preferred collocation for each pair 
    of adjacent words.
    '''
    data = loadData(data_filename)
    for i in range(1, len(args)):
        checkBigram(args[i-1], args[i], data)


In [7]:
print("If a collocation is found it is printed")
print('vvvvvvvvv')

fixCollocation("good", "bargain", "year")

If a collocation is found it is printed
vvvvvvvvv
good deal


In [8]:
print("If none is found, nothing is printed")
fixCollocation("How", "much", "could", "a", "...")

If none is found, nothing is printed


### In practice:

In practice the functions work somewhat alright. The largest flaw is that the corpus is likely not big enough to gather sufficient information about the language to capture the collocations. So sometimes it will come with wrong predictions. I tired to combat this slightly by only allowing a single word to change at a time, as to keep most of the meaning of the bigram. 

In [10]:
fixCollocation("polite", "war", )

Civil War
