## Assignment 1 Erik Hystad
Contents:
    * Code: CorrectionTool.py, MyBigramFilterFinder.py  -> independent programs that can be used elsewhere
    * Code/report: Assignment_1_Erik_Hystad.ipynb   -> report/implementation
    * Data: 1_1_1.txt, 1_1_2.txt    -> Collocations from the first task


Requirements:

Nltk
    * Corpus.Brown
    * Corpus.Wordnet
    * Collocations
Jupyter Notebook, not necessary if you are using .py files.

Run all cells to get results, or if you have text files, skip to last cell for correction tool.

#### Python files
Alternatively there is 2 python files with classes you can run to use in other programs.


##### Task 1.1/MyBigramFilterFinder.py

MyBigramFilterFinder -> bigram_filter = MyBigramFilterFinder() <br>
bigram_filter.get_hypothesis_tested_bigrams -> returns a list of hypothesis tested bigrams<br>
bigram_filter.get_freq_and_noun_adj_filtered -> returns a list of bigrams


##### Task 1.2/CorrectionTool.py

CorrectionTool.py -> correction_tool = CorrectionTool(file_path_of_collocation_library)<br>
correction_tool.correct(first_word, second_word)

In [1]:
import math
import nltk
from nltk.corpus import brown
from nltk.collocations import *
from nltk.corpus import wordnet
from tqdm import tqdm

### Task 1.1
I need to find, filter and test all collocations.
First I will load the corpus and find all collocations with the BigramCollocationFinder.

In [2]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
corpus = brown.words()

finder = BigramCollocationFinder.from_words(corpus)

#### Task 1.1.1
Then I will apply a frequency filter to the finder, which I here set to 6, to remove any collocations that
appear less than 6 times. I chose 6 to reduce somewhat the time the hypothesis testing in 1.1.2 would take to run.
Then I only keep collocations that are made up of nouns and adjectives.

After that I write these to a file, '1.1.1.txt'.

In [3]:
finder.apply_freq_filter(6)
frequency_collocations = finder.nbest(bigram_measures.pmi, 10000)


tagged = [nltk.pos_tag(bigram) for bigram in frequency_collocations]
including = ['NN', 'JJ']


# A function to see if a word is nouns or adjectives, will include NNS etc...
def check(bigram):
    for _, cl in bigram:
        result = False
        for cls in including:
            if cls in cl:
                result = True
        if not result:
            return result
    return True


noun_and_adjectives_collocations = [bigram for bigram in tagged if check(bigram)]

path_1 = "1_1_1.txt"
with open(path_1, 'w') as file:
    for bigram in noun_and_adjectives_collocations:
        file.write(bigram[0][0] + ' ' + bigram[1][0] + '\n')

#### Task 1.1.2

Here I did hypothesis testing on the remaining bigrams, the ones already filtered by nouns, adjectives and more than
6 occurrences.

For each bigram: <br>
&emsp; if t > confidence:  Where t = (sample mean - mean of the dist) / squareroot(p * (1 - p) / corpus length):<br>
&emsp;&emsp; Write bigram to file


In [4]:
n = len(corpus)


def hypothesis_test(collocation, confidence=2.576):
    first, second = collocation
    sample_mean = (finder.ngram_fd[collocation] / n)
    mean_of_the_dist = ((corpus.count(first) / n) * (corpus.count(second) / n))
    t = (sample_mean - mean_of_the_dist) / (math.sqrt((sample_mean * (1 - sample_mean))/n))
    return t > confidence


path_2 = "1_1_2.txt"
with open(path_2, 'w') as file:
    pbar = tqdm(total=len(noun_and_adjectives_collocations), desc='Hypothesis test')
    for bigram in noun_and_adjectives_collocations:
        if hypothesis_test((bigram[0][0], bigram[1][0])):
            file.write(bigram[0][0] + ' ' + bigram[1][0] + '\n')
        pbar.update()
    pbar.close()

Hypothesis test: 100%|██████████| 960/960 [37:42<00:00,  2.36s/it]


#### 1.2
Here I will need to be able to take an input, which should be an incorrect collocation, and return
a corrected version of the collocation. Run the next cell to try it with an input, or call the correction_tool
function with a bigram to try it out.

The process of correction tool works as follows:<br>
&emsp;    for all synonyms of the first word:<br>
&emsp;&emsp;        if synonym exists in the collocation library as a first word:<br>
&emsp;&emsp;&emsp;            second_words <- find all collocations that has the first word in the first spot, store second word<br>
&emsp;&emsp;&emsp;            for all words in second_words:<br>
&emsp;&emsp;&emsp;&emsp;                for all synonyms of the second word(input):  # Not the words we found above, but the second word of the input bigram<br>
&emsp;&emsp;&emsp;&emsp;&emsp;                    if second_synonym equals second word(input) -> you have found a correction.<br>

There are some syntactic sugar with wordnet library, synset.lemmas().name()...

Runtime is not optimal here, where
    * n: is number of synonyms of the first word
    * m: is number of collocations
    * p: is number of collocations that start with the first word
    * q: is number of synonym each word from second_words (all words in p)
runtime: n * (m + p * q)

In [26]:
with open(path_2, 'r') as file:
    raw = file.read()

learned_collocations = [(x.split()[0].lower(), x.split()[1].lower()) for x in raw.split('\n') if x != '']
# This is used to check if the first word of the bigram is in the library in O(1) time
learned_collocations_firsts = set([x[0] for x in learned_collocations])

def correction_tool(first, second):
    for synset in wordnet.synsets(first):
        for synonym in synset.lemmas():
            f_synonym = synonym.name()
            if f_synonym in learned_collocations_firsts:
                second_learned = [x[1] for x in learned_collocations if x[0] == f_synonym]
                for s in second_learned:
                    for second_synset in wordnet.synsets(second):
                        for second_synonym in second_synset.lemmas():
                            s_synonym = second_synonym.name()
                            if s == s_synonym and first + ' ' + second != f_synonym + ' ' + s:
                                print('Changed', first + ' ' + second, 'to', f_synonym + ' ' + s + '!')
                                return f_synonym + ' ' + s
    print("Found no matching collocation.")
    return ""

first, second = input("Write bigram to see if can be corrected to a collocation. e.g. polite war").split()
correction_tool(first, second)

Changed polite war to civil war!


'civil war'