# Assignment group 1: Textual feature extraction and numerical comparison

## Module C _(35 points)_ Similarity of word usage across a document

Here we'll be building up some code to discover how different terms are utilized similarly across a document. For this, our first task will be to create a word frequency counting function.

__C1.__ _(12 points)_ Define a function called `count_words(paragraph, pos = True, lemma = True)` that `return`s a `Counter()` called `frequency`. In `frequency`, each key will consist of a `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`. The `Counter()` should simply contain the number of times each `heading` is observed in the `paragraph`.

In [14]:
from pprint import pprint
from collections import defaultdict
import re
import csv
import spacy
from collections import Counter
import numpy as np

In [2]:
nlp = spacy.load('en')

In [3]:
#from previous part, this function loads the paragraphs of the book:
def load_book(book_id):
    id_string = str(book_id)
    text_file = open("./data/books/"+id_string+".txt", "r")
    booktext = text_file.read()
    paragraphs = booktext.split('\n\n')
    return paragraphs

In [4]:
paragraphs = load_book(84)

In [66]:
def count_words(paragraph, pos, lemma):
    frequency = Counter()
    doc = nlp(paragraph)
    #heading = ("word.lemma" if lemma else "word.text", "tag" if pos else "pos off")
    #frequency[heading] = "count"
    for eachword in doc:
        if pos:
            if lemma:
                frequency[(eachword.lemma_, eachword.pos_)] += 1
            else:
                frequency[(eachword.text, eachword.pos_)] += 1
        else:
            if lemma:
                frequency[(eachword.lemma_, "")] += 1
            else:
                frequency[(eachword.text, "")] += 1

    return frequency

In [67]:
print(paragraphs[9])

You will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings.  I arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.


In [68]:
print(count_words(paragraphs[9], pos = True, lemma = True))

Counter({('\n', 'SPACE'): 4, ('-PRON-', 'ADJ'): 4, ('-PRON-', 'PRON'): 3, ('of', 'ADP'): 3, ('to', 'PART'): 2, ('have', 'VERB'): 2, ('the', 'DET'): 2, ('.', 'PUNCT'): 2, ('and', 'CCONJ'): 2, ('will', 'VERB'): 1, ('rejoice', 'VERB'): 1, ('hear', 'VERB'): 1, ('that', 'ADP'): 1, ('no', 'DET'): 1, ('disaster', 'NOUN'): 1, ('accompany', 'VERB'): 1, ('commencement', 'NOUN'): 1, ('an', 'DET'): 1, ('enterprise', 'NOUN'): 1, ('which', 'ADJ'): 1, ('regard', 'VERB'): 1, ('with', 'ADP'): 1, ('such', 'ADJ'): 1, ('evil', 'ADJ'): 1, ('foreboding', 'NOUN'): 1, (' ', 'SPACE'): 1, ('arrive', 'VERB'): 1, ('here', 'ADV'): 1, ('yesterday', 'NOUN'): 1, (',', 'PUNCT'): 1, ('first', 'ADJ'): 1, ('task', 'NOUN'): 1, ('be', 'VERB'): 1, ('assure', 'VERB'): 1, ('dear', 'ADJ'): 1, ('sister', 'NOUN'): 1, ('welfare', 'NOUN'): 1, ('increase', 'VERB'): 1, ('confidence', 'NOUN'): 1, ('in', 'ADP'): 1, ('success', 'NOUN'): 1, ('undertaking', 'NOUN'): 1})


In [69]:
print(count_words(paragraphs[9], pos = True, lemma = True)[('of', 'ADP')]) #which is correct :D

3


__C2.__ _(8 pts)_ Next, define a function called `book_TDM(book_id, pos = True, lemma = True)` and copy into it the TDM-producing code from __Section 2.1.5.1__ of the lecture notes, now `return`-ing `TDM` and `all_words`. Once copied, modify this function to call `count_words` appropriately, now passing through the user of `book_TDM`'s specified `lemma` and `pos` arguments.

To provde your code's function, process `book_id = 84` with both of `pos = True` and `lemma = True` and print out the `TDM`'s `.shape` attribute and the first ten terms in `all_words`.

In [130]:
def book_TDM(book_id, pos, lemma):
    id_string = str(book_id)
    text_file = open("./data/books/"+id_string+".txt", "r")
    text = text_file.read()
    doc = nlp(text)
    ## the 'master' set, keeps track of the words in all documents
    all_words = set()

    ## store the word frequencies by book
    all_doc_frequencies = {}

    ## loop over the sentences
    for j, sentence in enumerate(doc.sents):
        frequency = count_words(sentence.text, pos, lemma) 
        #print(frequency)
        all_doc_frequencies[j] = frequency
        doc_words = set(frequency.keys())
        all_words = all_words.union(doc_words)

    ## create a matrix of zeros: (words) x (documents)
    TDM = np.zeros((len(all_words),len(all_doc_frequencies)))
    ## fix a word ordering for the rows
    all_words = sorted(list(all_words))

    ## loop over the (sorted) document numbers and (ordered) words; fill in matrix
    for j in all_doc_frequencies:
        for i, word in enumerate(all_words):
            TDM[i,j] = all_doc_frequencies[j][word]
    return (TDM, all_words)

#### The function works with the whole book, so it takes a while for it to run. 

In [131]:
TDM_84, all_words_84 = book_TDM(84, pos = True, lemma = True) #this processes TDM for the whole book

In [132]:
TDM_84.shape 

(6151, 3470)

In [133]:
all_words_84[:10]

[('\n', 'SPACE'),
 ('\n\n', 'SPACE'),
 ('\n\n  ', 'SPACE'),
 ('\n\n    ', 'SPACE'),
 ('\n\n     ', 'SPACE'),
 ('\n\n               ', 'SPACE'),
 ('\n\n                    ', 'SPACE'),
 ('\n\n                                                ', 'SPACE'),
 ('\n  ', 'SPACE'),
 ('\n   ', 'SPACE')]

In [134]:
all_words_84[-10:]

[('yield', 'VERB'),
 ('yon', 'NOUN'),
 ('you', 'PRON'),
 ('young', 'ADJ'),
 ('youngster', 'NOUN'),
 ('your', 'ADJ'),
 ('yourself', 'NOUN'),
 ('youth', 'NOUN'),
 ('youthful', 'ADJ'),
 ('zeal', 'NOUN')]

__C3.__ _(8 pts)_ Next, your job is to define two functions. The first is `sim(u,v)`, which shoud take two arbitrary numeric vectors and compute/output the cosine similarity, as described in __Section 1.1.2.10__.  

The second function is `term_sims(i, TDM)`, which should utilize the first function (`sim`) to output a list of cosine similarity values between the word/row `i` and all others (rows) in the `TDM`. 

Note: each of these functions can be straightforwardly completed using a single line of code! Exhibit your knowledge of comprehensions and vectorization!

In [135]:
def sim(u,v): return u.dot(v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [173]:
def term_sims(i, TDM): return [sim(TDM[i,], TDM[x,]) for x in range(TDM.shape[0]) if x!=i]

In [174]:
len(term_sims(2, TDM_84))

6150

In [175]:
term_sims(1000, TDM_84)[:20]

[0.025026082427542392,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.020117019055664216,
 0.0,
 0.06468462273531508,
 0.02643505285727147,
 0.0,
 0.0,
 0.0,
 0.0]

__C4.__ _(7 pts)_ Finally, your goal now is to a write function, `most_similar(term, terms, TDM, top = 25)`, that utilizes `term_sims` to output a sorted list of the `top = N` terms most similar to one specified (`term`). The output data type should be a list of lists, with each inner list representing information for a similar term as: `[row_ix, similarity, term]`. Once complete, prove your function's utility on a `TDM` produced for `book_id = 84` and exhibit the top 25 similar terms to both of `('monster', 'NOUN')` and `('beautiful', 'ADJ')`.

Once computation is complete, comment on the ordered results returned in the markdown cell below. Do you think the algorithm is exhibiting sensible result? What would you do to improve?

\[Hint: to locate the row containing the term of interest, utilize the list `.index()` method in application to the `terms` argument.\]

<font color=blue>One problem with this method is that we are counting new lines, spaces, etc. as words as we can see in the beginning of the sorted list of words. and they contribute to the similarity calculations. One way to improve is that to exclude those words from the computations. Also, it maybe useful to add the stop words and count them as a limiting criterion.</font>

In [176]:
def most_similar(term, terms, TDM, top):
    sim_info = []
    i = terms.index(term)
    sims = term_sims(i, TDM)
    for j, eachsim in enumerate(sims):
        sim_info.append([j, eachsim, terms[j]])
    sim_data = sorted(sim_info, key=lambda x: x[1], reverse = True)
    output = sim_data[:top]
    return output

In [177]:
monster_similars = most_similar(('monster', 'NOUN'), all_words_84, TDM_84, top = 25)
beautiful_similars = most_similar(('beautiful', 'ADJ'), all_words_84, TDM_84, top = 25)

In [178]:
print("Similarity info for the word monster as a noun: \n")
pprint(monster_similars)
print("\nSimilarity info for the word beautiful as an adjective: \n")
pprint(beautiful_similars)

Similarity info for the word monster as a noun: 

[[467, 0.17407765595569785, ('asseveration', 'NOUN')],
 [659, 0.17407765595569785, ('besiege', 'VERB')],
 [1225, 0.17407765595569785, ('convulse', 'VERB')],
 [1547, 0.17407765595569785, ('detested', 'ADJ')],
 [1588, 0.17407765595569785, ('dim', 'NOUN')],
 [1645, 0.17407765595569785, ('disown', 'VERB')],
 [2029, 0.17407765595569785, ('existence', 'VERB')],
 [2295, 0.17407765595569785, ('forehead', 'NOUN')],
 [2387, 0.17407765595569785, ('furiously', 'ADV')],
 [2925, 0.17407765595569785, ('inevitable', 'ADJ')],
 [3117, 0.17407765595569785, ('jeer', 'NOUN')],
 [3206, 0.17407765595569785, ('languish', 'VERB')],
 [3278, 0.17407765595569785, ('lid', 'NOUN')],
 [3504, 0.17407765595569785, ('merciless', 'NOUN')],
 [3544, 0.17407765595569785, ('mirror', 'NOUN')],
 [3606, 0.17407765595569785, ('mortal', 'NOUN')],
 [4320, 0.17407765595569785, ('proportionably', 'ADV')],
 [4418, 0.17407765595569785, ('rational', 'ADJ')],
 [4808, 0.17407765595569785