## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Edward Day
    - Email: ED558@drexel.edu
- Group member 2
    - Name: NA
    - Email: NA
- Group member 3
    - Name: NA
    - Email: NA
- Group member 4
    - Name: NA
    - Email: NA

### Additional submission comments
- Tutoring support received: Jacob Rosen 
- Other (other): NA

# Assignment group 1: Textual feature extraction and numerical comparison

## Module C _(35 points)_ Similarity of word usage across a document

Here we'll be building up some code to discover how different terms are utilized similarly across a document. For this, our first task will be to create a word frequency counting function.

__C1.__ _(12 points)_ Define a function called `count_words(paragraph, pos = True, lemma = True)` that `return`s a `Counter()` called `frequency`. In `frequency`, each key will consist of a `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`. The `Counter()` should simply contain the number of times each `heading` is observed in the `paragraph`.

In [1]:
paragraph = """Word frequencies are probably the first and easiest 
numerical representation of text to compute. In some communities, 
this is referred to as the bag of words (BOW) model. 
Put simply, the BOW model simply counts up the 
number of times each word appears in a document. 
This of course depends on a few things, e.g., case and lemmatization. 
However, constructing a basic BOW model is quite straightforward, especially using `Counter`. 
Let's use this very paragraph as our example text for the BOW model."""

import spacy
nlp = spacy.load("en")
from collections import Counter
def count_words(paragraph, pos = True, lemma = True):
    frequency = Counter()
    doc = nlp(paragraph)
    for word in doc:
        heading = (word.text if lemma == False else word.lemma_ , "" if pos == False else word.pos_)
        frequency[heading] +=1
    return frequency
count_words(paragraph, pos= True, lemma = True)



Counter({('word', 'NOUN'): 3,
         ('frequency', 'NOUN'): 1,
         ('be', 'AUX'): 3,
         ('probably', 'ADV'): 1,
         ('the', 'DET'): 5,
         ('first', 'ADJ'): 1,
         ('and', 'CCONJ'): 2,
         ('easy', 'ADJ'): 1,
         ('\n', 'SPACE'): 7,
         ('numerical', 'ADJ'): 1,
         ('representation', 'NOUN'): 1,
         ('of', 'ADP'): 4,
         ('text', 'NOUN'): 2,
         ('to', 'PART'): 1,
         ('compute', 'VERB'): 1,
         ('.', 'PUNCT'): 6,
         ('in', 'ADP'): 2,
         ('some', 'DET'): 1,
         ('community', 'NOUN'): 1,
         (',', 'PUNCT'): 6,
         ('this', 'DET'): 3,
         ('refer', 'VERB'): 1,
         ('to', 'ADP'): 1,
         ('as', 'SCONJ'): 2,
         ('bag', 'NOUN'): 1,
         ('(', 'PUNCT'): 1,
         ('BOW', 'PROPN'): 3,
         (')', 'PUNCT'): 1,
         ('model', 'NOUN'): 4,
         ('Put', 'PROPN'): 1,
         ('simply', 'ADV'): 2,
         ('count', 'VERB'): 1,
         ('up', 'ADP'): 1,
         

__C2.__ _(8 pts)_ Next, define a function called `book_TDM(book_id, pos = True, lemma = True)` and copy into it the TDM-producing code from __Section 2.1.5.1__ of the lecture notes, now `return`-ing `TDM` and `all_words`. Once copied, modify this function to call `count_words` appropriately, now passing through the user of `book_TDM`'s specified `lemma` and `pos` arguments.

To provde your code's function, process `book_id = 84` with both of `pos = True` and `lemma = True` and print out the `TDM`'s `.shape` attribute and the first ten terms in `all_words`.

In [2]:
import numpy as np
import spacy
import re
nlp = spacy.load("en")
def book_TDM(book_id, pos = True, lemma = True):
    all_words = set()
    book = open("./data/books/"+book_id+".txt","r").read().strip()
    paragraph = re.split("[\n]{2}",book)
    all_doc_frequencies = {}
    

    for j, Paragraph in enumerate(paragraph):
        frequency = count_words(Paragraph)
        all_doc_frequencies[j] = frequency
        doc_words = set(frequency.keys())
        all_words = all_words.union(doc_words)
    

    TDM = np.zeros((len(all_words),len(all_doc_frequencies)))
    all_words = sorted(list(all_words))
    for j in all_doc_frequencies:
        for i, word in enumerate(all_words):
            TDM[i,j] = all_doc_frequencies[j][word]
    return TDM, all_words

TDM, terms = book_TDM("84", pos = True, lemma = True)
terms[:10]
    

[('\n', 'SPACE'),
 ('\n  ', 'SPACE'),
 ('\n   ', 'SPACE'),
 ('\n     ', 'SPACE'),
 ('\n                              ', 'SPACE'),
 (' ', 'SPACE'),
 ('  ', 'SPACE'),
 ('    ', 'SPACE'),
 ('     ', 'SPACE'),
 ('               ', 'SPACE')]

__C3.__ _(8 pts)_ Next, your job is to define two functions. The first is `sim(u,v)`, which shoud take two arbitrary numeric vectors and compute/output the cosine similarity, as described in __Section 1.1.2.10__.  

The second function is `term_sims(i, TDM)`, which should utilize the first function (`sim`) to output a list of cosine similarity values between the word/row `i` and all others (rows) in the `TDM`. 

Note: each of these functions can be straightforwardly completed using a single line of code! Exhibit your knowledge of comprehensions and vectorization!

In [12]:
def sim(u,v):
    Sim = u.dot(v) / (np.linalg.norm(u) * np.linalg.norm(v))
    return Sim
def term_sims(i,TDM):
    cosines = []
    for j in range(TDM.shape[0]):
        if i != j:
            cosines.append(sim(TDM[i],TDM[j]))
    return cosines

In [13]:
term_sims(100,TDM)

[0.13014461838727612,
 0.0,
 0.0,
 0.0,
 0.0,
 0.13035721261658897,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.04485613040162566,
 0.08640699579358065,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.10290434478774935,
 0.0,
 0.0,
 0.0,
 0.0,
 0.07352146220938077,
 0.0,
 0.0,
 0.0,
 0.0717958158617738,
 0.12284273161059954,
 0.1332554289542996,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.10967253752026672,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.11396057645963795,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0

__C4.__ _(7 pts)_ Finally, your goal now is to a write function, `most_similar(term, terms, TDM, top = 25)`, that utilizes `term_sims` to output a sorted list of the `top = N` terms most similar to one specified (`term`). The output data type should be a list of lists, with each inner list representing information for a similar term as: `[row_ix, similarity, term]`. Once complete, prove your function's utility on a `TDM` produced for `book_id = 84` and exhibit the top 25 similar terms to both of `('monster', 'NOUN')` and `('beautiful', 'ADJ')`.

Once computation is complete, comment on the ordered results returned in the markdown cell below. Do you think the algorithm is exhibiting sensible result? What would you do to improve?

\[Hint: to locate the row containing the term of interest, utilize the list `.index()` method in application to the `terms` argument.\]

_Response._

In [14]:
def most_similar(term, terms, TDM, top = 25):
    x = terms.index(term)

    y = term_sims(x,TDM)

    z = []
    for i, value in enumerate(y):
        a = [i, value,terms[i]]
        z.append(a)
    b = sorted(z, key=lambda word: word[1], reverse = True)
    print(b[:top])
    


In [10]:
terms[3799]

('monster', 'NOUN')

In [15]:
most_similar(('beautiful','ADJ'),terms,TDM)
most_similar(('monster','NOUN'),terms,TDM)

[[349, 0.37499999999999994, ('Rotterdam', 'PROPN')], [1185, 0.37499999999999994, ('cast', 'VERB')], [5115, 0.37499999999999994, ('singular', 'ADJ')], [6108, 0.37499999999999994, ('when', 'ADV')], [263, 0.35355339059327373, ('Mainz', 'PROPN')], [265, 0.35355339059327373, ('Mannheim', 'PROPN')], [1750, 0.35355339059327373, ('delightful', 'ADJ')], [2050, 0.35355339059327373, ('dull', 'ADJ')], [2473, 0.35355339059327373, ('fifteen', 'NUM')], [3605, 0.35355339059327373, ('lustrous', 'ADJ')], [3685, 0.35355339059327373, ('mean', 'VERB')], [4192, 0.35355339059327373, ('peaked', 'ADJ')], [4271, 0.35355339059327373, ('picturesque', 'ADJ')], [4343, 0.35355339059327373, ('population', 'NOUN')], [5049, 0.35355339059327373, ('ship', 'NOUN')], [5073, 0.35355339059327373, ('shrink', 'VERB')], [5306, 0.35355339059327373, ('steel', 'NOUN')], [5332, 0.35355339059327373, ('stove', 'PROPN')], [5947, 0.35355339059327373, ('varied', 'ADJ')], [5993, 0.35355339059327373, ('vine', 'NOUN')], [6126, 0.3535533905