## Module submission header
### Submission preparation instructions
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Austin Eversole
    - Email: ae588@drexel.edu
- Group member 2
    - Name: Gregory Savage
    - Email: gs824@drexel.edu
- Group member 3
    - Name: Robert Thompson
    - Email: rt598@drexel.edu


### Additional submission comments
- Tutoring support received: From Anamika (CA/TA)
- Other (other): NA

# Assignment group 1: Textual feature extraction and numerical comparison

## Module C _(35 points)_ Similarity of word usage across a document

Here we'll be building up some code to discover how different terms are utilized similarly across a document. For this, our first task will be to create a word frequency counting function.

__C1.__ _(12 points)_ Define a function called `count_words(paragraph, pos = True, lemma = True)` that `return`s a `Counter()` called `frequency`. In `frequency`, each key will consist of a `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`. The `Counter()` should simply contain the number of times each `heading` is observed in the `paragraph`.

In [None]:
# C1:Function(12/12)

from collections import Counter
import spacy, json, re

#nlp = spacy.load("en")
nlp = spacy.load("en_core_web_sm")

def count_words(paragraph, pos = True, lemma = True):

    #---Your code starts here

    #Counter() contians number of times each heading is observed in paragraph

    #key consists of heading = (text, tag)
    frequency = Counter()

    for word in nlp(paragraph):


      if(lemma):
        text = word.lemma_
      else:
        text = word.text

      if(pos):
        tag = word.pos_
      else:
        tag = ""
      heading = (text, tag)

      frequency[heading] += 1

    #---Your code ends here

    return frequency

Let's make sure your function works by testing it on a short sentence.

In [None]:
# C1:SanityCheck

count_words("The quick brown fox jumps over the lazy dog.", True, True)


Counter({('the', 'DET'): 2,
         ('quick', 'ADJ'): 1,
         ('brown', 'ADJ'): 1,
         ('fox', 'NOUN'): 1,
         ('jump', 'VERB'): 1,
         ('over', 'ADP'): 1,
         ('lazy', 'ADJ'): 1,
         ('dog', 'NOUN'): 1,
         ('.', 'PUNCT'): 1})

__C2.__ _(8 pts)_ Next, define a function called `book_TDM(book_id, pos = True, lemma = True)` and copy into it the TDM-producing code from __Section 2.1.5.1__ of the lecture notes, now `return`-ing `TDM` and `all_words`. Once copied, modify this function to call `count_words` appropriately, now passing through the user of `book_TDM`'s specified `lemma` and `pos` arguments.

In [None]:
# C2:Function(8/8)

import numpy as np
from collections import Counter
import re

def book_TDM(book_id, pos = True, lemma = True):

  #---Your code starts here---
  from google.colab import drive
  drive.mount('/content/drive')

  # Update to match your current system working directory
  # This can be Google CoLab or Local System
  full_file_paths = "/content/drive/MyDrive/dsci521/Assignment1/data/books"
  path = full_file_paths + "/" + book_id + ".txt"
  bookText = open(path, "r").read().strip()


  doc = re.split("\n\n+", bookText)
  ## the 'master' set, keeps track of the words in all documents
  all_words = set()

  ## store the word frequencies by book
  all_doc_frequencies = {}

  ## loop over the sentences
  for j, sentence in enumerate(doc):
    frequency = count_words(sentence, pos, lemma)
    all_doc_frequencies[j] = frequency
    doc_words = set(frequency.keys())
    all_words = all_words.union(doc_words)

  ## create a matrix of zeros: (words) x (documents)
  TDM = np.zeros((len(all_words),len(all_doc_frequencies)))
  ## fix a word ordering for the rows
  all_words = sorted(list(all_words))
  ## loop over the (sorted) document numbers and (ordered) words; fill in matrix
  #print(all_doc_frequencies)
  #print(all_words)
  for j in all_doc_frequencies:
    for i, word in enumerate(all_words):
      #The first value for all_doc_frequencies[j][word] is SPACE
      TDM[i,j] = all_doc_frequencies[j][word]


  #TDM[:10,]
  #---Your code ends here---

  return(TDM, all_words)


To test your code's function, let's process `book_id = 84` with both of `pos = True` and `lemma = True` and print out the `TDM`'s `.shape` attribute and the first ten terms in `all_words`.

In [None]:
# C2:SanityCheck

TDM, terms = book_TDM("84", pos = True, lemma = True)
terms[:10]

Mounted at /content/drive


[('\n', 'SPACE'),
 ('\n  ', 'SPACE'),
 ('\n   ', 'SPACE'),
 ('\n     ', 'SPACE'),
 ('\n                              ', 'SPACE'),
 (' ', 'SPACE'),
 ('  ', 'SPACE'),
 ('    ', 'SPACE'),
 ('     ', 'SPACE'),
 ('               ', 'SPACE')]

In [None]:
# C2:SanityCheck

TDM.shape

(6262, 723)

__C3.__ _(8 pts)_ Next, your job is to define two functions. The first is `sim(u,v)`, which shoud take two arbitrary numeric vectors and compute/output the `cosine_similarity`, as described in __Section 1.1.2.10__.  

The second function is `term_sims(i, TDM)`, which should utilize the first function (`sim` function) to output a list of cosine similarity values (`sim_values`) between the word/row `i` and all others (rows) in the `TDM`.

Note: each of these functions can be straightforwardly completed using a single line of code! Exhibit your knowledge of comprehensions and vectorization!

In [None]:
# C3:Function(4/8)
def sim(u,v):

    #---Your code starts here
    cosine_similarity = u.dot(v) / (np.linalg.norm(u) * np.linalg.norm(v))
    #---Your code ends here

    return cosine_similarity

In [None]:
# C3:SanityCheck

print("Exactly similar:", sim(np.array([1,2,3]), np.array([1,2,3])))
print("Exactly dissimilar:", sim(np.array([1,2,3]), np.array([-1,-2,-3])))
print("In the middle:", sim(np.array([1,1]), np.array([-1,1])))

Exactly similar: 1.0
Exactly dissimilar: -1.0
In the middle: 0.0


In [None]:
# C3:Function(4/8)

def term_sims(i, TDM):

    #---Your code starts here
    sim_values = [sim(TDM[i], TDM[row]) if row!= i else 1 for row in range(len(TDM)) ]
    #---Your code ends here

    return sim_values

In [None]:
# C3:SanityCheck

# Compare word/row 0 to all other (rows) in the TDM
term_sims(0, TDM)

[1,
 0.0,
 0.0,
 0.0,
 0.0,
 0.9498986502907771,
 0.045949296887323965,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.017124291893062648,
 0.39813258718259,
 0.46742415729096676,
 0.24199963027323954,
 0.11127332811291032,
 0.4329628779238955,
 0.2063477173114049,
 0.2063477173114049,
 0.9523276661400791,
 0.1477261756313495,
 0.0,
 0.5274414795006903,
 0.027398867028900233,
 0.41119037303226363,
 0.023974008650287704,
 0.0,
 0.058222592436413,
 0.9712507693137749,
 0.0,
 0.0,
 0.0,
 0.041098300543350355,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.010274575135837589,
 0.0,
 0.0,
 0.0,
 0.013699433514450117,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.015818743254626313,
 0.0,
 0.0,
 0.041098300543350355,
 0.35580586003870596,
 0.8189613459883718,
 0.3762721959394875,
 0.0,
 0.06920700173899012,
 0.07534688432947564,
 0.14106638902501817,
 0.16510813272016214,
 0.003424858378612529,
 0.02768280069559605,
 0.09491245952775788,
 0.0,
 0.06

__C4.__ _(7 pts)_ Finally, your goal now is to a write function, `most_similar(term, terms, TDM, top = 25)`, that utilizes `term_sims` to output a sorted list of the `top = N` terms (`top_n_terms`) most similar to one specified (`term`). The output data type should be a list of lists, with each inner list representing information for a similar term as: `[row_ix, similarity, term]`.

\[Hint: to locate the row containing the term of interest, utilize the list `.index()` method in application to the `terms` argument.\]

In [None]:
# C4:Function(6/7)

def most_similar(term, terms, TDM, top = 25):

    #---Your code starts here---
    relevantSims = term_sims(terms.index(term),TDM)
    #print(len(relevantSims)) #6261
    #print(len(terms)) #6262
    fullList = [[rowix, relevantSims[rowix], terms[rowix]] for rowix in range(0,len(relevantSims))]
    finalTerms = sorted(fullList, key = lambda elem: elem[1], reverse = True)
    top_n_terms = finalTerms[:top]
    #output a sorted list of the top = N terms (top_n_terms) most similar to one specified (term).
    #he output data type should be a list of lists, with each inner list representing information for a similar term as: [row_ix, similarity, term].
    #---Your code ends here---

    return top_n_terms

Now, let's test your functions utility on a `TDM` produced for `book_id = 84` and exhibit the top 25 similar terms to both of `('monster', 'NOUN')` and `('beautiful', 'ADJ')`.

In [None]:
# C4:SanityCheck

most_similar(('monster', 'NOUN'), terms, TDM, top = 25)

[[3801, 1, ('monster', 'NOUN')],
 [799, 0.34299717028501764, ('asseveration', 'NOUN')],
 [1561, 0.34299717028501764, ('correct', 'VERB')],
 [2601, 0.34299717028501764, ('formation', 'NOUN')],
 [3867, 0.34299717028501764, ('mutilate', 'VERB')],
 [4362, 0.34299717028501764, ('posterity', 'NOUN')],
 [218, 0.337510803485021, ('I', 'PRON')],
 [2950, 0.3102526139970115, ('hideous', 'ADJ')],
 [1770, 0.2970442628930023, ('demoniacal', 'ADJ')],
 [2594, 0.2970442628930023, ('forgetfulness', 'NOUN')],
 [5, 0.29194407123319405, (' ', 'SPACE')],
 [5538, 0.28020184134909754, ('that', 'SCONJ')],
 [13, 0.2794815423200416, ('!', 'PUNCT')],
 [0, 0.27370899867041026, ('\n', 'SPACE')],
 [29, 0.2718318190198858, ('.', 'PUNCT')],
 [5540, 0.27086782501514084, ('the', 'DET')],
 [677, 0.27083418336873366, ('and', 'CCONJ')],
 [3908, 0.2585438449975096, ('neck', 'NOUN')],
 [3575, 0.25724787771376323, ('loathsome', 'ADJ')],
 [3883, 0.25724787771376323, ('narrative', 'NOUN')],
 [20, 0.25250143182442775, (',', 'PUN

In [None]:
# C4:SanityCheck

most_similar(('beautiful', 'ADJ'), terms, TDM, top = 25)

[[933, 1, ('beautiful', 'ADJ')],
 [976, 0.40824829046386296, ('beneath', 'ADV')],
 [356, 0.37499999999999994, ('Rotterdam', 'PROPN')],
 [1200, 0.37499999999999994, ('castle', 'NOUN')],
 [2991, 0.37499999999999994, ('horrid', 'ADJ')],
 [5108, 0.37499999999999994, ('singularly', 'ADV')],
 [263, 0.35355339059327373, ('Mainz', 'PROPN')],
 [265, 0.35355339059327373, ('Mannheim', 'PROPN')],
 [1759, 0.35355339059327373, ('delineate', 'ADJ')],
 [2059, 0.35355339059327373, ('dun', 'PROPN')],
 [2492, 0.35355339059327373, ('fifth', 'ADJ')],
 [3613, 0.35355339059327373, ('luxuriance', 'NOUN')],
 [3686, 0.35355339059327373, ('meandering', 'ADJ')],
 [4197, 0.35355339059327373, ('pearly', 'ADJ')],
 [4345, 0.35355339059327373, ('populous', 'ADJ')],
 [5044, 0.35355339059327373, ('shipping', 'NOUN')],
 [5067, 0.35355339059327373, ('shrivel', 'VERB')],
 [5296, 0.35355339059327373, ('steep', 'ADJ')],
 [5320, 0.35355339059327373, ('straight', 'ADJ')],
 [5948, 0.35355339059327373, ('variegate', 'VERB')],
 [

In [None]:
# C4:Inline

# Comment on the ordered results returned in the sanity checks.
# Do you think the algorithm is exhibiting sensible results? print "Yes" or "No"
print("Yes")

Yes
