# Assignment group 1: Textual feature extraction and numerical comparison

## Module B _(35 points)_ Key word in context

Key word in context (KWiC) is a common format for concordance lines, i.e., contextualized instances of principal words used in a book. More generally, KWiC is essentially the concept behind the utility of 'find in page' on document viewers and web browsers. This module builds up a KWiC utility for finding key word-containing sentences, and 'most relevant' paragraphs, quickly.

__B1.__ _(3 points)_ Initialize `spacy`'s English model and write a function called `load_book(book_id)`, which reads a book of specified number (as a string, `book_id`) and executes a regular expressiion to `re.split()` the loaded `book` (a string) into a list of `paragraphs` (strings). 

Test your code on `book_id = '84'`, and print the number of paragraphs in the resulting output.

Note: this module is not focused on text pre-processing beyond a split into paragraphs; you are only required to determine a _reasonable_ split criterion, and _not_ to remove markup or non-substantive content.

In [1]:
from pprint import pprint
from collections import defaultdict
import re
import csv

In [2]:
import spacy
nlp = spacy.load('en')

In [5]:
def load_book(book_id):
    id_string = str(book_id)
    text_file = open("./data/books/"+id_string+".txt", "r")
    booktext = text_file.read()
    paragraphs = booktext.split('\n\n')
    return paragraphs

In [6]:
print("Number of paragraphs is: "+str(len(load_book(84)))+"(This value is the rough count.)\n")
print(load_book(84)[9]) #10th paragraph is printed as a sample.

Number of paragraphs is: 725(This value is the rough count.)

You will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings.  I arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.


__B2.__ _(10 points)_ Write a function called `kwic(paragraphs, search_terms)` that accepts a list of string `paragraphs` and a set of `search_term` strings. The function should:

1. initialize `data` as a `defaultdict` of lists
2. loop over the `paragraphs` and apply `spacy`'s processing to produce a `doc` for each;
3. loop over the `doc.sents` resulting from each `paragraph`;
4. loop over the words in each `sentence`;
5. check: `if` a `word` is `in` the `search_terms` set;
6. `if` (5), then `.append()` the reference under `data[word]` as a list: `[[i, j, k], sentence]`,

where `i`, `j`, and `k` refer to the paragraph-in-book, sentence-in-paragraph, and word-in-paragraph indices, respectively.

Your output, `data`, should then be a default dictionary of lists of the format:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```

In [50]:
def kwic(paragraphs, search_terms = {}):
    data = defaultdict(lambda: [])
    for i, eachparagraph in enumerate(paragraphs):
        doc = nlp(eachparagraph)
        for j, sentence in enumerate(doc.sents):
            for k, eachword in enumerate(sentence):
                if eachword.text in search_terms:
                    data[eachword.text].append([[i, j, k],[w.text for w in sentence]])
                else:
                    pass
    return data

__B3__ _(2 points)_ Prove your `kwic` search function's utility using the pre-processed paragraphs from book `84` and __B1__. Exhibit examples of the key words `Frankenstein` and `monster` in context and and comment on the run time of this program and explain why it runs so darn slow, and in particular would not support repeated queries. Note: if you think it doesn't, then just confirm `kwic`'sfunction, and proceed to part __B5__. You can comment here after completing the module.

In [51]:
data_Frankenstein_monster = kwic(load_book(84), {'Frankenstein', 'monster'})

In [54]:
pprint(data_Frankenstein_monster)

defaultdict(<function kwic.<locals>.<lambda> at 0x10b45f0d0>,
            {'Frankenstein': [[[2, 0, 0], ['Frankenstein', ',']],
                              [[104, 2, 11],
                               ['So',
                                'much',
                                'has',
                                'been',
                                'done',
                                ',',
                                'exclaimed',
                                'the',
                                'soul',
                                'of',
                                '\n',
                                'Frankenstein',
                                '--',
                                'more',
                                ',',
                                'far',
                                'more',
                                ',',
                                'will',
                                'I',
                                'achi

                           'of',
                           'past',
                           'misfortunes',
                           'pressed',
                           'upon',
                           'me',
                           ',',
                           'I',
                           'began',
                           'to',
                           'reflect',
                           'on',
                           'their',
                           '\n',
                           'cause',
                           '--',
                           'the',
                           'monster',
                           'whom',
                           'I',
                           'had',
                           'created',
                           ',',
                           'the',
                           'miserable',
                           'daemon',
                           'whom',
                           'I',
                     

<font color=blue>Code runs very slow because it has to loop over every component of the text and tokenize it. And then compare it to the search words. Also, there is no indication of an error prior to the return execution, everytime for running the code, we need to wait for all the looping.(We will waste a lot of time if we have a small error!)</font>

__B4.__ _(10 pts)_ The cost of _indexing_ a given book turns out to be the limiting factor here in the process. Presently, we have our pre-processing `load_book` function just splitting a document into paragraphs. This function should be modified not only to:

1. split a `book` into paragraphs and loop over them, but
2. process each paragraph with `spacy`;
3. store the `document` as a triple-nested list, so that each word _string_ is reachable via three indices: `word = document[i][j][k]`;
4. record an `index = defaultdict(list)` containing a list of `[i,j,k]` lists for each word; and
5. `return document, index`

Pre-computing the `index` will allow us to efficiently look up the locations of each word's instance in `document`, and the triple-list format of our document will allow us fast access to extract the sentence for KWiC. Exhibit this modified version of `load_book`'s function on `book_id = '84'` and print out the `[i,j,k]` locations of the word `'monster'` from `index`.

In [69]:
def mod_load_book(book_id):
    document = [] #[[  [], []   ], [  [], []  ], [  [], []  ]]
    index = defaultdict(list)
    #load the book
    id_string = str(book_id)
    text_file = open("./data/books/"+id_string+".txt", "r")
    booktext = text_file.read()
    paragraphs = booktext.split('\n\n')
    #find the index
    for i, eachparagraph in enumerate(paragraphs):
        doc = nlp(eachparagraph)
        document.append([])
        for j, sentence in enumerate(doc.sents):
            document[-1].append([])
            for k, eachword in enumerate(sentence):
                document[-1][-1].append(eachword.text)
                index[eachword.text].append([i, j, k])
    return (document, index)

In [70]:
document, index = mod_load_book(84)

In [109]:
print(document[10][4])
print("\n")
print(document[10][4][33])
print([index['delight'][1], document[index['delight'][1][0]][index['delight'][1][1]]])

['I', 'try', 'in', 'vain', 'to', 'be', 'persuaded', 'that', 'the', 'pole', 'is', 'the', 'seat', 'of', '\n', 'frost', 'and', 'desolation', ';', 'it', 'ever', 'presents', 'itself', 'to', 'my', 'imagination', 'as', 'the', '\n', 'region', 'of', 'beauty', 'and', 'delight', '.', ' ']


delight
[[10, 4, 33], ['I', 'try', 'in', 'vain', 'to', 'be', 'persuaded', 'that', 'the', 'pole', 'is', 'the', 'seat', 'of', '\n', 'frost', 'and', 'desolation', ';', 'it', 'ever', 'presents', 'itself', 'to', 'my', 'imagination', 'as', 'the', '\n', 'region', 'of', 'beauty', 'and', 'delight', '.', ' ']]


In [73]:
print(" ".join(document[10][4]))
print(" ".join(document[10][0]))

I try in vain to be persuaded that the pole is the seat of 
 frost and desolation ; it ever presents itself to my imagination as the 
 region of beauty and delight .  
I am already far north of London , and as I walk in the streets of 
 Petersburgh , I feel a cold northern breeze play upon my cheeks , which 
 braces my nerves and fills me with delight .  


In [74]:
index['delight']

[[10, 0, 39],
 [10, 4, 33],
 [75, 1, 24],
 [77, 4, 67],
 [83, 5, 12],
 [85, 11, 15],
 [113, 0, 16],
 [121, 2, 34],
 [133, 0, 4],
 [134, 0, 6],
 [196, 2, 7],
 [201, 5, 13],
 [262, 13, 10],
 [296, 0, 29],
 [318, 10, 5],
 [331, 2, 12],
 [334, 0, 5],
 [335, 2, 9],
 [337, 0, 55],
 [338, 2, 36],
 [342, 0, 35],
 [351, 0, 34],
 [376, 3, 8],
 [444, 3, 8],
 [478, 7, 21],
 [486, 8, 29],
 [505, 4, 1],
 [518, 4, 18],
 [527, 2, 16],
 [587, 2, 15],
 [613, 1, 11],
 [638, 18, 16]]

__B5.__ _(5 pts)_ Finally, make a new function called `fast_kwic(document, index, search_terms)` that loops through all specified `search_terms` to identify indices from `index[word]` for the key word-containing sentences and use them to extract these sentences from `document` into the same data structure as output by __B2__:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```
Confirm your output again by exhibiting examples of the key words `Frankenstein` and `monster` in context.

In [191]:
def fast_kwic(doc, inx, search_items = {}):
    data = defaultdict(list)
    for searchword in search_items:
        for c in range(len(inx[searchword])):
            data[searchword].append(
                [inx[searchword][c], doc[inx[searchword][c][0]][inx[searchword][c][1]]])
    return data

In [192]:
fast_kwic_Frankenstein_monster = fast_kwic(document, index, {'Frankenstein', 'monster'})

In [193]:
pprint(fast_kwic_Frankenstein_monster)

defaultdict(<class 'list'>,
            {'Frankenstein': [[[2, 0, 0], ['Frankenstein', ',']],
                              [[104, 2, 11],
                               ['So',
                                'much',
                                'has',
                                'been',
                                'done',
                                ',',
                                'exclaimed',
                                'the',
                                'soul',
                                'of',
                                '\n',
                                'Frankenstein',
                                '--',
                                'more',
                                ',',
                                'far',
                                'more',
                                ',',
                                'will',
                                'I',
                                'achieve',
                            

                           'proportionate',
                           'to',
                           'his',
                           'crimes',
                           '.',
                           ' ']],
                         [[654, 8, 5],
                          ['Let',
                           'the',
                           'cursed',
                           'and',
                           'hellish',
                           'monster',
                           '\n',
                           'drink',
                           'deep',
                           'of',
                           'agony',
                           ';',
                           'let',
                           'him',
                           'feel',
                           'the',
                           'despair',
                           'that',
                           'now',
                           'torments',
                           'me',
           

__B6.__ _(5 pts)_ Your goal here is to modify the pre-processing in `load_book` one more time! Make a small modification to the input: `load_book(book_id, pos = True, lemma = True):`, to accept two boolean arguments, `pos` and `lemma` specifying how to identify each word as a key term. In particular, each word will now be represented in both of the `document` and `index` as a tuple: `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`. When you've completed this part, exhibit your function's utility by using its ouput in the `fast_kwic` function to search for the key terms `('cold', NOUN)` and `('cold', ADJ)`.

Note this functions output should still consist of a `document` and `index` in the same format aside from the replacement of `word` with `heading`, which will allow for the same use of output in `fast_kwic`, although more specified by the textual features.

In [194]:
def mod2_load_book(book_id, pos, lemma):
    #document, index = mod_load_book(book_id)
    document = [] #[[  [], []   ], [  [], []  ], [  [], []  ]]
    index = defaultdict(list)
    #load the book
    id_string = str(book_id)
    text_file = open("./data/books/"+id_string+".txt", "r")
    booktext = text_file.read()
    paragraphs = booktext.split('\n\n')
    #find the index
    for i, eachparagraph in enumerate(paragraphs):
        doc = nlp(eachparagraph)
        document.append([])
        for j, sentence in enumerate(doc.sents):
            document[-1].append([])
            heading = ("word.lemma" if lemma else "word.text", "tag" if pos else "pos off")
            document[-1][-1].append(heading)
            for k, eachword in enumerate(sentence):
                if pos:
                    if lemma:
                        document[-1][-1].append((eachword.lemma_, eachword.pos_))
                        index[(eachword.lemma_, eachword.pos_)].append([i, j, k])
                    else:
                        document[-1][-1].append((eachword.text, eachword.pos_))
                        index[(eachword.text, eachword.pos_)].append([i, j, k])
                else:
                    if lemma:
                        document[-1][-1].append((eachword.lemma_, ""))
                        index[(eachword.lemma_, "")].append([i, j, k])
                    else:
                        document[-1][-1].append((eachword.text, ""))
                        index[(eachword.text, "")].append([i, j, k])
        
    return (document, index)

In [195]:
doc1, index1 = mod2_load_book(84, pos = True, lemma = True)

In [196]:
print([index1[('delight', 'NOUN')][1], doc1[index1[('delight', 'NOUN')][1][0]][index1[('delight', 'NOUN')][1][1]]])

[[10, 4, 33], [('word.lemma', 'tag'), ('-PRON-', 'PRON'), ('try', 'VERB'), ('in', 'ADP'), ('vain', 'ADJ'), ('to', 'PART'), ('be', 'VERB'), ('persuade', 'VERB'), ('that', 'ADP'), ('the', 'DET'), ('pole', 'NOUN'), ('be', 'VERB'), ('the', 'DET'), ('seat', 'NOUN'), ('of', 'ADP'), ('\n', 'SPACE'), ('frost', 'NOUN'), ('and', 'CCONJ'), ('desolation', 'NOUN'), (';', 'PUNCT'), ('-PRON-', 'PRON'), ('ever', 'ADV'), ('present', 'VERB'), ('-PRON-', 'PRON'), ('to', 'ADP'), ('-PRON-', 'ADJ'), ('imagination', 'NOUN'), ('as', 'ADP'), ('the', 'DET'), ('\n', 'SPACE'), ('region', 'NOUN'), ('of', 'ADP'), ('beauty', 'NOUN'), ('and', 'CCONJ'), ('delight', 'NOUN'), ('.', 'PUNCT'), (' ', 'SPACE')]]


In [197]:
print(doc1[10][4])

[('word.lemma', 'tag'), ('-PRON-', 'PRON'), ('try', 'VERB'), ('in', 'ADP'), ('vain', 'ADJ'), ('to', 'PART'), ('be', 'VERB'), ('persuade', 'VERB'), ('that', 'ADP'), ('the', 'DET'), ('pole', 'NOUN'), ('be', 'VERB'), ('the', 'DET'), ('seat', 'NOUN'), ('of', 'ADP'), ('\n', 'SPACE'), ('frost', 'NOUN'), ('and', 'CCONJ'), ('desolation', 'NOUN'), (';', 'PUNCT'), ('-PRON-', 'PRON'), ('ever', 'ADV'), ('present', 'VERB'), ('-PRON-', 'PRON'), ('to', 'ADP'), ('-PRON-', 'ADJ'), ('imagination', 'NOUN'), ('as', 'ADP'), ('the', 'DET'), ('\n', 'SPACE'), ('region', 'NOUN'), ('of', 'ADP'), ('beauty', 'NOUN'), ('and', 'CCONJ'), ('delight', 'NOUN'), ('.', 'PUNCT'), (' ', 'SPACE')]


In [198]:
print(index1[('cold', 'NOUN')])

[[14, 2, 2], [292, 1, 12], [304, 8, 45], [385, 2, 24], [662, 0, 16], [665, 1, 25], [687, 1, 1]]


In [199]:
fast_kwic_POS_LEMMA = fast_kwic(doc1, index1, {('cold', 'NOUN'), ('cold', 'ADJ')})

In [200]:
pprint(fast_kwic_POS_LEMMA)

defaultdict(<class 'list'>,
            {('cold', 'ADJ'): [[[10, 0, 22],
                                [('word.lemma', 'tag'),
                                 ('-PRON-', 'PRON'),
                                 ('be', 'VERB'),
                                 ('already', 'ADV'),
                                 ('far', 'ADV'),
                                 ('north', 'ADV'),
                                 ('of', 'ADP'),
                                 ('london', 'PROPN'),
                                 (',', 'PUNCT'),
                                 ('and', 'CCONJ'),
                                 ('as', 'ADP'),
                                 ('-PRON-', 'PRON'),
                                 ('walk', 'VERB'),
                                 ('in', 'ADP'),
                                 ('the', 'DET'),
                                 ('street', 'NOUN'),
                                 ('of', 'ADP'),
                                 ('\n', 'SPACE'),
              