## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Edward Day
    - Email: ED558@drexel.edu
- Group member 2
    - Name: Sophia Lee
    - Email: 
- Group member 3
    - Name: Sahar Siddiqi
    - Email: 
- Group member 4
    - Name: NA
    - Email: NA

### Additional submission comments
- Tutoring support received: Jacob Rosen 
- Other (other): 

# Assignment group 1: Textual feature extraction and numerical comparison

## Module B _(35 points)_ Key word in context

Key word in context (KWiC) is a common format for concordance lines, i.e., contextualized instances of principal words used in a book. More generally, KWiC is essentially the concept behind the utility of 'find in page' on document viewers and web browsers. This module builds up a KWiC utility for finding key word-containing sentences, and 'most relevant' paragraphs, quickly.

__B1.__ _(3 points)_ Initialize `spacy`'s English model and write a function called `load_book(book_id)`, which reads a book of specified number (as a string, `book_id`) and executes a regular expressiion to `re.split()` the loaded `book` (a string) into a list of `paragraphs` (strings). 

Test your code on `book_id = '84'`, and print the number of paragraphs in the resulting output.

Note: this module is not focused on text pre-processing beyond a split into paragraphs; you are only required to determine a _reasonable_ split criterion, and _not_ to remove markup or non-substantive content.

In [1]:
import re 
import spacy
nlp = spacy.load("en")

def load_book(book_id):
    book = open("./data/books/"+book_id+".txt","r").read().strip()

    
    paragraphs = re.split("[\n]{2}",book)
    
    return paragraphs
book_84 = load_book('84')
print(book_84[20:22])
print(len(book_84[20:22])) 

['Yet do not suppose, because I complain a little or because I can\nconceive a consolation for my toils which I may never know, that I am\nwavering in my resolutions.  Those are as fixed as fate, and my voyage\nis only now delayed until the weather shall permit my embarkation.  The\nwinter has been dreadfully severe, but the spring promises well, and it\nis considered as a remarkably early season, so that perhaps I may sail\nsooner than I expected.  I shall do nothing rashly:  you know me\nsufficiently to confide in my prudence and considerateness whenever the\nsafety of others is committed to my care.', 'I cannot describe to you my sensations on the near prospect of my\nundertaking.  It is impossible to communicate to you a conception of\nthe trembling sensation, half pleasurable and half fearful, with which\nI am preparing to depart.  I am going to unexplored regions, to "the\nland of mist and snow," but I shall kill no albatross; therefore do not\nbe alarmed for my safety or if I sh

__B2.__ _(10 points)_ Write a function called `kwic(paragraphs, search_terms)` that accepts a list of string `paragraphs` and a set of `search_term` strings. The function should:

1. initialize `data` as a `defaultdict` of lists
2. loop over the `paragraphs` and apply `spacy`'s processing to produce a `doc` for each;
3. loop over the `doc.sents` resulting from each `paragraph`;
4. loop over the words in each `sentence`;
5. check: `if` a `word` is `in` the `search_terms` set;
6. `if` (5), then `.append()` the reference under `data[word]` as a list: `[[i, j, k], sentence]`,

where `i`, `j`, and `k` refer to the paragraph-in-book, sentence-in-paragraph, and word-in-paragraph indices, respectively.

Your output, `data`, should then be a default dictionary of lists of the format:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```

In [7]:
import spacy
from collections import defaultdict
nlp = spacy.load("en")

data = defaultdict(list)
       



def kwic(paragraphs, search_terms):
    data = defaultdict(list)
    i=0
    for paragraph in paragraphs:
        doc = nlp(paragraph)
        j = 0 
        for sentence in doc.sents:
            k = 0
            for word in sentence:
                if word.text in search_terms :
                    data[word].append([[i, j, k], sentence])
                k = k+1
            j=j+1
        i=i+1    
    return data

__B3__ _(2 points)_ Prove your `kwic` search function's utility using the pre-processed paragraphs from book `84` and __B1__. Exhibit examples of the key words `Frankenstein` and `monster` in context and and comment on the run time of this program and explain why it runs so darn slow, and in particular would not support repeated queries. Note: if you think it doesn't, then just confirm `kwic`'sfunction, and proceed to part __B5__. You can comment here after completing the module.

_Response._ 

In [8]:
kwic(book_84,['Frankenstein','monster'])

# Run time about 20-30 seconds

defaultdict(list,
            {Frankenstein: [[[1, 0, 0], Frankenstein,]],
             Frankenstein: [[[103, 3, 11],
               So much has been done, exclaimed the soul of
               Frankenstein--more, far more, will I achieve; treading in the steps
               already marked, I will pioneer a new way, explore unknown powers, and
               unfold to the world the deepest mysteries of creation.]],
             monster: [[[124, 12, 9],
               shutters, I beheld the wretch--the miserable monster whom I had
               created.  ]],
             Frankenstein: [[[131, 4, 4], "My dear
               Frankenstein," exclaimed he, "how glad I am to see you!  ]],
             Frankenstein: [[[134, 2, 4],
               But, my dear Frankenstein," continued he, stopping
               short and gazing full in my face, "]],
             monster: [[[136, 3, 6], I dreaded to
               behold this monster, but I feared still more that Henry should see him.]],
      

__B4.__ _(10 pts)_ The cost of _indexing_ a given book turns out to be the limiting factor here in the process. Presently, we have our pre-processing `load_book` function just splitting a document into paragraphs. This function should be modified not only to:

1. split a `book` into paragraphs and loop over them, but
2. process each paragraph with `spacy`;
3. store the `document` as a triple-nested list, so that each word _string_ is reachable via three indices: `word = document[i][j][k]`;
4. record an `index = defaultdict(list)` containing a list of `[i,j,k]` lists for each word; and
5. `return document, index`

Pre-computing the `index` will allow us to efficiently look up the locations of each word's instance in `document`, and the triple-list format of our document will allow us fast access to extract the sentence for KWiC. Exhibit this modified version of `load_book`'s function on `book_id = '84'` and print out the `[i,j,k]` locations of the word `'monster'` from `index`.

In [28]:
def load_book(book_id):
    book = open("./data/books/"+book_id+".txt","r").read().strip()
    paragraphs = re.split("[\n]{2}",book)
    index = defaultdict(list)
    document = []

    for i,paragraph in enumerate(paragraphs):
        doc = nlp(paragraph)

        document.append([])
        for j,sentence in enumerate(doc.sents):

            document[-1].append([])
            for k,word in enumerate(sentence):
                document[-1][-1].append(word.text)
                index[word.text].append([i, j, k])

        
    return document, index

document, index = load_book('84')



In [29]:
index['monster']

[[124, 12, 9],
 [136, 3, 6],
 [139, 3, 4],
 [142, 1, 4],
 [243, 3, 29],
 [261, 3, 18],
 [280, 0, 2],
 [321, 1, 35],
 [345, 11, 6],
 [380, 12, 5],
 [397, 1, 46],
 [437, 1, 10],
 [439, 0, 3],
 [477, 6, 8],
 [478, 6, 6],
 [510, 1, 8],
 [527, 0, 1],
 [538, 24, 8],
 [560, 3, 31],
 [585, 5, 26],
 [587, 13, 72],
 [606, 3, 2],
 [615, 2, 11],
 [633, 8, 9],
 [639, 1, 21],
 [644, 4, 17],
 [653, 7, 5],
 [663, 7, 2],
 [673, 0, 39],
 [673, 1, 2],
 [709, 12, 1]]

__B5.__ _(5 pts)_ Finally, make a new function called `fast_kwic(document, index, search_terms)` that loops through all specified `search_terms` to identify indices from `index[word]` for the key word-containing sentences and use them to extract these sentences from `document` into the same data structure as output by __B2__:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```
Confirm your output again by exhibiting examples of the key words `Frankenstein` and `monster` in context.

In [30]:
def fast_kwic(document, index, search_terms):
    data = defaultdict(list)
    
    for term in search_terms:
        for n in range(0,len(index[term])):
            data[term].append([index[term][n], document[index[term][n][0]][index[term][n][1]]])

    return data

fast_kwic(document, index, ['Frakenstein'])
fast_kwic(document, index, ['monster'])

defaultdict(list,
            {'monster': [[[124, 12, 9],
               ['shutters',
                ',',
                'I',
                'beheld',
                'the',
                'wretch',
                '--',
                'the',
                'miserable',
                'monster',
                'whom',
                'I',
                'had',
                '\n',
                'created',
                '.',
                ' ']],
              [[136, 3, 6],
               ['I',
                'dreaded',
                'to',
                '\n',
                'behold',
                'this',
                'monster',
                ',',
                'but',
                'I',
                'feared',
                'still',
                'more',
                'that',
                'Henry',
                'should',
                'see',
                'him',
                '.',
                '\n']],
              [[139, 3, 4],
    

In [32]:
len(fast_kwic(document, index, ['monster']))

1

__B6.__ _(5 pts)_ Your goal here is to modify the pre-processing in `load_book` one more time! Make a small modification to the input: `load_book(book_id, pos = True, lemma = True):`, to accept two boolean arguments, `pos` and `lemma` specifying how to identify each word as a key term. In particular, each word will now be represented in both of the `document` and `index` as a tuple: `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`. When you've completed this part, exhibit your function's utility by using its ouput in the `fast_kwic` function to search for the key terms `('cold', NOUN)` and `('cold', ADJ)`.

Note this functions output should still consist of a `document` and `index` in the same format aside from the replacement of `word` with `heading`, which will allow for the same use of output in `fast_kwic`, although more specified by the textual features.

In [33]:
def load_book(book_id,pos = True, lemma = True):
    book = open("./data/books/"+book_id+".txt","r").read().strip()
    paragraphs = re.split("[\n]{2}",book)
    index = defaultdict(list)
    document = []

    for i,paragraph in enumerate(paragraphs):
        doc = nlp(paragraph)

        document.append([])
        for j,sentence in enumerate(doc.sents):

            document[-1].append([])
            for k,word in enumerate(sentence):
                document[-1][-1].append(word.text)
                words = (word.text if lemma == False else word.lemma_ , "" if pos == False else word.pos_)
                index[words].append([i, j, k])
        
    return document, index
doc, index = load_book('84')

In [34]:
fast_kwic(doc, index, search_terms={('cold','NOUN')})

defaultdict(list,
            {('cold',
              'NOUN'): [[[13, 2, 2],
               ['The',
                '\n',
                'cold',
                'is',
                'not',
                'excessive',
                ',',
                'if',
                'you',
                'are',
                'wrapped',
                'in',
                'furs',
                '--',
                'a',
                'dress',
                'which',
                'I',
                'have',
                '\n',
                'already',
                'adopted',
                ',',
                'for',
                'there',
                'is',
                'a',
                'great',
                'difference',
                'between',
                'walking',
                'the',
                '\n',
                'deck',
                'and',
                'remaining',
                'seated',
                'motionless',
      

defaultdict(list,
            {('cold',
              'NOUN'): [[[13, 2, 2],
               ['The',
                '\n',
                'cold',
                'is',
                'not',
                'excessive',
                ',',
                'if',
                'you',
                'are',
                'wrapped',
                'in',
                'furs',
                '--',
                'a',
                'dress',
                'which',
                'I',
                'have',
                '\n',
                'already',
                'adopted',
                ',',
                'for',
                'there',
                'is',
                'a',
                'great',
                'difference',
                'between',
                'walking',
                'the',
                '\n',
                'deck',
                'and',
                'remaining',
                'seated',
                'motionless',
      