# Assignment group 1: Textual feature extraction and numerical comparison

## Module B _(35 points)_ Key word in context

Key word in context (KWiC) is a common format for concordance lines, i.e., contextualized instances of principal words used in a book. More generally, KWiC is essentially the concept behind the utility of 'find in page' on document viewers and web browsers. This module builds up a KWiC utility for finding key word-containing sentences, and 'most relevant' paragraphs, quickly.

__B1.__ _(3 points)_ Initialize `spacy`'s English model and write a function called `load_book(book_id)`, which reads a book of specified number (as a string, `book_id`) and executes a regular expressiion to `re.split()` the loaded `book` (a string) into a list of `paragraphs` (strings). 

Test your code on `book_id = '84'`, and print the number of paragraphs in the resulting output.

Note: this module is not focused on text pre-processing beyond a split into paragraphs; you are only required to determine a _reasonable_ split criterion, and _not_ to remove markup or non-substantive content.

In [None]:
# code here

__B2.__ _(10 points)_ Write a function called `kwic(paragraphs, search_terms)` that accepts a list of string `paragraphs` and a set of `search_term` strings. The function should:

1. initialize `data` as a `defaultdict` of lists
2. loop over the `paragraphs` and apply `spacy`'s processing to produce a `doc` for each;
3. loop over the `doc.sents` resulting from each `paragraph`;
4. loop over the words in each `sentence`;
5. check: `if` a `word` is `in` the `search_terms` set;
6. `if` (5), then `.append()` the reference under `data[word]` as a list: `[[i, j, k], sentence]`,

where `i`, `j`, and `k` refer to the paragraph-in-book, sentence-in-paragraph, and word-in-paragraph indices, respectively.

Your output, `data`, should then be a default dictionary of lists of the format:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```

In [None]:
# code here

__B3__ _(2 points)_ Prove your `kwic` search function's utility using the pre-processed paragraphs from book `84` and __B1__. Exhibit examples of the key words `Frankenstein` and `monster` in context and and comment on the run time of this program and explain why it runs so darn slow, and in particular would not support repeated queries. Note: if you think it doesn't, then just confirm `kwic`'sfunction, and proceed to part __B5__. You can comment here after completing the module.

_Response._ 

In [None]:
# code here

__B4.__ _(10 pts)_ The cost of _indexing_ a given book turns out to be the limiting factor here in the process. Presently, we have our pre-processing `load_book` function just splitting a document into paragraphs. This function should be modified not only to:

1. split a `book` into paragraphs and loop over them, but
2. process each paragraph with `spacy`;
3. store the `document` as a triple-nested list, so that each word _string_ is reachable via three indices: `word = document[i][j][k]`;
4. record an `index = defaultdict(list)` containing a list of `[i,j,k]` lists for each word; and
5. `return document, index`

Pre-computing the `index` will allow us to efficiently look up the locations of each word's instance in `document`, and the triple-list format of our document will allow us fast access to extract the sentence for KWiC. Exhibit this modified version of `load_book`'s function on `book_id = '84'` and print out the `[i,j,k]` locations of the word `'monster'` from `index`.

In [None]:
# code here

__B5.__ _(5 pts)_ Finally, make a new function called `fast_kwic(document, index, search_terms)` that loops through all specified `search_terms` to identify indices from `index[word]` for the key word-containing sentences and use them to extract these sentences from `document` into the same data structure as output by __B2__:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```
Confirm your output again by exhibiting examples of the key words `Frankenstein` and `monster` in context.

In [None]:
# code here

__B6.__ _(5 pts)_ Your goal here is to modify the pre-processing in `load_book` one more time! Make a small modification to the input: `load_book(book_id, pos = True, lemma = True):`, to accept two boolean arguments, `pos` and `lemma` specifying how to identify each word as a key term. In particular, each word will now be represented in both of the `document` and `index` as a tuple: `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`. When you've completed this part, exhibit your function's utility by using its ouput in the `fast_kwic` function to search for the key terms `('cold', NOUN)` and `('cold', ADJ)`.

Note this functions output should still consist of a `document` and `index` in the same format aside from the replacement of `word` with `heading`, which will allow for the same use of output in `fast_kwic`, although more specified by the textual features.

In [None]:
# code here