# 08a Exercises: Spacy & regular expressions

In this exercise, we will practice writing some regex search patterns, explore the Spacy Matcher engine, and write custom pipeline Spacy components and understand how to add attributes to Spacy Doc objects (applicable to Span and Token objects too). 

### Preliminaries: load modules and a spacy model
We will use the `en_core_web_lg` model for better performance POS-tagging and named entity recognition performance. However, if this takes too long to run, switch to the `en_core_web_sm` model. 

In [1]:
import spacy, re
from pprint import pprint
from spacy import displacy
from collections import Counter, defaultdict

# There are two ways to load a model
# 1. use spacy.load
nlp = spacy.load('en_core_web_lg')

# 2. import as a module 
# import en_core_web_lg
# nlp = en_core_web_lg.load()

### PART A: Identify all the important characters in the text

#### Exercise 1: open and read the `emma-austen.txt` file as a single string

In [2]:
with open('emma-austen.txt', encoding  = 'utf-8') as f: 
    lines = f.read()

#### Exercise 2: Finding all mentions of Jane Fairfax in the file
Write a regex search that will return the start index position of **all** the mentions of "Jane Fairfax" in the text. Note that she may be referred to as "Miss Fairfax" as well. Hint: use `finditer` and mix of lookahead/operators. 

In [3]:
re.findall('Jane Fairfax | Miss Fairfax', lines)

['Jane Fairfax ',
 'Jane Fairfax ',
 'Jane Fairfax ',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 'Jane Fairfax ',
 'Jane Fairfax ',
 'Jane Fairfax ',
 'Jane Fairfax ',
 'Jane Fairfax ',
 'Jane Fairfax ',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 'Jane Fairfax ',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 'Jane Fairfax ',
 'Jane Fairfax ',
 ' Miss Fairfax',
 ' Miss Fairfax',
 'Jane Fairfax ',
 ' Miss Fairfax',
 ' Miss Fairfax',
 'Jane Fairfax ',
 ' Miss Fairfax',
 ' Miss Fairfax',
 'Jane Fairfax ',
 ' Miss Fairfax',
 ' Miss Fairfax',
 ' Miss Fairfax',
 'Jane Fairfax ',
 ' Miss Fa

#### Exercise 3: identify the set of main characters in the book 
We will leverage Spacy's EntityRecognizer (https://spacy.io/api/entityrecognizer in preloaded (cf. `nlp.pipe_names`)) and use two rules of thumb to identify the main characters: (i) entities tagged as persons; and (ii) entities whose spans have more than 2 words. e.g. Emma Woodhouse, i.e. assume the author will give the full name of important characters as they are introduced.  

First process with spacy (apply the `nlp` object to the string containing the text). **Make sure to save the Spacy Doc object to the variable `doc_emma`.** Use the `.ents` attribute to get the detected entities. Then use the `.label_` attributes to identify the *PERSON* entities; finally use the `.text` attribute to recover the str form of the span. 

Store the names (string form) of the main characters in a set with the variable name `main_characters`.

In [4]:
doc_emma = nlp(lines) #Store the document processed with spacy

In [16]:
main_characters_raw = [ent.text for ent in doc_emma.ents if ent.label_ == "PERSON" and len(ent.text.split()) == 2]

In [17]:
main_characters = [] #It's weird why its retrieving CHAPTER X as person

for char in main_characters_raw:
    if char not in main_characters:
        main_characters.append(char)

print(main_characters)

['Jane Austen', 'Emma Woodhouse', 'Miss Woodhouse', 'Farmer Mitchell’s', 'Frank Churchill', 'Miss Bates', 'Frank\nChurchill', 'Donwell Abbey', 'Miss Smith', 'Harriet Smith', 'Harriet\nSmith', 'Harriet Smith’s', 'Miss Prince', 'Miss Richardson', 'Robert Martin', 'Harriet good', 'John Knightley', 'Harriet, Harriet', 'Knightley.—“Robert Martin', 'CHAPTER X', 'Jane Fairfax', 'John\nKnightley', 'Jane\nFairfax', 'John Knightley.—“It', 'CHAPTER XIII', 'Harriet\nseemed', 'her;—William Coxe', 'William Coxe', 'Harriet bore', 'Miss\nWoodhouse', 'Miss\nBates', 'Jane Bates', 'Miss Hawkins,—I', 'Miss Hawkins', 'Harriet: Harriet', 'Elizabeth Martin', 'Miss Woodhouse?)—for', 'Augusta Hawkins', 'Elizabeth\nMartin', 'Philip Elton', 'York Tan', 'well:—a\nman', 'Henry supplanted.—Mr', 'near—“Miss Bates', 'Jane Fairfax’s', 'Harriet rather', 'Anne Cox', 'William Cox', 'Miss Smith?—Very', 'John Saunders', 'William Larkins', 'Miss\nSmith', 'Miss Bates?—I', 'Harriet earnestly', 'Clara Partridge', 'James Cooper

#### Exercise 4: Get a sense of how each character is protrayed in the text
We will leverage the rules-based matching (using the Matcher https://spacy.io/usage/rule-based-matching) 
A pattern we can use is to look for spans that are tagged `PERSON` and look for adjectives surrounding them. For e.g. (1) "*beautiful* Emma" or (2) "Emma Woodhouse is *charming*".

The pattern for (1) has been done for you -- we look for spans with 1 or more tokens (using the [quantifier](https://spacy.io/usage/rule-based-matching#quantifiers) "OP": "{1,}") that have been labeled as `PERSON` as well as the token preceding it that has a ADJ part-of-speech tag. **Your task is to do the same for (2) and add it to the matcher and then run matcher on `doc_emma`.**

In [18]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern1 = [{"POS": "ADJ"}, 
            {"ENT_TYPE": "PERSON", "OP": "{1,}"}]
pattern2 = [{"ENT_TYPE": "PERSON", "OP": "{1,}"}, {"LEMMA": "be", "POS": "VERB"}, {"POS": "ADJ"}] #You stablish here the patterns 
matcher.add('Patterns added',[pattern1, pattern2]) #Then add them to the matcher. This can work with any type of pattern! 

In [19]:
matches = matcher(doc_emma) #You store the matches to the doc that you have. This is a list

In [20]:
for match_id, start, end in matches:
    span = doc_emma[start:end]
    print(span.text)

poor James
poor Isabella
Dear Emma
little Frank
dear Emma
little Perrys
little Welch
bad thing?—why
dear Emma
little George
dear Harriet
dearest Harriet
dear Harriet
dear Harriet
dear Harriet
dear Harriet
poor Isabella
poor Isabella
poor Isabella
poor Isabella
dear Emma
dear Emma
Little Emma
dear Isabella
dear Emma
little Bella
little Bella
good Bateses
amiable Jane
amiable Jane Fairfax
dear Emma
ill

ill
behind!—Most
dear Emma
dear Isabella
poor Isabella
poor Emma
poor Harriet
young Martin
poor Harriet
poor Isabella
dear Emma
poor Jane
dear Emma
dear Emma
dear Jane
dear Jane
poor Harriet
fortunate Miss
fortunate Miss Hawkins
charming Augusta
charming Augusta Hawkins
dear Emma
dear Emma
dear Emma
poor Jane
poor Jane Fairfax
dear Emma
little Henry
dear Emma
little Henry
dear Emma
dear Emma
Little Henry
perfect Jane
perfect Jane Fairfax
young Cox
young Cox
little Harriet
good exchange?—You
little Emma
poor Isabella
dear Emma
foolish preparation!—You
dear William
dear William Larkins
last

#### Exercise 5 (provided): Run matcher on the document 
**Note:** we set the `as_spans` parameter as True so the results will be returned as Spacy.Span objects. 

In [28]:
#There is a problem here. I need to put the labels, because without them it doesn't really work
#To solve it, you need to follow this steps:

# Step 1: Run the matcher to get the matches without as_spans=True
matches = matcher(doc_emma)

# Step 2: Extract unique labels from the matches and register them
unique_labels = set()
for match_id, start, end in matches:
    unique_labels.add(nlp.vocab.strings[match_id])

# Register unique labels in the StringStore
for label in unique_labels:
    _ = nlp.vocab.strings[label]

# Step 3: Run the matcher again with as_spans=True
matches = matcher(doc_emma, as_spans=True)

# Collect main character descriptions
main_char_desc = []
for span in matches:
    main_char_desc.append((span.label_, span.text))

# Print the main character descriptions
print(main_char_desc)


ValueError: [E084] Error assigning label ID 8749331614196081296 to span: not in StringStore.

#### Exercise 6: add two custom pipeline components
Create two new methods called `identify_main_characters` and `characters_descs` and move your code for Exercise 2 and 3 into each of them. The objective is to be able to add these custom components that will also be applied when calling `nlp` on a text. 

Each method should extend the attributes for the Doc object. 
- For `identify_main_characters`, a new `main_characters` attribute should be added and it should hold the set of the found main character names (in str form) after processing is done. 
- For `characters_descs`, a new `characters_descriptions` attribute should be added. This should hold the set of adjective+character names found (in str form). 

Refer to `08a_spacy.ipynb` and look for how to add **custom components** and **extension attributes**. Look also at the spacy documentation on (1) [Creating custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components) and (2) [Extension attributes ](https://spacy.io/usage/processing-pipelines#custom-components-attributes). 

In [11]:
import spacy
from spacy.tokens import Doc
from spacy.matcher import Matcher
from spacy.language import Language



Main Characters: set()
Character Descriptions: set()


#### Exercise 7 (provided): add the custom components and run processing on the text again
Inspect the outputs of the two custom components to see the main characters in the text as well as the an idea of how the characters are protrayed in it. 

In [12]:
# Add the component to the pipeline
print(nlp.pipe_names)
doc_emma_new  = nlp(lines)
doc_emma_new._.identify_main_characters, doc_emma_new._.characters_descriptions

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'identify_main_characters', 'characters_descs']


AttributeError: [E046] Can't retrieve unregistered extension attribute 'identify_main_characters'. Did you forget to call the `set_extension` method?

### Part B: Write a function that extracts the text in each chapter of the book
The content layout of the book is as follows: 

**Contents**

 VOLUME I.
 CHAPTER I.
 CHAPTER II.
 CHAPTER III.
 CHAPTER IV.
 CHAPTER V.
 CHAPTER VI.
 CHAPTER VII.
 CHAPTER VIII.
 CHAPTER IX.
 CHAPTER X.
 CHAPTER XI.
 CHAPTER XII.
 CHAPTER XIII.
 CHAPTER XIV.
 CHAPTER XV.
 CHAPTER XVI.
 CHAPTER XVII.
 CHAPTER XVIII.

 VOLUME II.
 CHAPTER I.
 CHAPTER II.
 CHAPTER III.
 CHAPTER IV.
 CHAPTER V.
 CHAPTER VI.
 CHAPTER VII.
 CHAPTER VIII.
 CHAPTER IX.
 CHAPTER X.
 CHAPTER XI.
 CHAPTER XII.
 CHAPTER XIII.
 CHAPTER XIV.
 CHAPTER XV.
 CHAPTER XVI.
 CHAPTER XVII.
 CHAPTER XVIII.

 VOLUME III.
 CHAPTER I.
 CHAPTER II.
 CHAPTER III.
 CHAPTER IV.
 CHAPTER V.
 CHAPTER VI.
 CHAPTER VII.
 CHAPTER VIII.
 CHAPTER IX.
 CHAPTER X.
 CHAPTER XI.
 CHAPTER XII.
 CHAPTER XIII.
 CHAPTER XIV.
 CHAPTER XV.
 CHAPTER XVI.
 CHAPTER XVII.
 CHAPTER XVIII.
 CHAPTER XIX.


#### Exercise 8 (provided): open the file and read its contents

In [None]:
with open('emma-austen.txt', encoding  = 'utf-8') as f: 
    lines = f.readlines()

#### Exercise 9: compile two regex objects to identify lines for volume and chapter headers 
An initial set of solutions have been made for you, but they need to be corrected. You will need to correct them by (1) adding/changing some parts of them and (2) simplifying/removing duplicated patterns. Note: you should assume the following:
- volume and chapter headers can be numbered in arabic or latin numerals ("Volume 5"/"Volume V" etc)
- the headers can be title-cased, lowercased or capitalised (e.g. "Volume", "volume" or "VOLUME")

In [None]:
### YOUR SOLUTION HERE ###
import re
r_volume = re.compile(r'\bvolume\b (\d+|[ivxlcdm]+)', re.IGNORECASE)
r_chapter = re.compile(r'\bchapter\b (\d+|[ivxlcdm]+)', re.IGNORECASE)

#### Exercise 10: use the two regex object above in the following code snippet
The goal is to populate dictionary which we will name `book`. Each volume of the text will have an entry in `book`, which in turn contains the chapters in the volume. Each chapter is a list of lines that follows the order of the text. **All of the keys in your dictionary must be strings.**

NOTE: you need to identify two areas in the code snippet that need changes to meet the specifications above. 

In [None]:
### YOUR SOLUTION HERE ###

book = {}
curr_vol = None
curr_chap = None

for l in lines:
    vline = re.match(r_volume, l)
    cline = re.match(r_chapter, l)
    
    if vline: 
        curr_vol = vline
        if curr_vol not in book: 
            book[curr_vol] = {}
            curr_chap = None
        continue

    elif cline:
        curr_chap = cline
        if curr_chap not in book[curr_vol]:
            
            book[curr_vol][curr_chap] = []
    
    elif  curr_chap != None and curr_vol != None:
        book[curr_vol][curr_chap].append(l)

for v in book:
    print(f'{v}\n\n')
    for c in book[v]:
        print(f'{c}\n{book[v][c][3]}')

#### Sanity check: make sure your changes to the code snippet achieved the desired output

In [None]:
for k,v in book.items():
    print(k, len(v))
    for k2, v2 in v.items():
        print('\t\t', k2, len(v2))

### Part C: Identifying the characters and protrayal information on a cleaner version of the text

#### Exercise 11: Apply your custom spacy components on each chapter

Collect the set of main characters and character descriptions from these. Compare it with your initial set applied to the contents of the entire .txt file. 

Apply `nlp` to the text associated with each chapter. Note: when defining the custom components and setting the new attributes (using `.set_extension`), the "force = True" parameter should be set; this allows the same `nlp` object to be reused and each time the added attributes can be reset. 

In [None]:
### YOUR SOLUTION HERE ###