# 08a Exercises: Spacy & regular expressions

In this exercise, we will practice writing some regex search patterns, explore the Spacy Matcher engine, and write custom pipeline Spacy components and understand how to add attributes to Spacy Doc objects (applicable to Span and Token objects too).

### Preliminaries: load modules and a spacy model
We will use the `en_core_web_lg` model for better performance POS-tagging and named entity recognition performance. However, if this takes too long to run, switch to the `en_core_web_sm` model.

In [None]:
import spacy, re
from pprint import pprint
from spacy import displacy
from collections import Counter, defaultdict

# There are two ways to load a model
# 1. use spacy.load
# nlp = spacy.load('en_core_web_lg')
!python -m spacy download en_core_web_lg
# 2. import as a module
import en_core_web_lg
nlp = en_core_web_lg.load()

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


### PART A: Identify all the important characters in the text

#### Exercise 1: open and read the `emma-austen.txt` file as a single string

In [None]:
with open('emma-austen.txt', encoding  = 'utf-8') as f:
    lines = f.read()
print(lines)

﻿The Project Gutenberg eBook of Emma, by Jane Austen

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: Emma

Author: Jane Austen

Release Date: August, 1994 [eBook #158]
[Most recently updated: December 14, 2021]

Language: English

Character set encoding: UTF-8

Produced by: An Anonymous Volunteer and David Widger

*** START OF THE PROJECT GUTENBERG EBOOK EMMA ***




Emma

by Jane Austen


Contents

 VOLUME I.
 CHAPTER I.
 CHAPTER II.
 CHAPTER III.
 CHAPTER IV.
 CHAPTER V.
 CHAPTER VI.
 CHAPTER VII.
 CHAPTER VIII.
 CHAPTER IX.
 CHAPTER X.
 CHAPTER XI.
 CHAPTER XII.
 CHAPTER XIII.
 CHAPTER 

#### Exercise 2: Finding all mentions of Jane Fairfax in the file
Write a regex search that will return the start index position of **all** the mentions of "Jane Fairfax" in the text. Note that she may be referred to as "Miss Fairfax" as well. Hint: use `finditer` and mix of lookahead/operators.

In [None]:
### YOUR SOLUTION HERE ###
import re
for item in re.finditer("Jane |Miss (Fairfax)", lines):
  print(item.start(0))

41
519
783
153505
153779
185108
185266
186010
271923
273776
274196
274339
274471
274865
276384
277362
277773
277803
278185
279010
280137
280388
280856
281017
281183
281866
282037
282540
283083
283309
284658
284863
285174
285311
285553
286321
286408
286558
287968
288303
288897
289782
290831
293273
293533
294447
294757
295888
297730
300962
301198
302039
303281
305321
306427
306593
309186
309726
310829
313563
347110
347450
355290
355491
356444
357809
358009
358752
359028
360490
361660
361815
362339
383097
383441
384241
385253
385778
386822
388316
389859
390990
392990
395146
398098
398277
398863
399558
401816
401918
402760
403224
403350
404073
404123
404329
405435
406932
407470
408328
409330
411309
412071
412847
412989
413773
414487
414833
415271
415713
416046
420752
423110
423204
424980
426202
427999
428691
431121
431707
432163
432604
433243
433366
433783
436005
436345
436612
438446
438529
440140
440190
440518
442915
443822
444079
461791
462226
466541
471195
471470
482423
506011
506489
50

#### Exercise 3: identify the set of main characters in the book
We will leverage Spacy's EntityRecognizer (https://spacy.io/api/entityrecognizer in preloaded (cf. `nlp.pipe_names`)) and use two rules of thumb to identify the main characters: (i) entities tagged as persons; and (ii) entities whose spans have more than 2 words. e.g. Emma Woodhouse, i.e. assume the author will give the full name of important characters as they are introduced.  

First process with spacy (apply the `nlp` object to the string containing the text). **Make sure to save the Spacy Doc object to the variable `doc_emma`.** Use the `.ents` attribute to get the detected entities. Then use the `.label_` attributes to identify the *PERSON* entities; finally use the `.text` attribute to recover the str form of the span.

Store the names (string form) of the main characters in a set with the variable name `main_characters`.

In [None]:
### YOUR SOLUTION HERE ###
persons = []
doc_emma = nlp(lines)
# list_of_sentences = []
for sentence in doc_emma.sents:
#   list_of_sentences.append(sentence)
  # for s in list_of_sentences:
  for ent in sentence.ents:
      if ent.label_ == 'PERSON' and len(ent)>=2:
          persons.append(ent.text)


main_characters = set(persons)
print(main_characters)



{'Robert Martin', 'Harriet more conversable', 'de Genlis’ Adelaide', 'Donwell Lane', 'Miss F', 'near—“Miss Bates', 'Harriet\n', 'Miss Prince', 'John ostler', 'Harriet good', 'Elizabeth Martin', 'Harriet exultingly', 'Jane Austen', 'Elizabeth\nMartin', 'Harriet’s', 'Miss Smith!—Miss Smith', 'Miss Woodhouse?)—for', 'F. C. Weston Churchill', 'Jane Fairfax’s', 'Frank Churchill.—He', 'James Cooper', 'Robin_', 'CHAPTER XVI', 'Harriet indignantly.—“Oh!', 'William Larkins', 'William Coxe', 'Emma Woodhouse-ing', 'Redistributing Project\nGutenberg-tm', 'Farmer Mitchell’s', 'John Saunders', 'Harriet Smith!—It', 'CHAPTER XIII', 'Miss Smith!—Very', 'Humph—Harriet', 'Miss Hawkins!—Good\nmorning', 'Miss Bates', 'John Knightley.—“It', 'Miss F.', 'herself.—Robert Martin', 'Jane Fairfax.—And', 'Harriet earnestly', 'E. The', 'David Widger\n\n*', 'Knightley.—“Robert Martin', 'Jane Fairfax', 'Abbey fish-ponds', 'Project Gutenberg-tm', 'Miss\nBates', 'Miss Woodhouse', 'Miss W.', 'Harriet\nseemed', 'Harriet 

#### Exercise 4: Get a sense of how each character is protrayed in the text
We will leverage the rules-based matching (using the Matcher https://spacy.io/usage/rule-based-matching)
A pattern we can use is to look for spans that are tagged `PERSON` and look for adjectives surrounding them. For e.g. (1) "*beautiful* Emma" or (2) "Emma Woodhouse is *charming*".

The pattern for (1) has been done for you -- we look for spans with 1 or more tokens (using the [quantifier](https://spacy.io/usage/rule-based-matching#quantifiers) "OP": "{1,}") that have been labeled as `PERSON` as well as the token preceding it that has a ADJ part-of-speech tag. **Your task is to do the same for (2) and add it to the matcher and then run matcher on `doc_emma`.**

In [None]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern1 = [{"POS": "ADJ"},
            {"ENT_TYPE": "PERSON", "OP": "{1,}"}]
matcher.add("Main char description #1", [pattern1])

In [None]:
### YOUR SOLUTION HERE ###
pattern2 = [{"ENT_TYPE": "PERSON", "OP": "{1,}"},
            {"POS": "VERB"},
            {"POS": "ADJ"}]
matcher.add("Main char description #2", [pattern2])

#### Exercise 5 (provided): Run matcher on the document
**Note:** we set the `as_spans` parameter as True so the results will be returned as Spacy.Span objects.

In [None]:
matches = matcher(doc_emma, as_spans = True)

main_char_desc = set()
for span in matches:
    main_char_desc.add((span.label_, span.text))

#### Exercise 6: add two custom pipeline components
Create two new methods called `identify_main_characters` and `characters_descs` and move your code for Exercise 2 and 3 into each of them. The objective is to be able to add these custom components that will also be applied when calling `nlp` on a text.

Each method should extend the attributes for the Doc object.
- For `identify_main_characters`, a new `main_characters` attribute should be added and it should hold the set of the found main character names (in str form) after processing is done.
- For `characters_descs`, a new `characters_descriptions` attribute should be added. This should hold the set of adjective+character names found (in str form).

Refer to `08a_spacy.ipynb` and look for how to add **custom components** and **extension attributes**. Look also at the spacy documentation on (1) [Creating custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components) and (2) [Extension attributes ](https://spacy.io/usage/processing-pipelines#custom-components-attributes).

In [None]:
### YOUR SOLUTION HERE ###
from spacy.language import Language
from spacy.matcher import Matcher
from spacy.tokens import Doc



@Language.component("identify_main_characters")
def identify_main_characters(doc):
    '''
    Identifies main characters from the doc
    '''
    main_characters = set()
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            main_characters.add(ent.text)
    doc.set_extension("main_characters", default=set(), force=True)
    doc._.main_characters = main_characters
    return doc


@Language.component("characters_descs")
def characters_descs(doc):
    '''
    Looks for specific character descriptions
    '''
    matcher = Matcher(nlp.vocab)

    # Pattern 1: Adjectives followed by persons
    pattern1 = [{"POS": "ADJ"},{"ENT_TYPE": "PERSON", "OP": "{1,}"}]
    matcher.add("Main_char_description_1", [pattern1])

    pattern2 = [{"ENT_TYPE": "PERSON", "OP": "{1,}"},{"POS": "VERB"},{"POS": "ADJ"}]
    matcher.add("Main_char_description_2", [pattern2])

    matches = matcher(doc, as_spans=True)
    main_char_desc = set()
    for span in matches:
        main_char_desc.add((span.label_, span.text))
    doc.set_extension("character_descriptions", default=set(), force=True)
    doc._.character_descriptions = main_char_desc
    return doc


#### Exercise 7 (provided): add the custom components and run processing on the text again
Inspect the outputs of the two custom components to see the main characters in the text as well as the an idea of how the characters are protrayed in it.

In [None]:
# Add the component to the pipeline
nlp.add_pipe("identify_main_characters")
nlp.add_pipe("characters_descs")
print(nlp.pipe_names)
doc_emma_new  = nlp(lines)
doc_emma_new._.main_characters, doc_emma_new._.characters_descriptions

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'identify_main_characters', 'characters_descs']


AttributeError: [E046] Can't retrieve unregistered extension attribute 'characters_descriptions'. Did you forget to call the `set_extension` method?

### Part B: Write a function that extracts the text in each chapter of the book
The content layout of the book is as follows:

**Contents**

 VOLUME I.
 CHAPTER I.
 CHAPTER II.
 CHAPTER III.
 CHAPTER IV.
 CHAPTER V.
 CHAPTER VI.
 CHAPTER VII.
 CHAPTER VIII.
 CHAPTER IX.
 CHAPTER X.
 CHAPTER XI.
 CHAPTER XII.
 CHAPTER XIII.
 CHAPTER XIV.
 CHAPTER XV.
 CHAPTER XVI.
 CHAPTER XVII.
 CHAPTER XVIII.

 VOLUME II.
 CHAPTER I.
 CHAPTER II.
 CHAPTER III.
 CHAPTER IV.
 CHAPTER V.
 CHAPTER VI.
 CHAPTER VII.
 CHAPTER VIII.
 CHAPTER IX.
 CHAPTER X.
 CHAPTER XI.
 CHAPTER XII.
 CHAPTER XIII.
 CHAPTER XIV.
 CHAPTER XV.
 CHAPTER XVI.
 CHAPTER XVII.
 CHAPTER XVIII.

 VOLUME III.
 CHAPTER I.
 CHAPTER II.
 CHAPTER III.
 CHAPTER IV.
 CHAPTER V.
 CHAPTER VI.
 CHAPTER VII.
 CHAPTER VIII.
 CHAPTER IX.
 CHAPTER X.
 CHAPTER XI.
 CHAPTER XII.
 CHAPTER XIII.
 CHAPTER XIV.
 CHAPTER XV.
 CHAPTER XVI.
 CHAPTER XVII.
 CHAPTER XVIII.
 CHAPTER XIX.


#### Exercise 8 (provided): open the file and read its contents

In [None]:
with open('emma-austen.txt', encoding  = 'utf-8') as f:
    lines = f.readlines()

#### Exercise 9: compile two regex objects to identify lines for volume and chapter headers
An initial set of solutions have been made for you, but they need to be corrected. You will need to correct them by (1) adding/changing some parts of them and (2) simplifying/removing duplicated patterns. Note: you should assume the following:
- volume and chapter headers can be numbered in arabic or latin numerals ("Volume 5"/"Volume V" etc)
- the headers can be title-cased, lowercased or capitalised (e.g. "Volume", "volume" or "VOLUME")

In [None]:
len(lines)

16868

In [None]:
### YOUR SOLUTION HERE ###
import re
r_volume  = re.compile(r'VOLUME (\d+|[IXV]+)\b')
r_chapter = re.compile(r'CHAPTER (\d+|[ixv]+|[IXV]+)\b')

#### Exercise 10: use the two regex object above in the following code snippet
The goal is to populate dictionary which we will name `book`. Each volume of the text will have an entry in `book`, which in turn contains the chapters in the volume. Each chapter is a list of lines that follows the order of the text. **All of the keys in your dictionary must be strings.**

NOTE: you need to identify two areas in the code snippet that need changes to meet the specifications above.

In [None]:
### YOUR SOLUTION HERE ###

book = {}
curr_vol = None
curr_chap = None

for l in lines:
    vline = re.match(r_volume, l)
    cline = re.match(r_chapter, l)

    if vline:
        curr_vol = vline.group(1)
        if curr_vol not in book:
            book[curr_vol] = {}
            curr_chap = None
        continue

    elif cline:
        curr_chap = cline.group(1)
        if curr_vol and curr_chap not in book[curr_vol]:
            book[curr_vol][curr_chap] = []

    elif curr_chap and curr_vol:
      book[curr_vol][curr_chap].append(l)

for v in book:
    print(f'{v}\n\n')
    for c in book[v]:
        print(f'{c}\n{book[v][c][3]}')

I


I
happy disposition, seemed to unite some of the best blessings of

II
which for the last two or three generations had been rising into

III
have his friends come and see him; and from various united causes, from

IV
and decided in her ways, Emma lost no time in inviting, encouraging,

V
Knightley, “of this great intimacy between Emma and Harriet Smith, but

VI
direction and raised the gratitude of her young vanity to a very good

VII
for Emma’s services towards her friend. Harriet had been at Hartfield,

VIII
spending more than half her time there, and gradually getting to have a

IX
herself. He was so much displeased, that it was longer than usual

X
prevent the young ladies from tolerably regular exercise; and on the

XI
to superintend his happiness or quicken his measures. The coming of her

XII
Mr. Woodhouse, who did not like that any one should share with him in

XIII
Knightley, in this short visit to Hartfield, going about every morning

XIV
walked into Mrs. Weston’s drawing

#### Sanity check: make sure your changes to the code snippet achieved the desired output

In [None]:
for k,v in book.items():
    print(k, len(v))
    for k2, v2 in v.items():
        print('\t\t', k2, len(v2))

I 18
		 I 333
		 II 166
		 III 181
		 IV 346
		 V 203
		 VI 287
		 VII 263
		 VIII 425
		 IX 559
		 X 271
		 XI 215
		 XII 349
		 XIII 313
		 XIV 238
		 XV 327
		 XVI 188
		 XVII 125
		 XVIII 258
II 18
		 I 282
		 II 231
		 III 393
		 IV 162
		 V 342
		 VI 309
		 VII 222
		 VIII 668
		 IX 327
		 X 240
		 XI 350
		 XII 238
		 XIII 186
		 XIV 394
		 XV 308
		 XVI 335
		 XVII 219
		 XVIII 295
III 19
		 I 123
		 II 445
		 III 163
		 IV 226
		 V 304
		 VI 523
		 VII 363
		 VIII 268
		 IX 236
		 X 355
		 XI 425
		 XII 278
		 XIII 338
		 XIV 322
		 XV 250
		 XVI 351
		 XVII 302
		 XVIII 407
		 XIX 472


### Part C: Identifying the characters and protrayal information on a cleaner version of the text

#### Exercise 11: Apply your custom spacy components on each chapter

Collect the set of main characters and character descriptions from these. Compare it with your initial set applied to the contents of the entire .txt file.

Apply `nlp` to the text associated with each chapter. Note: when defining the custom components and setting the new attributes (using `.set_extension`), the "force = True" parameter should be set; this allows the same `nlp` object to be reused and each time the added attributes can be reset.

In [None]:
for v in book:
    print(f'{v}\n\n')
    for c in book[v]:
        print(f'{c}\n{book[v][c][3]}')

        # Join the lines of the chapter text
        chapter_text = " ".join(book[v][c])

        # Process the chapter text using the NLP pipeline
        doc = nlp(chapter_text)

        # Extract main characters and character descriptions using the custom components
        main_characters = doc._.main_characters
        character_descriptions = doc._.character_descriptions

        # Print main characters and descriptions for each chapter
        print("Main characters:", main_characters)
        print("Character descriptions:", character_descriptions)


I


I
happy disposition, seemed to unite some of the best blessings of

Main characters: {'Elton', 'Woodhouse', 'Emma Woodhouse', 'vex', 'Weston', 'Isabella', 'large.—And', 'Emma', 'Taylor', 'Hannah', 'James', 'Woodhouses', 'Miss Woodhouse', 'Knightley'}
Character descriptions: {('Main_char_description_1', 'Dear Emma'), ('Main_char_description_1', 'poor Isabella'), ('Main_char_description_1', 'poor James')}
II
which for the last two or three generations had been rising into

Main characters: {'Frank', 'Woodhouse', 'Weston', 'Churchill', 'Miss Bates', 'Frank\n Churchill', 'Taylor', 'Emma', 'Perry', 'Frank Churchill', 'Enscombe'}
Character descriptions: {('Main_char_description_1', 'dear Emma'), ('Main_char_description_1', 'little Frank')}
III
have his friends come and see him; and from various united causes, from

Main characters: {'Elton', 'Woodhouse', 'Donwell Abbey', 'Weston', 'Serle', 'Martin', 'Harriet Smith', 'Miss Bates', 'Bates', 'Emma', 'Goddard', 'Miss Smith', 'James', 'Smith'