# Using SpaCy to Extract Information from Cast Biographies

<sub>Content of this notebook was prepared by Basel Shbita (shbita@usc.edu) as part of the class <u>INF 558: Building Knowledge Graphs</u></sub>

SpaCy is an open-source software library for advanced natural language processing (NLP). SpaCy provides a one-stop-shop for tasks commonly used in any NLP project, including: Tokenisation, Lemmatisation, Part-of-speech (POS) tagging, Entity recognition, Dependency parsing, Sentence recognition, Word-to-vector transformations and many more methods for cleaning and normalising text data.

This notebook introduces some applied examples of NLP tasks to extract information from unstructured data using spaCy. The extracted structured data we produce can be used for downstream applications, such as creating Knowledge Graphs!

In [4]:
!pip install spacy

Collecting spacy
  Downloading spacy-2.3.5-cp38-cp38-manylinux2014_x86_64.whl (10.5 MB)
[K     |████████████████████████████████| 10.5 MB 3.1 MB/s 
[?25hCollecting plac<1.2.0,>=0.9.6
  Downloading plac-1.1.3-py2.py3-none-any.whl (20 kB)
Collecting catalogue<1.1.0,>=0.0.7
  Downloading catalogue-1.0.0-py2.py3-none-any.whl (7.7 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp38-cp38-manylinux2014_x86_64.whl (35 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.5-cp38-cp38-manylinux2014_x86_64.whl (20 kB)
Collecting srsly<1.1.0,>=1.0.2
  Downloading srsly-1.0.5-cp38-cp38-manylinux2014_x86_64.whl (186 kB)
[K     |████████████████████████████████| 186 kB 20.3 MB/s 
Collecting wasabi<1.1.0,>=0.4.0
  Downloading wasabi-0.8.2-py3-none-any.whl (23 kB)
Collecting thinc<7.5.0,>=7.4.1
  Downloading thinc-7.4.5-cp38-cp38-manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 17.1 MB/s 
[?25hCollecting blis<0.8.0,>=0.4.0
  Downloadi

## Language Model

There are various different types of models in spaCy. We well use an available pretrained statistical model for English (`en_core_web_sm`). Let’s download then load it.

In [5]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 6.5 MB/s 
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-py3-none-any.whl size=12047105 sha256=92c54a13feec2de5ce779d41c8fac80e63a54b8dd31a16e8d70d010651b2ddab
  Stored in directory: /tmp/pip-ephem-wheel-cache-tpyk295r/wheels/ee/4d/f7/563214122be1540b5f9197b52cb3ddb9c4a8070808b22d5a84
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.3.1
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [6]:
import spacy
import en_core_web_sm
import csv

We will store the model in an nlp object which is a language model instance.

In [7]:
nlp = en_core_web_sm.load()

## Sentence Segmentation

Sentence Segmentation is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units. You’ll use these units when you’re processing your text to perform tasks such as part of speech tagging and entity extraction.

First, let's load a cast biography from the provided sxample `tsv` file

In [12]:
data = [["https://www.imdb.com/name/nm0000095", """Woody Allen was born Allan Stewart Konigsberg on December 1, 1935 in Brooklyn, New York, to Nettie (Cherrie), a bookkeeper, and Martin Konigsberg, a waiter and jewellery engraver. His father was of Russian Jewish descent, and his maternal grandparents were Austrian Jewish immigrants. As a young boy, he became intrigued with magic tricks and playing the clarinet, two hobbies that he continues today. Allen broke into show business at 15 years when he started writing jokes for a local paper, receiving $200 a week. He later moved on to write jokes for talk shows but felt that his jokes were being wasted. His agents, Charles Joffe and Jack Rollins, convinced him to start doing stand-up and telling his own jokes. Reluctantly he agreed and, although he initially performed with such fear of the audience that he would cover his ears when they applauded his jokes, he eventually became very successful at stand-up. After performing on stage for a few years, he was approached to write a script for Warren Beatty to star in: What's New Pussycat (1965) and would also have a moderate role as a character in the film. During production, Woody gave himself more and better lines and left Beatty with less compelling dialogue. Beatty inevitably quit the project and was replaced by Peter Sellers, who demanded all the best lines and more screen-time. It was from this experience that Woody realized that he could not work on a film without complete control over its production. Woody's theoretical directorial debut was in What's Up, Tiger Lily? (1966); a Japanese spy flick that he dubbed over with his own comedic dialogue about spies searching for the secret recipe for egg salad. His real directorial debut came the next year in the mockumentary Take the Money and Run (1969). He has written, directed and, more often than not, starred in about a film a year ever since, while simultaneously writing more than a dozen plays and several books of comedy. While best known for his romantic comedies Annie Hall (1977) and Manhattan (1979), Woody has made many transitions in his films throughout the years, transitioning from his "early, funny ones" of Bananas (1971), Love and Death (1975) and Everything You Always Wanted to Know About Sex * But Were Afraid to Ask (1972); to his more storied and romantic comedies of Annie Hall (1977), Manhattan (1979) and Hannah and Her Sisters (1986); to the Bergmanesque films of Stardust Memories (1980) and Interiors (1978); and then on to the more recent, but varied works of Crimes and Misdemeanors (1989), Husbands and Wives (1992), Mighty Aphrodite (1995), Celebrity (1998) and Deconstructing Harry (1997); and finally to his films of the last decade, which vary from the light comedy of Scoop (2006), to the self-destructive darkness of Match Point (2005) and, most recently, to the cinematically beautiful tale of Vicky Cristina Barcelona (2008). Although his stories and style have changed over the years, he is regarded as one of the best filmmakers of our time because of his views on art and his mastery of filmmaking."""], ["https://www.imdb.com/name/nm0001804", """Actor Stanley Tucci was born on November 11, 1960, in Peekskill, New York. He is the son of Joan (Tropiano), a writer, and Stanley Tucci, an art teacher. His family is Italian-American, with origins in Calabria. Tucci took an interest in acting while in high school, and went on to attend the State University of New York's Conservatory of Theater Arts in Purchase. He began his professional career on the stage, making his Broadway debut in 1982, and then made his film debut in Prizzi's Honor (1985). In 2009, Tucci received his first Academy Award nomination for his turn as a child murderer in The Lovely Bones (2009). He also received a BAFTA nomination and a Golden Globe nomination for the same role. Other than The Lovely Bones, Tucci has recently had noteworthy supporting turns in a broad range of movies including Lucky Number Slevin (2006), The Devil Wears Prada (2006) and Captain America: The First Avenger (2011). Tucci reached his widest audience yet when he played Caesar Flickerman in box office sensation The Hunger Games (2012). While maintaining an active career in movies, Tucci received major accolades for some work in television. He won an Emmy and a Golden Globe for his role in TV movie Winchell (1998), an Emmy for a guest turn on Monk (2002), and a Golden Globe for his role in HBO movie Conspiracy (2001). Tucci has also had an extensive career behind the camera. His directorial efforts include Big Night (1996), The Impostors (1998), Joe Gould's Secret (2000) and Blind Date (2007), and he did credited work on all of those screenplays with the exception of Joe Gould's Secret (2000). Tucci has three children with Kate Tucci, who passed away in 2009. Tucci married Felicity Blunt in August 2012."""]]

for (idx, (act, bio)) in enumerate(data):
    print(f'[{idx:2d}] >', act)
    biog = bio

[ 0] > https://www.imdb.com/name/nm0000095
[ 1] > https://www.imdb.com/name/nm0001804


Here's the full biography text:

In [13]:
biog

"Actor Stanley Tucci was born on November 11, 1960, in Peekskill, New York. He is the son of Joan (Tropiano), a writer, and Stanley Tucci, an art teacher. His family is Italian-American, with origins in Calabria. Tucci took an interest in acting while in high school, and went on to attend the State University of New York's Conservatory of Theater Arts in Purchase. He began his professional career on the stage, making his Broadway debut in 1982, and then made his film debut in Prizzi's Honor (1985). In 2009, Tucci received his first Academy Award nomination for his turn as a child murderer in The Lovely Bones (2009). He also received a BAFTA nomination and a Golden Globe nomination for the same role. Other than The Lovely Bones, Tucci has recently had noteworthy supporting turns in a broad range of movies including Lucky Number Slevin (2006), The Devil Wears Prada (2006) and Captain America: The First Avenger (2011). Tucci reached his widest audience yet when he played Caesar Flickerman

Let’s read a text using spaCy and store in a `doc` object which is a container for accessing linguistic annotations.

In [14]:
doc = nlp(biog)

In spaCy, the `sents` property is used to extract sentences. Here’s how you would extract the sentences for a given input text:

In [15]:
for idx, sent in enumerate(doc.sents):
    print(f'[{idx:2d}] >', sent)
    mysent = str(sent)

[ 0] > Actor Stanley Tucci was born on November 11, 1960, in Peekskill, New York.
[ 1] > He is the son of Joan (Tropiano), a writer, and Stanley Tucci, an art teacher.
[ 2] > His family is Italian-American, with origins in Calabria.
[ 3] > Tucci took an interest in acting while in high school, and went on to attend the State University of New York's Conservatory of Theater Arts in Purchase.
[ 4] > He began his professional career on the stage, making his Broadway debut in 1982, and then made his film debut in Prizzi's Honor (1985).
[ 5] > In 2009, Tucci received his first Academy Award nomination for his turn as a child murderer in The Lovely Bones (2009).
[ 6] > He also received a BAFTA nomination and a Golden Globe nomination for the same role.
[ 7] > Other than The Lovely Bones, Tucci has recently had noteworthy supporting turns in a broad range of movies including Lucky Number Slevin (2006), The Devil Wears Prada (2006) and Captain America: The First Avenger (2011).
[ 8] > Tucci re

Here's the sentence we will work on moving forward:

In [16]:
mysent

'Tucci married Felicity Blunt in August 2012.'

## Tokenization & POS tagging

Tokenization is the next step after sentence detection. It allows you to identify the basic units in your text. These basic units are called tokens. Tokenization is useful because it breaks a text into meaningful units. These units are used for further analysis, like part of speech tagging.

Parts-of-speech (POS) is a grammatical role that explains how a particular word is used in a sentence. There are eight parts-of-speech: Noun, Pronoun, Adjective, Verb, Adverb, Preposition, Conjunction, Interjection.

You can print tokens and their POS tages by iterating on the `doc` object:

In [17]:
doc = nlp(mysent)
for w in doc:
    print(f'{w.text:15s} [{w.tag_:5s} | {w.pos_:6s} | {spacy.explain(w.tag_)}]')

Tucci           [NNP   | PROPN  | noun, proper singular]
married         [VBD   | VERB   | verb, past tense]
Felicity        [NNP   | PROPN  | noun, proper singular]
Blunt           [NNP   | PROPN  | noun, proper singular]
in              [IN    | ADP    | conjunction, subordinating or preposition]
August          [NNP   | PROPN  | noun, proper singular]
2012            [CD    | NUM    | cardinal number]
.               [.     | PUNCT  | punctuation mark, sentence closer]


## Relation Extraction & Dependency Parsing

The POS tags alone are not sufficient for various cases and require further analysis like dependency parsing. Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. Now, let’s extract the dependency relations among entities:

In [18]:
for w in doc: 
    print(f'{w.text:15s} [{w.dep_}]')

Tucci           [nsubj]
married         [ROOT]
Felicity        [compound]
Blunt           [dobj]
in              [prep]
August          [pobj]
2012            [nummod]
.               [punct]


## Visualization: Using displaCy

spaCy comes with a built-in visualizer called displaCy. You can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.

In [19]:
from spacy import displacy
options = {"distance": 120}
displacy.render(doc, style="dep", options=options)

## Entity recognition

Entity recognition is the process of classifying named entities found in a text into pre-defined categories, such as persons, places, organizations, dates, etc. spaCy uses a statistical model to classify a broad range of entities, including persons, events, works-of-art and nationalities / religion.

Let's parse our sentence, then access the identified entities using the `doc` object's `.ents` method. With this method called on the `doc` we can access additional `token` methods, specifically `.label_`:

In [20]:
for ent in doc.ents:
    print(f'{ent.text:15s} [{ent.label_}]')

Tucci           [PERSON]
Felicity Blunt  [PERSON]
August 2012     [DATE]


## Rule-Based Matching

Rule-based matching is one of the steps in extracting information from unstructured text. It’s used to identify and extract tokens and phrases according to patterns (such as lowercase) and grammatical features (such as part of speech).

Rule-based matching can use regular expressions to extract entities or relations from an unstructured text. It’s different from extracting text using regular expressions only in the sense that regular expressions don’t consider the lexical and grammatical attributes of the text.

The spaCy library comes with `Matcher` tool that can be used to specify custom rules for phrase matching. The process to use the `Matcher` tool is pretty straight forward. Here's an example:

In [21]:
from spacy.matcher import Matcher

# define the pattern 
pattern = [{'POS': 'PROPN'},
           {'LOWER': 'married'},
           {'ENT_TYPE': 'PERSON'}]
   
# Matcher class object 
matcher = Matcher(nlp.vocab) 
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 
print(span.text)

Tucci married Felicity


**Notes**:
- You can find additional examples and use-cases in [SpaCy's documentation](https://spacy.io/usage/rule-based-matching).
- You can use the online [Rule-based Matcher Explorer](https://explosion.ai/demos/matcher) to test spaCy's rule-based `Matcher` by creating token patterns interactively and executing them.
- Here's a nice [article](https://stackabuse.com/python-for-nlp-vocabulary-and-phrase-matching-with-spacy/) you can review. In the article, the author explores vocabulary and phrase matching using the spaCy library. He defines patterns and detects phrases that match the defined patterns. 

Now, you know how to perform some basic NLP tasks like sentence segmentation, tokenization, POS tagging, entity recognition, and - most important - Rule-Based Matching. You now have enough knowledge about how to get the entities and the relations between entities and extract structured data that can be used for downstream applications, such as building a Knowledge Graph! Congratulations!

You can start applying this knowledge on the tasks you are required to do for Homework 02 of the class :)

In [3]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Samuel Alexander Mendes was born on August 1, 1965 in Reading, England, UK to parents James Peter Mendes, a retired university lecturer from University of Southern California, and Valerie Helene Mendes, an author. He later attended New York's High School of Performing Arts who writes children's books.")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

options = {"distance": 120}
displacy.render(doc, style="dep", options=options)

Samuel compound Mendes PROPN []
Alexander compound Mendes PROPN []
Mendes nsubjpass born VERB [Samuel, Alexander]
was auxpass born VERB []
born ROOT born VERB [Mendes, was, on, to]
on prep born VERB [August]
August pobj on ADP [1, ,, 1965, in]
1 nummod August PROPN []
, punct August PROPN []
1965 nummod August PROPN []
in prep August PROPN [Reading]
Reading pobj in ADP [,, England, ,, UK]
, punct Reading PROPN []
England conj Reading PROPN []
, punct Reading PROPN []
UK appos Reading PROPN []
to prep born VERB [parents]
parents pobj to ADP []
James compound Mendes PROPN []
Peter compound Mendes PROPN []
Mendes ROOT Mendes PROPN [James, Peter, ,, lecturer, ,, and, Mendes, .]
, punct Mendes PROPN []
a det lecturer NOUN []
retired amod lecturer NOUN []
university compound lecturer NOUN []
lecturer appos Mendes PROPN [a, retired, university, from]
from prep lecturer NOUN [University]
University pobj from ADP [of]
of prep University PROPN [California]
Southern compound California PROPN []
C

In [5]:
from spacy.symbols import nsubj, VERB
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
print(verbs)

{writes, attended}
