# Using SpaCy to Extract Information from Cast Biographies

<sub>Content of this notebook was prepared by Basel Shbita (shbita@usc.edu) as part of the class <u>CSCI 563/INF 558: Building Knowledge Graphs</u> during Spring 2020 at University of Southern California (USC).</sub>

SpaCy is an open-source software library for advanced natural language processing (NLP). SpaCy provides a one-stop-shop for tasks commonly used in any NLP project, including: Tokenisation, Lemmatisation, Part-of-speech (POS) tagging, Entity recognition, Dependency parsing, Sentence recognition, Word-to-vector transformations and many more methods for cleaning and normalising text data.

This notebook introduces some applied examples of NLP tasks to extract information from unstructured data using spaCy. The extracted structured data we produce can be used for downstream applications, such as creating Knowledge Graphs!

## Language Model

There are various different types of models in spaCy. We well use an available pretrained statistical model for English (`en_core_web_sm`). Let’s download then load it.

In [None]:
!python3 -m spacy download en_core_web_sm

In [1]:
import spacy
import en_core_web_sm
import csv

We will store the model in an nlp object which is a language model instance.

In [2]:
nlp = en_core_web_sm.load()

## Sentence Segmentation

Sentence Segmentation is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units. You’ll use these units when you’re processing your text to perform tasks such as part of speech tagging and entity extraction.

First, let's load a cast biography from the provided sxample `tsv` file

In [3]:
tsv_reader = csv.reader(open('entities_bio_sample.tsv'), delimiter='\t')

for (idx, (act, bio)) in enumerate(tsv_reader):
    print(f'[{idx:2d}] >', act)
    biog = bio

[ 0] > https://www.imdb.com/name/nm0000095
[ 1] > https://www.imdb.com/name/nm0001804


Here's the full biography text:

In [4]:
biog

"Actor Stanley Tucci was born on November 11, 1960, in Peekskill, New York. He is the son of Joan (Tropiano), a writer, and Stanley Tucci, an art teacher. His family is Italian-American, with origins in Calabria. Tucci took an interest in acting while in high school, and went on to attend the State University of New York's Conservatory of Theater Arts in Purchase. He began his professional career on the stage, making his Broadway debut in 1982, and then made his film debut in Prizzi's Honor (1985). In 2009, Tucci received his first Academy Award nomination for his turn as a child murderer in The Lovely Bones (2009). He also received a BAFTA nomination and a Golden Globe nomination for the same role. Other than The Lovely Bones, Tucci has recently had noteworthy supporting turns in a broad range of movies including Lucky Number Slevin (2006), The Devil Wears Prada (2006) and Captain America: The First Avenger (2011). Tucci reached his widest audience yet when he played Caesar Flickerman

Let’s read a text using spaCy and store in a `doc` object which is a container for accessing linguistic annotations.

In [5]:
doc = nlp(biog)

In spaCy, the `sents` property is used to extract sentences. Here’s how you would extract the sentences for a given input text:

In [6]:
for idx, sent in enumerate(doc.sents):
    print(f'[{idx:2d}] >', sent)
    mysent = str(sent)

[ 0] > Actor Stanley Tucci was born on November 11, 1960, in Peekskill, New York.
[ 1] > He is the son of Joan (Tropiano), a writer, and Stanley Tucci, an art teacher.
[ 2] > His family is Italian-American, with origins in Calabria.
[ 3] > Tucci took an interest in acting while in high school, and went on to attend the State University of New York's Conservatory of Theater Arts in Purchase.
[ 4] > He began his professional career on the stage, making his Broadway debut in 1982, and then made his film debut in Prizzi's Honor (1985).
[ 5] > In 2009, Tucci received his first Academy Award nomination for his turn as a child murderer in The Lovely Bones (2009).
[ 6] > He also received a BAFTA nomination and a Golden Globe nomination for the same role.
[ 7] > Other than The Lovely Bones, Tucci has recently had noteworthy supporting turns in a broad range of movies including Lucky Number Slevin (2006)
[ 8] > , The Devil Wears Prada (2006) and Captain America:
[ 9] > The First Avenger (2011).


Here's the sentence we will work on moving forward:

In [7]:
mysent

'Tucci married Felicity Blunt in August 2012.'

## Tokenization & POS tagging

Tokenization is the next step after sentence detection. It allows you to identify the basic units in your text. These basic units are called tokens. Tokenization is useful because it breaks a text into meaningful units. These units are used for further analysis, like part of speech tagging.

Parts-of-speech (POS) is a grammatical role that explains how a particular word is used in a sentence. There are eight parts-of-speech: Noun, Pronoun, Adjective, Verb, Adverb, Preposition, Conjunction, Interjection.

You can print tokens and their POS tages by iterating on the `doc` object:

In [8]:
doc = nlp(mysent)
for w in doc:
    print(f'{w.text:15s} [{w.tag_:5s} | {w.pos_:6s} | {spacy.explain(w.tag_)}]')

Tucci           [NNP   | PROPN  | noun, proper singular]
married         [VBD   | VERB   | verb, past tense]
Felicity        [NNP   | PROPN  | noun, proper singular]
Blunt           [NNP   | PROPN  | noun, proper singular]
in              [IN    | ADP    | conjunction, subordinating or preposition]
August          [NNP   | PROPN  | noun, proper singular]
2012            [CD    | NUM    | cardinal number]
.               [.     | PUNCT  | punctuation mark, sentence closer]


## Relation Extraction & Dependency Parsing

The POS tags alone are not sufficient for various cases and require further analysis like dependency parsing. Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. Now, let’s extract the dependency relations among entities:

In [9]:
for w in doc: 
    print(f'{w.text:15s} [{w.dep_}]')

Tucci           [nsubj]
married         [ROOT]
Felicity        [compound]
Blunt           [dobj]
in              [prep]
August          [pobj]
2012            [nummod]
.               [punct]


## Visualization: Using displaCy

spaCy comes with a built-in visualizer called displaCy. You can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.

In [10]:
from spacy import displacy
options = {"distance": 120}
displacy.render(doc, style="dep", options=options)

## Entity recognition

Entity recognition is the process of classifying named entities found in a text into pre-defined categories, such as persons, places, organizations, dates, etc. spaCy uses a statistical model to classify a broad range of entities, including persons, events, works-of-art and nationalities / religion.

Let's parse our sentence, then access the identified entities using the `doc` object's `.ents` method. With this method called on the `doc` we can access additional `token` methods, specifically `.label_`:

In [11]:
for ent in doc.ents:
    print(f'{ent.text:15s} [{ent.label_}]')

Felicity Blunt  [PERSON]
August 2012     [DATE]


## Rule-Based Matching

Rule-based matching is one of the steps in extracting information from unstructured text. It’s used to identify and extract tokens and phrases according to patterns (such as lowercase) and grammatical features (such as part of speech).

Rule-based matching can use regular expressions to extract entities or relations from an unstructured text. It’s different from extracting text using regular expressions only in the sense that regular expressions don’t consider the lexical and grammatical attributes of the text.

The spaCy library comes with `Matcher` tool that can be used to specify custom rules for phrase matching. The process to use the `Matcher` tool is pretty straight forward. Here's an example:

In [12]:
from spacy.matcher import Matcher

# define the pattern 
pattern = [{'POS': 'PROPN'},
           {'LOWER': 'married'},
           {'ENT_TYPE': 'PERSON'}]
   
# Matcher class object 
matcher = Matcher(nlp.vocab) 
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 
print(span.text)

Tucci married Felicity


**Notes**:
- You can find additional examples and use-cases in [SpaCy's documentation](https://spacy.io/usage/rule-based-matching).
- You can use the online [Rule-based Matcher Explorer](https://explosion.ai/demos/matcher) to test spaCy's rule-based `Matcher` by creating token patterns interactively and executing them.
- Here's a nice [article](https://stackabuse.com/python-for-nlp-vocabulary-and-phrase-matching-with-spacy/) you can review. In the article, the author explores vocabulary and phrase matching using the spaCy library. He defines patterns and detects phrases that match the defined patterns. 

Now, you know how to perform some basic NLP tasks like sentence segmentation, tokenization, POS tagging, entity recognition, and - most important - Rule-Based Matching. You now have enough knowledge about how to get the entities and the relations between entities and extract structured data that can be used for downstream applications, such as building a Knowledge Graph! Congratulations!

You can start applying this knowledge on the tasks you are required to do for Homework 02 of the class :)