# Named Entity Recognition

[Download relevant files](https://melaniewalsh.org/spacy.zip)

In this lesson, we're going to learn about a text analysis method called **Named Entity Recognition** (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

<img src="../images/Ada-Lovelace-NER.png" width="100%">

# Why is NER Useful?

Named Entity Recognition is useful for extracting key information from texts. You might use NER to identify the most frequently appearing characters in a novel or build a network of characters (something we'll do in a later lesson!). Or you might use NER to identify the geographic locations mentioned in texts, a first step toward mapping the locations (something we'll also do in a later lesson!).

# Natural Language Processing (NLP)

Named Entity Recognition is a fundamental task in the field of *natural language processing* (NLP). What is NLP, exactly? NLP is an interdisciplinary field that blends linguistics, statistics, and computer science. The heart of NLP is to understand human language with statistics and computers. Applications of NLP are all around us. Have you ever heard of a little thing called *spellcheck*? How about autocomplete, Google translate, chat bots, and Siri? These are all examples of NLP in action!

Thanks to recent advances in machine learning and to increasing amounts of available text data on the web, NLP has grown by leaps and bounds in the last decade. NLP models that generate texts are now getting eerily good. (If you don't believe me, check out [this app that will autocomplete your sentences](https://transformer.huggingface.co/doc/gpt2-large/qCNMTfzephfZMBkryTNvSRKQ/edit) with GPT-2, a state-of-the-art text generation model. When I ran it, the model generated a mini-lecture from a "university professor" that sounds spookily close to home...)

<img src="../images/GPT-2.png">

Open-source NLP tools are getting very good, too. We're going to use one of these open-source tools, the Python library `spaCy`, for our Named Entity Recognition tasks in this lesson.

# How spaCy Works

<img src="../images/Ada-Lovelace-NER.png">

The screenshot above shows spaCy correctly identifying named entities in Ada Lovelace's *New York Times* obituary (something that we'll test out for ourselves below). How does spaCy know that "Ada Lovelace" is a person and that "1843" is a date?

Well, spaCy doesn't *know*, not for sure anyway. Instead, spaCy is making a very educated guess. This "guess" is based on what spaCy has learned about the English language after seeing lots of other examples.

That's a colloquial way of saying: spaCy relies on machine learning models that were trained on a large amount of carefully-labeled texts. (These texts were, in fact, often labeled and corrected by hand). This is similar to our <a href="https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling-Overview.html#1)-LDA-is-an-Unsupervised-Algorithm">topic modeling work</a> from the previous lesson, except our topic model wasn't using labeled data.

The English-language spaCy model that we're going to use in this lesson was trained on an annotated corpus called ["OntoNotes"](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf): 2 million+ words drawn from "news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech," which were meticulously tagged by a group of researchers and professionals for people's names and places, for nouns and verbs, for subjects and objects, and much more. (Like a lot of other major machine learning projects, OntoNotes was also sponsored by the Defense Advaced Research Projects Agency (DARPA), the branch of the Defense Department that develops technology for the U.S. military.)

When spaCy identifies people and places in Ada Lovelace's obituary, in other words, its NLP model is actually making a series of *predictions* about the text based on what it has learned about how people and places function in English-language sentences.

# NER with spaCy

# Install spaCy

To use spaCy, we first need to install the library.

In [None]:
!pip install -U spacy

# Import Libraries

Then we're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [2]:
import spacy
from spacy import displacy

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

In [3]:
from collections import Counter

In [4]:
import pandas as pd
pd.set_option("max_rows", 400)
pd.set_option("max_colwidth", 400)

# Download Language Model

Next we need to download the English-language model (`en_core_web_sm`), which will be processing and making predictions about our texts. This is the model that was trained on the annotated "OntoNotes" corpus. You can download the `en_core_web_sm` model by running the cell below:

In [6]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don't currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean or [Jieba](https://github.com/fxsjy/jieba) for Chinese.*

# Load Language Model

Once the model is downloaded, we need to load it with `spacy.load()` and assign it to the variable `nlp`.

In [7]:
nlp = spacy.load('en_core_web_sm')

# Create a Processed spaCy Document

`document = nlp(open(filepath, , encoding='utf-8').read())`

Whenever we use spaCy, our first step will be to create a processed spaCy `document` with the loaded NLP model `nlp()`. Most of the heavy NLP lifting is done in this line of code. After processing, the `document` object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In the cell below, we `open()` and `.read()` Ada Lovelace's obituary. Then we run`nlp()` on the text and create our `document`.

In [159]:
filepath = "../texts/history/NYT-Obituaries/1852-Ada-Lovelace.txt"

document = nlp(open(filepath, encoding='utf-8').read())

# spaCy Named Entities

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


Above is a Named Entities chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different named entities that spaCy can identify as well as their corresponding type labels. To quickly see spaCy's NER in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) with the `style=` parameter set to "ent"  (short for entities):

In [35]:
displacy.render(document, style="ent")

From a quick glance at the text above, we can see that spaCy is doing quite well with NER. But it's definitely not perfect.

Though spaCy correctly identifies "Ada Lovelace" as a `PERSON` in the first sentence, just a few sentences later it labels her as a `WORK_OF_ART`. Though spaCy correctly identifies "London" as a place `GPE` a few paragraphs down, it incorrectly identifies "Jacquard" as a place `GPE`, too (when really "Jacquard" is a type of loom, named after [Marie Jacquard](https://en.wikipedia.org/wiki/Jacquard_machine)). 

This inconsistency is very important to note and keep in mind. If we wanted to use spaCy's NER for a project, it would almost certainly require manual correction and cleaning. And even then it wouldn't be perfect. That's why understanding the limitations of this tool is so crucial. While spaCy's NER can be very good for identifying entities in broad strokes, it can't be relied upon for anything exact and fine-grained — not out of the box anyway.

# Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property. If we check out `document.ents`, we can see all the entities from Ada Lovelace's obituary.

In [38]:
document.ents

(first,
 Ada Lovelace,
 1843,
 Jacquard,
 British,
 Charles Babbage’s,
 Lovelace —,
 1852,
 36,
 first,
 the Analytical Engine,
 Swiss,
 Jacob Bernoulli,
 Walter Isaacson,
 The Innovators,
 British,
 Lord Byron,
 Romantic,
 one,
 Betty Alexandra Toole,
 Lovelace,
 the mid-20th century,
 the Defense Department,
 October,
 Lady Lovelace,
 The London Examiner,
 Sciences,
 Dec. 10, 1815,
 London,
 Annabella Milbanke,
 8,
 Medea,
 Lovelace,
 Smith Collection/Gado/Getty Images
 
  Lovelace,
 British,
 the day,
 Mary Somerville,
 Somerville,
 Lovelace to Babbage,
 17,
 two-foot,
 almost two decades,
 William King,
 Somerville,
 1835,
 19,
 the Countess of Lovelace,
 1839,
 two,
 Somerville,
 Trigonometry,
 Cubic and Biquadratic Equations,
 1840,
 Lovelace,
 Augustus De Morgan,
 London,
 first,
 1843,
 27,
 Lovelace,
 the Babbage Analytical Engine,
 nearly three,
 Notes,
 first,
 Ursula Martin,
 the University of Oxford,
 Lovelace’s,
 less than a decade later,
 Nov. 27, 1852,
 Notes,
 Claire C

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing), which we can access by iterating through the `document.ents` with a simple `for` loop. `For` each `named_entity` in `document.ents`, we will extract the `named_entity` and its corresponding `named_entity.label_`.

In [40]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

first ORDINAL
Ada Lovelace PERSON
1843 DATE
Jacquard GPE
British NORP
Charles Babbage’s PERSON
Lovelace — WORK_OF_ART
1852 DATE
36 CARDINAL
first ORDINAL
the Analytical Engine ORG
Swiss NORP
Jacob Bernoulli PERSON
Walter Isaacson PERSON
The Innovators WORK_OF_ART
British NORP
Lord Byron PERSON
Romantic NORP
one CARDINAL
Betty Alexandra Toole PERSON
Lovelace PERSON
the mid-20th century DATE
the Defense Department ORG
October DATE
Lady Lovelace PERSON
The London Examiner ORG
Sciences ORG
Dec. 10, 1815 DATE
London GPE
Annabella Milbanke PERSON
8 DATE
Medea PERSON
Lovelace PERSON
Smith Collection/Gado/Getty Images

 Lovelace ORG
British NORP
the day DATE
Mary Somerville PERSON
Somerville PERSON
Lovelace to Babbage WORK_OF_ART
17 DATE
two-foot QUANTITY
almost two decades DATE
William King PERSON
Somerville ORG
1835 DATE
19 DATE
the Countess of Lovelace FAC
1839 DATE
two CARDINAL
Somerville ORG
Trigonometry GPE
Cubic and Biquadratic Equations ORG
1840 DATE
Lovelace PERSON
Augustus De Morgan 

To extract just the named entities that have been identified as `PERSON`, we can add a simple `if` statement into the mix:

In [42]:
for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)

Ada Lovelace
Charles Babbage’s
Jacob Bernoulli
Walter Isaacson
Lord Byron
Betty Alexandra Toole
Lovelace
Lady Lovelace
Annabella Milbanke
Medea
Lovelace
Mary Somerville
Somerville
William King
Lovelace
Augustus De Morgan
Lovelace
Ursula Martin
Lovelace’s
Claire Cain Miller
Ada Lovelace


# Practicing with *Lost in the City*

For the rest of this lesson, we're going to work with Edward P. Jones's short story collection *Lost in the City*.

<img src="https://mybinder.org/static/images/logo_social.png" width="150" align="left"> *If you're using this Jupyter notebook in Binder (in the cloud), please uncomment the cell below and work with only the first story from _Lost in the City_. The Binder notebook is currently having issues loading the entire collection.*

In [None]:
#file = "../texts/literature/Lost-in-the-City_Stories/01-The-Girl-Who-Raised-Pigeons.txt"
#document = nlp(open(file).read())

In [72]:
filepath = "../texts/literature/Jones-Lost-in-The-City.txt"
document = nlp(open(filepath, encoding="utf-8").read())

# Get People

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|

To extract and count the people identified in *Lost in the City*, we will follow the same model as above, using an `if` statement that will pull out words only if their "ent" label matches "PERSON."

> 🐍 **Python Review** 🐍

>*While we demonstrate how to extract named entities in the sections below, we're also going to reinforce some integral Python skills. Notice how we use `for` loops and `if` statements to `.append()` specific words to a list. Then we count the words in the list and make a pandas dataframe from the list.* 

Here's the code all together:

In [None]:
people = []
for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        people.append(named_entity.text)
        
people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Here's the code broken up. We make a list of all the people identified in *Lost in the City*:

In [50]:
people = []
for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        people.append(named_entity.text)

In [51]:
people

['Betsy Ann Morgan',
 'Robert',
 'Miles Patterson',
 'Miles',
 'Betsy Ann',
 'Jenny',
 'Robert',
 'Betsy Ann',
 'Betsy Ann',
 'Miles',
 'Buster',
 'Betsy Ann',
 'Miles',
 'Miles',
 'Betsy Ann’s',
 'Betsy Ann',
 'Betsy Ann',
 'Walter Creed’s',
 'Jenny',
 'Creed',
 'Miles',
 'Miles',
 'Miles Patterson',
 'Miles',
 'Jenny',
 'Jenny Creed',
 'Betsy Ann',
 'Jenny’s',
 'Robert Morgan',
 'Jenny',
 'Walter Creed',
 'Jenny',
 'Clara',
 'Robert’s',
 'Clara',
 'Jenny',
 'Clara',
 'Robert',
 'Jenny',
 'Hahn',
 'Jenny',
 'William',
 'Alice Hobson',
 'Jenny',
 'Clara',
 'Robert',
 'Clara',
 'Jenny’s',
 'Clara',
 'Robert',
 'Robert',
 'Jenny',
 'Jenny',
 'Clara',
 'Robert Morgan',
 'Robert',
 'Miss Jenny',
 'Jenny',
 'Clara',
 'Robert',
 'Robert',
 'Clara',
 'Robert',
 'Clara',
 'Robert',
 'Jenny',
 'Robert',
 'Robert',
 'Miss Jenny',
 'Jenny',
 'Jenny',
 'Woulda',
 'Jenny',
 'Betsy Ann',
 'Betsy Ann',
 'Oscar Jackson',
 'Jenny’s',
 'Jake Horton',
 'Robert Morgan',
 'Jenny’s',
 'Robert',
 'Clara',
 '

Then we count the unique people in this list with the `Counter()` module:

In [55]:
people_tally = Counter(people)

In [56]:
people_tally.most_common()

[('Cassandra', 97),
 ('Anita', 88),
 ('Joyce', 82),
 ('Melanie', 65),
 ('Rickey', 57),
 ('Woodrow', 54),
 ('Sherman', 47),
 ('Marie', 45),
 ('Penny', 44),
 ('Betsy Ann', 40),
 ('Madeleine', 40),
 ('Diane', 38),
 ('Samuel', 37),
 ('Garrett', 37),
 ('Gladys', 33),
 ('Manny', 33),
 ('Sam', 33),
 ('Robert', 31),
 ('Jenny', 30),
 ('Humphrey', 30),
 ('Avis', 27),
 ('Maddie', 26),
 ('Carol', 25),
 ('Pearl', 24),
 ('Marvin', 24),
 ('Maude', 24),
 ('Rhonda', 23),
 ('Ralph', 21),
 ('Jenkins', 16),
 ('Miles', 15),
 ('Sandy', 15),
 ('Mildred', 15),
 ('Rita', 15),
 ('Marvella', 14),
 ('Anacostia', 13),
 ('Anna', 13),
 ('Marcus', 13),
 ('Clara', 12),
 ('Jesus', 12),
 ('Wesley', 11),
 ('Adam', 11),
 ('John Henry', 10),
 ('Jenny’s', 9),
 ('Darlene', 9),
 ('Baxter', 9),
 ('Smokey', 9),
 ('’s', 8),
 ('Alice', 8),
 ('Angelo', 8),
 ('Brenda', 8),
 ('Lydia', 8),
 ('Saunders', 8),
 ('Lonney', 7),
 ('Tommy', 7),
 ('Joe', 7),
 ('Patricia', 6),
 ('Hazel', 6),
 ('Mama Joyce', 6),
 ('Clovis', 6),
 ('Taylor', 6),

Then we make a dataframe from this list with `pd.DataFrame()`:

In [57]:
df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,Cassandra,97
1,Anita,88
2,Joyce,82
3,Melanie,65
4,Rickey,57
5,Woodrow,54
6,Sherman,47
7,Marie,45
8,Penny,44
9,Betsy Ann,40


To write this dataframe (or any dataframe!) to a CSV file, we can use `df.to_csv()`. To create a CSV file of character counts, uncomment the cell below:

In [None]:
#df.to_csv("Lost-in-the-City-characters.csv", encoding='utf-8', index=False)

# Get Places

|Type Label|Description|
|:---:|:---:|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|

To extract and count places, we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "GPE" or "LOC." These are the type labels for "counties cities, states" and "locations, mountain ranges, bodies of water."

In [60]:
places = []
for named_entity in document.ents:
    if named_entity.label_ == "GPE" or named_entity.label_ == "LOC":
        places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,Washington,45
1,Cassandra,30
2,Madeleine,25
3,Kentucky,21
4,Georgia,21
5,Vernelle,21
6,Santiago,17
7,Maddie,16
8,Baltimore,12
9,Carmena,12


Do you notice anything off about this list...?

# Get Streets & Parks

|Type Label|Description|
|:---:|:---:|
|FAC|Buildings, airports, highways, bridges, etc.|

To extract and count streets and parks (which show up a lot in *Lost in the City*!), we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "FAC." This is the type label for "buildings, airports, highways, bridges, etc."

In [62]:
streets = []
for named_entity in document.ents:
    if named_entity.label_ == "FAC":
        streets.append(named_entity.text)

streets_tally = Counter(streets)

df = pd.DataFrame(streets_tally.most_common(), columns = ['street', 'count'])
df

Unnamed: 0,street,count
0,Myrtle Street,10
1,5th Street,10
2,F Street,6
3,Vernelle Wise,5
4,New Jersey Avenue,4
5,11th Street,4
6,Connecticut Avenue,4
7,10th Street,4
8,Clifton Street,3
9,12th Street,3


# Get Works of Art

|Type Label|Description|
|:---:|:---:|
|WORK_OF_ART|Titles of books, songs, etc.|

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we're going to make our code even more economical and efficient (while still changing our `if` statement to match the "ent" label "WORK_OF_ART").

> 🐍 **Python Review** 🐍

>We can use a [*list comprehension*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Python/More-Lists-Loops.html#List-Comprehensions) to get our list of named entities in a single line of code! Closely examine the first line of code below:

In [155]:
works_of_art = [named_entity.text for named_entity in document.ents if named_entity.label_ == "WORK_OF_ART"]

art_tally = Counter(works_of_art)

df = pd.DataFrame(art_tally.most_common(), columns = ['work_of_art', 'count'])
df

Unnamed: 0,work_of_art,count
0,Bible,9
1,the Declaration of Independence,2
2,A Deficit of Decency,1
3,Poor Richard’s Almanack,1
4,The Declaration of Independence,1
5,Profiles in Courage,1
6,the Gilded Age,1
7,the Bible Belt,1
8,I Have a Dream,1
9,the Pledge of Allegiance,1


# Your Turn!

Now it's your turn to take a crack at NER with a whole new text!


|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


In this section, you're going to extract and count named entities from Barack Obama's memoir *The Audacity of Hope*. We're exploring Obama's memoir because it's chock full of named entities.

Read in and process the text file

In [146]:
file = "../texts/literature/Obama-The-Audacity-of-Hope.txt"

document = nlp(open(file, encoding='utf-8').read())

**1.** Choose a named entity from the possible spaCy named entities listed above. Extract, count, and make a dataframe from the most frequent named entities (of the type that you've chosen) in *The Audacity of Hope*. If you need help, study the examples above.

In [157]:
#Your Code Here 👇 


**2.** What is a result from this NER extraction that conformed to your expectations, that you find obvious or predictable? Why?

**#**Your answer here. (Double click this cell to type your answer.)

**3.** What is a result from this NER extraction that defied your expectations, that you find curious or counterintuitive? Why?

**#**Your answer here. (Double click this cell to type your answer.)

**4.** What's an insight that you might be able to glean about *The Audacity of Hope* based on your NER extraction?

**#**Your answer here. (Double click this cell to type your answer.)