# Named Entity Recognition — Workbook

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

---

## Dataset

---

## NER with spaCy

### Install spaCy and Download Language Model

Next we need to download the English-language model (`en_core_web_sm`), which will be processing and making predictions about our texts. This is the model that was trained on the annotated "OntoNotes" corpus. You can download the `en_core_web_sm` model by running the cell below:

In [71]:
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting pip
  Using cached pip-21.0.1-py3-none-any.whl (1.5 MB)
Collecting wheel
  Using cached wheel-0.36.2-py2.py3-none-any.whl (35 kB)
Collecting setuptools
  Using cached setuptools-56.0.0-py3-none-any.whl (784 kB)
Installing collected packages: wheel, setuptools, pip
  Attempting uninstall: wheel
    Found existing installation: wheel 0.34.2
    Uninstalling wheel-0.34.2:
      Successfully uninstalled wheel-0.34.2
  Attempting uninstall: setuptools
    Found existing installation: setuptools 50.3.2
    Uninstalling setuptools-50.3.2:
      Successfully uninstalled setuptools-50.3.2
  Attempting uninstall: pip
    Found existing installation: pip 20.3.3
    Uninstalling pip-20.3.3:
      Successfully uninstalled pip-20.3.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.1.0 requires wrapt>=1.11.1, which is not installed.
tensorboard 2.

### Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don't currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean or [Jieba](https://github.com/fxsjy/jieba) for Chinese.*

In [4]:
import spacy
from spacy import displacy
import en_core_web_sm
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

### Load Language Model

Once the model is downloaded, we need to load it.

In [15]:
nlp = en_core_web_sm.load()

## Process Document

In the cell below, we open and read Ada Lovelace's obituary. Then we run`nlp()` on the text and create our document.

In [16]:
filepath = "../texts/literature/House-on-Mango-Street/04-My-Name.txt"
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## spaCy Named Entities

Below is a Named Entities chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different named entities that spaCy can identify as well as their corresponding type labels.

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


To quickly see spaCy's NER in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) with the `style=` parameter set to "ent"  (short for entities):

In [17]:
displacy.render(document, style="ent")

## Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property. If we check out `document.ents`, we can see all the entities from Ada Lovelace's obituary.

In [11]:
document.ents

(English,
 Spanish,
 nine,
 Mexican,
 Sunday,
 mornings,
 the Chinese year,
 Chinese,
 Chinese,
 Mexicans,
 Spanish,
 Magdalena,
 Magdalena,
 Nenny,
 Esperanza)

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing), which we can access by iterating through the `document.ents` with a simple `for` loop.

For each `named_entity` in `document.ents`, we will extract the `named_entity` and its corresponding `named_entity.label_`.

In [10]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

English LANGUAGE
Spanish NORP
nine CARDINAL
Mexican NORP
Sunday DATE
mornings TIME
the Chinese year DATE
Chinese NORP
Chinese NORP
Mexicans NORP
Spanish LANGUAGE
Magdalena PERSON
Magdalena PERSON
Nenny PERSON
Esperanza PERSON


To extract just the named entities that have been identified as `PERSON`, we can add a simple `if` statement into the mix:

In [12]:
for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)

Magdalena
Magdalena
Nenny
Esperanza


## NER with The House on Mango Street

In [19]:
text = open("../texts/literature/The-House-on-Mango-Street-Sandra-Cisneros.txt").read()

In [20]:
document = nlp(text)

## Get People

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|

To extract and count the people, we will use an `if` statement that will pull out words only if their "ent" label matches "PERSON."

In [None]:
people = []

for named_entity in document.ents:
    # Your Code Here
        #Your Code Here

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

In [21]:
people = []

for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,Nenny,35
1,Sally,31
2,Lucy,29
3,Rachel,25
4,Ruthie,13
5,Benny,9
6,Cathy,8
7,Earl,8
8,Esperanza,7
9,Geraldo,7


## Get Places

|Type Label|Description|
|:---:|:---:|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|

To extract and count places, we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "GPE" or "LOC." These are the type labels for "counties cities, states" and "locations, mountain ranges, bodies of water."

In [None]:
places = []

for named_entity in document.ents:
    # Your Code Here
        # Your Code Here

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

In [11]:
places = []

for named_entity in document.ents:
    if named_entity.label_ == "GPE" or named_entity.label_ == "LOC":
        places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,But,26
1,There,20
2,Ruthie,9
3,Mango Street,7
4,Benny’s,5
5,And,5
6,Rachel,3
7,You’re,3
8,Where,3
9,Somebody,3


## Get Languages

|Type Label|Description|
|:---:|:---:|
|WORK_OF_ART|Titles of books, songs, etc.|

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we're going to make our code even more economical and efficient (while still changing our `if` statement to match the "ent" label "WORK_OF_ART").

In [14]:
languages = []

for named_entity in document.ents:
    if named_entity.label_ == "LANGUAGE":
        languages.append(named_entity.text)

languages_tally = Counter(languages)

df = pd.DataFrame(languages_tally.most_common(), columns = ['language', 'count'])
df

Unnamed: 0,language,count


## Get Another Entity

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


In [None]:
# Your Code Here

## Your Turn!

Now it's your turn to take a crack at NER with a whole new text!


```{toggle}
|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|
```

In this section, you're going to extract and count named entities from *The Autobiography of Benjamin Franklin*.

Open and read the text file

In [187]:
filepath = "../texts/literature/The-Autobiography-of-Benjamin-Franklin.txt"
text = open(filepath, encoding='utf-8').read()

To process the book in smaller chunks (if working in Binder or on a computer with memory constraints):

In [188]:
chunked_text = text.split('\n')
chunked_documents = list(nlp.pipe(chunked_text))

To process the book all at once (if working on a computer with a larger amount of memory):

In [62]:
document = nlp(text)

**1.** Choose a named entity from the possible spaCy named entities listed above. Extract, count, and make a dataframe from the most frequent named entities (of the type that you've chosen) in the book. If you need help, study the examples above.

In [None]:
#Your Code Here 👇 

**2.** What is a result from this NER extraction that conformed to your expectations, that you find obvious or predictable? Why?

**#**Your answer here. (Double click this cell to type your answer.)

**3.** What is a result from this NER extraction that defied your expectations, that you find curious or counterintuitive? Why?

**#**Your answer here. (Double click this cell to type your answer.)

**4.** What's an insight that you might be able to glean about the book based on your NER extraction?

**#**Your answer here. (Double click this cell to type your answer.)