# People & Places: Named Entity Recognition

A Jupyter Notebook created for a **Reproducible Research Workshop**

(A Collaboration between Dartmouth Library and Research Computing)

[*Click here to view or register for our current list of workshops*](http://dartgo.org/RRADworkshops), including a workshop next week on [**Stylometry**](https://libcal.dartmouth.edu/event/11237111) (the study of a text's or author's style - to discover authorship of anonymous documents, among other things) and other workshops on Bibliometric Analysis, the customized use of Large Language Models, Text Generation in R, and more....

*Created by*:
+ Jeremy Mikecz, Research Data Services (Dartmouth Library)


In this lesson, we will learn how to extract named entities (names of people, places, groups, institutions, etc.) from text files and then analyze the results.

For example, one of our goals is to create a map of all countries mentioned in the State of the Union corpus. 

What steps do you anticipate we will have to do in order to successfully accomplish this project? List the steps in the markdown below:

## I. Using spaCy

**spaCy** is a Python package / library designed to enable fast and efficient Natural Language Processing (for more: [What's spaCy?](https://spacy.io/usage/spacy-101#whats-spacy)). In a previous lesson we used the **Natural Language ToolKit** (NLTK), which does similar things. However, where NLTK is largely designed for instruction and research, spaCy is designed for production, including the fast processing of large amounts of text.

It offers a variety of Natural Language Processing (NLP) features, including:
+ pre-processing tools
    + tokenization - dividing text into words (and punctuation marks, numbers, etc.)
    + sentence boundary detection - dividing texts into sentences
    + lemmatization - identifying the root or base form of words
+ Linguistic Annotations
    + Part-of-speech tags (POS) and dependencies - tags part of speech (noun, proper noun, verb, adjective, etc.) and dependency (which words modify which other words? adjectives --> nouns; subject --> verb --> object, etc.)
    + Named Entity Recognition (NER) - identifying names of objects, whether people, places, organizations, or other "entities" like book or product titles
    + word vectorization - word vectors assign numerical values to words placing each into a multi-dimensional space where similar words are found in close proximity to one another

And much more....

### Working with Foreign Languages

+ You can use [spaCy's existing language models](https://spacy.io/usage/models) for languages from English, Spanish, and Mandarin Chinese to Kyrgyz and Yoruba.
+ you can modify one of these existing language models
+ or you can create a new language model from scratch
    



## II. Installing spaCy

If you were to want to run spaCy on your own computer, [the spaCy instructions](https://spacy.io/usage) recommend installing spaCy in a *virtual environment*. After activating a virtual environment, you would run the following in a terminal:

```
python -m venv .env    #to activate already established virtual environment called ".venv"
source .env/bin/activate     #to activate .venv
pip install -U pip setuptools wheel   
pip install -U spacy             #installs spaCy
python -m spacy download en_core_web_sm   #installs English model from spaCy
```

However, **we are going to install spaCy in JupyterHub**. To do so, we will need to uncomment (remove the "#") the following cells and run on JupyterHub.

In [38]:
#!pip install -U spacy

In [39]:
#!python -m spacy download en_core_web_sm

## III. About Named Entity Recognition with spaCy

Basic named entity recognizers commonly identify the following types of entities:

```
place names
person names
group names
miscellaneous / other entities
```

**spaCy**'s NER identifies a wider-range of entities.

Examine the list of [entity types identified by spaCy](https://towardsdatascience.com/explorations-in-named-entity-recognition-and-was-eleanor-roosevelt-right-671271117218) below:

```
PERSON:      People, including fictional.
NORP:        Nationalities or religious or political groups.
FAC:         Buildings, airports, highways, bridges, etc.
ORG:         Companies, agencies, institutions, etc.
GPE:         Countries, cities, states.
LOC:         Non-GPE locations, mountain ranges, bodies of water.
PRODUCT:     Objects, vehicles, foods, etc. (Not services.)
EVENT:       Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW:         Named documents made into laws.
LANGUAGE:    Any named language.
DATE:        Absolute or relative dates or periods.
TIME:        Times smaller than a day.
PERCENT:     Percentage, including ”%“.
MONEY:       Monetary values, including unit.
QUANTITY:    Measurements, as of weight or distance.
ORDINAL:     “first”, “second”, etc.
CARDINAL:    Numerals that do not fall under another type.
```

Brainstorm some ways you could use Named Entity Recognition in your research field for the types of texts and documents researchers in that field typically deal with. What projects can you envision? What questions could you answer?

## IV. Import Packages

First, let's import all necessary Python packages.

In [40]:
import spacy
import collections
import pandas as pd
from spacy.lang.en.examples import sentences
from spacy import displacy   #for visualizing word types and relationships

Second, since we will be working with English texts, we need to import one of spaCy's English models.

In [41]:
#https://spacy.io/models/en
nlp = spacy.load("en_core_web_sm")

## V. Linguistic Tagging with spaCy

In [42]:
print(sentences)

['Apple is looking at buying U.K. startup for $1 billion', 'Autonomous cars shift insurance liability toward manufacturers', 'San Francisco considers banning sidewalk delivery robots', 'London is a big city in the United Kingdom.', 'Where are you?', 'Who is the president of France?', 'What is the capital of the United States?', 'When was Barack Obama born?']


In [43]:
doc = nlp(sentences[0])                  #try substituting sentence #0 with another sentence
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple is looking at buying U.K. startup for $1 billion
Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


<div class="alert alert-info" role="alert">
    <p style="color:blue"><b>Exercises</b>:</p> 
    <p style="color:blue">Apply spaCy's Part-of-Speech (POS) detection to a sentence or short text of your choice. 
    </p>
</div>

In [44]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP dobj X.X. False False
startup startup NOUN NN dep xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


## VI. Applying Named Entity Recognition (NER) in spaCy

In [45]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


Let's move beyond spaCy's sample sentences and try extracting named entities (NEs) from song lyrics and other texts. Examine the results below.

In [46]:
#below: partial lyrics from Boyz II Men's "All Around the World"
lyrics = """
London, Paris, Monte Carlo, Germany and Rome
Different places, different faces still it feels like home
China, Holland, Belgium, Rio, Africa, Japan
That's the way we live and we do the best we can
Here we go on another tour on the road again
Feelin' good it's alright
Just enjoying ourselves
Come and take a flight with
Boyz II Men back around the world
And we're comin' through your town
All we do is for you
'Cause you've always been there
And we appreciate you
Keisha, Kelly, Tonya, Stacy, Mica and LaShaun
Kathy, Trina, Carla, Lisa, Cheri, and Diane
All these girls around the world are fly in every land
And it's hard to choose, but there's one for every man
Here we go on another tour on the road again
Feelin' good it's alright
Just enjoying ourselves
Come and take a flight with
Boyz II Men back around the world
And we're comin' through your town
All we do, we do it for you
'Cause you've always been there
And we appreciate you
Houston, Phoenix, Carolina, Jersey, and the Keys
Denver, Boston, Mississippi, Georgia, Tennessee
Dallas, Cleveland, Cali, Philly, New York, and DC
That's the life we live and it's the only life
"""

doc_lyrics = nlp(lyrics)                  
for ent in doc_lyrics.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


London 1 7 GPE
Paris 9 14 GPE
Monte Carlo 16 27 PERSON
Germany 29 36 GPE
Rome 41 45 GPE
China 105 110 GPE
Holland 112 119 GPE
Belgium 121 128 GPE
Rio 130 133 GPE
Africa 135 141 LOC
Japan 143 148 GPE
Feelin 243 249 PRODUCT
Boyz II Men 321 332 ORG
Keisha 465 471 PERSON
Kelly 473 478 PERSON
Tonya 480 485 GPE
Stacy 487 492 ORG
Mica 494 498 PERSON
LaShaun 503 510 GPE
Trina 518 523 GPE
Carla 525 530 PERSON
Lisa 532 536 PERSON
Cheri 538 543 PERSON
Diane 549 554 PERSON
Feelin 710 716 PRODUCT
Boyz II Men 788 799 ORG
Houston 939 946 GPE
Phoenix 948 955 GPE
Carolina 957 965 GPE
Jersey 967 973 GPE
the Keys
Denver 979 994 FAC
Boston 996 1002 GPE
Mississippi 1004 1015 GPE
Georgia 1017 1024 GPE
Tennessee 1026 1035 GPE
Dallas 1036 1042 GPE
Cleveland 1044 1053 GPE
Cali 1055 1059 GPE
Philly 1061 1067 GPE
New York 1069 1077 GPE
DC 1083 1085 GPE


To better understand these results, we need to understand what this model was trained on. At the beginning of this notebook we imported the **en_core_web_sm** model. Let's examine [spaCy's documentation for this model](https://spacy.io/models/en). 

What texts / sources was this model trained on?

How does it differ from spaCy's other English models?

How accurate are these models at NER?


We can experiment with different types of text. Below, we will extract NEs from an excerpt from Jack Kerouac's *On the Road* (1957).

In [47]:
novel_excerpt = """
part one
1

I first met Dean not long after my wife and I split up. I had just gotten over a serious illness that I won’t bother to talk about, except that it had something to do with the miserably weary split-up and my feeling that everything was dead. With the coming of Dean Moriarty began the part of my life you could call my life on the road. Before that I’d often dreamed of going West to see the country, always vaguely planning and never taking off. Dean is the perfect guy for the road because he actually was born on the road, when his parents were passing through Salt Lake City in 1926, in a jalopy, on their way to Los Angeles. First reports of him came to me through Chad King, who’d shown me a few letters from him written in a New Mexico reform school. I was tremendously interested in the letters because they so naively and sweetly asked Chad to teach him all about Nietzsche and all the wonderful intellectual things that Chad knew. At one point Carlo and I talked about the letters and wondered if we would ever meet the strange Dean Moriarty. This is all far back, when Dean was not the way he is today, when he was a young jailkid shrouded in mystery. Then news came that Dean was out of reform school and was coming to New York for the first time; also there was talk that he had just married a girl called Marylou.
One day I was hanging around the campus and Chad and Tim Gray told me Dean was staying in a cold-water pad in East Harlem, the Spanish Harlem. Dean had arrived the night before, the first time in New York, with his beautiful little sharp chick Marylou; they got off the Greyhound bus at 50th Street and cut around the corner looking for a place to eat and went right in Hector’s, and since then Hector’s cafeteria has always been a big symbol of New York for Dean. They spent money on beautiful big glazed cakes and creampuffs.
"""
doc_kerouac = nlp(novel_excerpt)                  
for ent in doc_kerouac.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


one 6 9 CARDINAL
1 10 11 CARDINAL
first 15 20 ORDINAL
Dean 25 29 PERSON
Dean Moriarty 274 287 ORG
Dean 460 464 PERSON
Salt Lake City 577 591 GPE
1926 595 599 DATE
Los Angeles 630 641 GPE
First 643 648 ORDINAL
Chad King 683 692 PERSON
New Mexico 745 755 GPE
Chad 858 862 GPE
Nietzsche 886 895 PERSON
Chad 943 947 GPE
one 957 960 CARDINAL
Carlo 967 972 PERSON
Dean Moriarty 1051 1064 ORG
Dean 1093 1097 PERSON
today 1120 1125 DATE
Dean 1196 1200 PERSON
New York 1244 1252 GPE
first 1261 1266 ORDINAL
Marylou 1332 1339 PRODUCT
One day 1341 1348 DATE
Chad 1385 1389 GPE
Tim Gray 1394 1402 PERSON
Dean 1411 1415 PERSON
East Harlem 1451 1462 LOC
the Spanish Harlem 1464 1482 ORG
Dean 1484 1488 PERSON
the night before 1501 1517 TIME
first 1523 1528 ORDINAL
New York 1537 1545 GPE
Greyhound 1611 1620 NORP
50th Street 1628 1639 FAC
New York 1787 1795 GPE
Dean 1800 1804 PERSON


<div class="alert alert-info" role="alert">
    <p style="color:blue"><b>Exercises</b>:</p> 
    <p style="color:blue">Try applying <b>spaCy's named entity recognition (NER)</b> to a text of your choosing. Copy and paste the above code into the cell below, but insert song lyrics, an excerpt from a novel, or other text.</p>
</div>

## VII. Visualize spaCy tagging using displacy

We can visualize NEs, relationships between words (aka. "dependencies"), and other linguistic entities using spaCy's visualization tool [displacy](https://spacy.io/universe/project/displacy).
For more on how to use displacy see: https://spacy.io/usage/visualizers. 

In [48]:
displacy.render(doc, style = "ent")

In [49]:
displacy.render(doc, style = "dep")

<div class="alert alert-info" role="alert">
    <p style="color:blue"><b>Exercises</b>:</p> 
    <p style="color:blue">Use the displacy functions to visualize named entities and dependencies for other texts of your choosing.</p>
</div>

## VIII. Extracting entities from one State of the Union address

In previous lessons, we applied basic text analysis methods to a corpus of 233 State of the Union addresses given by Presidents of the United States from 1791 - 2023.

In one of those lessons, we stored all 233 State of the Union addresses in one .tsv file (tab separated values). This will save us from having to read in all 233 text files individually. Let's begin by opening this tsv file.

In [50]:
#sotudir = Path("state-of-the-union-dataset","txt")
sotudf = pd.read_csv("sotudf.tsv", encoding = "utf-8", sep = "\t", index_col = 0)
sotudf = sotudf.sort_values(by = ['year'])
sotudf.tail()

Unnamed: 0,pres,year,numtoks,tokens,fulltext,ltoks,ltoks_ns
156,Obama,2014,7017,"['Mr', 'Speaker', 'Mr', 'Vice', 'President', '...","Mr. Speaker, Mr. Vice President, Members of Co...","['mr', 'speaker', 'mr', 'vice', 'president', '...","['mr', 'speaker', 'mr', 'vice', 'president', '..."
157,Obama,2015,6961,"['Mr', 'Speaker', 'Mr', 'Vice', 'President', '...","Mr. Speaker, Mr. Vice President, Members of Co...","['mr', 'speaker', 'mr', 'vice', 'president', '...","['mr', 'speaker', 'mr', 'vice', 'president', '..."
158,Obama,2016,5628,"['Mr', 'Speaker', 'Mr', 'Vice', 'President', '...","Mr. Speaker, Mr. Vice President, Members of Co...","['mr', 'speaker', 'mr', 'vice', 'president', '...","['mr', 'speaker', 'mr', 'vice', 'president', '..."
207,Trump,2017,5095,"['Thank', 'you', 'very', 'much', 'Mr', 'Speake...","Thank you very much. Mr. Speaker, Mr. Vice Pre...","['thank', 'you', 'very', 'much', 'mr', 'speake...","['thank', 'much', 'mr', 'speaker', 'mr', 'vice..."
208,Trump,2018,5204,"['Mr', 'Speaker', 'Mr', 'Vice', 'President', '...","Mr. Speaker, Mr. Vice President, Members of Co...","['mr', 'speaker', 'mr', 'vice', 'president', '...","['mr', 'speaker', 'mr', 'vice', 'president', '..."


Let's open one recent SOTU address:

In [51]:
# In the code below, we are opening the 2009 SOTU 
# presidential address (which would have been Obama's first SOTU address)
sotu = sotudf[sotudf['year'] == 2009]['fulltext'].item()
sotu[:200]

"Madame Speaker, Mr. Vice President, Members of Congress, and the First Lady of\nthe United States:\n\nI've come here tonight not only to address the distinguished men and women in\nthis great chamber, but"


Let's once again review the list of named entities that spaCy can extract for us:

```
PERSON:      People, including fictional.
NORP:        Nationalities or religious or political groups.
FAC:         Buildings, airports, highways, bridges, etc.
ORG:         Companies, agencies, institutions, etc.
GPE:         Countries, cities, states.
LOC:         Non-GPE locations, mountain ranges, bodies of water.
PRODUCT:     Objects, vehicles, foods, etc. (Not services.)
EVENT:       Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW:         Named documents made into laws.
LANGUAGE:    Any named language.
DATE:        Absolute or relative dates or periods.
TIME:        Times smaller than a day.
PERCENT:     Percentage, including ”%“.
MONEY:       Monetary values, including unit.
QUANTITY:    Measurements, as of weight or distance.
ORDINAL:     “first”, “second”, etc.
CARDINAL:    Numerals that do not fall under another type.
```

In the markdown cell below, brainstorm some different research questions you could use spaCy's NER to help you answer (given the list of entities it could help you analyze) about a given SOTU address.




To begin processing this text with spaCy, we will read it into the spaCy **nlp** object.

In [52]:
doc_sotu = nlp(sotu)



We can retrieve named entities from the nlp object using the **.ents** method.

In [53]:
for i, ent in enumerate(doc_sotu.ents):
    if i < 10:   # here, we are indicating we only want to return the first 10 entities
        print(ent.text, ent.start_char, ent.end_char, ent.label_)

Congress 47 55 ORG
First 65 70 ORDINAL
the United States 79 96 GPE
tonight 114 121 TIME
Americans 292 301 NORP
tonight 1136 1143 TIME
American 1157 1165 NORP
the United States of America 1219 1247 GPE
Earth 1584 1589 LOC
America 1623 1630 GPE


To understand what these labels mean, we can run:

In [54]:
spacy.explain("NORP")

'Nationalities or religious or political groups'

We can even print out these explanations with each entity:

In [55]:
for i, ent in enumerate(doc_sotu.ents):
    if i < 10:   # here, we are indicating we only want to return the first 10 entities
        print(ent.text, ent.start_char, ent.end_char, ent.label_, spacy.explain(ent.label_))

Congress 47 55 ORG Companies, agencies, institutions, etc.
First 65 70 ORDINAL "first", "second", etc.
the United States 79 96 GPE Countries, cities, states
tonight 114 121 TIME Times smaller than a day
Americans 292 301 NORP Nationalities or religious or political groups
tonight 1136 1143 TIME Times smaller than a day
American 1157 1165 NORP Nationalities or religious or political groups
the United States of America 1219 1247 GPE Countries, cities, states
Earth 1584 1589 LOC Non-GPE locations, mountain ranges, bodies of water
America 1623 1630 GPE Countries, cities, states


We can save the names and labels of these entities into various formats. Below we are saving info about each entity into a tuple and placing those tuples into a list.

In [56]:
#ents = [(e.text, e.label_, e.kb_id_) for e in doc_sotu.ents]
ents = [(e.text, e.label_, e.start_char, e.end_char) for e in doc_sotu.ents]
print(ents[:10])

[('Congress', 'ORG', 47, 55), ('First', 'ORDINAL', 65, 70), ('the United States', 'GPE', 79, 96), ('tonight', 'TIME', 114, 121), ('Americans', 'NORP', 292, 301), ('tonight', 'TIME', 1136, 1143), ('American', 'NORP', 1157, 1165), ('the United States of America', 'GPE', 1219, 1247), ('Earth', 'LOC', 1584, 1589), ('America', 'GPE', 1623, 1630)]


We may use the start and end character span for each entity to identify it within its context. For example, the code below prints out the entity and the 50 characters immediately preceding and following it:

In [57]:
char_span = 50
for ent in ents:
    start_char = ent[2] - char_span
    if start_char < 0:
        start_char = 0
    end_char = ent[3] + char_span
    if end_char > len(doc_sotu.text):
        end_char = len(doc_sotu.text)
    print(ent[0], "=", ent[1])
    print(doc_sotu.text[start_char: end_char])

Congress = ORG
Madame Speaker, Mr. Vice President, Members of Congress, and the First Lady of
the United States:

I've c
First = ORDINAL
 Mr. Vice President, Members of Congress, and the First Lady of
the United States:

I've come here tonigh
the United States = GPE
ident, Members of Congress, and the First Lady of
the United States:

I've come here tonight not only to address the 
tonight = TIME
 First Lady of
the United States:

I've come here tonight not only to address the distinguished men and wom
Americans = NORP
and women who
sent us here.

I know that for many Americans watching right now, the state of our economy is
a
tonight = TIME
are
living through difficult and uncertain times, tonight I want every American to
know this:

We will rebu
American = NORP
fficult and uncertain times, tonight I want every American to
know this:

We will rebuild, we will recover, 
the United States of America = GPE
know this:

We will rebuild, we will recover, and the United States of America wil

Or we can save only one type of entity. Below, we focus only on person names:

In [58]:
person_names = []
for ent in ents:
    if ent[1] == "PERSON":
        #person_names.append((ent[0], ent[1]))
        person_names.append(ent[0])
person_names[:10]

#list comprehension to produce the same results in one line of code:
#person_names = [ent[0] for ent in ents if ent[1] == "PERSON"]


['Biden',
 'Joe',
 'Teddy Roosevelt',
 'Orrin Hatch',
 'Edward Kennedy',
 'Leonard Abess',
 "Ty'Sheoma Bethea",
 'God Bless',
 'God Bless']

For place names, spaCy offers at least two different types ("GPE" and "LOC") of place name entities. In the code below, we save both.

In [59]:
place_names = [(ent[0], ent[1]) for ent in ents if ent[1] in ['GPE', 'LOC']]


In [60]:
place_names[:20]


[('the United States', 'GPE'),
 ('the United States of America', 'GPE'),
 ('Earth', 'LOC'),
 ('America', 'GPE'),
 ('Minneapolis', 'GPE'),
 ('America', 'GPE'),
 ('Washington', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('the moon', 'LOC'),
 ('China', 'GPE'),
 ('Germany', 'GPE'),
 ('Japan', 'GPE'),
 ('Korea', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE')]

We can create a frequency list of the place names found within this address using the Counter method from the **collections** package.

In [61]:
pnfreqs = collections.Counter(place_names)

In [62]:
pnfreqs

Counter({('America', 'GPE'): 15,
         ('Iraq', 'GPE'): 4,
         ('the United States of America', 'GPE'): 2,
         ('Afghanistan', 'GPE'): 2,
         ('South Carolina', 'GPE'): 2,
         ('the United States', 'GPE'): 1,
         ('Earth', 'LOC'): 1,
         ('Minneapolis', 'GPE'): 1,
         ('Washington', 'GPE'): 1,
         ('the moon', 'LOC'): 1,
         ('China', 'GPE'): 1,
         ('Germany', 'GPE'): 1,
         ('Japan', 'GPE'): 1,
         ('Korea', 'GPE'): 1,
         ('Pakistan', 'GPE'): 1,
         ('Guantanamo Bay', 'LOC'): 1,
         ('Israel', 'GPE'): 1,
         ('Miami', 'GPE'): 1,
         ('Greensburg', 'GPE'): 1,
         ('Kansas', 'GPE'): 1,
         ('Dillon', 'GPE'): 1,
         ('United States of America', 'GPE'): 1})

### IX. Extracting Named Entities from the entire corpus

We can now scale up and extract NEs from the entire SOTU corpus. A researcher interested in identifying and mapping locations mentioned in these addresses, for example, may want to export all place names. They could extract both "GPEs" (geopolitical entities like countries) and "LOC" (other types of place names). But, for the example below, we will focus on GPEs, which are easier to map.

First, we will create a function that extracts all GPEs from a given text.



In [63]:
def extract_placenames (text):
    doc = nlp(text)
    ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
    #place_names = [(ent[0], ent[1]) for ent in ents if ent[1] in ['GPE', 'LOC']]
    gpes = [ent[0] for ent in ents if ent[1] == "GPE"]
    return(gpes)

Then, we can apply this function to all our texts found in our sotu dataframe. The code below creates a new column called "gpes" that stores a list of the GPEs found within each text.

*Note: This will take several minutes to run across the SOTU corpus of 233 SOTU addresses. It took 5 minutes to complete on my relatively fast laptop*. The apply function is commented out, but when you have 5 - 20 minutes to spare feel free to run it and check out the results

In [64]:
sotudf['gpes'] = sotudf['fulltext'].apply(extract_placenames)

If you have created the new "gpes" column with the code above, you may also uncomment out the code below (remove the `"""`) to create one long list of all gpes found within the SOTU corpus.

In [65]:

all_gpes = list([a for b in sotudf.gpes.tolist() for a in b])
all_gpes = [gpe.replace("\n", " ") for gpe in all_gpes]
print(len(all_gpes))

with open('sotudf_gpes.txt', 'w') as f:
    for gpe in all_gpes:
        f.write(f"{gpe}\n")


18636


We can then create a frequency list of these GPEs, which could be used in subsequent efforts to create a map of all countries in the corpus.

In [67]:
sotu_gpes_freqs = collections.Counter(all_gpes)
sotu_gpes_freqs.most_common(20)

[('the United States', 3478),
 ('America', 1418),
 ('States', 1187),
 ('Mexico', 680),
 ('United States', 537),
 ('Spain', 454),
 ('Great Britain', 446),
 ('China', 347),
 ('France', 338),
 ('Washington', 323),
 ('Cuba', 308),
 ('Texas', 249),
 ('Japan', 222),
 ('Russia', 190),
 ('Germany', 153),
 ('California', 145),
 ('Nicaragua', 136),
 ('Iraq', 129),
 ('The United States', 126),
 ('New York', 118)]

You may observe some problems here. "The United States", "the United States", "United States", and "America" are considered separated entities. We would want to aggregate them into one entity. We also have a few state and/or city names mixed in with the country names. More data cleaning is necessary. But, it is a good starting point....