# Using Named Entity Recognition (NER)

**Named entities** are noun phrases that refer to specific locations, people, organizations, and so on. With **named entity recognition**, you can find the named entities in your texts and also determine what kind of named entity they are.

Here’s the list of named entity types from the <a href = "https://www.nltk.org/book/ch07.html#sec-ner">NLTK book</a>:

<table>
    <tr><th>NEtype</th>	<th>Examples</th></tr>
    <tr><td>ORGANIZATION</td>	<td>Georgia-Pacific Corp., WHO</td></tr>
    <tr><td>PERSON</td>	<td>Eddy Bonte, President Obama</td></tr>
    <tr><td>LOCATION</td>	<td>Murray River, Mount Everest</td></tr>
    <tr><td>DATE</td>	<td>June, 2008-06-29</td></tr>
    <tr><td>TIME</td>	<td>two fifty a m, 1:30 p.m.</td></tr>
    <tr><td>MONEY</td>	<td>175 million Canadian dollars, GBP 10.40</td></tr>
    <tr><td>PERCENT</td>	<td>twenty pct, 18.75 %</td></tr>
    <tr><td>FACILITY</td>	<td>Washington Monument, Stonehenge</td></tr>
    <tr><td>GPE</td>	<td>South East Asia, Midlothian</td></tr>
<table>
You can use nltk.ne_chunk() to recognize named entities. Let’s use lotr_pos_tags again to test it out:

In [1]:
import nltk
from nltk.tokenize import word_tokenize

In [2]:
lotr_quote = "It's a dangerous business, Frodo, going out your door."

In [3]:
words_in_lotr_quote = word_tokenize(lotr_quote)
print(words_in_lotr_quote)

['It', "'s", 'a', 'dangerous', 'business', ',', 'Frodo', ',', 'going', 'out', 'your', 'door', '.']


In [4]:
lotr_pos_tags = nltk.pos_tag(words_in_lotr_quote)
print(lotr_pos_tags)

[('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('dangerous', 'JJ'), ('business', 'NN'), (',', ','), ('Frodo', 'NNP'), (',', ','), ('going', 'VBG'), ('out', 'RP'), ('your', 'PRP$'), ('door', 'NN'), ('.', '.')]


In [5]:
tree = nltk.ne_chunk(lotr_pos_tags)

Now take a look at the visual representation:

In [6]:
tree.draw()

Here’s what you get:



See how Frodo has been tagged as a PERSON? You also have the option to use the parameter binary=True if you just want to know what the named entities are but not what kind of named entity they are:

In [7]:
tree = nltk.ne_chunk(lotr_pos_tags, binary=True)
tree.draw()

Now all you see is that Frodo is an NE:

That’s how you can identify named entities! But you can take this one step further and extract named entities directly from your text. Create a string from which to extract named entities. You can use this quote from <a href = "https://en.wikipedia.org/wiki/The_War_of_the_Worlds" >The War of the Worlds</a>:

In [8]:
quote = """
Men like Schiaparelli watched the red planet—it is odd, by-the-bye, that
for countless centuries Mars has been the star of war—but failed to
interpret the fluctuating appearances of the markings they mapped so well.
All that time the Martians must have been getting ready.

During the opposition of 1894 a great light was seen on the illuminated
part of the disk, first at the Lick Observatory, then by Perrotin of Nice,
and then by other observers. English readers heard of it first in the
issue of Nature dated August 2."""

Now create a function to extract named entities:

In [9]:
def extract_ne(quote):
    words = word_tokenize(quote, language='english')
    tags = nltk.pos_tag(words)
    tree = nltk.ne_chunk(tags, binary=True)
    tree.draw()
    return set(
        " ".join(i[0] for i in t)
        for t in tree
        if hasattr(t, "label") and t.label() == "NE"
    )

With this function, you gather all named entities, with no repeats. In order to do that, you tokenize by word, apply part of speech tags to those words, and then extract named entities based on those tags. Because you included binary=True, the named entities you’ll get won’t be labeled more specifically. You’ll just know that they’re named entities.

Take a look at the information you extracted:

In [10]:
extract_ne(quote)

{'Lick Observatory', 'Mars', 'Nature', 'Perrotin', 'Schiaparelli'}

You missed the city of Nice, possibly because NLTK interpreted it as a regular English adjective, but you still got the following:

1.**An institution**: 'Lick Observatory'

2.**A planet**: 'Mars'

3.**A publication**: 'Nature'

4.**People**: 'Perrotin', 'Schiaparelli'