# POS and NER with Spacy


POS tagging and NER are essential tasks in NLP.

- POS (Part-of-speech) is used for information extraction (finding all the adjectives associated with a person or a product, for example) and facilitates language understanding for complex NLP tasks (text generation, for instance).


- NER (Named entity recognition) is used across many domains to identify specific entities from the text (medical terms, legal concepts, people, …).

When parsing a text with a Spacy model: ```doc = nlp(text)```, Spacy also performs POS tagging and NER.


In [None]:
"""
Here is the POS tags:

ADJ: adjective
ADP: adposition
ADV: adverb
AUX: auxiliary verb
CONJ: coordinating conjunction
DET: determiner
INTJ: interjection
NOUN: noun
NUM: numeral
PART: particle
PRON: pronoun
PROPN: proper noun
PUNCT: punctuation
SCONJ: subordinating conjunction
SYM: symbol
VERB: verb
X: other
"""


"""
Here is the NER tags:
PERSON:      People, including fictional.
NORP:        Nationalities or religious or political groups.
FAC:         Buildings, airports, highways, bridges, etc.
ORG:         Companies, agencies, institutions, etc.
GPE:         Countries, cities, states.
LOC:         Non-GPE locations, mountain ranges, bodies of water.
PRODUCT:     Objects, vehicles, foods, etc. (Not services.)
EVENT:       Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW:         Named documents made into laws.
LANGUAGE:    Any named language.
DATE:        Absolute or relative dates or periods.
TIME:        Times smaller than a day.
PERCENT:     Percentage, including ”%“.
MONEY:       Monetary values, including unit.
QUANTITY:    Measurements, as of weight or distance.
ORDINAL:     “first”, “second”, etc.
CARDINAL:    Numerals that do not fall under another type.
"""

In [1]:
# install spacy if you haven't done so already and download the small English model
!pip install -U spacy
!python -m spacy download en_core_web_sm

# install NLTK
!pip install nltk

2023-09-08 07:29:33.002548: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
# load spacy and the small English model
import spacy
nlp = spacy.load("en_core_web_sm")

## Part of Speech Tagging

Let's start by exploring POS


In [3]:
text = "If you don't know where you are going any road can take you there."
doc = nlp(text)

# print the nature of each token
for token in doc:
   print(f"{token.text}\t {token.pos_} ")

If	 SCONJ 
you	 PRON 
do	 AUX 
n't	 PART 
know	 VERB 
where	 SCONJ 
you	 PRON 
are	 AUX 
going	 VERB 
any	 DET 
road	 NOUN 
can	 AUX 
take	 VERB 
you	 PRON 
there	 ADV 
.	 PUNCT 


In [4]:
# and now for some Shakespeare

doc = nlp("Grace me no grace, nor uncle me no uncle")
for t in doc:
    print(t, t.pos_)

Grace VERB
me PRON
no DET
grace NOUN
, PUNCT
nor CCONJ
uncle VERB
me PRON
no DET
uncle NOUN


Spacy correctly identifies the nature of the _grace_ and _uncle_ both used as nouns (as expected) and as verbs.

On the other hand, NLTK, is confused. Grace and Uncle are identified as nouns in all occurences.

In [None]:
import nltk

nltk.download('universal_tagset')

text = nltk.word_tokenize("Grace me no grace, nor uncle me no uncle")

nltk.pos_tag(text,tagset='universal')

## Named Entity Recognition (NER)

Now let's see how we can extract names of peoples, places etc from a text with Spacy.

And let's see which persons can be found in Alice in Wonderland




In [5]:
import requests
import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

# text from Alice in Wonderland
r = requests.get('http://www.gutenberg.org/files/11/11-0.txt')

# remove the footer and some weird characters
# remove the header, the footer and some weird characters
text = ' '.join(r.text.split('***')[1:])
text = text.split("END OF THE PROJECT GUTENBERG")[0]
text = text.encode('ascii',errors='ignore').decode('utf-8')
print(text)

 START OF THE PROJECT GUTENBERG EBOOK ALICES ADVENTURES IN WONDERLAND  

[Illustration]




Alices Adventures in Wonderland

by Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0

Contents

 CHAPTER I.     Down the Rabbit-Hole
 CHAPTER II.    The Pool of Tears
 CHAPTER III.   A Caucus-Race and a Long Tale
 CHAPTER IV.    The Rabbit Sends in a Little Bill
 CHAPTER V.     Advice from a Caterpillar
 CHAPTER VI.    Pig and Pepper
 CHAPTER VII.   A Mad Tea-Party
 CHAPTER VIII.  The Queens Croquet-Ground
 CHAPTER IX.    The Mock Turtles Story
 CHAPTER X.     The Lobster Quadrille
 CHAPTER XI.    Who Stole the Tarts?
 CHAPTER XII.   Alices Evidence




CHAPTER I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, and what is the use of a book, thought Alice
without pi

In [10]:
# and parse the text
doc = nlp(text)

# Find all the 'persons' in the text
persons = []
persons_ent= []
# For each entity in the doc
for ent in doc.ents:
    # if the entity is a person
    if ent.label_ == 'EVENT':
        # add to the list of persons
        persons.append(ent.text)

# note we could have written the last bit in one line with
persons = [ent.text for ent in doc.ents if ent.label_ == 'EVENT']

# list the 12 most common ones
Counter(persons).most_common(20)

[]

In [11]:
persons = []
for ent in doc.ents:
  if ent.label_ == 'PERSON':
    persons.append(ent.text)

Counter(persons).most_common(20)

[('Alice', 359),
 ('Hatter', 54),
 ('Queen', 53),
 ('Mouse', 27),
 ('Bill', 9),
 ('William', 7),
 ('Lory', 7),
 ('Majesty', 6),
 ('Gryphon', 6),
 ('Mary Ann', 4),
 ('Knave', 4),
 ('Ill', 3),
 ('Mabel', 3),
 ('Said', 3),
 ('Tis', 3),
 ('Panther', 3),
 ('Down', 2),
 ('Edwin', 2),
 ('Mercia', 2),
 ('Found', 2)]

The Rabbit, although a very frequent character in the book, doesn't come out in the top 20 of identified persons.

Let's see how the Rabbit entity is classified.


In [12]:
rabbit_ner = [(ent.text, ent.label_) for ent in doc.ents if "Rabbit" in ent.text]
Counter(rabbit_ner).most_common(10)

[(('Rabbit', 'ORG'), 6),
 (('the White Rabbit', 'ORG'), 4),
 (('the White Rabbit', 'FAC'), 3),
 (('Rabbits', 'PERSON'), 1),
 (('The Rabbit Sends', 'WORK_OF_ART'), 1),
 (('RabbitsPat', 'ORG'), 1),
 (('the White Rabbit', 'WORK_OF_ART'), 1),
 (('the White\r\nRabbit', 'ORG'), 1),
 (('The White Rabbit', 'WORK_OF_ART'), 1)]

Interestingly, the Rabbit is identified as a location, an event and even a work of art! But not as a person.

Let's see if we get better results by using a larger Spacy model.


In [None]:
# Download and load the large English model.
# Note: Better to comment out the line after you've downladed the model the first time
# to avoid downloading it each time you run the notebook!
!python -m spacy download en_core_web_lg


In [None]:
nlp_lg = spacy.load("en_core_web_lg")

In [None]:
# and parse the text this time with the large language model

# and parse the text
doc = nlp_lg(text)

# Find all the 'persons' in the text
persons = []
# For each entity in the doc
for ent in doc.ents:
    # if the entity is a person
    if ent.label_ == 'PERSON':
        # add to the list of persons
        persons.append(ent.text)

# note we could have written the last bit in one line with
persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']

# list the 12 most common ones
Counter(persons).most_common(20)


In [None]:
rabbit_ner = [(ent.text, ent.label_) for ent in doc.ents if "Rabbit" in ent.text]
Counter(rabbit_ner).most_common(10)

Well that did not really work out either. The poor rabbit is now an organisation and still not a person or character.

Note that with the larger model, Alice is identified as a Person 293 but with the smaller model, Alice is a person only 191 times. So although, the model still can't identify the entity class of the Rabbit, it does a better job on other characters.

Let's see which other ORGs we can find in the book

In [None]:
orgs = [ent.text for ent in doc.ents if ent.label_ == 'ORG']
Counter(orgs).most_common(10)

In [None]:
# and work of art

woas = [ent.text for ent in doc.ents if ent.label_ == 'WORK_OF_ART']
Counter(woas).most_common(10)