# Entity Extraction Exercise in Class

In this example, we will use NLTK for entity extraction. 
- Firstly, install python environment
- Change the source mirror of pip : https://mirrors.bfsu.edu.cn/help/pypi/
- Install NLTK: pip install nltk
- Download data distribution for NLTK. Enter python terminal first. import nltk. Install packages by using NLTK downloader: ``nltk.download()``. If cannot download using ``nltk.download()``, try download manually from https://github.com/nltk/nltk_data/tree/gh-pages![image.png](attachment:image.png) or https://pan.baidu.com/s/1wONWpaa86_wnsIksKda8eQ (code:tfon )
- Unzip the downloaded file to the following folder: ``nltk.data.find(".")``
- Unzip each zip file in the ten folders: *chunkers, corpora, grammers, help, misc, models, sentiment, stemmers, taggers, tokenizers*

In [1]:
# import packages 
import nltk
from nltk import word_tokenize
from nltk import Tree

In [8]:
nltk.download("tagsets")

[nltk_data] Error loading tagset: [WinError 10060]
[nltk_data]     由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。


False

In [23]:
# Tokenize sentence:
raw = """John was born in Liverpool, to Julia and Alfred Lennon"""
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
tokens = tokenizer.tokenize(raw)
tokens

['John was born in Liverpool, to Julia and Alfred Lennon']

In [24]:
# pos-tag of inputs
#output: a list of tokens with pos tags
tagged = nltk.pos_tag(tokens)
print(tagged)

[('John was born in Liverpool, to Julia and Alfred Lennon', 'NN')]


In [25]:
type(tagged)

list

If you want to know the detail information of each tag, use the following statements:

In [26]:
#You can call the following method to get info about any pos tag
nltk.help.upenn_tagset('NNP')

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


### Chunking:
- Use ``ne_chunk`` provided by NLTK. ``ne_chunk`` needs part-of-speech annotations to add ``NE`` labels to the sentence. The output of the ``ne_chunk`` is a ``nltk.Tree`` object
- ``ne_chunk`` produces 2-level trees:
 - Nodes on Level-1: outsides any chunk
 - Nodes on Level-2: inside a chunk (the label of the chunk is denoted by the label of the subtree)


In [27]:
#import related packages
from nltk import pos_tag, ne_chunk
text = """John was born in Liverpool, to Julia and Alfred Lennon"""
chunks = ne_chunk(pos_tag(word_tokenize(text)))
print(chunks)
chunks.draw()

(S
  (PERSON John/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Liverpool/NNP)
  ,/,
  to/TO
  (GPE Julia/NNP)
  and/CC
  (PERSON Alfred/NNP Lennon/NNP))


In [28]:
type(chunks)

nltk.tree.Tree

In [29]:
for i in chunks:
    print(i, type(i))

(PERSON John/NNP) <class 'nltk.tree.Tree'>
('was', 'VBD') <class 'tuple'>
('born', 'VBN') <class 'tuple'>
('in', 'IN') <class 'tuple'>
(GPE Liverpool/NNP) <class 'nltk.tree.Tree'>
(',', ',') <class 'tuple'>
('to', 'TO') <class 'tuple'>
(GPE Julia/NNP) <class 'nltk.tree.Tree'>
('and', 'CC') <class 'tuple'>
(PERSON Alfred/NNP Lennon/NNP) <class 'nltk.tree.Tree'>


In [30]:
for i in chunks:
    if type(i) == Tree:
        print(i.label())
        chunk_phrase = []
        for token, pos in i.leaves():
            print(token, pos)

PERSON
John NNP
GPE
Liverpool NNP
GPE
Julia NNP
PERSON
Alfred NNP
Lennon NNP


Problem: Julia is labled as GPE instead of PERSON
add a lastname for Julia, such as Kim

In [31]:
text = """John was born in Liverpool, to Julia Kim and Alfred Lennon"""
chunks = ne_chunk(pos_tag(word_tokenize(text)))
print(chunks)
chunks.draw()

(S
  (PERSON John/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Liverpool/NNP)
  ,/,
  to/TO
  (PERSON Julia/NNP Kim/NNP)
  and/CC
  (PERSON Alfred/NNP Lennon/NNP))


Traverse the chunked tree structure to get each chunk and words inside each chunk:

In [37]:
#Traverse the level-1 nodes in the tree: 
for i in chunks:
    print(i, type(i))

(PERSON John/NNP) <class 'nltk.tree.Tree'>
('was', 'VBD') <class 'tuple'>
('born', 'VBN') <class 'tuple'>
('in', 'IN') <class 'tuple'>
(GPE Liverpool/NNP) <class 'nltk.tree.Tree'>
(',', ',') <class 'tuple'>
('to', 'TO') <class 'tuple'>
(PERSON Julia/NNP Kim/NNP) <class 'nltk.tree.Tree'>
('and', 'CC') <class 'tuple'>
(PERSON Alfred/NNP Lennon/NNP) <class 'nltk.tree.Tree'>


In [39]:
for i in chunks:
    print(, type(i))

('John', 'NNP') <class 'nltk.tree.Tree'>
was <class 'tuple'>
born <class 'tuple'>
in <class 'tuple'>
('Liverpool', 'NNP') <class 'nltk.tree.Tree'>
, <class 'tuple'>
to <class 'tuple'>
('Julia', 'NNP') <class 'nltk.tree.Tree'>
and <class 'tuple'>
('Alfred', 'NNP') <class 'nltk.tree.Tree'>


In [19]:
#Traverse the level-2 nodes in the sub-tree: 
for i in chunks:
    if type(i) == Tree:
        print("Chunk detected!")
        chunk_phrase = []
        for token, pos in i.leaves():
            print(token, pos)
        

Chunk detected!
John NNP
Chunk detected!
Liverpool NNP
Chunk detected!
Julia NNP
Kim NNP
Chunk detected!
Alfred NNP
Lennon NNP


In [40]:
sen = """the little dog barked at the cat"""
grammer = "NP: {<JJ>*<NN.*>+}\n {<NN.*>+}"  
cp = nltk.RegexpParser(grammer)
mychunk = cp.parse(pos_tag(word_tokenize(sen)))
mychunk.draw()

In [43]:
#print(mychunk)
for i in mychunk:
    print(i, type(i))

('the', 'DT') <class 'tuple'>
(NP little/JJ dog/NN) <class 'nltk.tree.Tree'>
('barked', 'VBD') <class 'tuple'>
('at', 'IN') <class 'tuple'>
('the', 'DT') <class 'tuple'>
(NP cat/NN) <class 'nltk.tree.Tree'>
