# NLP_4 Parts of Speech Tagging and Named Entity Recognition

[Python for NLP: Parts of Speech Tagging and Named Entity Recognition](https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition/)

## Parts of Speech (POS) Tagging

parts of speech tagging，又稱詞性標記，將句子當中的每一個字詞分派適當的詞性。取得詞性和詞性標籤的方式，是透過對處理好的 tokens 取```pos_```以及```tag_```性質。如果對於 tag 的涵義不了解，可以透過```spacy.explain()```參數放入 tag 名稱來取得更詳細的說明。

In [2]:
import spacy
sp = spacy.load('en_core_web_sm')

sen = sp("I like to play football. I hated it in my childhood though")
print(sen.text)

I like to play football. I hated it in my childhood though


In [3]:
print(sen[7].pos_)
print(sen[7].tag_)
print(spacy.explain(sen[7].tag_))

VERB
VBD
verb, past tense


In [10]:
for word in sen:
    print(f'{word.text:12} {word.pos_:10} {word.tag_:8} {spacy.explain(word.tag_)}')
# print 中放入 f-字符串可以將表達式寫成{expression}的形式呈現，此外在':'之後放一個整数可以使該字串存有最小字符寬度

I            PRON       PRP      pronoun, personal
like         VERB       VBP      verb, non-3rd person singular present
to           PART       TO       infinitival "to"
play         VERB       VB       verb, base form
football     NOUN       NN       noun, singular or mass
.            PUNCT      .        punctuation mark, sentence closer
I            PRON       PRP      pronoun, personal
hated        VERB       VBD      verb, past tense
it           PRON       PRP      pronoun, personal
in           ADP        IN       conjunction, subordinating or preposition
my           DET        PRP$     pronoun, possessive
childhood    NOUN       NN       noun, singular or mass
though       SCONJ      IN       conjunction, subordinating or preposition


### Why POS Tagging is Useful?

pos tagging 特別好用的地方在於能夠分辨同一個字的不同詞性，也就是能根據上下文分辨同一個單詞的 pos tag，例如 google 作為動詞和名詞。

In [6]:
sen = sp('Can you google it?')
word = sen[2]

print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

google       VERB       VB       verb, base form


In [7]:
sen = sp('Can you search it on google?')
word = sen[5]

print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

google       PROPN      NNP      noun, proper singular


### Finding the Number of POS Tags

若要計算文本當中不同 pos tag 出現的次數，對 SpaCy 物件取```count_by(spacy.attrs.POS)```即可，結果會回傳一組 dictionary，key 為 pos tag 的 ID，value 為出現次數。透過```vocab[id].text```可以找出該 id 所對應到的詞性標籤。

In [3]:
sen = sp("I like to play football. I hated it in my childhood though")

num_pos = sen.count_by(spacy.attrs.POS)
num_pos

{85: 1, 90: 1, 92: 2, 94: 1, 95: 3, 97: 1, 98: 1, 100: 3}

In [4]:
for k,v in sorted(num_pos.items()):
    print(f'{k}. {sen.vocab[k].text:8}: {v}')

85. ADP     : 1
90. DET     : 1
92. NOUN    : 2
94. PART    : 1
95. PRON    : 3
97. PUNCT   : 1
98. SCONJ   : 1
100. VERB    : 3


### Visualizing Parts of Speech Tags

pos tag 很適合用視覺化的方式呈現其交互關係，SpaCy 提供了```displacy```模組可以完成這項作業。從 displacy 中呼叫```render()```，並在參數中放入 SpaCy 物件，```style```設定為 "dep"，將```jupyter```參數設為 True 以直接顯示。

如果想要在 jupyter notebook 之外顯示，就必須使用```serve()```，視覺化結果便會以 html 形式呈現於預設瀏覽器。

In [9]:
from spacy import displacy

sen = sp(u"I like to play football. I hated it in my childhood though")
displacy.render(sen, style='dep', jupyter=True, options={'distance': 85})

In [10]:
displacy.serve(sen, style='dep', options={'distance': 120})

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



127.0.0.1 - - [17/Dec/2019 14:06:31] "GET / HTTP/1.1" 200 9473


Shutting down server on port 5000.


## Named Entity Recognition

命名實體辨別(NER)為在文本主找出不同類型的 entities，包括組織、人名、地方等等。要找出 entites 可以對 SpaCy 物件使用```ents```性質，找出 entites 的類別標籤可以分別對不同的 entites 使用```label_```性質，要取得該標籤進一步的說明可以將標籤放入```spacy.explain()```的參數中。

In [28]:
import spacy
sp = spacy.load('en_core_web_sm')

sen = sp(u'Manchester United Company is looking to sign Harry Kane for $90 million')
print(sen.ents)

for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

(Manchester United Company, Harry Kane, $90 million)
Manchester United Company - ORG - Companies, agencies, institutions, etc.
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit


### Adding New Entities

我們也可以增加新的 entites 至文件中，假設 NesfruitaQQQQQ 是一個組織，但是並沒有被標注為 entity。

首先要先從```spacy.tokens```中呼叫```Span```模組，接著我們要取得 entity 類型標籤的 hash value（就像是指紋的概念），再透過```Span()```放入 index 範圍捕捉尚未被標注的實體，同時放入先前找出來的標籤的 hash value，最後再把這個```Span()```函數生成的物件合併到原先 entity 的 list 當中。

In [32]:
sen = sp('NesfruitaQQQQQ is setting up a new company in India')
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

India - GPE - Countries, cities, states


In [33]:
from spacy.tokens import Span

ORG = sen.vocab.strings['ORG'] # 取得 "ORG" 標籤的 hash value
new_entity = Span(sen, 0, 1, label=ORG) # 標注出新的 entity 並且給予標籤
sen.ents = list(sen.ents) + [new_entity] # 合併 entity list

In [34]:
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

NesfruitaQQQQQ - ORG - Companies, agencies, institutions, etc.
India - GPE - Countries, cities, states


### Counting Entities

先前要計算 pos tag 的數量有```count_by()```函數可以使用，然而計算 entites 的數量沒有這樣的函式，就乖乖用迴圈算啦！

In [25]:
sen = sp('Manchester United Company is looking to sign Harry Kane for $90 million. David demand 100 Million Dollars')
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Manchester United Company - ORG - Companies, agencies, institutions, etc.
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit
David - PERSON - People, including fictional
100 Million Dollars - MONEY - Monetary values, including unit


In [26]:
len([ent for ent in sen.ents if ent.label_=='PERSON'])

2

### Visualizing Named Entities

SpaCy 的```displacy```模組也可以進行 NER 視覺化。從 displacy 中呼叫```render()```，並在參數中放入 SpaCy 物件，```style```設定為 "ent"，將```jupyter```參數設為 True 以直接顯示。

In [36]:
from spacy import displacy

sen = sp('Manchester United Company is looking to sign Harry Kane for $90 million. David demand 100 Million Dollars')
displacy.render(sen, style='ent', jupyter=True)

NER 視覺化也可以只篩選出特定的 entities，只要在```options```中放入字典說明，該字典中 ents key 的 value 放入想要呈現的 entity label list。

In [37]:
filter = {'ents': ['ORG']} # entity label 只呈現 "ORG"
displacy.render(sen, style='ent', jupyter=True, options=filter)

In [31]:
displacy.serve(sen, style='ent')

  "__main__", mod_spec)



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...



127.0.0.1 - - [17/Dec/2019 14:15:37] "GET / HTTP/1.1" 200 2160


Shutting down server on port 5000.
