# 語意
語意代表句子的意思。  
在NLP內處理語意的手法如下:  
1. Word Segmentation: 將詞語分割成正確的單詞。  
2. Stemming: 將衍生字還原成原始型態。  
3. Lemmatization: 將變形字轉成原始型態。  
4. Part-of-Speech Tagging: 根據語言學辨識每一個字的詞性。
5. Parsing: 根據語法樹辨識每一個字句的文法，並正確判斷句子的意思。  
6. Sentence Breaking: 將文章內的句子完整拆分出來，又稱為sentence tokenize。  

### 1. Word Segmentation
分詞常用套件: jieba(中文) , NLTK(英文)  

> jieba  
- cut函數可以用來做中文詞語分割  
- add_word函數可以用來自訂詞語  
- load_userdict函數可以載入自定義詞庫  
- lcut函數可以將切完的結果以上list型別回傳

> Ckip  
- 台灣中研院開發的中文斷詞套件  
- 支援多種NLP功能(POS,NER,...etc)  

##### jieba
jieba的斷詞技術base on字典內的詞產生Trie Tree，並且根據Trie Tree建立DAG。  
如果出現不在字典內的詞，則使用HMM方法來進行斷詞辨識。  

In [2]:
import jieba

In [3]:
sentence = "足球運動需要大家一起來推廣，歡迎加入我們的行列！"
print("輸入： {}".format(sentence))

輸入： 足球運動需要大家一起來推廣，歡迎加入我們的行列！


In [12]:
words1 = jieba.cut(sentence, cut_all=False) # 速度快
words2 = jieba.cut(sentence, cut_all=True) # 精確模式
words3 = jieba.cut_for_search(sentence) # 搜尋引擎模式

In [13]:
print("精確模式：")
for word in words1:
    print(word+'/', end='')

print("\n全模式：")
for word in words2:
    print(word+'/', end='')

print("\n搜索引擎模式：")
for word in words3:
    print(word+'/', end='')

精確模式：
足球/運動/需要/大家/一起/來/推廣/，/歡迎/加入/我們/的/行列/！/
全模式：
足球/運/動/需要/大家/一起/來/推/廣/，/歡/迎/加入/我/們/的/行列/！/
搜索引擎模式：
足球/運動/需要/大家/一起/來/推廣/，/歡迎/加入/我們/的/行列/！/

In [14]:
text = '考試即將結束'
words4 = jieba.cut(text, cut_all=False)
for word in words4:
    print(word+'/', end='')

考試/即將/結束/

In [16]:
# 若要將"即將"與"結束"當作一個斷詞，可使用load_userdict函數搭配自訂dict.txt 或者 add_word自行加入斷詞
jieba.add_word('即將結束', freq=None, tag=None)

# 測試斷詞
text = '考試即將結束'
words4 = jieba.cut(text, cut_all=False)
for word in words4:
    print(word+'/', end='')

考試/即將結束/

In [20]:
# jieba 的 tokenize 可以將每一個斷詞的起始與結束點記錄下來
import jieba
sentence = "足球運動需要大家一起來推廣，歡迎加入我們的行列！"
words = jieba.tokenize(sentence)
list(words)
# for tk in words:
#     print("word {}\t\t start: {} \t\t end:{}".format(tk[0],tk[1],tk[2]))

[('足球', 0, 2),
 ('運動', 2, 4),
 ('需要', 4, 6),
 ('大家', 6, 8),
 ('一起', 8, 10),
 ('來', 10, 11),
 ('推廣', 11, 13),
 ('，', 13, 14),
 ('歡迎', 14, 16),
 ('加入', 16, 18),
 ('我們', 18, 20),
 ('的', 20, 21),
 ('行列', 21, 23),
 ('！', 23, 24)]

##### Ckip

In [1]:
from ckiptagger import data_utils , construct_dictionary , WS , POS , NER

In [None]:
# 從中研院遠端server download資料
data_utils.download_data_url("./")

In [4]:
def cutText():
    ws = WS("./data")
    pos = POS("./data")
    ner = NER("./data")
    word_to_weight = {
        
    }
    dictionary = construct_dictionary(word_to_weight)
    print(dictionary)
    
    sentence_list = ["指揮中心表示，今日新增之26943例本土病例，為12577例男性、14350例女性、16例調查中" , 
                     "被軍事迷戲稱「妖怪」的中共山東號航空母艦，昨天被記錄到現身台海中線以西，沿中國沿海南下通過金門外海，隨後跟著一艘疑似補給艦，也疑似被在西南空域盤旋的美國空軍RC-135鎖定目標，畫面也被記錄下來。" , 
                     "前鋒格林（JaMychal Green）在選秀夜當天遭金塊交易至雷霆，7月20日則傳出他和雷霆完成買斷，並且將加盟勇士，為衛冕軍補充禁區戰力。"]
    
    word_sentence_list = ws(sentence_list)
    pos_sentence_list = pos(word_sentence_list)
    entity_sentence_list = ner(word_sentence_list , pos_sentence_list)
    
    del ws
    del pos
    del ner
    
    def print_word_pos_sentence(word_sentence , pos_sentence):
        for word , pos in zip(word_sentence , pos_sentence):
            print(f"{word}({pos})" , end="\u3000")
            print("\n")
            
    for i , sentence in enumerate(sentence_list):
        print("\n")
        print(f"'{sentence}'")
        print_word_pos_sentence(word_sentence_list[i] , pos_sentence_list[i])
        for entity in sorted(entity_sentence_list):
            print(entity)
    return

In [5]:
cutText()

[]


'指揮中心表示，今日新增之26943例本土病例，為12577例男性、14350例女性、16例調查中'
指揮(Na)　

中心(Nc)　

表示(VE)　

，(COMMACATEGORY)　

今日(Nd)　

新增(VJ)　

之(DE)　

26943(Neu)　

例(Na)　

本土(Nc)　

病例(Na)　

，(COMMACATEGORY)　

為(VG)　

12577(Neu)　

例(Na)　

男性(Na)　

、(PAUSECATEGORY)　

14350(Neu)　

例(Na)　

女性(Na)　

、(PAUSECATEGORY)　

16(Neu)　

例(Na)　

調查(VE)　

中(Ng)　

{(42, 44, 'CARDINAL', '16'), (33, 38, 'CARDINAL', '14350'), (12, 17, 'CARDINAL', '26943'), (24, 29, 'CARDINAL', '12577')}
{(45, 47, 'GPE', '金門'), (74, 76, 'GPE', '美國'), (78, 84, 'PRODUCT', 'RC-135'), (29, 31, 'LOC', '台海'), (11, 13, 'ORG', '中共'), (21, 23, 'DATE', '昨天'), (37, 39, 'GPE', '中國'), (54, 55, 'CARDINAL', '一')}
{(35, 40, 'DATE', '7月20日'), (21, 24, 'PERSON', '選秀夜'), (32, 34, 'PERSON', '雷霆'), (5, 19, 'PERSON', 'JaMychal Green'), (2, 4, 'PERSON', '格林'), (24, 26, 'DATE', '當天')}


'被軍事迷戲稱「妖怪」的中共山東號航空母艦，昨天被記錄到現身台海中線以西，沿中國沿海南下通過金門外海，隨後跟著一艘疑似補給艦，也疑似被在西南空域盤旋的美國空軍RC-135鎖定目標，畫面也被記錄下來。'
被(P)　

軍事迷(Na)　

戲稱(VE)　

「(PARENTHESISCATEGORY)　

妖怪(Na)　

」(PARENTHESISCATEGORY)　


### 2. Stemming
常用套件: Porter、Snowball、Lancaster(英文)  

將衍生字轉成原始型態，例如:  
plays/playing/played => play  

In [9]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
#nltk.download('all')
pst = PorterStemmer()
snowball = SnowballStemmer('english')

print(pst.stem('eating'))
print(pst.stem('passed'))
print(pst.stem('puts'))

print("=================")

print(snowball.stem('eating'))
print(snowball.stem('passed'))
print(snowball.stem('puts'))

eat
pass
put
eat
pass
put


### 3. Lemmatization
常用套件: WordNet(英文)  

將變形字轉成原始型態，例如:  
is/are/been => be  

In [1]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
wnl = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Student\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
# 後面要指定詞性
nltk.download('omw-1.4')
print(wnl.lemmatize('indexes','n'))  # noun名詞
print(wnl.lemmatize('struggling','v')) # verb動詞
print(wnl.lemmatize('saddest', 'a')) # adj形容詞

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Student\AppData\Roaming\nltk_data...


index
struggle
sad


In [12]:
from nltk.corpus import wordnet as wn

# 找出詞語屬於 wordnet 裡面哪一種 syn 分類
wn.synsets('motorcar')

[Synset('car.n.01')]

In [13]:
wn.synsets('trunk')

[Synset('trunk.n.01'),
 Synset('trunk.n.02'),
 Synset('torso.n.01'),
 Synset('luggage_compartment.n.01'),
 Synset('proboscis.n.02')]

In [14]:
# show出每個car.n.01這個 wordnet syn 類別類包含哪些同意單詞
wn.synset('car.n.01').lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

In [15]:
# 單字 trunk 有很多意思，我們把不同意義的trunk，其同意詞show出來
for synset in wn.synsets('trunk'):
    print(synset.lemma_names())

['trunk', 'tree_trunk', 'bole']
['trunk']
['torso', 'trunk', 'body']
['luggage_compartment', 'automobile_trunk', 'trunk']
['proboscis', 'trunk']


In [16]:
# 查詢 car.n.01 這個分類在WordNet裡面的定義
wn.synset('car.n.01').definition()

'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

In [17]:
# 查詢 trunk.n.01 這個分類在WordNet裡面的定義
wn.synset('trunk.n.01').definition()

'the main stem of a tree; usually covered with bark; the bole is usually the part that is commercially useful for lumber'

In [18]:
# 找出 car.n.01 分類的上位分類 hyper = 上
motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hypernyms()
types_of_motorcar

[Synset('motor_vehicle.n.01')]

In [19]:
# 找出 car.n.01 分類的下位分類
motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()
types_of_motorcar

[Synset('ambulance.n.01'),
 Synset('beach_wagon.n.01'),
 Synset('bus.n.04'),
 Synset('cab.n.03'),
 Synset('compact.n.03'),
 Synset('convertible.n.01'),
 Synset('coupe.n.01'),
 Synset('cruiser.n.01'),
 Synset('electric.n.01'),
 Synset('gas_guzzler.n.01'),
 Synset('hardtop.n.01'),
 Synset('hatchback.n.01'),
 Synset('horseless_carriage.n.01'),
 Synset('hot_rod.n.01'),
 Synset('jeep.n.01'),
 Synset('limousine.n.01'),
 Synset('loaner.n.02'),
 Synset('minicar.n.01'),
 Synset('minivan.n.01'),
 Synset('model_t.n.01'),
 Synset('pace_car.n.01'),
 Synset('racer.n.02'),
 Synset('roadster.n.01'),
 Synset('sedan.n.01'),
 Synset('sport_utility.n.01'),
 Synset('sports_car.n.01'),
 Synset('stanley_steamer.n.01'),
 Synset('stock_car.n.01'),
 Synset('subcompact.n.01'),
 Synset('touring_car.n.01'),
 Synset('used-car.n.01')]

In [20]:
# 找到下位詞組後後，再從 synset 找出單詞（以詞為中心）
sorted( lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas() )

['Model_T',
 'S.U.V.',
 'SUV',
 'Stanley_Steamer',
 'ambulance',
 'beach_waggon',
 'beach_wagon',
 'bus',
 'cab',
 'compact',
 'compact_car',
 'convertible',
 'coupe',
 'cruiser',
 'electric',
 'electric_automobile',
 'electric_car',
 'estate_car',
 'gas_guzzler',
 'hack',
 'hardtop',
 'hatchback',
 'heap',
 'horseless_carriage',
 'hot-rod',
 'hot_rod',
 'jalopy',
 'jeep',
 'landrover',
 'limo',
 'limousine',
 'loaner',
 'minicar',
 'minivan',
 'pace_car',
 'patrol_car',
 'phaeton',
 'police_car',
 'police_cruiser',
 'prowl_car',
 'race_car',
 'racer',
 'racing_car',
 'roadster',
 'runabout',
 'saloon',
 'secondhand_car',
 'sedan',
 'sport_car',
 'sport_utility',
 'sport_utility_vehicle',
 'sports_car',
 'squad_car',
 'station_waggon',
 'station_wagon',
 'stock_car',
 'subcompact',
 'subcompact_car',
 'taxi',
 'taxicab',
 'tourer',
 'touring_car',
 'two-seater',
 'used-car',
 'waggon',
 'wagon']

In [24]:
# 顯示完整路徑（上位詞組再往上走）
# 可以看到最原始的詞組就是第一個陣列內的元素(例如: entity,object...etc)
motorcar = wn.synset('car.n.01')
motorcar.hypernym_paths()

[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('artifact.n.01'),
  Synset('instrumentality.n.03'),
  Synset('container.n.01'),
  Synset('wheeled_vehicle.n.01'),
  Synset('self-propelled_vehicle.n.01'),
  Synset('motor_vehicle.n.01'),
  Synset('car.n.01')],
 [Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('artifact.n.01'),
  Synset('instrumentality.n.03'),
  Synset('conveyance.n.03'),
  Synset('vehicle.n.01'),
  Synset('wheeled_vehicle.n.01'),
  Synset('self-propelled_vehicle.n.01'),
  Synset('motor_vehicle.n.01'),
  Synset('car.n.01')]]

In [25]:
# 直指頂端上位詞組
motorcar = wn.synset('car.n.01')
motorcar.root_hypernyms()

[Synset('entity.n.01')]

In [27]:
# 以鯨魚為例
right = wn.synset("right_whale.n.01")
minke = wn.synset("minke_whale.n.01")

In [29]:
# 「露脊鯨」與「小鬚鯨」在上位詞組中最低位的詞組 (共同最靠近的詞組)
right.lowest_common_hypernyms(minke)

[Synset('baleen_whale.n.01')]

In [30]:
# 露脊鯨 vs 虎鯨
orca = wn.synset("orca.n.01")
right.lowest_common_hypernyms(orca)

[Synset('whale.n.02')]

In [33]:
# 露脊鯨 vs 陸龜
# 因為兩物種差比較遠，所以顯示結果就不是鯨類，而是更上層的分類: 脊椎動物(vertebrate)
tortoise = wn.synset("tortoise.n.01")
right.lowest_common_hypernyms(tortoise)

[Synset('vertebrate.n.01')]

In [34]:
# 露脊鯨 vs 小說
novel = wn.synset("novel.n.01")
right.lowest_common_hypernyms(novel)

[Synset('entity.n.01')]

In [35]:
# 計算由當前 synset 而上的階層數 (離 root 幾層)
print(wn.synset('baleen_whale.n.01').min_depth())
print(wn.synset('whale.n.02').min_depth())
print(wn.synset('vertebrate.n.01').min_depth())
print(wn.synset('entity.n.01').min_depth())

14
13
8
0


In [36]:
# 上下位詞組結構的相似程度 (數字接近1代表path越像)
print(right.path_similarity(right))     #露脊鯨和自己本身
print(right.path_similarity(minke))     #露脊鯨和小鬚鯨
print(right.path_similarity(orca))      #露脊鯨和虎鯨
print(right.path_similarity(tortoise))  #露脊鯨和陸龜
print(right.path_similarity(novel))     #露脊鯨和小說

1.0
0.25
0.16666666666666666
0.07692307692307693
0.043478260869565216


### 4. Part-of-Speech Tagging
常用套件: jieba(中文)、Ckip(中文)、NLTK(英文)  

根據語言學辨別每一個字的詞性，例如: 動詞/名詞/主詞... 

In [5]:
sentence = 'The brown fox is quick and he is jumping over the lazy dog'
tokens = nltk.word_tokenize(sentence)

In [7]:
# 將句子分詞後判斷每一個字元的詞性
nltk.download('averaged_perceptron_tagger')
tagged_sent = nltk.pos_tag(tokens)
print(tagged_sent)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Student\AppData\Roaming\nltk_data...


[('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]


[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


In [15]:
# 不分詞直接判斷，則使用pos_tag_sents()

# (錯誤作法!)
from nltk.book import  *
tagged_sent = nltk.pos_tag_sents(sentence)  # 必須是 list of string，不能只放單一string
print(tagged_sent)

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.


LookupError: 
**********************************************************************
  Resource [93mgutenberg[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('gutenberg')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/gutenberg[0m

  Searched in:
    - 'C:\\Users\\Student/nltk_data'
    - 'C:\\Users\\Student\\anaconda3\\envs\\NLP\\nltk_data'
    - 'C:\\Users\\Student\\anaconda3\\envs\\NLP\\share\\nltk_data'
    - 'C:\\Users\\Student\\anaconda3\\envs\\NLP\\lib\\nltk_data'
    - 'C:\\Users\\Student\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [18]:
# (正確做法!)
text1 = 'European Parliament Vice President Nicola Beer in Taipei yesterday urged China to refrain from “threatening gestures” that could alter the “status quo” in the Taiwan Strait.'
text2 = 'Eligibility to receive a second COVID-19 vaccine booster is to be expanded to include adults aged 50 or older from tomorrow, the Central Epidemic Command Center (CECC) said yesterday.'
tagged_sent = nltk.pos_tag_sents(nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text2))
print(tagged_sent)

[[('Eligibility', 'NN'), ('to', 'TO'), ('receive', 'VB'), ('a', 'DT'), ('second', 'JJ'), ('COVID-19', 'JJ'), ('vaccine', 'NN'), ('booster', 'NN'), ('is', 'VBZ'), ('to', 'TO'), ('be', 'VB'), ('expanded', 'VBN'), ('to', 'TO'), ('include', 'VB'), ('adults', 'NNS'), ('aged', 'VBN'), ('50', 'CD'), ('or', 'CC'), ('older', 'JJR'), ('from', 'IN'), ('tomorrow', 'NN'), (',', ','), ('the', 'DT'), ('Central', 'NNP'), ('Epidemic', 'NNP'), ('Command', 'NNP'), ('Center', 'NNP'), ('(', '('), ('CECC', 'NNP'), (')', ')'), ('said', 'VBD'), ('yesterday', 'NN'), ('.', '.')]]


In [27]:
import jieba.posseg
result = jieba.posseg.cut("指揮中心表示，今日新增之26,943例本土病例，為12,577例男性、14,350例女性、16例調查中")
for x in result:
    print(x)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Student\AppData\Local\Temp\jieba.cache
Loading model cost 0.459 seconds.
Prefix dict has been built successfully.


指揮/v
中心/n
表示/v
，/x
今日/t
新增/v
之/f
26/m
,/x
943/m
例/v
本土/n
病例/n
，/x
為/p
12/m
,/x
577/m
例/v
男性/n
、/x
14/m
,/x
350/m
例/v
女性/n
、/x
16/m
例/n
調查/vn
中/f


### 5. Parsing
常用套件: gensim

根據語法樹辨別一個字句的文法，並做出正確判斷句子是哪種意思。

<img src="./NLP_parsing.png">

### 6. Sentence Breaking
常用套件: spacy(英文)、NLTK(英文)

將斷落或文章內的句子拆分出來。

##### NLTK

In [21]:
from nltk.tokenize import sent_tokenize
text = 'Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.'
sentences = sent_tokenize(text)
for sentence in sentences:
    print(sentence+'\n')

Backgammon is one of the oldest known board games.

Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.

It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.



In [29]:
# 整合範例 : word_tokenize + pos_tag + lemmatize
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

sentence = 'football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.'
tokens = word_tokenize(sentence)
tagged_sent = pos_tag(tokens)

In [26]:
print(tokens)

['football', 'is', 'a', 'family', 'of', 'team', 'sports', 'that', 'involve', ',', 'to', 'varying', 'degrees', ',', 'kicking', 'a', 'ball', 'to', 'score', 'a', 'goal', '.']


In [28]:
print(tagged_sent)

[('football', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('family', 'NN'), ('of', 'IN'), ('team', 'NN'), ('sports', 'NNS'), ('that', 'WDT'), ('involve', 'VBP'), (',', ','), ('to', 'TO'), ('varying', 'VBG'), ('degrees', 'NNS'), (',', ','), ('kicking', 'VBG'), ('a', 'DT'), ('ball', 'NN'), ('to', 'TO'), ('score', 'VB'), ('a', 'DT'), ('goal', 'NN'), ('.', '.')]


In [32]:
# 可以看到 sports 被還原成 sport ； is 被還原成 be
wnl = WordNetLemmatizer()
lemmas_sent = []
for tag in tagged_sent:
    wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN # 這邊要先取出詞性的原因是後面要還原的時候需要詞性做第2輸入
    lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos)) # pos參數用到上一列的 wordnet_pos 參數
print(lemmas_sent)

['football', 'be', 'a', 'family', 'of', 'team', 'sport', 'that', 'involve', ',', 'to', 'vary', 'degree', ',', 'kick', 'a', 'ball', 'to', 'score', 'a', 'goal', '.']


##### spacy

In [2]:
import spacy
# 如果下面這句不能執行，記得在terminal執行 : python -m spacy download en
nlp = spacy.load('en_core_web_sm')
doc = nlp("All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. Everyone has the right to life, liberty and security of person.")

# 分出句子
for item in doc.sents:
    print(item.text)
    
# 把句子存成list物件
sentences_as_list = list(doc.sents)
len(sentences_as_list)

All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
Everyone has the right to life, liberty and security of person.


3

In [3]:
import random
random.choice(sentences_as_list)

# 還原每一個詞語的同型態
for word in doc:
    print(word.text, word.lemma_)

All all
human human
beings being
are be
born bear
free free
and and
equal equal
in in
dignity dignity
and and
rights right
. .
They they
are be
endowed endow
with with
reason reason
and and
conscience conscience
and and
should should
act act
towards towards
one one
another another
in in
a a
spirit spirit
of of
brotherhood brotherhood
. .
Everyone everyone
has have
the the
right right
to to
life life
, ,
liberty liberty
and and
security security
of of
person person
. .


In [4]:
sentence = list(doc.sents)[1]
# show出句子內每一個單詞
for word in sentence:
    print(word.text)

They
are
endowed
with
reason
and
conscience
and
should
act
towards
one
another
in
a
spirit
of
brotherhood
.


In [5]:
# show出每一個單詞的性質(詞語/詞性/tag)
for item in doc:
    print(item.text, item.pos_, item.tag_)

All DET DT
human ADJ JJ
beings NOUN NNS
are AUX VBP
born VERB VBN
free ADJ JJ
and CCONJ CC
equal ADJ JJ
in ADP IN
dignity NOUN NN
and CCONJ CC
rights NOUN NNS
. PUNCT .
They PRON PRP
are AUX VBP
endowed VERB VBN
with ADP IN
reason NOUN NN
and CCONJ CC
conscience NOUN NN
and CCONJ CC
should AUX MD
act VERB VB
towards ADP IN
one NUM CD
another DET DT
in ADP IN
a DET DT
spirit NOUN NN
of ADP IN
brotherhood NOUN NN
. PUNCT .
Everyone PRON NN
has VERB VBZ
the DET DT
right NOUN NN
to ADP IN
life NOUN NN
, PUNCT ,
liberty NOUN NN
and CCONJ CC
security NOUN NN
of ADP IN
person NOUN NN
. PUNCT .


In [6]:
# 篩選動詞的詞語
verbs = []
for item in doc:
    if item.pos_ == 'VERB':
        verbs.append(item.text)
print(verbs)

['born', 'endowed', 'act', 'has']


In [8]:
doc2 = nlp("John McCain and I visited the Apple Store in Manhattan.")
# 辨識實體(entity)
for item in doc2.ents:
    print(item)

John McCain
the Apple Store
Manhattan


In [9]:
# 查看辨識的entity為哪一種實體命名
for item in doc2.ents:
    print(item.text, item.label_)

John McCain PERSON
the Apple Store ORG
Manhattan GPE


# 語法
語法代表句子內的結構與詞語之間的關係。  
在NLP內處理語法的手法如下:  
1. Named Entity Recognition(NER): 標註文章內的字詞屬性(e.g. 時間、地點、組織...etc)  
2. Word Sentence Disambiguation: 讀取上下文，給予某字詞合適的意思。  
3. Natural Language Genreation(NLG): 根據歷史資料來生成文章。  