## 2023 ChemiCloud Task 3 
關鍵字元組的抽取
是否能夠看成兩個子任務 NER + RE (Relation Extraction) 
，以現有checkpoints來完成？
- [ckip tagger ner paper](https://arxiv.org/pdf/1908.11046.pdf) (Att BiLSTM CNN 的數據)
- [Ckip 使用 OntoNotes Entity type list](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf)

Entity | Description
|---|---|
PERSON|People, including fictional
NORP | Nationalities or religious or political groups
FACILITY| Buildings, airports, highways, bridges, etc.
ORGANIZATION | Companies, agencies, institutions, etc.
GPE | Countries, cities, states
LOCATION | Non-GPE locations, mountain ranges, bodies of water
PRODUCT | Vehicles, weapons, foods, etc. (Not services)
EVENT | Named hurricanes, battles, wars, sports events, etc.
WORK OF ART| Titles of books, songs, etc.
LAW | Named documents made into laws 
LANGUAGE| Any named language
DATE | Absolute or relative dates or periods
TIME | Times smaller than a day
PERCENT| Percentage (including “%”)
MONEY | Monetary values, including unit
QUANTITY| Measurements, as of weight or distance
ORDINAL| “first”, “second”
CARDINAL| Numerals that do not fall under another type

- ckip tagger 
- link: https://github.com/ckiplab/ckiptagger 

|F1 score| OntoNotes 5.0 |WNUT 
|---|---|---|
|ckip tagger NER |88.4%+-0.18|42.26%+-0.82

- ckip transformers 
- link: https://github.com/ckiplab/ckip-transformers 

|F1 score| OntoNotes 5.0 |
|---|---|
|ckip transformers NER | 81.17% # bert-base-chinese 


In [2]:
import os 
import pandas as pd 
root = '/home/nanaeilish/projects/2023-chemicloud/'
data_dir = os.path.join(root, 'data')
filename = '食品安全_ws.csvpkl'
df = pd.read_pickle(os.path.join(data_dir, filename))


In [4]:
from ckip_transformers.nlp import CkipPosTagger, CkipNerChunker 
titlecol = '新聞標題'
textcol = '新聞內容' 
titles = df[titlecol].tolist() 
ner_driver = CkipNerChunker(model="bert-base") 

texts = df[textcol].tolist()
titles = df[titlecol].tolist() 

df['ner_text'] = ner_driver(texts) 
df['ner_title'] = ner_driver(titles)

Tokenization: 100%|██████████| 133/133 [00:00<00:00, 943.43it/s]
Inference: 100%|██████████| 2/2 [01:15<00:00, 37.76s/it]
Tokenization: 100%|██████████| 133/133 [00:00<00:00, 64408.55it/s]
Inference: 100%|██████████| 1/1 [00:02<00:00,  2.16s/it]


In [40]:

import termcolor 
from termcolor import colored 


colormap = {
    'PERSON': 'red',
    'FOOD': 'blue',
    'PRODUCT': 'blue',
    'CHEMICAL': 'magenta',      
    'ORG': 'green',
    'DATE': 'yellow',
}


def color_ner_text(text, nertokens):
    for nertoken in reversed(nertokens):
        b, e = nertoken.idx 
        color = colormap.get(nertoken.ner, 'cyan')
        typed_text = text[b:e] + f'_{nertoken.ner}' + ' '
        span = colored(typed_text, color) 
        text = text[:b] + span + text[e:]

    print(text)


In [56]:
news_id = 13 
df.iloc[news_id][titlecol]

'快訊 哈根達斯 香草冰淇淋 再驗出禁用農藥 香港喊停售'

In [57]:



text = df.iloc[news_id][textcol] 
nertokens = df.iloc[news_id]['ner_text'][0] 
color_ner_text(text = text, nertokens = nertokens)

▲[31m哈根達斯_PERSON [0m香草冰淇淋再被驗出致癌物環氧乙烷。（圖／食藥署提供）
記者[31m鄒鎮宇_PERSON [0m／綜合報導
[36m美國_GPE [0m知名冰淇淋品牌「Häagen-Dazs（哈根達斯）」日前進口[36m台灣_GPE [0m的「香草冰淇淋」驗出禁用農藥，被邊境攔下依規定銷毀。沒想到，[32m香港食物環境衞生署_ORG [0m食物安全中心[33m10日_DATE [0m發出公告，[31m哈根達斯_PERSON [0m的香草冰淇淋被驗出[32m歐盟_ORG [0m禁用的農藥環氧乙烷，因此要求業界停止使用或出售。
據《東網》報導，[32m香港食物環境衞生署_ORG [0m食物安全中心發現，哈根達斯[36m75毫升_QUANTITY [0m、[36m100毫升_QUANTITY [0m、[36m473毫升_QUANTITY [0m、[36m9.46公升_QUANTITY [0m裝的香草冰淇淋被驗出農藥環氧乙烷，因此立刻跟進口商溝通，並通知業界停止使用或出售，後續將進行調查。
[31m哈根達斯_PERSON [0m[33m6月21日_DATE [0m時也在[36m香港_GPE [0m被驗出有環氧乙烷，當時[36m香港_GPE [0m[31m哈根達斯_PERSON [0m致歉，並停售、撤回商品，豈料[33m本月10日_DATE [0m又再被驗出含有環氧乙烷。
據悉，[31m哈根達斯_PERSON [0m[33m6月21日_DATE [0m時進口[36m台灣_GPE [0m的香草冰淇淋也驗出環氧乙烷，在邊境被攔下1164盒、[36m5471.34公斤_QUANTITY [0m的產品，因不符合食品安全衛生管理法[36m第15_ORDINAL [0m條有關「農藥殘留容許量標準」規定，須依規定退運或銷毀。
[32m食藥署_ORG [0m北區管理中心簡任技正[31m吳宗熹_PERSON [0m過去受訪時指出，環氧乙烷為農藥一種，具致癌風險，依目前規定不在食品中檢出，國內也沒有核准作為農藥使用，因此進口產品也不得檢出。
►按這訂閱Podcast《[36m小編沒收工_WORK_OF_ART [0m》每天熱門話題聊不完


### NER 
- Spacy NER 
- ckip-transformers NER 
- ckip tagger NER (not as good as transformers)
#### References 
https://www.jsjkx.com/EN/10.11896/jsjkx.200800181 -> food safety news 
#### Issue: chemical substance: 環氧乙烷 抓不出來 (Chemical NER) 
-> 求助易庭的 dictionary tree  
-> 卓騰 Articut Chemical NER (Python API available): https://blog.droidtown.co/post/643573663484067840/chemical 


#### Issue: 香草冰淇淋抓不出來（food NER)  
-> 卓騰 Articut Food NER: https://api.droidtown.co/ArticutAPI/document/#ArticutAPI
 
- foodBERT
  https://github.com/chambliss/foodbert -> English only  
- Chinese ner dataset: https://zhuanlan.zhihu.com/p/529541521
    - ner food related dataset: 萬創杯中醫相關命名實體辨識資料集
      https://aiqianji.com/openoker/Chinese-DeepNER-Pytorch 
    - 其中有食物的 entity annotation，但是感覺比較偏向原型食物：
```
食物(FOOD):指能够满足机体正常生理和生化能量需求，并能延续正常寿命的物质。对人体而言，能够满足人的正常生活活动需求并利于寿命延长的物质称之为食物。例子：苹果、茶、木耳、萝卜
食物分组(FOOD_GROUP): 中医中饮食养生中，将食物分为寒热温凉四性，同时中医药禁忌中对于具有某类共同属性食物的统称，记为食物分组。例子：油腻食物、辛辣食物、凉性食物
```
- English food dataset: 
- TASTEset (Recipe Dataset and Food Entities Recoginition Benchmark): https://github.com/taisti/tasteset 
```
ingredients,ingredients_entities
"5 ounces rum
4 ounces triple sec
3 ounces Tia Maria
20 ounces orange juice
","[{""start"": 0, ""end"": 1, ""type"": ""QUANTITY"", ""entity"": ""5""},{""start"": 2, ""end"": 8, ""type"": ""UNIT"", ""entity"": ""ounces""},{""start"": 9, ""end"": 12, ""type"": ""FOOD"", ""entity"": ""rum""},{""start"": 13, ""end"": 14, ""type"": ""QUANTITY"", ""entity"": ""4""},{""start"": 15, ""end"": 21, ""type"": ""UNIT"", ""entity"": ""ounces""},{""start"": 22, ""end"": 32, ""type"": ""FOOD"", ""entity"": ""triple sec""},{""start"": 33, ""end"": 34, ""type"": ""QUANTITY"", ""entity"": ""3""},{""start"": 35, ""end"": 41, ""type"": ""UNIT"", ""entity"": ""ounces""},{""start"": 42, ""end"": 51, ""type"": ""FOOD"", ""entity"": ""Tia Maria""},{""start"": 52, ""end"": 54, ""type"": ""QUANTITY"", ""entity"": ""20""},{""start"": 55, ""end"": 61, ""type"": ""UNIT"", ""entity"": ""ounces""},{""start"": 62, ""end"": 74, ""type"": ""FOOD"", ""entity"": ""orange juice""}]"
``` 
- FoodBase corpus 
https://academic.oup.com/database/article/doi/10.1093/database/baz121/5611291?login=true


In [88]:
from ArticutAPI import Articut

articut = Articut() 
# default: public quota mode 
# 每小時更新 2000 字

inputSTR = df.iloc[news_id][textcol]

result = articut.parse(inputSTR)
food_ner_tokens = articut.NER.getFood(result)
chemi_ner_tokens = articut.getChemicalLIST(result)
print(food_ner_tokens) # wrong indexing 
print(chemi_ner_tokens)

[[[118, 178, '草冰淇淋']], [], [], [], [], [], [], [], [], [], [], [], [[46, 76, '冰淇淋']], [], [], [], [], [], [], [], [], [[0, 57, '香草冰淇淋']], [], [], [], [], [], [], [], [[80, 117, '中心']], [], [[76, 137, '香草冰淇淋']], [], [], [], [], [], [], [], [], [], [], [[80, 117, '中心']], [], [], [], [], [], [], [], [[101, 162, '香草冰淇淋']], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [[205, 266, '香草冰淇淋']], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [[51, 84, '管理中心']], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]
[[(319, 323, '環氧乙烷')], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [(341, 345, '環氧乙烷')], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [(280, 284, '環氧乙烷')], [], [], [], [], [], [], [], [], [(313, 317, '環氧乙烷')], [], [], [], [], [], [], [], [(275, 279, '環氧乙烷')], [], [], [], [], [(337, 341, '環氧乙烷')], [], [], [], [], [], [], [],

In [72]:
from easydict import EasyDict as edict 

def format_ner_tokens(ner_tokens, type = 'FOOD'):
    raw = [edict({'idx': (t[0][1] - len(t[0][2]), t[0][1]), 
                   'text': t[0][2],
                   'ner': type}) for t in ner_tokens if len(t) > 0]
    return raw
    
format_food_tokens = format_ner_tokens(food_ner_tokens, type = 'FOOD')
format_chemi_tokens = format_ner_tokens(chemi_ner_tokens, type = 'CHEMICAL')


In [78]:
format_food_tokens

[{'idx': [174, 178], 'text': '草冰淇淋', 'ner': 'FOOD'},
 {'idx': [73, 76], 'text': '冰淇淋', 'ner': 'FOOD'},
 {'idx': [52, 57], 'text': '香草冰淇淋', 'ner': 'FOOD'},
 {'idx': [115, 117], 'text': '中心', 'ner': 'FOOD'},
 {'idx': [132, 137], 'text': '香草冰淇淋', 'ner': 'FOOD'},
 {'idx': [115, 117], 'text': '中心', 'ner': 'FOOD'},
 {'idx': [157, 162], 'text': '香草冰淇淋', 'ner': 'FOOD'},
 {'idx': [261, 266], 'text': '香草冰淇淋', 'ner': 'FOOD'},
 {'idx': [80, 84], 'text': '管理中心', 'ner': 'FOOD'}]

In [87]:
# add the extra tokens into ckip ner tokens 
# df.iloc[13]['ner_text'][0] + 
# ner_tokens = format_food_tokens + format_chemi_tokens 
ner_tokens = sorted(format_chemi_tokens, key = lambda x: x.idx[0]) # need to sort to ensure coloring results
color_ner_text(text = text, nertokens = ner_tokens)

▲哈根達斯香草冰淇淋再被驗出致癌物環氧乙[35m烷。（圖_CHEMICAL [0m／食藥署提供）
記者鄒鎮宇／綜合報導
美國知名冰淇淋品牌「Häagen-Dazs（哈根達斯）」日前進口台灣的「香草冰淇淋」驗出禁用農藥，被邊境攔下依規定銷毀。沒想到，香港食物環境衞生署食物安全中心10日發出公告，哈根達斯的香草冰淇淋被驗出歐盟禁用的農藥環氧乙烷，因此要求業界停止使用或出售。
據《東網》報導，香港食物環境衞生署食物安全中心發現，哈根達斯75毫升、100毫升、473毫升、9.46公升裝的香草冰淇淋被驗出農藥環氧乙烷，因此立刻跟進口商溝通，並通知業界停止使用或出售，後續將進行調查。
哈根[35m達斯6月_CHEMICAL [0m2[35m1日時也_CHEMICAL [0m在香港被驗出有環氧乙烷，當時香港哈根達斯致歉，並停售、撤回[35m商品，豈_CHEMICAL [0m料本[35m月10日_CHEMICAL [0m又再被驗出含有環氧乙烷。
據[35m悉，哈根_CHEMICAL [0m[35m達斯6月_CHEMICAL [0m21日時進口台灣的香草冰淇淋也驗出環氧乙烷，在邊境被攔下1164盒、5471.34公斤的產品，因不符合食品安全衛生管理法第15條有關「農藥殘留容許量標準」規定，須依規定退運或銷毀。
食藥署北區管理中心簡任技正吳宗熹過去受訪時指出，環氧乙烷為農藥一種，具致癌風險，依目前規定不在食品中檢出，國內也沒有核准作為農藥使用，因此進口產品也不得檢出。
►按這訂閱Podcast《小編沒收工》每天熱門話題聊不完


In [80]:
ner_tokens

[{'idx': [20, 24], 'text': '環氧乙烷', 'ner': 'CHEMICAL'},
 {'idx': [52, 57], 'text': '香草冰淇淋', 'ner': 'FOOD'},
 {'idx': [73, 76], 'text': '冰淇淋', 'ner': 'FOOD'},
 {'idx': [80, 84], 'text': '管理中心', 'ner': 'FOOD'},
 {'idx': [115, 117], 'text': '中心', 'ner': 'FOOD'},
 {'idx': [115, 117], 'text': '中心', 'ner': 'FOOD'},
 {'idx': [132, 137], 'text': '香草冰淇淋', 'ner': 'FOOD'},
 {'idx': [157, 162], 'text': '香草冰淇淋', 'ner': 'FOOD'},
 {'idx': [174, 178], 'text': '草冰淇淋', 'ner': 'FOOD'},
 {'idx': [261, 266], 'text': '香草冰淇淋', 'ner': 'FOOD'},
 {'idx': [275, 279], 'text': '環氧乙烷', 'ner': 'CHEMICAL'},
 {'idx': [280, 284], 'text': '環氧乙烷', 'ner': 'CHEMICAL'},
 {'idx': [313, 317], 'text': '環氧乙烷', 'ner': 'CHEMICAL'},
 {'idx': [319, 323], 'text': '環氧乙烷', 'ner': 'CHEMICAL'},
 {'idx': [337, 341], 'text': '環氧乙烷', 'ner': 'CHEMICAL'},
 {'idx': [341, 345], 'text': '環氧乙烷', 'ner': 'CHEMICAL'}]

In [None]:
## fix indexing issue by regex match  