# Recommend Keywords
In order to auto recommend wikidata keywords to users, here're two things we need to achieve. 

First, we'll tokenization the input string and find out which ones can be the keyword for the sentence. We'll introduce a NLP tool develop by the ckiplab of Academic Sinica called ckip-tagger. Within its help, we can do NER to the input stence and hence obtain keywords which are potentially be wikidata keywords.

Second, after getting a list of potential words, we'll check if they are wikidata keyword. Here we send request through the wikidata API, to search if the keyword is recorded in wikidata.

Finishing the two step works, we'll finally obtain a list of keywords in the input string, and also are wikidata keywords. That result is what we recommend to the users.

## Load Data
Before we start trying this feature, we'll load the input data from Depositar.

Previously, we downloaded metadata of datasets from Depositar through its API, randomly selected 10 datasets, and stored them in a file named `example_depositar_data.json` in the `04_data/` directory. Since we'll only use this as an example input, there's no need to update this file, and the code for calling the API is not included in this notebook.

### Obtain metadata from datasets

In [1]:
import json
import pandas as pd
# function definition
def get_metadata(data, data_index):
    with open(data, 'r', encoding='utf-8') as file:
        data = json.load(file)

        title = data[data_index]['title']
        notes = data[data_index]['notes']

        resources_names = []
        resources_desps = []
        for item in data[data_index]['resources']:
            if 'name' in item:
                resources_names.append(item['name'])
                resources_desps.append(item['description'])

        organization_title = data[data_index]['organization']['title']
        organization_desp = data[data_index]['organization']['description']

    df = pd.DataFrame({
        'Title': [title],
        'Notes': [notes],
        'Resource Names': [resources_names],
        'Resource Descriptions': [resources_desps],
        'Organization Title': [organization_title],
        'Organization Description': [organization_desp]
    })

    return df

We can chose one datasets by its index:(from 0 to 9)

In [2]:
dataset_idx = 9

### Output

In [3]:
if(dataset_idx < 100 and dataset_idx > 1):
    df = get_metadata('../04_data/example_depositar_data.json', dataset_idx)
    input_list = []
    #data_string = df.to_string(index=False, header=False)
    for entity in df:
        print(entity, ':', df[entity][0])
        input_list.append(df[entity][0])
else:
    print('input number in the interval from 0 to 99')

Title : 第二批次<提案七> 高雄市旗津區中洲漁港水環境改善計畫-中洲漁港老舊碼頭、疏浚及景觀營造
Notes : 本計畫工程完工後，改善中洲漁港老舊碼頭，維持漁港既有功能外，提供漁民停泊之安全。並清理港區水域淤泥及廢棄物，提供足夠水深。部分設施老舊問題著手進行改善及周邊整體景觀改造，讓整體港區呈現全新的風貌，更提升港區環境品質。
計畫內容包含:
(1) 碼頭工程210公尺(B區102公尺+C區108公尺)
(2) 遮休憩設施(約263m)及欄杆(約67m)
(3) 意象設施佈告欄3座
(4) 環境綠美化(約330m)
Resource Names : ['「全國水環境改善計畫」【高雄市旗津區中洲漁港水環境改善計畫】 整體計畫工作計畫書', '高雄市旗津區中洲漁港水環境改善計畫簡報', '資訊公開查核表_中洲漁港老舊碼頭疏浚及景觀營造', '1080628施工前說明會議紀錄', '工程介紹及說明_老舊碼頭疏浚及景觀營造', '生態檢核自評表_中洲漁港老舊碼頭疏浚及景觀營造', '全民督工專線和專人聯繫窗口', '生態檢核專章報告']
Resource Descriptions : ['中華民國一 O 六年十月', '提案機關：高雄市政府', '施工階段', '1080708-高市海洋工字第10831803800號', '簡報單位:誠蓄工程顧問(股)有限公司', '計畫提報/調查設計/施工階段', '工程告示牌', '中華民國 109 年 12 月 ']
Organization Title : 「全國水環境改善計畫」高雄市政府生態檢核暨相關工作計畫
Organization Description : 經濟部研擬「全國水環境改善計畫」，透過跨部會協調整合，積極推動治水、淨水、親水一體，推動結合生態保育、水質改善及周邊地景之水環境改善，以加速改善全國水環境，期能恢復河川生命力及親水永續水環境。


## Step 1: NER task

### Import Models
Here we import the transformer models, and do the NER to our input data.

In [4]:
# Import model
from transformers import (
   BertTokenizerFast,
   AutoModelForMaskedLM,
   AutoModelForCausalLM,
   AutoModelForTokenClassification,
)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForTokenClassification.from_pretrained('ckiplab/bert-tiny-chinese-ner')

# NLP task model
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker
ws_driver  = CkipWordSegmenter(model="bert-base")
pos_driver = CkipPosTagger(model="bert-base")
ner_driver = CkipNerChunker(model="bert-base")

  from .autonotebook import tqdm as notebook_tqdm


### NER task

In [5]:
# NER task
ner = ner_driver(input_list)

Tokenization:   0%|          | 0/6 [00:00<?, ?it/s]

Tokenization: 100%|██████████| 6/6 [00:00<00:00, 4968.57it/s]




Inference:   0%|          | 0/1 [00:00<?, ?it/s]

Inference: 100%|██████████| 1/1 [00:00<00:00,  1.02it/s]

Inference: 100%|██████████| 1/1 [00:00<00:00,  1.02it/s]




### Output
Below is the output list of the NER result:

In [6]:
# Show results
avoid_class = ['QUANTITY', 'CARDINAL', 'DATE', 'ORDINAL']
keyword_map = {}
for sentence_ner in ner:
   for entity in sentence_ner:
      if(entity[1] in avoid_class):
        continue
      keyword_map[entity[0]] = entity[1]

for key, value in keyword_map.items():
  print(key, ':', value)

高雄市 : GPE
旗津區 : GPE
中洲 : GPE
高雄市政府 : ORG
經濟部 : ORG


## Step 2: Searching through Wikidata API
After searching each potential word obtained in previous step, now we are going to check if each word is a wikidata keyword.

### Request Wikidata API

In [7]:
import requests

def wiki_search(search_term):
    url = f"https://www.wikidata.org/w/api.php?action=wbsearchentities&format=json&search={search_term}&language=zh"

    response = requests.get(url)
    data = response.json()

    # organize the response
    if "search" in data:
        for result in data["search"]:
            qid = result["id"]
            label = result["label"]
            description = result.get("description", "No description available")
            print(f"QID: {qid}, Label: {label}, Description: {description}")
    else:
        print("No results found.")

### Output

Here is the output of searching result:

In [8]:
for item in keyword_map:
    print(item)
    wiki_search(item)
    print('-------------------------------------------')

高雄市


QID: Q181557, Label: Kaohsiung, Description: special municipality of Taiwan
QID: Q15910959, Label: Kaohsiung City, Description: 1945-1979 provincial city in Taiwan
QID: Q13178092, Label: Kaohsiung City, Description: 1979-2010 special municipality in Taiwan
QID: Q15915730, Label: Takao City, Description: city of Taiwan under Japanese rule
QID: Q3846600, Label: Kaohsiung Museum of Fine Arts, Description: art museum in Kaohsiung, Taiwan
QID: Q713142, Label: Sanmin District, Description: district of Kaohsiung, Taiwan
QID: Q718014, Label: Zuoying District, Description: district of Kaohsiung City, Taiwan
-------------------------------------------
旗津區


QID: Q706704, Label: Cijin District, Description: district of Kaohsiung City, Taiwan
-------------------------------------------
中洲


QID: Q6960335, Label: Nakasu, Description: red-light district in Fukuoka, Japan
QID: Q16075758, Label: Chen Zisheng, Description: Qing dynasty person CBDB = 81558
QID: Q49130357, Label: Nakasu, Description: Japanese family name (中洲, なかす)
QID: Q10875799, Label: Zhongzhou, Description: township in Yueyang, Hunan, China
QID: Q1039189, Label: Nakasu-Kawabata Station, Description: metro station in Fukuoka, Fukuoka prefecture, Japan
QID: Q10875809, Label: 中洲镇 (淳安县), Description: No description available
QID: Q10875808, Label: Zhongzhou, Description: town in Huaiji, Zhaoqing, Guangdong, China
-------------------------------------------
高雄市政府


QID: Q6366077, Label: Kaohsiung City Government, Description: The local government of Kaohsiung
QID: Q113580970, Label: 高雄市政府, Description: No description available
QID: Q72325842, Label: Public Works Bureau,Kaohsiung City Government, Description: No description available
QID: Q60995902, Label: Environmental Protection Bureau, Kaohsiung City Government, Description: government agency of Kaohsiung, Taiwan
QID: Q15918055, Label: Bureau of Cultural Affairs of Kaohsiung, Description: No description available
QID: Q11673241, Label: Mass Rapid Transit Bureau, Kaohsiung City Government, Description: No description available
QID: Q15900495, Label: Kaohsiung City Police, Description: No description available
-------------------------------------------
經濟部


QID: Q697113, Label: Ministry of Economic Affairs, Description: registration authority
QID: Q81886673, Label: Ministry of Economic Affairs and Digital Transformation, Description: No description available
QID: Q3958441, Label: economic sector, Description: conceptual grouping of economic activities
QID: Q31046, Label: Water Resources Agency, MOEA, Description: government agency in Taiwan
QID: Q6865835, Label: Sweden's Minister for Finance, Description: Swedish cabinet minister
QID: Q109797916, Label: 經濟部水利署第二河川局, Description: No description available
QID: Q109798375, Label: 經濟部水利署第七河川局, Description: No description available
-------------------------------------------
