# Word associations
Finding word associations for a given word and providing example representative words.


## Goal
Find concrete representational words for the input that can easily be visualized. Avoid abstract words and direct synonyms.

## Examples

Example 1:
- Input: `up` or `上`
- Output 1: `balloon,風船,ふうせん`
- Output 2: `ceiling,天井,てんじょう`
- ...

Example 2:
- Input: `down` or `下`
- Output 1: `floor,床,ゆか`
- Output 2: `subway,地下鉄,ちかてつ`



In [3]:
import pandas as pd
import numpy as np

# Step 1 Word embeddings corpus

Load word-embeddings from the chiVe corpus: https://github.com/WorksApplications/chiVe

In [4]:
import gensim

embeddings_chive = gensim.models.KeyedVectors.load("data/chive/chive-1.2-mc90_gensim/chive-1.2-mc90.kv")

In [5]:
input_token = '上'
negative_token = '下'

In [6]:
input_token_embedding = embeddings_chive[input_token]

### Get embedding of token

In [7]:
(input_token, input_token_embedding[0:3])

('上', array([0.05878441, 0.06310406, 0.11182886], dtype=float32))

### Find most similar words in corpus based on word-embedding

In [8]:
most_similar=[word[0] for word in embeddings_chive.most_similar('上', topn=10)]
print(f'{input_token}: | Outputs: {most_similar}')

上: | Outputs: ['下', 'と', '様', 'て', 'に', 'を', '、', 'だけ', '其れ', 'が']


Most similar words contain its particles/stop words, antonyms (下), synonyms and abstract representations (様).

Let's get first rid of the particles.

In [9]:
token_stopwords = ['｜','|','・',' ',' ','も','の','は','が','に','へ','こと','たり','や','し','など','ない','又','なんか','方','もの','事','物',
                   'ご','☆','ですよ','です','よ','と','で','だ','た','て','を','だけ','ます','其れ','ね','','!', '#','%','&',"'",'(',')','+','-','.','/','‐','―',
                   '’','“','”','…','│','。','〇〇','〈','〉','「','」','『','』','【','】','、', '~','×',':','[',']','一','777']

In [10]:
def filter_stopwords(words, token_stopwords):
    return [word for word in words if word not in token_stopwords]

most_similar=[word[0] for word in embeddings_chive.most_similar('上', topn=100)]
most_similar_filtered_stop=filter_stopwords(most_similar, token_stopwords)
print(f'{input_token}: | Outputs: {most_similar_filtered_stop}')

上: | Outputs: ['下', '様', '居る', '乗せる', '其の', '中', '其処', '場合', '乗っかる', 'ば', '此れ', '部分', '此の', '有る', '乗っける', '更に', '際', '為る', 'おく', '置く', '等', '代わり', 'から', '成る', '或いは', '所', '同様', '形', '敷く', '勿論', '若しくは', '実際', '侭', 'そして', '言う', '基本的', '為', '周り', '一般的', 'ながら', '出来る', '無い', '間', '大きな', '先ず', '時', '自体', '上部', '良く', '端', '合う', '直接', '筈', '予め', '尚且つ', '此処', '番下', '単に', 'れる', '常', '食み出る', 'より', '状態', 'みたい', '先程', '思う', '全て', '全体', '沿う', '訳', '必ず', '右側', '以外']


These words are still contain antonyms, let's remove them

In [11]:
most_similar=[word[0] for word in embeddings_chive.most_similar(positive=['上'], negative=['下'], topn=100)]
most_similar_filtered_stop=filter_stopwords(most_similar, token_stopwords)
print(f'{input_token}: | Outputs: {most_similar_filtered_stop}')

上: | Outputs: ['不特定', 'インターネット', '匿名性', 'ターネット', 'ネット', '自体', 'ンターネット', '紙媒体', '当方', '送受信', '媒体', '個人情報漏洩', 'ウェブサーバー', '客側', 'パソコン', '性質', '使う', 'ミラーリング', 'ユーザー', 'OS', 'ソフト', 'リアルタイム', '電子データ', '閲覧者', 'mtbf', 'アプリケーションソフト', 'スパム', 'dnsbl', 'PC', '電子メール', '実際', '重々', '単体', 'リティソフト', 'メールソフト', '可用性', 'windowsos', '無い', '社会通念', '物理的', 'コミュニケーションツール', '利用方法', 'オフライン', '情報量', 'クラウド', '加味', '営業目的', '実害', 'スパイウェア', 'インゲーム', '予め', 'ハードディスク', 'ウェブサービス', 'ハードウェア', 'レピュテーション', '店側', '利用状況', '善し悪し', 'パターンファイル', '乗せる', 'cifs', 'ソコン', 'MAC', 'ティソフト', '限る', '事柄', '書き込む', '御客様', 'ウェブブラウザ', '乗っける', '鵜呑み', 'usbhdd', '切り分け', 'uuid', 'linuxpc', 'データベースサーバー', 'Eメール', 'サーバー', 'bitos', '代物', '閲覧', 'ジャニオタ', 'データ', '羅列', 'ipad', '公然性', '中傷', 'バンドル', '一般ユーザー', 'DBサーバー', '思う', 'フリーズ', 'dropbox', '仕事柄', 'データ化', '信頼性', '照らし合わせる', 'ントソフト', 'googledocs', '前提']


Now it makes no sense, let's filter only nouns

In [12]:
from sudachipy import dictionary

In [13]:
tokenizer_obj = dictionary.Dictionary().create()  

In [38]:
def filter_nouns(words):
    morphemes = (tokenizer_obj.tokenize(word)[0] for word in words)
    return [word.dictionary_form() for word in morphemes if word.part_of_speech()[0] == '名詞']
    
most_similar_nouns = filter_nouns(most_similar_filtered_stop)
most_similar_nouns[0:15]

['気球',
 'バルーン',
 '雲',
 '頭上',
 '夜空',
 '飛行機',
 '上空',
 '自機',
 '円盤',
 'ヒコーキ',
 '筐体',
 '入道雲',
 '半透明',
 '夕空',
 '物体']

Still very far away from expected result... Let's add more synonyms

In [39]:
most_similar=[word[0] for word in embeddings_chive.most_similar(positive=['上','天井','風船','空','天面','上司'], negative=['下','床'], topn=500)]
most_similar_filtered_stop=filter_stopwords(most_similar, token_stopwords)
most_similar_nouns = filter_nouns(most_similar_filtered_stop)
print(f'{input_token}: | Outputs: {most_similar_nouns[0:50]}')

上: | Outputs: ['気球', 'バルーン', '雲', '頭上', '夜空', '飛行機', '上空', '自機', '円盤', 'ヒコーキ', '筐体', '入道雲', '半透明', '夕空', '物体', '実機', '画面', '向こう', 'UFO', 'ヘリウムガス', '真上', '尾翼', '丸', '背面', '吹き出し', '画用紙', '水平線', '天辺', '点滅', '飛行船', '球体', '逆様', 'ボトル', '赤外線', 'ポエアリー', '一', '翼面', '立体的', '奴', '半円', '窓', 'ビー玉', '紙吹雪', '綿雲', 'デコレーション', '球形', '夕焼け', 'スヌーピー', 'ジェット', 'ジャンボジェット']


### Add English translation

In [40]:
import goslate
gs = goslate.Goslate()
from jamdict import Jamdict
jam = Jamdict()


In [41]:
most_similar_nouns[0]

'気球'

In [42]:
result = jam.lookup(most_similar_nouns[0][0])
result.entries[0].senses[0]

spirit/mind/heart ((noun (common) (futsuumeishi)))

In [78]:
def translate_words(jp_words):
    translated = []
    for jp_word in jp_words:
        result = jam.lookup(jp_word)
        if result.entries and result.entries[0].kanji_forms:
            result_entry = result.entries[0]
            translated.append((result_entry.kanji_forms[0], result_entry.kana_forms[0], result_entry.senses[0].gloss[0]))
    return translated
translate_words(['飛行機'])

[(飛行機, ひこうき, aeroplane)]

In [79]:
translated = translate_words(most_similar_nouns[0:30])
translated

[(気球, ききゅう, balloon),
 (雲, くも, cloud),
 (頭上, ずじょう, overhead),
 (夜空, よぞら, night sky),
 (飛行機, ひこうき, aeroplane),
 (上空, じょうくう, sky),
 (自機, じき, player character or vehicle (in video games)),
 (円盤, えんばん, disk),
 (筐体, きょうたい, case (of a machine, computer, etc.)),
 (入道雲, にゅうどうぐも, cumulonimbus),
 (半透明, はんとうめい, semi-transparent),
 (夕空, ゆうぞら, evening sky),
 (物体, ぶったい, object),
 (実機, じっき, real machine (as opposed to a model or simulation)),
 (画面, がめん, screen (of a TV, computer, etc.)),
 (向こう, むこう, opposite side),
 (ＵＦＯ, ユーフォー, unidentified flying object),
 (真上, まうえ, just above),
 (尾翼, びよく, tail (of an aircraft)),
 (丸, まる, circle),
 (背面, はいめん, rear),
 (吹き出し, ふきだし, speech balloon (in a comic strip)),
 (画用紙, がようし, drawing paper),
 (水平線, すいへいせん, horizon (related to bodies of water)),
 (天辺, てっぺん, top),
 (点滅, てんめつ, switching on and off (of a light)),
 (飛行船, ひこうせん, airship)]

In [83]:
def find_words(pos, neg, n):
    most_similar=[word[0] for word in embeddings_chive.most_similar(positive=pos, negative=neg, topn=700)]
    most_similar_filtered_stop=filter_stopwords(most_similar, token_stopwords)
    most_similar_nouns = filter_nouns(most_similar_filtered_stop)
    most_similar_translate_words = translated = translate_words(most_similar_nouns[0:n])
    return most_similar_translate_words

In [92]:
words = find_words(pos=['上','頂','頂上','トップ','天井','風船','空','天面','上司'], neg=['下','床'], n=50)
words

[(天辺, てっぺん, top),
 (山頂, さんちょう, summit (of a mountain)),
 (雲海, うんかい, sea of clouds),
 (稜線, りょうせん, ridgeline),
 (雲, くも, cloud),
 (山, やま, mountain),
 (上部, じょうぶ, top part),
 (頭上, ずじょう, overhead),
 (上空, じょうくう, sky),
 (下界, げかい, the world),
 (三角錐, さんかくすい, triangular pyramid),
 (山々, やまやま, (many) mountains),
 (水平線, すいへいせん, horizon (related to bodies of water)),
 (中腹, ちゅうふく, halfway up (down) a mountain),
 (真上, まうえ, just above),
 (眼下, がんか, under one's eyes),
 (富士山, ふじさん, Mount Fuji),
 (夜空, よぞら, night sky),
 (地平線, ちへいせん, horizon (related to land)),
 (天空, てんくう, sky),
 (連峰, れんぽう, mountain range),
 (気球, ききゅう, balloon),
 (向こう, むこう, opposite side),
 (岩山, いわやま, rocky mountain),
 (入道雲, にゅうどうぐも, cumulonimbus),
 (景色, けしき, scenery),
 (見晴らし, みはらし, view),
 (雲間, くもま, rift between clouds),
 (半円, はんえん, semicircle),
 (雪山, ゆきやま, snowy mountain),
 (噴煙, ふんえん, (eruption of) smoke),
 (円錐, えんすい, cone),
 (絶景, ぜっけい, superb view),
 (岳, たけ, peak),
 (山肌, やまはだ, mountain's surface),
 (夕空, ゆうぞら, evening sky),
 (逆さま, さかさま, inv

In [95]:
words = find_words(pos=['下','底'], neg=['上'], n=50)
words

[(底部, ていぶ, base),
 (底なし, そこなし, bottomless),
 (上げ底, あげぞこ, false bottom),
 (内側, うちがわ, inside),
 (奥底, おくそこ, depths),
 (底割れ, そこわれ, situation where the bottom has dropped out),
 (凹み, くぼみ, hollow),
 (穴, あな, hole),
 (鍋底, なべぞこ, (inner) bottom of a pot),
 (空洞, くうどう, cave),
 (蓋, ふた, cover),
 (底面, ていめん, bottom),
 (淵, ふち, deep pool),
 (水面, すいめん, water's surface),
 (奈落, ならく, Naraka),
 (丸底, まるぞこ, round-bottom),
 (水底, すいてい, sea or river bottom),
 (裂け目, さけめ, tear),
 (楔, くさび, wedge),
 (底抜け, そこぬけ, bottomless (bucket, etc.)),
 (水抜き, みずぬき, draining (esp. pipes from water for the winter)),
 (溝, こう, 10^32),
 (足元, あしもと, at one's feet),
 (船底, せんてい, ship's bottom),
 (外側, そとがわ, exterior),
 (川底, かわぞこ, riverbed),
 (隙間, すきま, crevice),
 (支え棒, ささえぼう, stay bar),
 (上端, じょうたん, upper end),
 (淀み, よどみ, stagnation),
 (下側, したがわ, underside),
 (穴蔵, あなぐら, cellar),
 (沈澱, ちんでん, precipitation),
 (上蓋, あげぶた, trap door),
 (上部, じょうぶ, top part),
 (奥, おく, inner part),
 (中層, ちゅうそう, middle part),
 (浮き, うき, floating),
 (付け根, つけね, root)]