<h1>Word2Vec (Thai word)</h1>
<p>In this notebook, we will try to construct a word2vec from a thai corpus</p>

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
import codecs
file = codecs.open('dataset/orchid97-utf8.crp','r','utf-8')
fileString = file.read()
testArr = fileString.split("#")
testArr = [row for row in testArr if '/' in row]

In [3]:
sentences = []
wordsList = []

for row in testArr:
    parts = row.split('//')
    if("\n" in parts[0] and len(parts)>1):
        sentence = parts[0].split("\n")[1]
        sentence = sentence.replace(" ","")
        sentences.append(sentence)
        
        partsArr = parts[1].split("\n")
        partsArr = [p.split("/")[0] for p in partsArr]
        partsArr = [p for p in partsArr if p!='<space>']
        wordsList.append(partsArr[1:-1])

tokenized_sentences = [" ".join(word) for word in wordsList]

In [4]:
wordsList[0:5]

[['การ', 'ประชุม', 'ทาง', 'วิชาการ', 'ครั้ง', 'ที่ 1'],
 ['โครงการวิจัยและพัฒนา', 'อิเล็กทรอนิกส์', 'และ', 'คอมพิวเตอร์'],
 ['ปีงบประมาณ', '2531'],
 ['เล่ม', '1'],
 ['ศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์แห่งชาติ']]

In [5]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

# Set values for various parameters
num_features = 500    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model (this will take some time)
from gensim.models import word2vec
print("Training model...")
model = word2vec.Word2Vec(wordsList, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "telex_context"
model.save(model_name)

2017-04-28 08:01:26,696 : INFO : 'pattern' package not found; tag filters are not available for English
2017-04-28 08:01:26,702 : INFO : collecting all words and their counts
2017-04-28 08:01:26,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-04-28 08:01:26,731 : INFO : PROGRESS: at sentence #10000, processed 118968 words, keeping 9396 word types
2017-04-28 08:01:26,760 : INFO : PROGRESS: at sentence #20000, processed 229315 words, keeping 15114 word types
2017-04-28 08:01:26,776 : INFO : collected 17185 word types from a corpus of 288841 raw words and 24879 sentences
2017-04-28 08:01:26,777 : INFO : Loading a fresh vocabulary
2017-04-28 08:01:26,789 : INFO : min_count=40 retains 815 unique words (4% of original 17185, drops 16370)
2017-04-28 08:01:26,790 : INFO : min_count=40 leaves 233066 word corpus (80% of original 288841, drops 55775)
2017-04-28 08:01:26,794 : INFO : deleting the raw counts dictionary of 17185 items
2017-04-28 08:01:26,796 : INF

Training model...


2017-04-28 08:01:27,744 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-04-28 08:01:27,748 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-04-28 08:01:27,752 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-04-28 08:01:27,756 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-04-28 08:01:27,757 : INFO : training on 1444205 raw words (794203 effective words) took 0.9s, 858647 effective words/s
2017-04-28 08:01:27,758 : INFO : precomputing L2-norms of word weight vectors
2017-04-28 08:01:27,765 : INFO : saving Word2Vec object under telex_context, separately None
2017-04-28 08:01:27,766 : INFO : not storing attribute syn0norm
2017-04-28 08:01:27,768 : INFO : not storing attribute cum_table
2017-04-28 08:01:27,807 : INFO : saved telex_context


In [6]:
model.most_similar('วิทยาศาสตร์')

[('สารนิเทศ', 0.9880535006523132),
 ('มหาวิทยาลัย', 0.9817095398902893),
 ('สถาบัน', 0.9740793108940125),
 ('สาขา', 0.9730427265167236),
 ('วารสาร', 0.9635933041572571),
 ('ห้องสมุด', 0.9582613110542297),
 ('เครือข่ายคอมพิวเตอร์', 0.952242374420166),
 ('แห่ง', 0.9518786668777466),
 ('ทรัพยากร', 0.9424996376037598),
 ('NECTEC', 0.9415864944458008)]

In [7]:
model.most_similar('โครงการ')

[('ศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์แห่งชาติ', 0.9383119344711304),
 ('เสนอ', 0.9129125475883484),
 ('คณะ', 0.9037826061248779),
 ('อิเล็กทรอนิกส์', 0.9027509689331055),
 ('ปี', 0.901549220085144),
 ('เครือข่ายคอมพิวเตอร์', 0.8999563455581665),
 ('สารนิเทศ', 0.891416609287262),
 ('สนับสนุน', 0.8882615566253662),
 ('วิชาการ', 0.8881869912147522),
 ('NECTEC', 0.8856964707374573)]

In [8]:
model.most_similar('NECTEC')

[('เครือข่ายคอมพิวเตอร์', 0.9833523035049438),
 ('นโยบาย', 0.971004068851471),
 ('งบประมาณ', 0.9627659320831299),
 ('จัดตั้ง', 0.9621474742889404),
 ('มหาวิทยาลัย', 0.9554195404052734),
 ('ทรัพยากร', 0.9538665413856506),
 ('ร่วม', 0.9534153342247009),
 ('เอกชน', 0.9524781107902527),
 ('สถาบัน', 0.9502981901168823),
 ('สารนิเทศ', 0.9473577737808228)]

In [9]:
model.most_similar('ประชุม')

[('วิชาการ', 0.9838699102401733),
 ('ครั้ง', 0.935802161693573),
 ('ปีงบประมาณ', 0.9299863576889038),
 ('2532', 0.9235069751739502),
 ('วิศวกรรมไฟฟ้า', 0.9214415550231934),
 ('ศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์แห่งชาติ', 0.9207188487052917),
 ('ภาควิชา', 0.9121106863021851),
 ('มหาวิทยาลัยเชียงใหม่', 0.9048099517822266),
 ('ผลงานวิจัย', 0.9021814465522766),
 ('2536', 0.8995680809020996)]

<h4>Some word might not exist in vocab</h4>

In [10]:
mode.most_similar('เกม')

NameError: name 'mode' is not defined

<h3>TODO</h3>
<ul>
<li>Implement word2vec to PokemonGO corpus using thai segment</li>
<li>Comparison between bag of words & word2vec</li>
</ul>