# Homework: Word Embedding

In this exercise, you will work on the skip-gram neural network architecture for Word2Vec. You will be using Keras to train your model. 

The sample code for skip-gram model is given. Your job is to incorporate the tokenizer model that you created in HomeWork-1 to tokenize raw text and turn it into word vectors.

You must complete the following tasks:
1. Read/clean text files
2. Indexing (Assign a number to each word)
3. Create skip-grams (inputs for your model)
4. Create the skip-gram neural network model
5. Visualization
6. Evaluation (Using pre-trained, not using pre-trained)
    (classify topic from 4 categories) 
    
This notebook assumes you have already installed Tensorflow and Keras with python3 and had GPU enabled. If you run this exercise on GCloud using the provided disk image you are all set.



In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import math
import glob
import re
import random
import collections
import os
import sys
from keras.preprocessing import sequence
from keras.models import Sequential, Model
from keras.layers import GRU, Dropout, Flatten
from keras.models import load_model
from keras.layers import Embedding, Reshape, Activation, Input, Dense, Masking
from keras.layers.merge import Dot
from keras.utils import np_utils
from keras.utils.data_utils import get_file
from keras.utils.np_utils import to_categorical
from keras.preprocessing.sequence import skipgrams
from keras.preprocessing import sequence
from keras import backend as K
from keras.optimizers import Adam

random.seed(42)

Using TensorFlow backend.


## Step 1: Read/clean text files

The given code can be used to processed the pre-tokenzied text file from the wikipedia corpus. In your homework, you must replace those text files with raw text files.  You must use your own tokenizer to process your text files

In [2]:
!wget https://www.dropbox.com/s/eexden7246sgfzf/BEST-TrainingSet.zip
!wget https://www.dropbox.com/s/n87fiy25f2yc3gt/wiki.zip
!unzip wiki.zip
!unzip BEST-TrainingSet.zip

--2020-02-23 06:29:08--  https://www.dropbox.com/s/eexden7246sgfzf/BEST-TrainingSet.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.9.1, 2620:100:601f:1::a27d:901
Connecting to www.dropbox.com (www.dropbox.com)|162.125.9.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/eexden7246sgfzf/BEST-TrainingSet.zip [following]
--2020-02-23 06:29:14--  https://www.dropbox.com/s/raw/eexden7246sgfzf/BEST-TrainingSet.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uca99f59df488eaa81f5422d894b.dl.dropboxusercontent.com/cd/0/inline/AyqB52ofp0itcj4CZv59YrifPbFIfPBNDT0T-xjCqWcL4W-MHxE0vV5YUQdu2w_t-DAg7abfI7cjRZ86V7WaKk9mncBNx__fIzJRLI9YBY0WwizQv_dxQqXFVbbP1gmb5YQ/file# [following]
--2020-02-23 06:29:14--  https://uca99f59df488eaa81f5422d894b.dl.dropboxusercontent.com/cd/0/inline/AyqB52ofp0itcj4CZv59YrifPbFIfPBNDT0T-xjCqWcL4W-MHxE0vV5YUQdu2w_t-DAg7abfI7cjRZ86V7WaKk9mnc

In [0]:
#Step 1: read the wikipedia text file
with open("wiki/thwiki_chk.txt") as f:
    raw_text = [] 
    #The text file is already tokenized BUT...
    #we've replaced all the spaces between words, so you have to use your tokenizer.
    raw_text.extend(re.sub(r"\s+","",f.read()))
    #since the wiki file is very large, we will only use 1/20 of the whole wiki file in this homework
    # if you have enough memeory and want to add more training data, please feel free to edit this code
    # to include more data
    raw_text = raw_text[:len(raw_text)//20]


In [0]:
# Create a character map
CHARS = [
  '\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+',
  ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8',
  '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E',
  'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
  'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_',
  'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
  'n', 'o', 'other', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
  'z', '}', '~', 'ก', 'ข', 'ฃ', 'ค', 'ฅ', 'ฆ', 'ง', 'จ', 'ฉ', 'ช',
  'ซ', 'ฌ', 'ญ', 'ฎ', 'ฏ', 'ฐ', 'ฑ', 'ฒ', 'ณ', 'ด', 'ต', 'ถ', 'ท',
  'ธ', 'น', 'บ', 'ป', 'ผ', 'ฝ', 'พ', 'ฟ', 'ภ', 'ม', 'ย', 'ร', 'ฤ',
  'ล', 'ว', 'ศ', 'ษ', 'ส', 'ห', 'ฬ', 'อ', 'ฮ', 'ฯ', 'ะ', 'ั', 'า',
  'ำ', 'ิ', 'ี', 'ึ', 'ื', 'ุ', 'ู', 'ฺ', 'เ', 'แ', 'โ', 'ใ', 'ไ',
  'ๅ', 'ๆ', '็', '่', '้', '๊', '๋', '์', 'ํ', '๐', '๑', '๒', '๓',
  '๔', '๕', '๖', '๗', '๘', '๙', '‘', '’', '\ufeff'
]
CHARS_MAP = {v: k for k, v in enumerate(CHARS)}
char = np.array(CHARS)

In [0]:
def create_n_gram_df(df, n_pad):
  """
  Given an input dataframe, create a feature dataframe of shifted characters
  Input:
  df: timeseries of size (N)
  n_pad: the number of context. For a given character at position [idx],
    character at position [idx-n_pad/2 : idx+n_pad/2] will be used 
    as features for that character.
  
  Output:
  dataframe of size (N * n_pad) which each row contains the character, 
    n_pad_2 characters to the left, and n_pad_2 characters to the right
    of that character.
  """
  n_pad_2 = int((n_pad - 1)/2)
  for i in range(n_pad_2):
      df['char-{}'.format(i+1)] = df['char'].shift(i + 1)
      df['char{}'.format(i+1)] = df['char'].shift(-i - 1)
  return df[n_pad_2: -n_pad_2]


def prepare_wiki_feature(raw_text_input):
    """
    Transform the path to a directory containing processed files 
    into a feature matrix and output array
    """
    # we use padding equals 21 here to consider 10 characters to the left
    # and 10 characters to the right as features for the character in the middle
    n_pad = 21
    n_pad_2 = int((n_pad - 1)/2)
    pad = [{'char': ' ', 'target': True}]
    df_pad = pd.DataFrame(pad * n_pad_2)

    df = []

    df.append(pd.DataFrame(  {'char': raw_text_input}))

    df = pd.concat(df)
    # pad with empty string feature
    df = pd.concat((df_pad, df, df_pad))

    # map characters to numbers, use 'other' if not in the predefined character set.
    df['char'] = df['char'].map(lambda x: CHARS_MAP.get(x, 80))

    # Use nearby characters as features
    df_with_context = create_n_gram_df(df, n_pad=n_pad)

    char_row = ['char' + str(i + 1) for i in range(n_pad_2)] + \
             ['char-' + str(i + 1) for i in range(n_pad_2)] + ['char']

    # convert pandas dataframe to numpy array to feed to the model
    x_char = df_with_context[char_row].as_matrix()

    return x_char

#A function for displaying our features in text
def print_features(tfeature,index):
    feature = np.array(tfeature[index],dtype=int).reshape(21,1)
    #Convert to string
    char_list = char[feature]
    left = ''.join(reversed(char_list[10:20].reshape(10))).replace(" ", "")
    center = ''.join(char_list[20])
    right =  ''.join(char_list[0:10].reshape(10)).replace(" ", "")
    word = ''.join([left,' ',center,' ',right])
    print(center + ': ' + word )

## <font color='blue'>Homework Question1:</font>
<font color='blue'>Use your own tokenizer (aka word segmentation model)  to define word boundaries and split the given text file into words. </font>

In [11]:
# TODO#1 
#load your word segmentation model here!
def get_your_nn():
  input1 = Input(shape=(21,))
  x = Embedding(178,32,input_length=21)(input1)
  x = GRU(32,reset_after=True)(x)
  x = Dense(512, activation='relu')(x)
  x = Dropout(0.2)(x)
  x = Dense(256, activation='relu')(x)
  out = Dense(1, activation='sigmoid')(x)

  model = Model(inputs=input1, outputs=out)
  model.compile(optimizer=Adam(),
                loss='binary_crossentropy',
                metrics=['acc'])
  
  model.load_weights('/content/model_weight_nn.h5')
  model.summary()
  return model

model = get_your_nn()

#load weights here/ or alternatively you can also load your entire model








Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 21)                0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 21, 32)            5696      
_________________________________________________________________
gru_2 (GRU)                  (None, 32)                6336      
_________________________________________________________________
dense_4 (Dense)              (None, 512)               16896     
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 256)               131328    
_________________________________________________________________
dense_6 (Dense)              (None, 1)               

In [12]:
x_char= prepare_wiki_feature(raw_text)
#feel free to edit prepare_wiki_feature if your model has different input format
# As a sanity check, we print out the size of the data.
print(' data shape: ', x_char.shape)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




 data shape:  (7592023, 21)


In [0]:
def char_to_word(raw_text, y_pred):
    """ add spaces between words in the raw text based on your prediction
    """
    split_text=""
    for char, y in zip(raw_text,y_pred):
        if y == 1:
            split_text+=" "
            split_text+=char
        else:
            split_text+=char
    return split_text.split(" ")

In [0]:
####TOKENIZATION
###THIS MIGHT TAKE ABOUT 10 MINS on feed forward models
y_pred = model.predict(x_char)
prob_to_class = lambda p: 1 if p[0]>=0.5 else 0
y_pred = np.apply_along_axis(prob_to_class,1,y_pred)
del x_char #clear up some memory

In [15]:
tokens= char_to_word(raw_text, y_pred)
#print out first 100 words for sanity check
print(tokens[0:100])

['', 'หน้า', 'หลัก', 'วิกิพีเดีย', 'ดำเนินการ', 'โดยมูลนิธิวิกิมี', 'เดีย', 'องค์กร', 'ไม่', 'แสวง', 'ผล', 'กำไร', 'ผู้', 'ดำเนินการ', 'อีก', 'หลาย', 'ได้', 'แก่__NOEDITSECTION__', 'ดาราศาสตร์', 'ดาราศาสตร์', 'คือวิชาวิทยาศาสตร์', 'ที่', 'ศึกษาวัตถุท้องฟ้า', '(', 'อาทิ', 'ดาวฤกษ์', 'ดาว', 'เคราะห์', 'ดาว', 'หาง', 'และ', 'ดาราจักร', ')', 'รวม', 'ทั้ง', 'ปรากฏการณ์', 'ทาง', 'ธรรมชาติ', 'ต่างๆ', 'ที่', 'เกิด', 'ขึ้น', 'จาก', 'นอก', 'ชั้น', 'บรรยากาศ', 'ของ', 'โลก', 'โดย', 'ศึกษา', 'เกี่ยว', 'กับ', 'วิวัฒนาการลักษณะ', 'ทาง', 'กายภาพ', 'ทาง', 'เคมี', 'ทาง', 'อุตุนิยมวิทยา', 'และ', 'การ', 'เคลื่อน', 'ที่', 'ของ', 'วัตถุ', 'ท้องฟ้า', 'ตลอดจน', 'ถึง', 'การ', 'กำเนิด', 'และ', 'วิวัฒนาการ', 'ของ', 'เอกภพดาราศาสตร์', 'เป็น', 'หนึ่ง', 'ใน', 'สาขา', 'ของ', 'วิทยาศาสตร์', 'ที่', 'เก่า', 'แก่', 'ที่สุด', 'นักดาราศาสตร์', 'ใน', 'วัฒนธรรม', 'โบราณ', 'สังเกตการณ์', 'ดวง', 'ดาว', 'บน', 'ท้องฟ้า', 'ใน', 'เวลากลาง', 'คืน', 'และ', 'วัตถุ', 'ทาง', 'ดาราศาสตร์']


In [16]:
print("total word count:", len(tokens))

total word count: 1707140


## Step 2: Indexing (Assign a number to each word)

The code below generates an indexed dataset(each word is represented by a number), a dictionary, a reversed dictionary

## <font color='blue'>Homework Question 2:</font>
<font color='blue'>“UNK” is often used to represent an unknown word (a word which does not exist in your dictionary/training set). You can also represent a rare word with this token as well.  How do you define a rare word in your program? Explain in your own words and capture the screenshot of your code segment that is a part of this process</font>

 + <font color='blue'>edit or replace create_index with your own code to set a threshold for rare words and replace them with "UNK"</font>

In [17]:
#step 2:Build dictionary and build a dataset(replace each word with its index)
def create_index(input_text):
    # TODO#2:edit or replace this function
    words = [word for word in input_text ]
    word_count = list()
    #use set and len to get the number of unique words
    word_count.extend(collections.Counter(words).most_common(len(set(words))))
    #include a token for unknown word
    word_count.append(("UNK",0))
    #print out 10 most frequent words
    print(word_count[:10])
    dictionary = dict()
    dictionary["for_keras_zero_padding"] = 0
    for word in word_count:
        dictionary[word[0]] = len(dictionary)
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    data = list()
    for word in input_text:
        data.append(dictionary[word])

    return data,dictionary, reverse_dictionary

dataset,dictionary, reverse_dictionary=create_index(tokens)


[('ที่', 46831), ('ใน', 43820), ('เป็น', 35842), ('และ', 35524), ('การ', 34916), ('ของ', 33581), ('มี', 29459), ('ได้', 24652), ('"', 18158), (')', 17556)]


In [18]:
print("output sample (dataset):",dataset[:10])
print("output sample (dictionary):",{k: dictionary[k] for k in list(dictionary)[:10]})
print("output sample (reverse dictionary):",{k: reverse_dictionary[k] for k in list(reverse_dictionary)[:10]})

output sample (dataset): [30475, 303, 178, 1046, 2338, 18726, 3088, 1120, 26, 6817]
output sample (dictionary): {'for_keras_zero_padding': 0, 'ที่': 1, 'ใน': 2, 'เป็น': 3, 'และ': 4, 'การ': 5, 'ของ': 6, 'มี': 7, 'ได้': 8, '"': 9}
output sample (reverse dictionary): {0: 'for_keras_zero_padding', 1: 'ที่', 2: 'ใน', 3: 'เป็น', 4: 'และ', 5: 'การ', 6: 'ของ', 7: 'มี', 8: 'ได้', 9: '"'}


In [19]:
len(dictionary)

112106

# Step3: Create skip-grams (inputs for your model)
Keras has a skipgrams-generator, the cell below shows us how it generates skipgrams 

## <font color='blue'>Homework Question 3:</font>
<font color='blue'>The negative samples are sampled from sampling_table.  Look through Keras source code to find out how they sample negative samples. Discuss the sampling technique taught in class and compare it to the Keras source code.</font>



In [20]:
# Step 3: Create data samples
vocab_size = len(dictionary)
skip_window = 1       # How many words to consider left and right.

sample_set= dataset[:10]
sampling_table = sequence.make_sampling_table(vocab_size)
#TO DO#3 check out keras source code and find out how their sampling technique works. Describe it in your own words.
'''
    ###################################################  Answer  #############################################################

    A skipgram model receive an input as a word vector and generate a one hot vector of the center word with 1 skip_window around
    and find around 1 skip_window proability with make_sampling_table command ,so 
    and choose the word around with the highest proability as a positive sample like a binary classification.

    ###########################################################################################################################
'''
couples, labels = skipgrams(sample_set, vocab_size, window_size=skip_window, sampling_table=sampling_table)
word_target, word_context = zip(*couples)
word_target = np.array(word_target, dtype="int32")
word_context = np.array(word_context, dtype="int32")

print(couples, labels)

for i in range(8):
    print(reverse_dictionary[couples[i][0]],reverse_dictionary[couples[i][1]])



[[30475, 91507], [18726, 3088], [18726, 2338], [1046, 2338], [30475, 303], [303, 178], [303, 105621], [1046, 178], [6817, 26], [1120, 3088], [1120, 852], [1046, 44598], [1120, 26], [18726, 99459], [303, 30475], [1120, 106094], [18726, 36464], [1046, 20927], [6817, 55393], [303, 77237]] [0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]
 โอกาสการคลอดลูก
โดยมูลนิธิวิกิมี เดีย
โดยมูลนิธิวิกิมี ดำเนินการ
วิกิพีเดีย ดำเนินการ
 หน้า
หน้า หลัก
หน้า คำนวณlog
วิกิพีเดีย หลัก


# Step 4: create the skip-gram model
## <font color='blue'>Homework Question 4:</font>
 <font color='blue'>Q4:  In your own words, discuss why Sigmoid is chosen as the activation function in the  skip-gram model.</font>

In [21]:
#reference: https://github.com/nzw0301/keras-examples/blob/master/Skip-gram-with-NS.ipynb
dim_embedddings = 32
V= len(dictionary)

#step1: select the embedding of the target word from W
w_inputs = Input(shape=(1, ), dtype='int32')
w = Embedding(V, dim_embedddings)(w_inputs)

#step2: select the embedding of the context word from C
c_inputs = Input(shape=(1, ), dtype='int32')
c  = Embedding(V, dim_embedddings)(c_inputs)

#step3: compute the dot product:c_k*v_j
o = Dot(axes=2)([w, c])
o = Reshape((1,), input_shape=(1, 1))(o)

#step4: normailize dot products into probability
o = Activation('sigmoid')(o)
#TO DO#4 Question: Why sigmoid?
'''
  #######################################################################  Answer of Why sigmoid?  ####################################################################### 

  Multi classification task with softmax is too slow because It mean the length of vocabulary multi class, so the output vector may be 10000000 class. 
  we might change the task to a binary classification that is skip-grams (0 for not a couple of word, 1 for a couple of word) with sigmoid instead with negative sampling .

  #########################################################################################################################################################################
'''
SkipGram = Model(inputs=[w_inputs, c_inputs], outputs=o)
SkipGram.summary()
opt=Adam(lr=0.01)
SkipGram.compile(loss='binary_crossentropy', optimizer=opt)

Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 1, 32)        3587392     input_4[0][0]                    
__________________________________________________________________________________________________
embedding_5 (Embedding)         (None, 1, 32)        3587392     input_5[0][0]                    
____________________________________________________________________________________________

In [22]:
# you don't have to spend too much time training for your homework, you are allowed to do it on a smaller corpus
# currently the dataset is 1/20 of the full text file.
for _ in range(5):
    prev_i=0
    #it is likely that your GPU won't be able to handle large input
    #just do it 100000 words at a time
    for i in range(len(dataset)//100000):
        #generate skipgrams
        data, labels = skipgrams(sequence=dataset[prev_i*100000:(i*100000)+100000], vocabulary_size=V, window_size=2, negative_samples=4.)
        x = [np.array(x) for x in zip(*data)]
        y = np.array(labels, dtype=np.int32)
        if x:
            loss = SkipGram.train_on_batch(x, y)
        prev_i = i 
        print(loss,i*100000)




0.69315165 0
0.69312096 100000
0.69308937 200000
0.6930213 300000
0.69290763 400000
0.69271946 500000
0.6924523 600000
0.6920628 700000
0.6914731 800000
0.6906833 900000
0.6896429 1000000
0.68835855 1100000
0.68672246 1200000
0.6846438 1300000
0.68228185 1400000
0.67931986 1500000
0.6756787 1600000
0.66923994 0
0.66514516 100000
0.65961295 200000
0.65174717 300000
0.6438066 400000
0.63489115 500000
0.6251258 600000
0.61583066 700000
0.6060089 800000
0.594386 900000
0.5814119 1000000
0.5690461 1100000
0.5550006 1200000
0.54028165 1300000
0.52714646 1400000
0.5120137 1500000
0.4950847 1600000
0.46456814 0
0.4537776 100000
0.43747854 200000
0.4144349 300000
0.39771774 400000
0.38064986 500000
0.3608763 600000
0.34867433 700000
0.3407871 800000
0.3283194 900000
0.31433374 1000000
0.30424708 1100000
0.29361713 1200000
0.28448236 1300000
0.28038 1400000
0.27436388 1500000
0.2670767 1600000
0.23507935 0
0.23994696 100000
0.23578498 200000
0.22181827 300000
0.21995093 400000
0.21722032 50000

In [0]:
SkipGram.save_weights('my_skipgram32_weights-hw.h5')

In [24]:
#Get weight of the embedding layer
final_embeddings=SkipGram.get_weights()[0]
print(final_embeddings)

[[ 0.00696474  0.00201385  0.01166249 ...  0.02834919  0.04179854
  -0.0412469 ]
 [-0.61383915 -0.6563922  -0.5923705  ...  0.6761808  -0.5882664
   0.6094683 ]
 [-0.56639445 -0.59205955 -0.66191715 ...  0.6394608  -0.587136
   0.6385333 ]
 ...
 [ 0.01673509 -0.04318354  0.03632034 ... -0.04882034 -0.00669356
   0.04439757]
 [-0.03680561 -0.03709161 -0.03314473 ... -0.02201749  0.03523887
   0.01686216]
 [ 0.00256364 -0.04951977 -0.0125877  ...  0.02965248  0.03053318
  -0.03666181]]


In [25]:
final_embeddings.shape

(112106, 32)

# Step 5: Intrinsic Evaluation: Word Vector Analogies
## <font color='blue'>Homework Question 5: </font>
<font color='blue'> Read section 2.1 and 2.3 in this [lecture note](http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes02-wordvecs2.pdf). Come up with 10 semantic analogy examples and report results produced by your word embeddings </font>


In [0]:
from sklearn.metrics.pairwise import cosine_similarity

def similarity(u, v):
    return np.squeeze(cosine_similarity(u.reshape(1, -1), v.reshape(1, -1)))

def complete_analogy(word_a, word_b, word_c, embeddings_index):
    
    # Get the word embeddings v_a, v_b and v_c 
    e_a, e_b, e_c = embeddings_index[dictionary[word_a]], embeddings_index[dictionary[word_b]], embeddings_index[dictionary[word_c]]
    
    words = dictionary.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set
    for w in words:        
        # to avoid best_word being one of the input words, pass on them.
        if w in [word_a, word_b, word_c] :
            continue
        
        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)
        cosine_sim = similarity(e_b - e_a + e_c, embeddings_index[dictionary[w]])
        
        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word
            
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
        
    return best_word

In [56]:
complete_analogy('สิงหาคม', 'ธันวาคม', 'ชาย', final_embeddings)

'ซีก'

In [47]:
complete_analogy('สิงหาคม', 'ธันวาคม', 'อังกฤษ', final_embeddings)

'สันสกฤต'

In [48]:
complete_analogy('แผ่นดิน', 'เกาะ', 'อังกฤษ', final_embeddings)

'ลัทธิ'

In [49]:
complete_analogy('แผ่นดิน', 'เกาะ', 'ประเทศ', final_embeddings)

'หน่วย'

In [50]:
complete_analogy('อาณาจักร', 'ประเทศ', 'ชาย', final_embeddings)

'สุด'

In [51]:
complete_analogy('ระยะ', 'กิโลเมตร', 'เวลา', final_embeddings)

'อุบัติ'

In [52]:
complete_analogy('ชาย', 'หญิง', 'พระราชา', final_embeddings)

'ชีอะห์'

In [53]:
complete_analogy('ระยะ', 'กิโลเมตร', 'น้ำหนัก', final_embeddings)

'ภาษาศาสตร์'

In [54]:
complete_analogy('เหนือ', 'ใต้', 'ตะวันตก', final_embeddings)

'กิ'

In [55]:
complete_analogy('ไป', 'มา', 'รับ', final_embeddings)

'เขน'

The complete_analogy function return the word vector that have maximum cosine similarity with word target - 3rd position input
and the 1st and 2nd position input.

The result is return the word vector that have maximum cosine similariy but have no language meaning
Its may be caused by the tokenizer or the word embedding  isn't have enough accuracy.

# Step 6: Extrinsic Evaluation

## <font color='blue'>Homework Question6:</font>
<font color='blue'>
Use the word embeddings from the skip-gram model as pre-trained weights in a classification model. Compare the result the with the same classification model that does not use the pre-trained weights. 
</font>


In [0]:
all_news_filepath = glob.glob('BEST-TrainingSet/news/*.txt')
all_novel_filepath = glob.glob('BEST-TrainingSet/novel/*.txt')
all_article_filepath = glob.glob('BEST-TrainingSet/article/*.txt')
all_encyclopedia_filepath = glob.glob('BEST-TrainingSet/encyclopedia/*.txt')

In [0]:
#preparing data for the classificaiton model
#In your homework, we will only use the first 2000 words in each text file
#any text file that has less than 2000 words will be padded
#reason:just to make this homework feasible under limited time and resource
max_length = 2000
def word_to_index(word):
    if word in dictionary:
        return dictionary[word]
    else:#if unknown
        return dictionary["UNK"]


def prep_data():
    input_text = list()
    for textfile_path in [all_news_filepath, all_novel_filepath, all_article_filepath, all_encyclopedia_filepath]:
        for input_file in textfile_path:
            f = open(input_file,"r") #open file with name of "*.txt"
            text = re.sub(r'\|', ' ', f.read()) # replace separation symbol with white space           
            text = re.sub(r'<\W?\w+>', '', text)# remove <NE> </NE> <AB> </AB> tags
            text = text.split() #split() method without an argument splits on whitespace 
            indexed_text = list(map(lambda x:word_to_index(x), text[:max_length])) #map raw word string to its index   
            if 'news' in input_file:
                input_text.append([indexed_text,0]) 
            elif 'novel' in input_file:
                input_text.append([indexed_text,1]) 
            elif 'article' in input_file:
                input_text.append([indexed_text,2]) 
            elif 'encyclopedia' in input_file:
                input_text.append([indexed_text,3]) 
            
            f.close()
    random.shuffle(input_text)
    return input_text

input_data = prep_data()
train_data = input_data[:int(len(input_data)*0.6)]
val_data = input_data[int(len(input_data)*0.6):int(len(input_data)*0.8)]
test_data = input_data[int(len(input_data)*0.8):]

train_input = [data[0] for data in train_data]
train_input = sequence.pad_sequences(train_input, maxlen=max_length) #padding
train_target = [data[1] for data in train_data]
train_target=to_categorical(train_target, num_classes=4)



val_input = [data[0] for data in val_data]
val_input = sequence.pad_sequences(val_input, maxlen=max_length) #padding
val_target = [data[1] for data in val_data]
val_target=to_categorical(val_target, num_classes=4)

test_input = [data[0] for data in test_data]
test_input = sequence.pad_sequences(test_input, maxlen=max_length) #padding
test_target = [data[1] for data in test_data]
test_target=to_categorical(test_target, num_classes=4)

del input_data, val_data,train_data, test_data

In [39]:
#the classification model
#TO DO#6 find out how to initialize your embedding layer with pre-trained weights, evaluate and observe
#don't forget to compare it with the same model that does not use pre-trained weights
#you can use your own model too! and feel free to customize this model as you wish
cls_model = Sequential()
cls_model.add(Embedding(len(dictionary), 32, input_length=max_length))
cls_model.layers[0].set_weights([final_embeddings])
cls_model.add(GRU(32))
cls_model.add(Dropout(0.2))
cls_model.add(Dense(100, activation='relu'))
cls_model.add(Dense(4, activation='softmax'))
cls_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
cls_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 2000, 32)          3587392   
_________________________________________________________________
gru_3 (GRU)                  (None, 32)                6240      
_________________________________________________________________
dropout_3 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 100)               3300      
_________________________________________________________________
dense_8 (Dense)              (None, 4)                 404       
Total params: 3,597,336
Trainable params: 3,597,336
Non-trainable params: 0
_________________________________________________________________


In [40]:
print('Train...')
cls_model.fit(train_input, train_target,
          epochs=50,verbose=2,
          validation_data=[val_input, val_target])

Train...
Train on 303 samples, validate on 101 samples
Epoch 1/50
 - 34s - loss: 1.3690 - acc: 0.3399 - val_loss: 1.3709 - val_acc: 0.3465
Epoch 2/50
 - 33s - loss: 1.3379 - acc: 0.3861 - val_loss: 1.3587 - val_acc: 0.3465
Epoch 3/50
 - 32s - loss: 1.3378 - acc: 0.3861 - val_loss: 1.3521 - val_acc: 0.3465
Epoch 4/50
 - 32s - loss: 1.3323 - acc: 0.3861 - val_loss: 1.3525 - val_acc: 0.3465
Epoch 5/50
 - 32s - loss: 1.3281 - acc: 0.3861 - val_loss: 1.3560 - val_acc: 0.3465
Epoch 6/50
 - 32s - loss: 1.3255 - acc: 0.3861 - val_loss: 1.3535 - val_acc: 0.3465
Epoch 7/50
 - 32s - loss: 1.3210 - acc: 0.3861 - val_loss: 1.3531 - val_acc: 0.3465
Epoch 8/50
 - 32s - loss: 1.3131 - acc: 0.3861 - val_loss: 1.3513 - val_acc: 0.3465
Epoch 9/50
 - 32s - loss: 1.3133 - acc: 0.3861 - val_loss: 1.3497 - val_acc: 0.3465
Epoch 10/50
 - 33s - loss: 1.2992 - acc: 0.3861 - val_loss: 1.3443 - val_acc: 0.3465
Epoch 11/50
 - 33s - loss: 1.2864 - acc: 0.3861 - val_loss: 1.3431 - val_acc: 0.3465
Epoch 12/50
 - 33s 

<keras.callbacks.History at 0x7fe04b024240>

In [41]:
cls_model.evaluate(test_input,test_target,verbose=2)

[2.2955935562358185, 0.5196078431372549]

In [42]:
cls_model = Sequential()
cls_model.add(Embedding(len(dictionary), 32, input_length=max_length))
cls_model.add(GRU(32))
cls_model.add(Dropout(0.2))
cls_model.add(Dense(100, activation='relu'))
cls_model.add(Dense(4, activation='softmax'))
cls_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
cls_model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 2000, 32)          3587392   
_________________________________________________________________
gru_4 (GRU)                  (None, 32)                6240      
_________________________________________________________________
dropout_4 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_9 (Dense)              (None, 100)               3300      
_________________________________________________________________
dense_10 (Dense)             (None, 4)                 404       
Total params: 3,597,336
Trainable params: 3,597,336
Non-trainable params: 0
_________________________________________________________________


In [43]:
print('Train...')
cls_model.fit(train_input, train_target,
          epochs=50,verbose=2,
          validation_data=[val_input, val_target])

Train...
Train on 303 samples, validate on 101 samples
Epoch 1/50
 - 34s - loss: 1.3820 - acc: 0.3597 - val_loss: 1.3794 - val_acc: 0.3465
Epoch 2/50
 - 32s - loss: 1.3668 - acc: 0.3861 - val_loss: 1.3714 - val_acc: 0.3465
Epoch 3/50
 - 33s - loss: 1.3458 - acc: 0.3861 - val_loss: 1.3624 - val_acc: 0.3465
Epoch 4/50
 - 33s - loss: 1.3171 - acc: 0.3861 - val_loss: 1.3554 - val_acc: 0.3465
Epoch 5/50
 - 33s - loss: 1.2714 - acc: 0.3861 - val_loss: 1.3490 - val_acc: 0.3465
Epoch 6/50
 - 33s - loss: 1.1935 - acc: 0.3861 - val_loss: 1.3242 - val_acc: 0.3465
Epoch 7/50
 - 33s - loss: 1.0376 - acc: 0.4653 - val_loss: 1.2750 - val_acc: 0.4356
Epoch 8/50
 - 33s - loss: 0.8323 - acc: 0.7294 - val_loss: 1.1855 - val_acc: 0.4752
Epoch 9/50
 - 33s - loss: 0.6845 - acc: 0.8713 - val_loss: 1.1693 - val_acc: 0.5050
Epoch 10/50
 - 33s - loss: 0.5241 - acc: 0.9538 - val_loss: 1.1296 - val_acc: 0.5446
Epoch 11/50
 - 33s - loss: 0.3670 - acc: 0.9670 - val_loss: 1.0915 - val_acc: 0.5446
Epoch 12/50
 - 33s 

<keras.callbacks.History at 0x7fe03e2b6cf8>

In [44]:
cls_model.evaluate(test_input,test_target,verbose=2)

[2.6213486662098004, 0.6176470576548109]

Model using pretrained embedding have less loss than model not using pretrained embedding on the test set,
that is 2.2955935562358185 and 2.6213486662098004.

But have accuracy less than compare to the model not using pretrained embedding .