<h2>Analyzing Most Similar Words and Their Changes in Pre-trained Embedding</h2>

<h4>Step 1: Get the vocabulary and dictionary of words</h4>

<p>Before this step we have already saved the weights from our best model used before. We read the "wd_vec_dict_SNLI.p" pickle file which is our trained embedding used in section 3.1. The "f.p" pickle file is the file we used to save the dictionary of pre-trained embedding in fastText.</p>

In [1]:
#Packages that needs to be imported
import pickle as pkl
import torch
from torch import nn
from torch import dist
from torch import norm
import nltk
from nltk.corpus import stopwords
import os

##We first define some variables and packages that we want to import
max_vocab_size = 10000
PAD_IDX = 0
UNK_IDX = 1
folder = os.getcwd()

In [2]:
#Here we also needs to build the vocabulary
def build_vocab(hypo_tokens, prem_tokens, max_vocab_size):
    # Returns:
    # id2token: list of tokens, where id2token[i] returns token that corresponds to token i
    # token2id: dictionary where keys represent tokens and corresponding values represent indices

    id2token = list(f[","])
    token2id = dict(zip(f[","], range(2,2+len(f))))
    id2token = ['<pad>', '<unk>'] + id2token
    token2id['<pad>'] = PAD_IDX
    token2id['<unk>'] = UNK_IDX
    return token2id, id2token

In [3]:
#Load the trained and pretrained embedding
wd_vec_dict = pkl.load(open("wd_vec_dict_SNLI.p", "rb"))
f=pkl.load(open("/Users/ludi/Desktop/tars/f.p","rb"))
f=f[:max_vocab_size]

In [4]:
# load data and tokens
hypo_data_tokens_train = pkl.load(open(folder+"/../all_data_pickle/hypo_data_tokens_train.p", "rb"))
prem_data_tokens_train = pkl.load(open(folder+"/../all_data_pickle/prem_data_tokens_train.p", "rb"))
hypo_data_tokens_val = pkl.load(open(folder+"/../all_data_pickle/hypo_data_tokens_val.p", "rb"))
prem_data_tokens_val = pkl.load(open(folder+"/../all_data_pickle/prem_data_tokens_val.p", "rb"))

all_hypo_data_tokens_train = pkl.load(open(folder+"/../all_data_pickle/all_hypo_data_tokens_train.p", "rb"))
all_prem_data_tokens_train = pkl.load(open(folder+"/../all_data_pickle/all_prem_data_tokens_train.p", "rb"))
all_hypo_data_tokens_val = pkl.load(open(folder+"/../all_data_pickle/all_hypo_data_tokens_val.p", "rb"))
all_prem_data_tokens_val = pkl.load(open(folder+"/../all_data_pickle/all_prem_data_tokens_val.p", "rb"))

# Vocabulary
# buid vocabulary index token accodding to max_vocab_size
token2id, id2token = build_vocab(all_hypo_data_tokens_train, all_prem_data_tokens_train, max_vocab_size)

<h4>Step 2: Get Pair-wise Distances and Get the Most Similar 10 Pairs<\h4>

<p>At first we need to get a id2vec list to help us get the embedding vector using word indices.<\p>

In [5]:
#The build_id_vec function
def build_id_vec_function(wd_vec, id_token):
    
    '''
    @param wd_vec: the dictionary that includes tokens and vectors
    @param id_token: i.e. id2token
    '''
    
    output = []
    for token in id_token:
        if token in wd_vec:
            output.append(torch.from_numpy(wd_vec[token]))
        else:
            output.append(None)
    return output

id2vec = build_id_vec_function(wd_vec_dict, id2token)

<p>Now we generate a sorted similarity lists which contains information of dissimilarity of different word pairs.<\p>
    
<p>We use Euclidean Distance between embedding vectors. An alternative is to use cosine similarity between word vectors. However, from trials we find that this shows little information of word similarity, so we do not use it.<\p>

In [6]:
# Compare the similarity between tokens
def build_similar_pairs(id_vec):
    '''
    @param id_vec: i.e. id2vec
    '''
    output = []
    for i in range(1, len(id_vec)):
        if type(id_vec[i]) != type(None):
            for j in range(i+1, len(id_vec)):
                if type(id_vec[j]) != type(None):
                    output.append((i, j, dist(id_vec[i], id_vec[j], p = 2).item()))
                    
        ##Let's also see how how the process goes on
        if i > 0 and i % 1000 == 999:
            print("[{}/{}] tokens has been visited.".format(i+1,len(id_vec)))
    
    ## At last we want to sort the similarity list
    output = sorted(output, key=lambda x: x[2], reverse=False)
    
    return output

similarity_pairs = build_similar_pairs(id2vec)

[1000/10002] tokens has been visited.
[2000/10002] tokens has been visited.
[3000/10002] tokens has been visited.
[4000/10002] tokens has been visited.
[5000/10002] tokens has been visited.
[6000/10002] tokens has been visited.
[7000/10002] tokens has been visited.
[8000/10002] tokens has been visited.
[9000/10002] tokens has been visited.
[10000/10002] tokens has been visited.


<p>As we do not want functional words, e.g. "the", in our comparison (since we want to compare words that have the meaning), we use <strong>NLTK</strong> <em>stopwords</em> corpus to collect functional words, adding up determiners and numbers, then perform the elimination<\p>

In [7]:
def elim_func_words(similar_pairs, id_token):
    '''
    @param similar_pairs: the sorted list of pairs with similarity scores
    @param id_token: i.e. id2token
    '''
    
    stp_wds = list(set(stopwords.words('english')))+['a', 'an']+list(map(str, range(100)))
    output = []
    i_temp = 0
    while len(output) < 10:
        id_1, id_2, distance = similar_pairs[i_temp]
        if id_token[id_1] not in stp_wds and id_token[id_2] not in stp_wds:
            output.append((id_token[id_1], id_token[id_2], distance))
        i_temp += 1
    
    return output

most_similar_pairs = elim_func_words(similarity_pairs, id2token)

In [8]:
most_similar_pairs

[('young', 'child', 9.58918285369873),
 ('landing', 'cluster', 9.736421585083008),
 ('man', 'guys', 9.81108283996582),
 ('group', 'hat', 9.829873085021973),
 ('young', 'casual', 10.05659008026123),
 ('group', 'wonderful', 10.079551696777344),
 ('young', 'grass', 10.100854873657227),
 ('man', 'sites', 10.119364738464355),
 ('young', 'across', 10.165505409240723),
 ('area', 'young', 10.191558837890625)]

<h4>Step 3: See Their Representation in Pre-train Embeddings<\h4>

<p>In this step we compute again the pair-wise distance in pretrained embedding.<\p>

In [9]:
def get_pretrained_dist(pretrained_dict, most_similar):
    '''
    @param pretrained_dict: the dictionary of pre-trained embeddings
    @param most_similar: the 10 most similar pairs we get in step 2
    '''
    
    pretrained_wd_vec = {}
    for i in range(len(pretrained_dict[','])):
        pretrained_wd_vec[pretrained_dict.iloc[i][0]] = pretrained_dict.iloc[i][1]
    
    output = []
    for tk_1, tk_2,_ in most_similar:
        pre_train_dist = dist(torch.tensor(pretrained_wd_vec[tk_1])
                            , torch.tensor(pretrained_wd_vec[tk_2])
                            , p = 2).item()
        output.append((tk_1, tk_2, pre_train_dist))
    
    return output

pretrained_distance = get_pretrained_dist(f, most_similar_pairs)

In [10]:
pretrained_distance

[('young', 'child', 1.8071242570877075),
 ('landing', 'cluster', 2.6011438369750977),
 ('man', 'guys', 1.7539427280426025),
 ('group', 'hat', 2.304748296737671),
 ('young', 'casual', 2.0644662380218506),
 ('group', 'wonderful', 2.1527419090270996),
 ('young', 'grass', 2.329569101333618),
 ('man', 'sites', 2.251317262649536),
 ('young', 'across', 2.2015292644500732),
 ('area', 'young', 1.8896145820617676)]

In [11]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/ludi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True