<h1>Math</h1>

Now we will look at how we can unify the concepts from word2vec and GloVe

---

<h4>Unifying word2vec and GloVe</h4>

This discussion is motivated by the paper <strong><u>Neural Word Embedding as Implicit Matrix Factorisation</u></strong>

just from the title we can try to guess  the approach this paper is going to take

we know that Neural Word Embeddings means Word Embeddings that are learned from a neural network model like word2vec

and we know that matrix factorisation is the model that we use for GloVe

so what we are trying to do is show that word2vec can be framed as a matrix factorisation problem

---

<h4>GloVe</h4>

Again lets review what matrix factorisation looks like at a very basic level

for GloVe we have a big matrix call it A , $A_{ij}$ is just the weighted and logged word-context counts between word $i$ and word $j$ and this is approximately equal to the input word vector for word $i$ dotted with the output word vector for word $j$

$$A_{ij} = \log \left(X_{ij}\right) \approx w_i^Tu_j$$

we are going to try to come up with something in the same format but with a different matrix that we want to factorise

---

<h4>word2vec objective</h4>

first lets start with our objective function

$$J = \sum_{w\in W}\sum_{c \in W} count(w,c) \left\{\log \sigma \left(\vec w^T \vec c \right) + kE_{n \sim p(n)} \left[\log \sigma(-\vec w^T \vec n)\right]\right\}$$

$W = set \ of \ all \ words \ in \ the \ vocabulary$

$w = input \ word, \ \vec w = \ input \ word \ vector$

$c = context \ word , \vec c = \ context \ word \ vector$

$n = negative \ sample , \vec n = negative \ sample \ word \ vector$

$p(n) = unigram \ distribution$

$k \ = \ number \ of \ negative \ samples \ drawn$

lets try to break down this equation

if we look closely this is just the binary cross entropy or log-likelihood , the first term is for the positive samples , and the second term is for the negative samples

as we can see the $\sigma$ in the second term has a $-$ in front of the logit which means that this is the probability of t he output being 0

we also see that we have $count(w,c)$ in front of this expression , why ?

notice that we are not summing over each sample in our dataset like we usually do , instead we see that this sum is explicitly taken over every possible middle word with every possible context word

and of course in our dataset that middle word and that context word may appear several times , so this objective is the total objective for the entire dataset not just one sample , not just one sentence

the second term is also of interest , we can see it has this expected value symbol , again why ?

well of course its because negative samples are probabilistic , they are sampels that we draw randomly , so we do not know there actual values , we can only reaon about their statistics

we want to multiply this by k , since for each word-context pair we will have k-negative samples

note : so the second term is just saying that the sum is the same as  count x expected value which we know is true

---

<h4>Split into 2 terms</h4>

The next thing we can do is split up the objective into two seperate terms , this way we can see how we can simplify the second term which is a little more complicated

$$J = \sum_{w \in W}\sum_{c \in W}count(w,c) \left\{\log \sigma \left(\vec w^T \vec c\right)\right\} + \sum_{w \in W}\sum_{c \in W}count(w,c) \left\{kE_{n \sim p(n)} \left[\log \sigma(-\vec w^T \vec n)\right]\right\}$$

what we can see then , considering the second term , is that nothing depends on the context word $c$ so we can effectively just count the $w$s without even considering $c$

$$J = \sum_{w \in W}\sum_{c \in W}count(w,c) \left\{\log \sigma \left(\vec w^T \vec c\right)\right\} + \sum_{w \in W}count(w) \left\{kE_{n \sim p(n)} \left[\log \sigma(-\vec w^T \vec n)\right]\right\}$$

---

<h4>Express expectation as ratio of counts</h4>

The next thing we can consider is the expectation itself , note that the expectation in the limit is just the count of that item divided by the total

so we can write the expectation in terms of counts :

$$E_{n \sim p(n)} \left[\log \sigma \left(-\vec w^T \vec n\right)\right] = \sum_{n \in W} \frac{count(n)}{\vert \Omega \vert} \log \sigma\left(- \vec w^T \vec n\right)$$

$\Omega$ : corpus , $\vert \Omega \vert$ : corpus size

now what we can do with this is , consider that our current input-context pair is $(w,c)$

so we can , rather trivially , split out the context word $c$ from this sum just so that is sits by itself

$$= \frac{count(c)}{\vert \Omega \vert} \log \left(-\vec w^T \vec c \right) + \sum_{n \in W - \{c\}} \frac{count(n)}{\vert \Omega \vert} \log \left(-\vec w^T \vec n\right)$$

---

<h4>Local objective</h4>

so why would we want to do tihs ?

well now we can take what we had before , and combine them to take things a step further

in order to do this derivation , we only need to consider a single entry in the target matrix

so this entry depends only on $(w,c)$

so anything that doesnot depends on $(w,c)$ will be thrown out

so we can look at this full objective function which contains the entire dataset and pickout only the part that depends on the current $w$ and the current $c$

That allows us to get rid of all but one term from the negative sampling expectation

$$J = \sum_{w \in W}\sum_{c \in W}count(w,c) \left\{\log \sigma \left(\vec w^T \vec c\right)\right\} + \boxed{\sum_{w \in W}count(w)} \left\{kE_{n \sim p(n)} \left[\log \sigma(-\vec w^T \vec n)\right]\right\}$$

$$E_{n \sim p(n)} \left[\log \sigma \left(-\vec w^T \vec n\right)\right] = \boxed{\frac{count(c)}{\vert \Omega \vert} \log \left(-\vec w^T \vec c \right)} + \sum_{n \in W - \{c\}} \frac{count(n)}{\vert \Omega \vert} \log \left(-\vec w^T \vec n\right)$$

we will call this the local objective for the specific word $w$ and the specific context word $c$

$$J_{w,c} = count(w,c) \left\{\log \sigma \left(\vec w^T \vec c\right)\right\} + count(w) k \frac{count(c)}{\vert \Omega \vert} \log \sigma \left(-\vec w^T \vec c\right)$$

what this allows us to do is get rid of all these summations and expectation from the original objective

---

<h4>Optimise the local objective</h4>

what we want to do is find out what logit (thats the thing that inside the sigmoid) , optimises this local $j$ , but why do we care about this ?

well , whatever that logit is , it is equal to the dot product of $w$ and $c$ , and we know that this is equal to the entry (w,c) of whatever matrix we are trying to factorise

again we dont know what matrix we want to factorise , but now we can clearly see that $w^Tc$ gives us a matrix entry , and this matrix entry will optimise this j , so we want to solve for whatever optimises j

$$J_{w,c} = count(w,c) \left\{\log \sigma \left(\vec w^T \vec c\right)\right\} + count(w) k \frac{count(c)}{\vert \Omega \vert} \log \sigma \left(-\vec w^T \vec c\right)$$


$Let: \ x = \ \vec w^T \vec c$

$Optimise: J_{w,c} \ w.r.t \ x$

$Solution: \frac{\partial J_{w,c}}{\partial x} = 0$

---

<h4>High-level steps</h4>

Take derivative and set to 0

first lets get an expression for the derivative :

$$\frac{\partial J_{w,c}}{\partial x} = count(w,c) \sigma(-x) -kcount(w)\frac{count(c)}{\vert \Omega \vert} \sigma(x) = 0$$

hint : remembr that $\sigma(x) = 1 - \sigma(-x)$ also $\sigma(-x) = 1-\sigma(x)$

then we can manipulate this expression a little bit:

$$e^{2x} - \left(\frac{count(w,c)\vert\Omega\vert}{kcount(w)count(c)}-1\right)e^{x} - \frac{count(w,c)\vert \Omega \vert}{k count(w)count(c)} = 0$$

hint : just expand the sigmoid and simplify

the next step is to recognise that this looks a little bit like a quadratic equation except that there is an exponent 

so if we set a new variable $y = e^{x}$ , then we get a quadratic which we can solve

we yeild

$$y = \frac{count(w,c) \vert \Omega \vert}{k count(w) count(c)}$$

then we can plug in back original variables ($w^Tc$) to get that :

$$\vec w^T \vec c = \log \left(\frac{count(w,c)}{count(w)}\frac{\vert \Omega \vert}{count(c)}\right) - \log k$$

---

<h4>Pointwise Mutual Information</h4>

notice how we seperated the term in a very specific way

$$\vec w^T \vec c = \log \left(\frac{count(w,c)}{count(w)}\frac{\vert \Omega \vert}{count(c)}\right) - \log k$$

the reason we did this is because these count proportions represent probabilities

so 

$$\frac{\vert \Omega \vert}{count(c)} = \frac{1}{p(c)}$$

and 

$$\frac{count(w,c)}{count(w)} = \frac{p(w,c)}{p(w)}$$

to get this divide both top and bottom by $\vert \Omega \vert$

In fact when these specific probabilities are combined in this specific way we arrive at a quantity called the Pointwise Mutual Information (PMI matrix) which is a well know quantity in information theory (just like cross entropy)

In general the Pointwise Mutual Information between two discrete variables $x$ and $y$ is given by :

$$PMI(x,y) = \log \frac{P(x,y)}{P(x)P(y)}$$

but notice how in the expression we derived , there is this extra $- log k$ term , this just means this is the shifted PMI matrix , but the basic idea is the same , and for k=1 we would get back the unshifted PMI matrix

 --- 
 
 now one thing we can do is manipulate our PMI a little bit to express it in terms of quantities we already know about
 
 so one term , specifically the one that represents context , comes from the unigram distribution , and we know that one simple modification we can make in word2vec is to smooth out this distribution by raising it to the power 0.75
 
$$\vec w^T \vec c = \log \left(\frac{count(w,c)}{count(w)}\frac{1}{\tilde p(c)}\right)$$

---

<h4>Similarities to GloVe</h4>

One interesting thing that allows us to relate this to GloVe is that , if we discard most of the terms , we can see that the matrix we want to factorise is in both cases proportional to the log of the count of every word-context pair

$$\vec w^Tc \propto \log count(w,c)$$

so this should give us confidence that both word2vec and GloVe are doing very similar things

<h1>code</h1>

In [1]:
# this is basically the same as GloVe code
# but this time we use construct the PMI matrix isntead

In [2]:
import numpy as np
from glob import glob
import string
from sklearn.metrics.pairwise import pairwise_distances
from datetime import datetime
import json
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD

In [3]:
V = 20000
D = 300
N_FILES = None
CONTEXT_SZ = 10
XMAX = 100
ALPHA = 0.75
EPOCHS = 10
REG = 0.0
PATH = 'outputs/PMI/ALS/'

In [4]:
def tokenise(sent):
    sent = sent.lower()
    sent = sent.translate(str.maketrans('','',string.punctuation))
    tokens = sent.split()
    return tokens

def get_sentences(path='datasets/wiki/',V=20000,n_files=None):
    files = glob(path+'*.txt')
    files = files[:n_files]
    # first we need to get word2count to identify our top words
    # we will make word2idx once we filter out the top words
    word2count = {}

    # this is a list of lists , each inner list is a sentence of indexes
    sentences = []

    # we need to limit covabulary
    # first we get word2count 
    print('counting words')

    for i,f in enumerate(files):
        for line in open(f,encoding = "utf8"):
            line = line.rstrip()
            if line and line[0] not in ('[', '*', '-', '|', '=', '{', '}'):
                # use lines instead of sentences
                tokens = tokenise(line)
                if len(tokens) < 2:
                    continue
                for token in tokens:
                    word2count[token] = word2count.get(token,0) + 1

        print('finished counting : ',i+1,'/',len(files),' files')

    print('finisehd counting')
    print('processing files')

    # now we use word2count to identify most frequent words
    # we need the special <none> token to replace words that wont make it to our vocabulary
    words  = ['<none>'] + [w for (w,c) in sorted(word2count.items() , reverse=True, key=lambda x: x[1])[:V-1]] 
    word2idx = {w:i for w,i in zip(words,range(V))}
    none = word2idx['<none>']
    for i,f in enumerate(files):
        # in the wiki files each line is a paragraph , we will be taking each paragraph as a sentence
        # we also want to remove header lines
        for line in open(f,encoding = "utf8"):
            line = line.rstrip()
            # skip headers , ...
            if line and line[0] not in ('[', '*', '-', '|', '=', '{', '}'):
                tokens = tokenise(line)
                if len(tokens) < 2:
                    continue
                # now we update word2idx and word2count
                # and we append tokenised line to our sentences
                sentence = [word2idx.get(token,none) for token in tokens]
                sentences.append(sentence)

        print('finished processing : ',i+1,'/',len(files),' files')

    print('finished processing data')
    return sentences,word2idx



In [5]:
sentences,word2idx = get_sentences(V=V,n_files=N_FILES)

counting words
finished counting :  1 / 69  files
finished counting :  2 / 69  files
finished counting :  3 / 69  files
finished counting :  4 / 69  files
finished counting :  5 / 69  files
finished counting :  6 / 69  files
finished counting :  7 / 69  files
finished counting :  8 / 69  files
finished counting :  9 / 69  files
finished counting :  10 / 69  files
finished counting :  11 / 69  files
finished counting :  12 / 69  files
finished counting :  13 / 69  files
finished counting :  14 / 69  files
finished counting :  15 / 69  files
finished counting :  16 / 69  files
finished counting :  17 / 69  files
finished counting :  18 / 69  files
finished counting :  19 / 69  files
finished counting :  20 / 69  files
finished counting :  21 / 69  files
finished counting :  22 / 69  files
finished counting :  23 / 69  files
finished counting :  24 / 69  files
finished counting :  25 / 69  files
finished counting :  26 / 69  files
finished counting :  27 / 69  files
finished counting :  2

In [6]:
def construct_PMI_matrix(sentences):
    X = np.zeros((V,V)) # stores count(w,c)
    N_sentences = len(sentences)
    for k,sentence in enumerate(sentences):
        # just a print to keep progress
        if (k+1)%1000 == 0:
            print('finished: ',k+1,'/',N_sentences)
        N = len(sentence)
        for i in range(N):
            start = max(i-CONTEXT_SZ,0)
            end = min(N,i+CONTEXT_SZ+1)
            w1 = sentence[i]
            
            for j in range(start,i):
                w2 = sentence[j]
                X[w1][w2] += 1
                
            for j in range(i+1,end):
                w2 = sentence[j]
                X[w1][w2] += 1
                
    c_counts = X.sum(axis=0)**ALPHA # counts get smoothed
    c_probs = c_counts/c_counts.sum()
    w_counts = X.sum(axis=1)
    X = X /w_counts[:,None]/c_probs
    PMI = np.log(X+1)
    PMI[PMI<0] = 0 # mentioned in paper    
    return PMI

In [7]:
PMI = construct_PMI_matrix(sentences)
# free some memory
sentences = []

finished:  1000 / 1271592
finished:  2000 / 1271592
finished:  3000 / 1271592
finished:  4000 / 1271592
finished:  5000 / 1271592
finished:  6000 / 1271592
finished:  7000 / 1271592
finished:  8000 / 1271592
finished:  9000 / 1271592
finished:  10000 / 1271592
finished:  11000 / 1271592
finished:  12000 / 1271592
finished:  13000 / 1271592
finished:  14000 / 1271592
finished:  15000 / 1271592
finished:  16000 / 1271592
finished:  17000 / 1271592
finished:  18000 / 1271592
finished:  19000 / 1271592
finished:  20000 / 1271592
finished:  21000 / 1271592
finished:  22000 / 1271592
finished:  23000 / 1271592
finished:  24000 / 1271592
finished:  25000 / 1271592
finished:  26000 / 1271592
finished:  27000 / 1271592
finished:  28000 / 1271592
finished:  29000 / 1271592
finished:  30000 / 1271592
finished:  31000 / 1271592
finished:  32000 / 1271592
finished:  33000 / 1271592
finished:  34000 / 1271592
finished:  35000 / 1271592
finished:  36000 / 1271592
finished:  37000 / 1271592
finished: 

finished:  298000 / 1271592
finished:  299000 / 1271592
finished:  300000 / 1271592
finished:  301000 / 1271592
finished:  302000 / 1271592
finished:  303000 / 1271592
finished:  304000 / 1271592
finished:  305000 / 1271592
finished:  306000 / 1271592
finished:  307000 / 1271592
finished:  308000 / 1271592
finished:  309000 / 1271592
finished:  310000 / 1271592
finished:  311000 / 1271592
finished:  312000 / 1271592
finished:  313000 / 1271592
finished:  314000 / 1271592
finished:  315000 / 1271592
finished:  316000 / 1271592
finished:  317000 / 1271592
finished:  318000 / 1271592
finished:  319000 / 1271592
finished:  320000 / 1271592
finished:  321000 / 1271592
finished:  322000 / 1271592
finished:  323000 / 1271592
finished:  324000 / 1271592
finished:  325000 / 1271592
finished:  326000 / 1271592
finished:  327000 / 1271592
finished:  328000 / 1271592
finished:  329000 / 1271592
finished:  330000 / 1271592
finished:  331000 / 1271592
finished:  332000 / 1271592
finished:  333000 / 

finished:  591000 / 1271592
finished:  592000 / 1271592
finished:  593000 / 1271592
finished:  594000 / 1271592
finished:  595000 / 1271592
finished:  596000 / 1271592
finished:  597000 / 1271592
finished:  598000 / 1271592
finished:  599000 / 1271592
finished:  600000 / 1271592
finished:  601000 / 1271592
finished:  602000 / 1271592
finished:  603000 / 1271592
finished:  604000 / 1271592
finished:  605000 / 1271592
finished:  606000 / 1271592
finished:  607000 / 1271592
finished:  608000 / 1271592
finished:  609000 / 1271592
finished:  610000 / 1271592
finished:  611000 / 1271592
finished:  612000 / 1271592
finished:  613000 / 1271592
finished:  614000 / 1271592
finished:  615000 / 1271592
finished:  616000 / 1271592
finished:  617000 / 1271592
finished:  618000 / 1271592
finished:  619000 / 1271592
finished:  620000 / 1271592
finished:  621000 / 1271592
finished:  622000 / 1271592
finished:  623000 / 1271592
finished:  624000 / 1271592
finished:  625000 / 1271592
finished:  626000 / 

finished:  884000 / 1271592
finished:  885000 / 1271592
finished:  886000 / 1271592
finished:  887000 / 1271592
finished:  888000 / 1271592
finished:  889000 / 1271592
finished:  890000 / 1271592
finished:  891000 / 1271592
finished:  892000 / 1271592
finished:  893000 / 1271592
finished:  894000 / 1271592
finished:  895000 / 1271592
finished:  896000 / 1271592
finished:  897000 / 1271592
finished:  898000 / 1271592
finished:  899000 / 1271592
finished:  900000 / 1271592
finished:  901000 / 1271592
finished:  902000 / 1271592
finished:  903000 / 1271592
finished:  904000 / 1271592
finished:  905000 / 1271592
finished:  906000 / 1271592
finished:  907000 / 1271592
finished:  908000 / 1271592
finished:  909000 / 1271592
finished:  910000 / 1271592
finished:  911000 / 1271592
finished:  912000 / 1271592
finished:  913000 / 1271592
finished:  914000 / 1271592
finished:  915000 / 1271592
finished:  916000 / 1271592
finished:  917000 / 1271592
finished:  918000 / 1271592
finished:  919000 / 

finished:  1171000 / 1271592
finished:  1172000 / 1271592
finished:  1173000 / 1271592
finished:  1174000 / 1271592
finished:  1175000 / 1271592
finished:  1176000 / 1271592
finished:  1177000 / 1271592
finished:  1178000 / 1271592
finished:  1179000 / 1271592
finished:  1180000 / 1271592
finished:  1181000 / 1271592
finished:  1182000 / 1271592
finished:  1183000 / 1271592
finished:  1184000 / 1271592
finished:  1185000 / 1271592
finished:  1186000 / 1271592
finished:  1187000 / 1271592
finished:  1188000 / 1271592
finished:  1189000 / 1271592
finished:  1190000 / 1271592
finished:  1191000 / 1271592
finished:  1192000 / 1271592
finished:  1193000 / 1271592
finished:  1194000 / 1271592
finished:  1195000 / 1271592
finished:  1196000 / 1271592
finished:  1197000 / 1271592
finished:  1198000 / 1271592
finished:  1199000 / 1271592
finished:  1200000 / 1271592
finished:  1201000 / 1271592
finished:  1202000 / 1271592
finished:  1203000 / 1271592
finished:  1204000 / 1271592
finished:  120

In [12]:
# Super vectorised ALS
def train():
    W = np.random.randn(V,D)/np.sqrt(V+D)
    U = np.random.randn(V,D)/np.sqrt(V+D)
    b = np.random.randn(V)/np.sqrt(V)
    c = np.random.randn(V)/np.sqrt(V)
    mu = PMI.mean()
    costs = []
    for epoch in range(EPOCHS):
        t0 = datetime.now()
        # super vectorisation !

        # update W
        A = U.T@U + REG*np.eye(D) # DxD matrix
        B = (PMI -b[:,None]-c-mu)@U # VxD matrix
        W = np.linalg.solve(A,B.T).T 
        
        print('epoch : ',epoch+1,' updated W')

        # update U
        A = W.T@W + REG*np.eye(D) # DxD matrix
        B = (PMI -b[:,None]-c-mu).T@W# VxD matrix
        U = np.linalg.solve(A,B.T).T 
        
        print('epoch : ',epoch+1,' updated U')
        
        # update b
        num = (PMI- W@U.T-c-mu).sum(axis=1)
        denom = V + REG
        b = num/denom 
        
        print('epoch : ',epoch+1,' updated b')
        
        # update c
        num = (PMI- W@U.T-b[:,None]-mu).sum(axis=0)
        denom = V + REG
        c = num/denom 
            
        print('epoch : ',epoch+1,' updated c')
    
        cost = np.sum((W@U.T+b[:,None]+c+mu-PMI)**2)
        costs.append(cost)        
        print('epoch: ',epoch+1,'/',EPOCHS,' cost : ',cost,' time : ',datetime.now()-t0)
        
    return W,U

In [13]:
W,U = train()

epoch :  1  updated W
epoch :  1  updated U
epoch :  1  updated b
epoch :  1  updated c
epoch:  1 / 10  cost :  62641735.59257191  time :  0:00:43.758131
epoch :  2  updated W
epoch :  2  updated U
epoch :  2  updated b
epoch :  2  updated c
epoch:  2 / 10  cost :  54633490.358161174  time :  0:00:42.450739
epoch :  3  updated W
epoch :  3  updated U
epoch :  3  updated b
epoch :  3  updated c
epoch:  3 / 10  cost :  53514036.9768321  time :  0:00:42.222430
epoch :  4  updated W
epoch :  4  updated U
epoch :  4  updated b
epoch :  4  updated c
epoch:  4 / 10  cost :  53166679.76127403  time :  0:00:44.283504
epoch :  5  updated W
epoch :  5  updated U
epoch :  5  updated b
epoch :  5  updated c
epoch:  5 / 10  cost :  53023205.1390841  time :  0:00:43.071669
epoch :  6  updated W
epoch :  6  updated U
epoch :  6  updated b
epoch :  6  updated c
epoch:  6 / 10  cost :  52952926.55454735  time :  0:00:42.149108
epoch :  7  updated W
epoch :  7  updated U
epoch :  7  updated b
epoch :  7 

In [18]:
# lets save our weights
np.savez(PATH+'weights.npz' , W, U)
# and word2idx
with open(PATH+'word2idx.json' , 'w') as f:
    json.dump(word2idx, f)

In [23]:
# load weights
npz = np.load(PATH+'weights.npz')
W,U = npz['arr_0'],npz['arr_1'].T
# lpad word2idx
with open(PATH+'word2idx.json') as f:
    word2idx = json.load(f)

In [25]:
def get_analogy(w1,w2,w4):
    E = (W+U)/2
    print('using averaged weight matricies : ')
    analogy(E,word2idx,w1,w2,w4)
    print('------------------------------------------')
    E = np.concatenate((W,U),axis=1)
    print('using concatenated weight matricies : ')
    analogy(E,word2idx,w1,w2,w4)
    print('------------------------------------------')
    print('------------------------------------------')



# king - man = ? - woman
def analogy(E,word2idx,w1,w2,w4):
    # first lets get our vector
    D = E.shape[1]
    king = word2idx.get(w1,None)
    man = word2idx.get(w2,None)
    woman = word2idx.get(w4,None)
    if king is None or man is None or woman is None:
        print('word not in dictionary')
        return
    king = E[king]
    man = E[man]
    woman = E[woman]
    queen = king - man + woman

    # next we calculate the distance between our vector and all other vectors
    # once using euclidean distance then again using cosine
    idx2word = {v:k for k,v in word2idx.items()}
    metrics = ['euclidean','cosine']

    for metric in metrics:
        distances = pairwise_distances(queen.reshape(1,D),E,metric=metric)
        # now we need to consider the 4 closest neighbours to that point
        # not to return a word in [w1,w2,w4]
        closest =  np.argpartition(distances[0], 4)[:4]
        closest = [idx2word[i] for i in closest]

        for word in closest:
            if word not in [w1,w2,w4]:
                print(metric,' distance :',w1,"-",w2,'=',word,'-',w4)
                break




In [30]:
get_analogy('france', 'paris', 'london')
get_analogy('france', 'paris', 'rome')
get_analogy('paris', 'france', 'italy')
get_analogy('france', 'french', 'english')
get_analogy('japan', 'japanese', 'chinese')
get_analogy('japan', 'japanese', 'italian')
get_analogy('japan', 'japanese', 'australian')
get_analogy('december', 'november', 'june')
get_analogy('true', 'false',  'bad')
get_analogy('fix','break','black')
get_analogy('even','odd','3')
get_analogy('woman', 'man',  'male')
get_analogy('king', 'emperor',  'empire')
get_analogy('north', 'south',  'west')
get_analogy('mosque', 'church',  'christianity')
get_analogy('six', 'five',  'three')
get_analogy('summer', 'hot',  'cold')
get_analogy('fire', 'water',  'south')
get_analogy('christianity', 'bible',  'quran')
get_analogy('arabic', 'arab',  'american')
get_analogy('fly', 'flying',  'walking')
get_analogy('proton', 'electron',  'negative')
get_analogy('king', 'prince', 'princess')
get_analogy('man', 'woman', 'she')
get_analogy('february', 'january',  'october')
get_analogy('heir', 'heiress',  'princess')
get_analogy('france', 'paris',  'tokyo')
get_analogy('france', 'paris',  'beijing')
get_analogy('france', 'paris',  'rome')
get_analogy('france', 'paris',  'berlin')
get_analogy('miami', 'florida', 'texas')
get_analogy('france', 'french',  'english')
get_analogy('japan', 'japanese',  'chinese')
get_analogy('china', 'chinese',  'american')
get_analogy('japan', 'japanese',  'italian')
get_analogy('japan', 'japanese',  'australian')
get_analogy('man', 'woman',  'mother')
get_analogy('nephew', 'niece',  'aunt')
get_analogy('man', 'woman',  'actress')

using averaged weight matricies : 
euclidean  distance : france - paris = kingdom - london
cosine  distance : france - paris = england - london
------------------------------------------
using concatenated weight matricies : 
euclidean  distance : france - paris = england - london
cosine  distance : france - paris = britain - london
------------------------------------------
------------------------------------------
using averaged weight matricies : 
euclidean  distance : france - paris = italy - rome
cosine  distance : france - paris = papal - rome
------------------------------------------
using concatenated weight matricies : 
euclidean  distance : france - paris = italy - rome
cosine  distance : france - paris = papal - rome
------------------------------------------
------------------------------------------
using averaged weight matricies : 
euclidean  distance : paris - france = venice - italy
cosine  distance : paris - france = milan - italy
-----------------------------------

euclidean  distance : fly - flying = go - walking
cosine  distance : fly - flying = walk - walking
------------------------------------------
------------------------------------------
using averaged weight matricies : 
euclidean  distance : proton - electron = positive - negative
cosine  distance : proton - electron = positive - negative
------------------------------------------
using concatenated weight matricies : 
euclidean  distance : proton - electron = positive - negative
cosine  distance : proton - electron = positive - negative
------------------------------------------
------------------------------------------
using averaged weight matricies : 
euclidean  distance : king - prince = queen - princess
cosine  distance : king - prince = queen - princess
------------------------------------------
using concatenated weight matricies : 
euclidean  distance : king - prince = daughter - princess
cosine  distance : king - prince = queen - princess
------------------------------------

In [42]:
# now lets try for SVD!
PATH = 'outputs/PMI/SVD/'

In [40]:
# lets rewrite our train function
def train():
    mu = PMI.mean()
    model = TruncatedSVD(n_components=D)
    t0 = datetime.now()
    # fit
    Z = model.fit_transform(PMI-mu)
    U = model.components_
    S = model.explained_variance_
    S_inv = np.linalg.inv(np.diag(S))
    W = Z@S_inv
    cost = np.sum((W@U+mu-PMI)**2)
    print(' cost : ',cost,' time : ',datetime.now()-t0)
    return W,U

In [41]:
W,U = train()

 cost :  70520442.41442539  time :  0:01:10.918986


In [43]:
# lets save our weights
np.savez(PATH+'weights.npz' , W, U)
# and word2idx
with open(PATH+'word2idx.json' , 'w') as f:
    json.dump(word2idx, f)

In [44]:
# load weights
npz = np.load(PATH+'weights.npz')
W,U = npz['arr_0'],npz['arr_1'].T
# lpad word2idx
with open(PATH+'word2idx.json') as f:
    word2idx = json.load(f)

In [45]:
get_analogy('france', 'paris', 'london')
get_analogy('france', 'paris', 'rome')
get_analogy('paris', 'france', 'italy')
get_analogy('france', 'french', 'english')
get_analogy('japan', 'japanese', 'chinese')
get_analogy('japan', 'japanese', 'italian')
get_analogy('japan', 'japanese', 'australian')
get_analogy('december', 'november', 'june')
get_analogy('true', 'false',  'bad')
get_analogy('fix','break','black')
get_analogy('even','odd','3')
get_analogy('woman', 'man',  'male')
get_analogy('king', 'emperor',  'empire')
get_analogy('north', 'south',  'west')
get_analogy('mosque', 'church',  'christianity')
get_analogy('six', 'five',  'three')
get_analogy('summer', 'hot',  'cold')
get_analogy('fire', 'water',  'south')
get_analogy('christianity', 'bible',  'quran')
get_analogy('arabic', 'arab',  'american')
get_analogy('fly', 'flying',  'walking')
get_analogy('proton', 'electron',  'negative')
get_analogy('king', 'prince', 'princess')
get_analogy('man', 'woman', 'she')
get_analogy('february', 'january',  'october')
get_analogy('heir', 'heiress',  'princess')
get_analogy('france', 'paris',  'tokyo')
get_analogy('france', 'paris',  'beijing')
get_analogy('france', 'paris',  'rome')
get_analogy('france', 'paris',  'berlin')
get_analogy('miami', 'florida', 'texas')
get_analogy('france', 'french',  'english')
get_analogy('japan', 'japanese',  'chinese')
get_analogy('china', 'chinese',  'american')
get_analogy('japan', 'japanese',  'italian')
get_analogy('japan', 'japanese',  'australian')
get_analogy('man', 'woman',  'mother')
get_analogy('nephew', 'niece',  'aunt')
get_analogy('man', 'woman',  'actress')

using averaged weight matricies : 
euclidean  distance : france - paris = britain - london
cosine  distance : france - paris = britain - london
------------------------------------------
using concatenated weight matricies : 
euclidean  distance : france - paris = britain - london
cosine  distance : france - paris = britain - london
------------------------------------------
------------------------------------------
using averaged weight matricies : 
euclidean  distance : france - paris = italy - rome
cosine  distance : france - paris = italy - rome
------------------------------------------
using concatenated weight matricies : 
euclidean  distance : france - paris = italy - rome
cosine  distance : france - paris = roman - rome
------------------------------------------
------------------------------------------
using averaged weight matricies : 
euclidean  distance : paris - france = rome - italy
cosine  distance : paris - france = bologna - italy
-----------------------------------

euclidean  distance : fly - flying = walk - walking
cosine  distance : fly - flying = trails - walking
------------------------------------------
------------------------------------------
using averaged weight matricies : 
euclidean  distance : proton - electron = positive - negative
cosine  distance : proton - electron = positive - negative
------------------------------------------
using concatenated weight matricies : 
euclidean  distance : proton - electron = addition - negative
cosine  distance : proton - electron = negatively - negative
------------------------------------------
------------------------------------------
using averaged weight matricies : 
euclidean  distance : king - prince = queen - princess
cosine  distance : king - prince = queen - princess
------------------------------------------
using concatenated weight matricies : 
euclidean  distance : king - prince = kings - princess
cosine  distance : king - prince = queen - princess
---------------------------------