Today I want to show you extremally cool thing that discovered by Arory et al in a paper [Linear Algebraic Structure of Word Senses, with Applications to Polysemy](https://arxiv.org/abs/1601.03764). This paper is a part of series of this author in which he tries to explain theoretically the properties of word embeddings. In this work he assumes that simple word embeddings obtained by word2vec or glove algorithms, for example, includes several senses of word and shows how to pick out them with sparse coding technique. Super cool staf.

More formaly, let's $\nu_{tie}$ be the word embedings of word *tie*. We assume that each word embedding is a linear combination of senses 
$$\nu_{tie} \approx \alpha_1 \nu_{tie1} + \alpha_2 \nu_{tie2} + \alpha_1 \nu_{tie3}+...$$
where $\nu_n$ is some sense of word and $\alpha$ is coefficient. 

In this notebook the magic of k-svd algorithm is shown. Here will show how to apply k-svd algorithm to obtain different senses of word through its embeddings.

In [22]:
import numpy as np

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from scipy.spatial.distance import cosine


**1. Load the word embeddings through gensim interface**

Here we load and transform the GloVe vectors in order to have the way of manipulating them. Due to large size, embeddings is not conteined in path below. You can donwnload it [here](https://nlp.stanford.edu/projects/glove/) and specify the path if you need. Remember that I and author used 300d vectors.

In [2]:
tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec("./embeddings/glove.6B.300d.txt", tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)

In [3]:
embeddings = model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors

  """Entry point for launching an IPython kernel.


In [5]:
print(embedds.shape)

(400000, 300)


So we have 400000 unique words with vector of 300 dim vectors.

**2. Installing and appluing ksvd to embedding matrix**

Now we need to obtain the atoms of discourse through sparse recovry. To do that will use ksvd packet.

In [13]:
!pip install ksvd
from ksvd import ApproximateKSVD



From the paper it is knonw, that number of atoms is 2000 and sparsity parametr k is 5. I trained two versions: first one is for 10000 embeddings and second one for the whole embeddings. Because this proccess takes quite much time, especially for the whole embedding matrix, I saved the matrices and you can just load them.

In [5]:
%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.54 µs


In [26]:
#gamma = np.load('./data/mats/.npz')
# dictionary_glove6b_300d.np.npz - whole matrix file
dictionary = np.load('./data/mats/dictionary_glove6b_300d_10000.np.npz')
dictionary = dictionary[dictionary.keys()[0]]

In [7]:
#print(gamma.shape)
print(dictionary.shape)

(10000, 2000)
(2000, 300)


In [6]:
#np.savez_compressed('gamma_glove6b_300d.npz', gamma)
#np.savez_compressed('dictionary_glove6b_300d.npz', dictionary)

**3.Determining reletionships between atoms/dictionaries and source matrix**

Let's play with dictionary, that we've gotten, finding the nearest words for several random choised atoms from the dictionary.

In [33]:
embeddings.similar_by_vector(dictionary[1354,:])

  if np.issubdtype(vec.dtype, np.int):


[('вслепую', 0.19667410850524902),
 ('успевать', 0.19195224344730377),
 ('незамедлительно', 0.18898837268352509),
 ('киссинген', 0.18743178248405457),
 ('дознано', 0.18590040504932404),
 ('хотек', 0.17938834428787231),
 ('безотлагательно', 0.17875441908836365),
 ('всепокорно', 0.17833231389522552),
 ('неослабно', 0.17833179235458374),
 ('молниеносно', 0.1781741827726364)]

In [15]:
embeddings.similar_by_vector(dictionary[1350,:])

  if np.issubdtype(vec.dtype, np.int):


[('applemans', 0.48566657304763794),
 ('psystar', 0.48538529872894287),
 ('aluminio', 0.46252337098121643),
 ('autlan', 0.4580308794975281),
 ('tongli', 0.45648980140686035),
 ('neuromarketing', 0.4531436562538147),
 ('thongrung', 0.4484277665615082),
 ('keitt', 0.44737738370895386),
 ('tom.fowler@chron.com', 0.441282719373703),
 ('sintered', 0.4411150813102722)]

In [14]:
embeddings.similar_by_vector(dictionary[1546,:])

  if np.issubdtype(vec.dtype, np.int):


[('lodovico', 0.5247857570648193),
 ('tasso', 0.5008430480957031),
 ('ariosto', 0.49769654870033264),
 ('frigerio', 0.4863497316837311),
 ('khayyám', 0.4795286953449249),
 ('pleasance', 0.47886714339256287),
 ('torquato', 0.4760018289089203),
 ('aliki', 0.4667115807533264),
 ('maini', 0.4652693271636963),
 ('blagojević', 0.46242815256118774)]

In [13]:
embeddings.similar_by_vector(dictionary[1850,:])

  if np.issubdtype(vec.dtype, np.int):


[('porcelain', 0.6797035336494446),
 ('handcrafted', 0.6475120186805725),
 ('inlaid', 0.6458134055137634),
 ('tapestries', 0.6427910923957825),
 ('ceramics', 0.6425598859786987),
 ('metalwork', 0.6384821534156799),
 ('rugs', 0.6371402144432068),
 ('carpets', 0.6327065825462341),
 ('embroidery', 0.629633903503418),
 ('handicrafts', 0.6282943487167358)]

Impressive result! The atoms is really the centroids of similar words. Now, let's take a couple of multi-meaning words and find the nearest atoms for them. They should represents different meaning. I'm going to take 'tie' and 'spring' from the paper.

In [16]:
itie = index2word.index('tie')
ispring = index2word.index('spring')

tie_emb = embedds[itie]
string_emb = embedds[ispring]

In [20]:
simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )
    
simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

  if np.issubdtype(vec.dtype, np.int):


Atom #718: semifinal quarterfinal finals play-off semi-final playoff semifinals semi-finals matches qualifying
Atom #860: win winning second third last fourth champion first fifth ,
Atom #1609: everton tottenham middlesbrough 1-0 chelsea fulham liverpool 2-0 2-1 sunderland
Atom #928: assists scored goals scoring rebounds points goal steals turnovers touchdowns
Atom #282: . , same but the though well one when which
Atom #1705: quarterback nfl broncos patriots cowboys touchdowns seahawks 49ers raiders redskins
Atom #1829: want 'll let you n't go tell come able ca
Atom #16: trousers tunic blouse dresses pants skirts sleeveless satin sweater blouses
Atom #711: something indeed really quite kind seems thing always certainly very
Atom #266: saying officials tuesday but thursday wednesday monday week earlier however
Atom #449: . which , already country although year while more countries
Atom #912: june april july october september january november march february december
Atom #723: 2-7 3-7 1-

In [74]:
simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )
    
simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:6]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

  if np.issubdtype(vec.dtype, np.int):


Atom #528: autumn spring summer winter season rainy seasons fall seasonal during
Atom #1070: start begin beginning starting starts begins next coming day started
Atom #931: holiday christmas holidays easter thanksgiving eve celebrate celebrations weekend festivities
Atom #1455: after later before when then came last took again but
Atom #754: but so not because even only that it this they
Atom #688: yankees yankee mets sox baseball braves steinbrenner dodgers orioles torre


In [21]:
simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )
    
simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

  if np.issubdtype(vec.dtype, np.int):


Atom #912: june april july october september january november march february december
Atom #282: . , same but the though well one when which
Atom #1829: want 'll let you n't go tell come able ca
Atom #449: . which , already country although year while more countries
Atom #860: win winning second third last fourth champion first fifth ,
Atom #266: saying officials tuesday but thursday wednesday monday week earlier however
Atom #1004: analysts earnings expectations expected investors market expect recent economy outlook
Atom #711: something indeed really quite kind seems thing always certainly very
Atom #1669: organizations organization education promote community organized educational development initiative addition
Atom #121: help need needs because needed helping keep already especially well


Okey, just for curiosity, let's do the same for the Russian fasttext embeddings and see what will happend. The source embeddings I downloaded from [RusVectores](http://rusvectores.org). It trained on Russuan National Corpus with dimensionality 300.

In [10]:
fasttext_model = KeyedVectors.load('/home/astromis/Embeddings/fasttext/model.model')

In [11]:
embeddings = fasttext_model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors

  """Entry point for launching an IPython kernel.


In [12]:
embedds.shape

(164996, 300)

In [14]:
%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors[:10000]
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)

CPU times: user 1 µs, sys: 2 µs, total: 3 µs
Wall time: 6.2 µs


In [34]:
dictionary = np.load('./data/mats/dictionary_rus_fasttext_300d.npz')
dictionary = dictionary[dictionary.keys()[0]]

In [35]:
embeddings.similar_by_vector(dictionary[1024,:], 20)

  if np.issubdtype(vec.dtype, np.int):


[('исчезать', 0.6854609251022339),
 ('бесследно', 0.6593252420425415),
 ('исчезавший', 0.6360634565353394),
 ('бесследный', 0.5998549461364746),
 ('исчезли', 0.5971367955207825),
 ('исчез', 0.5862340927124023),
 ('пропадать', 0.5788886547088623),
 ('исчезлотец', 0.5788123607635498),
 ('исчезнувший', 0.5623885989189148),
 ('исчезинать', 0.5610565543174744),
 ('ликвидироваться', 0.5551878809928894),
 ('исчезнуть', 0.551397442817688),
 ('исчезнет', 0.5356274247169495),
 ('исчезание', 0.531707227230072),
 ('устраняться', 0.5174376368522644),
 ('ликвидируть', 0.5131562948226929),
 ('ликвидировать', 0.5120065212249756),
 ('поглощаться', 0.5077806115150452),
 ('исчезаний', 0.5074601173400879),
 ('улетучиться', 0.5068254470825195)]

In [20]:
itie = index2word.index('коса')
ispring = index2word.index('ключ')

tie_emb = embedds[itie]
string_emb = embedds[ispring]

In [23]:
simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )
    
simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

Atom #185: загадка загадк загадкай загад проблема вопрос разгадка загадать парадокс задача
Atom #1217: дверь дверью двери дверка дверной калитка ставень запереть дверь-то настежь
Atom #1213: папка бумажник сейф сундук портфель чемодан ящик сундучк пачка сундучок
Atom #1978: кран плита крышка вентиль клапан электроплита котел плитка раковина посуда
Atom #1796: карман пазуха кармашек бумажник карманута карманбыть пазух карманчик карманьол кармашка
Atom #839: кнопка кнопф нажимать кноп клавиша нажать кнопа кнопочка рычажок нажатие
Atom #989: отыскивать искать отыскиваться поискать разыскивать разыскиваться поиск поискивать отыскать отыскаться
Atom #414: молоток молот топор пила колот молотобоец молотой кувалда молота умолот
Atom #1140: капиталец капитал капиталовек капиталист капитально капитализм -капиталист капитальный капиталоемкий капиталовложение
Atom #878: хранитель хранить храниться хранивший хранивать хранимый храниваться хранилище хранеть хранившийся


  if np.issubdtype(vec.dtype, np.int):


In [24]:
simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )
    
simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

Atom #883: косой русый кудряшка косичка челка русой черноволосой кудрявый кудряш светло-русый
Atom #40: кустарник заросль осока ивняк трава бурьян папоротник кустик полукустарник бурьяна
Atom #215: ниточка паучок бусинка паутинка жердочка стебелька веточка стебелек травинка пупырышек
Atom #688: волос валюта кудри валютный борода валютчик ус бивалютный коса усы
Atom #386: плечотец грудь шея подбородок бедро грудью ляжка плечо затылок живот
Atom #676: веревка канат бечевка веревочка бечевкий шест репшнур жердь веревочный ремень
Atom #414: молоток молот топор пила колот молотобоец молотой кувалда молота умолот
Atom #127: сюртучок сюртук галстучок фрак панталоны галстучек сюртуки галстук платье галстух
Atom #592: салфетка скатерть салфеточка платок шаль полотенце кружевной кружевцо кисея шелка
Atom #703: шлюпка катер баркас фок-мачта грот-мачта мачта фрегат судно корвет шхуна


  if np.issubdtype(vec.dtype, np.int):


In [25]:
np.savez_compressed('./data/mats/gamma_rus_fasttext_300d.npz', gamma)
np.savez_compressed('./data/mats/dictionary_rus_fasttext_300d.npz', dictionary)

If you don't know the Russian, trust me, it works well.