# Fun with word embeddings

#### In this notebook, word embeddings are explained in a simple way. Together with python and bpemb we do different kind of experiments to see what actually word embeddings are and how they work.

For this the bpemb byte-pair encodings by https://github.com/bheinzerling/bpemb
These need to be imported first


In [4]:
from bpemb import BPEmb
import numpy as np
# a smaller german model is used, so vocab_size is set to 25000 and dimensions is set to 25
bpemb_ger = BPEmb(lang="de", vs=25000, dim=25)

What does it mean, dimensions are set to 25? What are word embeddings at all?

Lets have a look at them in a simple way.

In [13]:
# The words that are about to be inspected are first defined
word1 = "Mann"
word2 = "Frau"
word3 = "Tochter"
word4 = "Kind"

words = [word1, word2, word3]
# Now a function _encode is defined, which will create very simple word_embeddings from those words
def _encode(words):
    # First the words are mapped to indices so it is easier to handle them
    word_indices = [i for i in range(len(words))]
    
    # Then these words should be put in a multi dimensional list, where each dimension has a specific or unspecific meaning. We define 2 Dimensions here:
    word_vectors = {i:np.array([0,0]) for i in word_indices}
    # Since this is a simple example, we are allowed to define the "meaning" of the dimensions by ourselves
    # So lets give the first dimension the name grownup, because it should define if someone is grownup
    # The seconds dimension gets the name related, because it defines if the context is about beeing related to someone
    
    # so now we can define our word_vectors manually with those meanings
    ## Mann - Man / Husband
    word_vectors[0][0] = 1
    word_vectors[0][1] = -1
    ## Frau - Woman / Wife
    word_vectors[1][0] = 1
    word_vectors[1][1] = -1
    ## Tochter - Daughter
    word_vectors[2][0] = 2
    word_vectors[2][1] = -2
    ## Kind - kid / children
    word_vectors[2][0] = 2
    word_vectors[2][1] = 0
    
    # Now the word vectors are defined
    
    return word_vectors

In [14]:
word_vectors = _encode(words)
print(word_vectors)

{0: array([ 1, -1]), 1: array([ 1, -1]), 2: array([2, 0])}


Numpy allows us to do list wise calculations, so we can do mathematic operations on these word_vectors
Lets find out what we get, what happens, when "Mann" and "Frau" are combined

In [9]:
# Remember, "Mann" has index 0 and "Frau" has index 1

In [15]:
combined_word_vector = word_vectors[0] + word_vectors[1]

In [16]:
print(combined_word_vector)

[ 2 -2]


This combined word vector contains the values of the combination of the words "Mann" and "Frau" which are the same as the values of the word "Tochter". This hard coded model just learned an assiciation, that man and women combined point at daughter.

### Word embeddings in bpemb

Now we know about the dimension-parameter of the bpemb import. But let some words be said about the vocab_size parameter.
Generally, each vocab in NLP should use tokens, that represent rare or unknown word-tokens. Depending on your hardware, the vocab size will restrict you at some point in your model building, since it will run out of memory. Simply said, having a vocab_size of 25.000 word-tokens means having 24.999 real-word tokens and 1 token, that represents the rare ones. (There might also be other tokens, like for special encodings, or other special word parts). Increasing the vocab size to 50.000 will result in more different word_tokens, that, in the 25.000 vs model, would be "hidden" in the rare token.

In bpemb, we have pretrained word vectors, that allow us to do some experiments to understand them better. What happens, when we do the same as above in bpemb?

In [22]:
embeds1 = bpemb_ger.embed('Mann')
embeds2 = bpemb_ger.embed('Frau')

embed_add = embeds1 + embeds2
bpemb_ger.most_similar(embed_add, topn=6)

[('▁frau', 0.9576594233512878),
 ('▁mann', 0.9351305365562439),
 ('▁freund', 0.8832379579544067),
 ('▁mutter', 0.8588590621948242),
 ('▁vater', 0.8556250333786011),
 ('▁kind', 0.8479531407356262)]

Stop, lets go through each line to see what happens

In [28]:
embeds1 = bpemb_ger.embed('Mann')
print(embeds1)

[[-0.69423   0.142541  0.367741  0.114064 -0.191817  0.329579 -0.181716
   1.041099 -1.150414  0.57117  -0.039958  0.936779  0.35633  -0.221654
   0.423722  0.433451 -0.308177 -0.793244  0.912121 -0.83934  -0.194813
   0.740967 -0.4088    0.140494  0.922742]]


We use the .embed function of our bpemb model, which tokenizes our input 'Mann' into subtokens, that are used by bpemb and then returns the corresponding word vector

The result is a list, containing lists of 25 elements (as the dimensions, we choose). Since "Mann" only results in one token, we only have one list of 25 values.

We can add numpy arrays in python, so the lists are merged

In [29]:
embed_add = embeds1 + embeds2

What we have now, is a new vector containing the following Information: We want a word, that has the information of adding "Mann" and "Frau" together. For this we use .most_similar, with getting the top 6 results


In [30]:
bpemb_ger.most_similar(embed_add, topn=6)

[('▁frau', 0.9576594233512878),
 ('▁mann', 0.9351305365562439),
 ('▁freund', 0.8832379579544067),
 ('▁mutter', 0.8588590621948242),
 ('▁vater', 0.8556250333786011),
 ('▁kind', 0.8479531407356262)]

"_frau" and "_mann" are still near our start vector, but we also find other words like "_freund" (friend) or mother and father. At the 6.th position we find "_kind".

When we teached our predefined _embed method, which information each dimension contains, the bpemb model had to learn it on his own by alot of wikipedia articles. So how about we try to find out more about the meaning of each dimension? For this, we iterate through the dimensions of our embed_add vector and set the to 1 and print the top 3 results. Lets see if we can "kid" or even "daughter" to the top with this.

In [31]:
for i in range(25):
    new_embed_add= embed_add
    new_embed_add[0][i] = 1
    print(bpemb_ger.most_similar(embed_add, topn=3))

[('▁frau', 0.8940346240997314), ('▁mann', 0.8566429615020752), ('▁freund', 0.808347761631012)]
[('▁frau', 0.8886023759841919), ('▁mann', 0.8520528078079224), ('▁geliebte', 0.8106622695922852)]
[('▁frau', 0.8825910091400146), ('▁mann', 0.853992223739624), ('▁freund', 0.8089397549629211)]
[('▁frau', 0.8841907978057861), ('▁mann', 0.8498778343200684), ('▁freund', 0.8016359806060791)]
[('▁frau', 0.8412380814552307), ('▁mann', 0.8276678323745728), ('▁freund', 0.7763669490814209)]
[('▁frau', 0.8366867899894714), ('▁mann', 0.8289816975593567), ('▁freund', 0.7783395051956177)]
[('▁mann', 0.812994122505188), ('▁frau', 0.7430222630500793), ('▁freund', 0.7360460162162781)]
[('▁mann', 0.7824007272720337), ('felt', 0.7378767132759094), ('▁eigentlich', 0.7248014211654663)]
[('▁klar', 0.6995150446891785), ('▁gerade', 0.659595787525177), ('▁eigentlich', 0.6595355868339539)]
[(',', 0.6670671701431274), ('▁paar', 0.6461077928543091), ('▁klar', 0.6414101123809814)]
[('▁klar', 0.6853453516960144), ('▁gera

As we can see, we get alot of differend results, but are rather more distant to having "kid" in the top three than before. This is because the whole learnt language in the model had to be squashed into 25 dimensions. Thats far too less for each dimension to have a clear context. We just dont know, what each dimension represents.

Lets try something else, restrict the information in the word vector more by adding extra information

In [33]:
embeds1 = bpemb_ger.embed('Mann')
embeds2 = bpemb_ger.embed('Frau')
# Adding Birth
embeds3 = bpemb_ger.embed('Geburt')

embed_add = embeds1 + embeds2 + embeds3
bpemb_ger.most_similar(embed_add, topn=6)

[('▁frau', 0.948681116104126),
 ('▁kind', 0.8942172527313232),
 ('▁mutter', 0.868741512298584),
 ('▁tod', 0.8545101284980774),
 ('▁mann', 0.8393333554267883),
 ('▁mannes', 0.8388955593109131)]

Oh look! We just got "_kind" (children) on the second place thorugh adding the word vector of "birth"
That is pretty cool, right? But what happens, when we swap "Mann" and "Frau"?

In [34]:
embeds1 = bpemb_ger.embed('Frau')
embeds2 = bpemb_ger.embed('Mann')
# Adding Birth
embeds3 = bpemb_ger.embed('Geburt')

embed_add = embeds1 + embeds2 + embeds3
bpemb_ger.most_similar(embed_add, topn=6)

[('▁frau', 0.948681116104126),
 ('▁kind', 0.8942172527313232),
 ('▁mutter', 0.868741512298584),
 ('▁tod', 0.8545101284980774),
 ('▁mann', 0.8393333554267883),
 ('▁mannes', 0.8388955593109131)]

Mathematicans wont be surprised, it is the same.
Simply, because 1+2 is the same as 2+1. But, lets see if we can "_kind" even higher through weighting the word_vectors 

In [47]:
embeds1 = bpemb_ger.embed('Frau')
embeds2 = bpemb_ger.embed('Mann')
embeds3 = bpemb_ger.embed('Geburt')

# The information about birth is 1.5x more important than the information about man and woman
embed_add = embeds1 + embeds2 + (1.5*embeds3)
bpemb_ger.most_similar(embed_add, topn=6)

[('▁frau', 0.9274512529373169),
 ('▁kind', 0.8905812501907349),
 ('▁geburt', 0.8828873038291931),
 ('▁mutter', 0.8547175526618958),
 ('▁gestorben', 0.8433364629745483),
 ('▁tod', 0.8418509364128113)]

This is better, but we still did not got "children" to the top. Maybe Adding an extra information about children will raise it more. So lets puts Spielzeug (toys) in it.

In [50]:
embeds1 = bpemb_ger.embed('Frau')
embeds2 = bpemb_ger.embed('Mann')
embeds3 = bpemb_ger.embed('Geburt')
embeds4 = bpemb_ger.embed('Spielzeug')


embed_add = embeds1 + embeds2 + (1.5*embeds3) + (2.0)*embeds4
bpemb_ger.most_similar(embed_add, topn=6)

[('▁kind', 0.893251895904541),
 ('▁frau', 0.8610904216766357),
 ('▁ihm', 0.8160647749900818),
 ('▁ihr', 0.8110315799713135),
 ('▁mädchen', 0.8107687830924988),
 ('▁mutter', 0.8054994344711304)]

We did it! Through adding the information "Spielzeug" we got the model to understand, that we talk about children. What do you think, can we change one word, to get the word "baby" to the top?

Lets try changing "Mann" to Windeln (diapers)

In [57]:
embeds1 = bpemb_ger.embed('Frau')
embeds2 = bpemb_ger.embed('Windeln')
embeds3 = bpemb_ger.embed('Geburt')
embeds4 = bpemb_ger.embed('Spielzeug')


embed_add = embeds1 + (2.0)*embeds2 + (1.5*embeds3) + (2.0)*embeds4
bpemb_ger.most_similar(embed_add, topn=6)

[('▁kind', 0.838711678981781),
 ('▁ihr', 0.7817391753196716),
 ('▁paar', 0.778824508190155),
 ('▁mädchen', 0.7774273157119751),
 ('▁ihrem', 0.7575104832649231),
 ('▁frau', 0.7430578470230103)]

"kind" is still on the top. Maybe the vocabulary is too small, to contain the word "baby", lets try it again with 50.000 words.

In [62]:
bpemb_ger_mid = BPEmb(lang="de", vs=100000, dim=25)

  0%|          | 0/10069301 [00:00<?, ?B/s]

downloading https://nlp.h-its.org/bpemb/de/de.wiki.bpe.vs100000.d25.w2v.bin.tar.gz


100%|██████████| 10069301/10069301 [00:02<00:00, 4629091.87B/s]


In [71]:
embeds1 = bpemb_ger_mid.embed('Baby')
embeds2 = bpemb_ger_mid.embed('Windeln')
embeds3 = bpemb_ger_mid.embed('Geburt')
embeds4 = bpemb_ger_mid.embed('Spielzeug')


embed_add = embeds1 + (2.0)*embeds2 + (1.5*embeds3) + (2.0)*embeds4
bpemb_ger_mid.most_similar(embed_add, topn=6)

[('▁wunder', 0.7633698582649231),
 ('▁koffer', 0.7478492856025696),
 ('▁braut', 0.7169751524925232),
 ('▁glück', 0.714860737323761),
 ('▁liebes', 0.7134131789207458),
 ('▁baby', 0.7069087624549866)]

Weird, isn't it? Still we don't get "Baby" to the top. Lets see, what is similar to baby.

In [74]:
embeds_baby = bpemb_ger_mid.embed('Baby')
bpemb_ger_mid.most_similar(embeds_baby, topn=10)

[('▁baby', 1.0),
 ('baby', 0.8973329663276672),
 ('good', 0.877951979637146),
 ('▁happy', 0.8630719780921936),
 ('▁love', 0.8501371741294861),
 ('love', 0.8397102355957031),
 ('▁like', 0.8358559012413025),
 ('▁christmas', 0.8313241004943848),
 ('▁girls', 0.8312001824378967),
 ('▁sweet', 0.8268948793411255)]

Now we see the word baby has different meanings. In the german bpemb model, there are alot of english words to find, thats where "baby" can be found.

In this post you learned how to use the bpemb model, to extract word embeddings to analyze them. You also learned how to find other similar words and how bpemb stores them in vectors.