# LAB2.2: Word embeddings from Wikipedia

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, we will introduce you to word embeddings. Word embeddings are vector representations for words learned by a neural network to predict words that occur in their context from a very large corpus. The weights applied to the hidden layer to make the correct predictions are taken as the vector representation for the meaning of the word. Usually, vector sizes are limited to 300 to 500 dimensions (context words). The advantage is that comparing vectors across words always match for certain dimensions: i.e. the vectors are dense vectors but match most strongly when words occur in similar contexts.

Although there are many packages and data sets with embeddings, we focus on publicly available and trainable embeddings, especially for multiple languages. We therefore use the wikipedia2vec package that has pretrained models in various languages.

Acnowledgement: https://wikipedia2vec.github.io/wikipedia2vec/

To use the embeddings created from wikipedia (in a specific language) you need to do 3 things (also described in the above website):

* Use `pip install Wikipedia2Vec` from the command line/terminal to install the package on your local computer
* download an embedding model trained from wikipedia and unpack the compressed file with a decompression application
* import the package in your notebook
* load the local copy of the embedding model

We guide you through these steps in this notebook and explain the basic functions. As there are different models for different languages, you can do this for any of the available languages.

Use `pip install Wikipedia2Vec` on the command line to install the package.

In [1]:
%pip install Wikipedia2Vec

You should consider upgrading via the '/Users/piek/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


You can download pre-trained models in various languages from: https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

Note that there are different variants trained for 100 and 300 dimensions. If your computer has limited capacity, it is better to start with the 100 dimensions. For this notebook, we will download enwiki_20180420_100d.pkl.bz2, which is a compressed version of the 100 dimensions embedding model built from the English Wikipedia.

## NOTE!

If you fail to install the Wikipedia2Vec package, do not waste too much time fixing this. You can also switch to the notebook *Lab2.2b.Wikipedia2Vec_Gensim.ipynb* which explains how you can load the Wikipedia2Vec  text models in another package *Gensim*.

If you succesfully install Wikipedia2Vec you can proceed here:

In [2]:
# When installed succesfully you can use the next import in your notebook. There is no need to install it again. 
from wikipedia2vec import Wikipedia2Vec

Next we need to download a model in a format that Wikipedia2Vec can load. The binary versions (pkl) are compressed in a bz2 format. You need to decompress the bz2 file to a file with the extension ".pkl".

In [3]:
# Fill in the path to your local copy of an embedding model.
# Here we specify an example of such a path. Adapt the path to where you have stored the donwload
# Make sure it is decompressed. The *.bz2 file will not load. 

MODEL_FILE='/Users/piek/Desktop/t-ONDERWIJS/data/word-embeddings/wiki2vec/enwiki_20180420_100d.pkl'
wiki2vec = Wikipedia2Vec.load(MODEL_FILE)

By loading the model, we created an object with the name "wiki2vec" through which we can call functions and attributes.

In [4]:
dir(wiki2vec)

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_build_entity_neg_table',
 '_build_uniform_neg_table',
 '_build_unigram_neg_table',
 '_build_word_neg_table',
 'dictionary',
 'get_entity',
 'get_entity_vector',
 'get_vector',
 'get_word',
 'get_word_vector',
 'load',
 'load_text',
 'most_similar',
 'most_similar_by_vector',
 'save',
 'save_text',
 'syn0',
 'syn1',
 'train',
 'train_params']

# 3. Accessing word representations of different models

Models may be stored in different (sometimes in confusing) formats, but they all boil down to these components:

* a matrix of word vectors 
* a vocabulary
* a mapping between vectors in the matrix to the words in the vocabulary (often via indices)

Think about what a matrix is (no not the movie). You know that a vector is a list of digits, such that each digit is a value for a dimension in an n-dimensional space. Well, if you have a list of these vectors you have a matrix of n-columns and m-rows. Each row corresponds to the vector of a word in the vocabulary.

The matrix of 3 rows and 3 columns
```
[[.34, .56, ,12],
 [.12, .39, ,05],
 [.78, .37, ,01]]
```

The vocabulary with the word as a key and the matrix list index that points to the row with the embedding for the word:

```{"dog": 0, "cat" : 1, "car" : 2}```

For this data, a simple lookup function for *dog* will give the embedding *[.34, .56, ,12]*.

Now let's see how this is implemented in Wikipedia2Vec.

### 3.1 The vocabulary

In [5]:
# Explore the wiki2vec model as a python object:
print('The model is represented internally as a...')
print(type(wiki2vec))

The model is represented internally as a...
<class 'wikipedia2vec.wikipedia2vec.Wikipedia2Vec'>


The model has a dictionary that contains words. Let's check how big the vocabulary is of the model derived from English Wikipedia:

In [6]:
vocabulary = wiki2vec.dictionary
print('The model vocabulary is represented internally as a...')
print(type(vocabulary))
print(len(list(vocabulary.words())))

The model vocabulary is represented internally as a...
<class 'wikipedia2vec.dictionary.Dictionary'>
1937422


Almost two millions words are present in this model. That is a lot more than in the English WordNet. Let's check some of these:

In [7]:
#####
print('Some words from the model vocabulary:')
print(list(vocabulary.words())[:20]) #Note that :20 gives the first 20 items in the list, print(list(vocabulary.words())[-1]) gives the last word
print()

Some words from the model vocabulary:
[<Word s>, <Word sa>, <Word san>, <Word sand>, <Word sanda>, <Word sandal>, <Word sandali>, <Word sandalia>, <Word sandaliatus>, <Word sandalinas>, <Word sandaling>, <Word sandalio>, <Word sandaliyah>, <Word sandalo>, <Word sandalodes>, <Word sandalon>, <Word sandalops>, <Word sandalore>, <Word sandalov>, <Word sandalow>]



### 3.2 The embeddings

For each word in the vocabulary, we can now get the vector. We assume that 'man' is in the vocabulary. We can use the *get_word_vector* function to lookup the vector from the matrix for the word 'man':

In [20]:
print('Information stored in the vocabulary for the word "man":')
man_vector=wiki2vec.get_word_vector('man')
print(type(man_vector))
print('Distributional meaning of "man" in Wikipedia:', man_vector)

Information stored in the vocabulary for the word "man":
<class 'numpy.memmap'>
Distributional meaning of "man" in Wikipedia: [-2.37749144e-01  2.68582195e-01 -9.62369144e-02  2.70704746e-01
 -2.24097610e-01 -2.48913109e-01  1.06461413e-01  4.12168130e-02
 -5.34945190e-01 -1.44513458e-01 -8.70477855e-02 -1.87745050e-01
  1.98523641e-01 -1.64299533e-01  1.02062628e-01 -1.78317577e-01
 -5.51789738e-02  2.19180398e-02 -2.18049601e-01  1.56891569e-01
 -2.83530265e-01 -3.29926699e-01 -6.78404942e-02  3.50453734e-01
 -3.24131519e-01 -9.09007853e-04 -1.23354875e-01 -3.45233470e-01
 -4.52311546e-01  7.44896114e-01  1.46970570e-01 -1.25839904e-01
 -1.07294962e-01  4.01940018e-01  1.11972339e-01  2.22993977e-02
 -3.72039467e-01  2.02560142e-01  3.16281393e-02  2.91241556e-02
 -2.40586206e-01  1.36774838e-01 -1.75260063e-02  1.01980194e-01
  8.33696201e-02  5.01191735e-01 -3.97316903e-01  4.00523953e-02
 -1.65336326e-01 -1.89155132e-01 -1.44131929e-01  6.28692061e-02
 -5.18540621e-01 -2.63796657e

As expected, a vector is a sorted bunch of numbers, each representing a dimension. These numbers are actually the weights learned by the neural network that are applied to the hidden layer when learning to predict the context words of 'man'.

How many do we have?

In [21]:
print('Number of vector dimensions:', len(man_vector))

Number of vector dimensions: 100


Not a surprise: we loaded a model with 100 dimensions based on a hidden layer with 100 neurons. This is true for all words so also for 'dog'.

In [22]:
vector4dog = wiki2vec.get_word_vector('dog')

In [23]:
len(vector4dog)

100

In [24]:
print(vector4dog)

[-0.01304934  0.64761317  0.10454331 -0.3116781   0.14754549  0.0808935
 -0.15227112  0.26503846 -0.64620554 -0.20578592  0.15636262 -0.20720603
  0.41793343  0.03861991 -0.01935025 -0.22413553  0.22274837 -0.34524342
 -0.42599422  0.102845   -0.21360567 -0.02671032  0.19456221  0.3651903
 -0.22647302 -0.27360198  0.03258029 -0.02785098 -0.23588972  0.5077206
  0.37592876 -0.22071666 -0.05057421  0.7909033   0.1343578  -0.07903094
 -0.4099386   0.15587732 -0.00657076  0.1236117  -0.54740536 -0.08774299
 -0.3738407  -0.25297046 -0.4688306  -0.11844479 -0.05014395  0.32674935
 -0.17993684 -0.26620498  0.09679675  0.28913295 -0.4815562  -0.3374474
  0.24882683  0.17436764  0.0888719  -0.18725184 -0.33120757 -0.1903342
  0.05470906 -0.61491376 -0.42699674 -0.10787722  0.13698857 -0.14450763
  0.05210021 -0.5711191  -0.38591218 -0.6626211   0.2417067  -0.01411594
  0.39739552  0.13306352 -0.6726368  -0.22698367  0.1793001   0.24538207
  0.15446481 -0.09232827 -0.02473994 -0.4611105  -0.1317

Because the representations are compatible across the words, we can compare two vector representations through the cosine similarity function:

![Cosine similarity](./images/cosine-full.png "Logo Title Text 1")

So suppose we have two vectors A and B, each with 100 slots, this formula (taken from the Wikipedia page) tells you to sum the results of multiplying each slot across A and B:

A[0]\*B[0]+A[1]\*B[1]+....A[99]\*B[99]

Next we divide this sum by the square-root of the total sum of the slots of A, multiplied by the square-root of the total sum of the slots of B. Dividing it that way normalises the value between 0 and 1 and it makes the sum of the products of the numerator relative to the product of the sums of the individual vectors.

Embedding software uses such measures to obtain the most similar words. We already have the vector for 'dog' so we can now use the wiki2vec.most_similar_by_vector() function to ask for the words that are most simlar:

In [25]:
wiki2vec.most_similar_by_vector(vector4dog, 10)

[(<Word dog>, 0.9999998),
 (<Word dogs>, 0.86373067),
 (<Word cat>, 0.8286425),
 (<Word puppy>, 0.8150868),
 (<Word rabbit>, 0.804229),
 (<Word montarges>, 0.7981079),
 (<Word poodle>, 0.7949788),
 (<Word barfy>, 0.79154897),
 (<Word cockapoo>, 0.7834619),
 (<Word pekapoos>, 0.782865)]

That looks good. We indeed get a list of closely related words. Note that some are not *dogs*.

Wikipedia2Vec also provides another way to get the most similar words directly from a word rather than from a vector of a word. This can be done using the Wikipedia2Vec *Word* object. Word objects are defined in the Wikipedia2Vec package as a special data type, (e.g. you do not find them in other packages such as Gensim). They provide the index to the vector in the matrix but also frequency stats at the token level and the document level.

Let's see what a Wikipedia2Vec *Word* object looks like.

In [26]:
word_object_dog=wiki2vec.get_word('dog')
print(type(word_object_dog))
dir(word_object_dog)

<class 'wikipedia2vec.dictionary.Word'>


['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'count',
 'doc_count',
 'index',
 'text']

We can now directly apply the *most_similar* function without having to get the vector first but also get other information such as frequency information of the word in Wikipedia.

### 3.2 Word frequency statistics

In [18]:
print('Token frequency:', word_object_dog.count)
print('Document frequency:', word_object_dog.doc_count)
wiki2vec.most_similar(word_object_dog, 10)

Token frequency: 116223
Document frequency: 54616


[(<Word dog>, 0.9999998),
 (<Word dogs>, 0.86373067),
 (<Word cat>, 0.8286425),
 (<Word puppy>, 0.8150868),
 (<Word rabbit>, 0.804229),
 (<Word montarges>, 0.7981079),
 (<Word poodle>, 0.7949788),
 (<Word barfy>, 0.79154897),
 (<Word cockapoo>, 0.7834619),
 (<Word pekapoos>, 0.782865)]

Remember that we discussed the notion of information value in the class as expressed by the 'TD\*IDF' formulae (Term frequency times the inversed document frequency). We could easily get a basic information value score based on the Wikipedia articles by dividing the token count by the document count.

In [15]:
print('Information value of "dog"', wiki2vec.get_word('dog').count/wiki2vec.get_word('dog').doc_count)
print('Information value of "cat"', wiki2vec.get_word('cat').count/wiki2vec.get_word('cat').doc_count)

print('Information value of "link"', wiki2vec.get_word('link').count/wiki2vec.get_word('link').doc_count)
print('Information value of "article"', wiki2vec.get_word('article').count/wiki2vec.get_word('article').doc_count)

print('Information value of "Trump"', wiki2vec.get_word('trump').count/wiki2vec.get_word('trump').doc_count)
print('Information value of "Poetin"', wiki2vec.get_word('poetin').count/wiki2vec.get_word('poetin').doc_count)


Information value of "dog" 2.1280027830672332
Information value of "cat" 2.2003139717425433
Information value of "the" 29.74152814006325
Information value of "for" 5.936486691616065
Information value of "link" 1.7776395603831898
Information value of "article" 1.5689368990990131
Information value of "Trump" 3.7625386706665416
Information value of "Poetin" 2.25


We can observe that general words have lower scores than names of inviduals such as *Trump* and *Poetin*. Typical Wikipedia jargon has lowest information value as it occurs in many different articles.

Unfortunately, ```TF*IDF``` does not work out-of-the-box for very freqeunt words such as *the* and *for*. This is because their frequency is so high in comparison to the number of Wikipedia articles that we still get high values:

In [18]:
print('Information value of "the"', wiki2vec.get_word('the').count/wiki2vec.get_word('the').doc_count)
print('Information value of "for"', wiki2vec.get_word('for').count/wiki2vec.get_word('for').doc_count)

Information value of "the" 29.74152814006325
Information value of "for" 5.936486691616065


For Wikipedia as a genre the document count needs to be boosted to work. There are may different variants of ```TD*IDF``` that adjust for peculiarities of the data.

### 3.4 Entity embeddings and statistics

Wikipedia does not only have text but also a lot of entity mentions. Wiki2Vec therefore allows you to obtain entities and entity vectors as well. The package defined another specific object type *Entity* for this.

In [28]:
scarlett_entity=wiki2vec.get_entity('Scarlett Johansson')
print(type(scarlett_entity))
print(dir(scarlett_entity))

<class 'wikipedia2vec.dictionary.Entity'>
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'count', 'doc_count', 'index', 'title']


We can see that *Entity* has similar attributes and functions as *Word*. There are also a few differences such as e.g. *title*.

In [29]:
print('Title:', scarlett_entity.title)
print('Token frequency:', scarlett_entity.count)
print('Document frequency:', scarlett_entity.doc_count)

scarlet_list = wiki2vec.most_similar(scarlett_entity, 10)
print(scarlet_list)

Title: Scarlett Johansson
Token frequency: 687
Document frequency: 582
[(<Entity Scarlett Johansson>, 1.0), (<Word charlize>, 0.76443684), (<Word winslet>, 0.72750795), (<Entity Hilary Swank>, 0.7215545), (<Word paltrow>, 0.71570885), (<Entity Eva Green>, 0.71551895), (<Word noomi>, 0.714082), (<Entity Kate Winslet>, 0.7103845), (<Entity Keira Knightley>, 0.7100761), (<Word blanchett>, 0.7095127)]


The most_similar function now gives a mixture of entities and words. This is nice property of Wikipedia2Vec because it includes the linked entities as a structure.

Also note that the token frequency of 'Scarlett Johansson' is just a bit higher than the document frequency. This means she is mostly mentioned only once per article in which she occurs.

In the same way as for words, you can get the entity vector as well and get the most similar by vector results

In [30]:
scarlett=wiki2vec.get_entity_vector('Scarlett Johansson')
print(len(scarlett))
wiki2vec.most_similar_by_vector(scarlett, 10)

100


[(<Entity Scarlett Johansson>, 1.0),
 (<Word charlize>, 0.76443684),
 (<Word winslet>, 0.72750795),
 (<Entity Hilary Swank>, 0.7215545),
 (<Word paltrow>, 0.71570885),
 (<Entity Eva Green>, 0.71551895),
 (<Word noomi>, 0.714082),
 (<Entity Kate Winslet>, 0.7103845),
 (<Entity Keira Knightley>, 0.7100761),
 (<Word blanchett>, 0.7095127)]

## 4. Links to existing models available for download

Follow the links to browse available models. The sources listed below contain English models trained using different algorithms, data with different degrees of preprocessing and varying hyperparameter settings. Some resources also include models in other languages.

### Large and commonly used models (English):

* Google word2vec: can be downloaded from here (follow link in instructions): http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

* GloVe (trained on various corpora): https://nlp.stanford.edu/projects/glove/

* FastText embeddings (Facebook): https://fasttext.cc/docs/en/english-vectors.html

* Models with different algorithms, hyperparamters, dimensions and degrees of preprocessing (e.g. dependency parsing windows):  https://vecto.readthedocs.io/en/docs/tutorial/getting_vectors.html


### Various models in English & other languages:

* word2vec trained on Wikipedia for various languages: https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

* Various algorithms and parameters for English and other languages: http://vectors.nlpl.eu/repository/#

* Word2vec wikipedia for English and German: https://github.com/idio/wiki2vec

* Facebook's fastText (https://fasttext.cc) for languages other than English: https://fasttext.cc/docs/en/crawl-vectors.html 


Gensim even lets you download models directly via their api. 

Tip: You can build your own word embedding model from a text corpus. Normally, you need a very large corpus to do this but it may be beneficiary to create a dedicated word embedding model for your application. If you build your own embedding space you can visualise it in:
    http://projector.tensorflow.org/?config=https://wikipedia2vec.github.io/projector_files/config.json

# End of this notebook