# Mangoes : Create embeddings

This notebook illustrates how to create embeddings from a cooccurence matrix. The examples are applied on some extracts of wikipedia (en and fr). 

First, we have to import the module

In [1]:
import mangoes
from IPython.display import display

## Content of this notebook

1. [Just create a representation](#1.-Just-create-a-representation)
2. [Apply transformations to the co-occurrence matrix](#2.-Apply-transformations-to-the-co-occurrence-matrix)
3. [Example with annotated text](#3.-Example-with-annotated-text)

## 1. Just create a representation

To create embeddings, you first have to compute a co-occurrence matrix (See the cooccurrence notebook). 

In [2]:
import mangoes.counting
words = mangoes.Vocabulary(["anarchist", "communism", "societies", "state"])
corpus = mangoes.Corpus("data/wiki_article_en")
contexts = corpus.create_vocabulary(filters = [mangoes.corpus.truncate(10)])
coocc_count = mangoes.counting.count_cooccurrence(corpus, words, contexts)

Counting words: 0it [00:00, ?it/s]

The returned value is a mangoes.CountBasedRepresentation object, that is a first representation where the vector representing a word is the raw count of co-occurrences with each context words :

In [3]:
coocc_count.pprint(display=display)

Unnamed: 0,the,",",of,.,and,"""",in,a,to,as
anarchist,25,5,6,1,8,5,4,0,2,1
communism,0,2,2,0,3,0,1,0,0,0
societies,0,1,0,0,0,0,0,0,0,0
state,15,4,1,5,3,0,1,0,2,2


You can get the vector representation of a word :

In [4]:
print(coocc_count["anarchist"].toarray())

[[25  5  6  1  8  5  4  0  2  1]]


Or find its closest words :

In [5]:
print(coocc_count.get_closest_words("anarchist", 3))

[('state', 0.07678875479127578), ('communism', 0.5825498791786463), ('societies', 0.8228909132560464)]


## 2. Apply transformations to the co-occurrence matrix 

But this matrix is also made to be used as a source for the `mangoes.create_representation` function which applies two king of transformations to it :

### 2.1 Apply weighting 
The module mangoes.weighting provides some functions you can apply : 
* joint_probabilities : $P(w,c)$, probability to find the context word c in the context of the word w
* conditional_probabilities : $P(c|w)$, probability to find the context word c given w (linear normalization of the matrix)
* probabilities_ratio : $\frac{P(c|w)}{P(c)}$ ratio between the probability of c given w and the probability of c. Note : $\frac{P(c|w)}{P(c)} = \frac{P(w|c)}{P(w)} = \frac{P(w,c)}{P(w)P(c)}$
* pmi (Pointwise Mutual Information) : $log(\frac{P(c|w)}{P(c)})$. 
* positive pmi : $max(log(\frac{P(c|w)}{P(c)})\ or\ 0\ if\ not\ defined, 0)$
* shifted ppmi : $max(log(\frac{P(c|w)}{P(c)}) - log(shift)\ or\ 0\ if\ not\ defined, 0)$
* tf-idf

But you can also use your own functions.

In [6]:
import mangoes.weighting
ppmi_representation = mangoes.create_representation(coocc_count, weighting=mangoes.weighting.PPMI())

The returned value is still a mangoes.CountBasedRepresentation object.

In [7]:
ppmi_representation.pprint(display=display)

Unnamed: 0,the,",",of,.,and,"""",in,a,to,as
anarchist,0.082065,0.0,0.146603,0.0,0.0,0.552069,0.146603,0.0,0.0,0.0
communism,0.0,0.723919,1.011601,0.0,0.975233,0.0,0.723919,0.0,0.0,0.0
societies,0.0,2.110213,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
state,0.117783,0.0,0.0,0.916291,0.0,0.0,0.0,0.0,0.405465,0.693147


## 2.2. Apply dimensions reduction

Finally, you can apply a reduction. The mangoes.reduction module provides 2 reduction functions : pca and svd

In [8]:
import mangoes.reduction
embeddings = mangoes.create_representation(coocc_count, reduction=mangoes.reduction.PCA(dimensions=3))

The returned value is a mangoes.Embeddings

In [9]:
embeddings.pprint(display=display)

Unnamed: 0,0,1,2
anarchist,28.095744,-2.650896,-0.775352
communism,1.669892,-2.120991,3.270904
societies,0.198125,0.054066,0.574083
state,16.144398,4.832018,1.003958


## 2.3. Apply transformation and reduction

You can chain weighting and reduction :

In [10]:
ppmi = mangoes.weighting.PPMI()
svd = mangoes.reduction.SVD(dimensions=3)

embeddings = mangoes.create_representation(coocc_count, weighting=ppmi, reduction=svd)
embeddings.pprint(display=display)

Unnamed: 0,0,1,2
anarchist,0.009603,0.177659,-0.062846
communism,-0.003327,1.203092,-1.253218
societies,0.00172,-0.773731,-1.963088
state,1.224045,0.002964,-0.000155


## 3. Example with annotated text

In [11]:
import nltk
import string


annotated_corpus = mangoes.Corpus("data/wiki_article_fr.lemmatized", reader=mangoes.corpus.BROWN)

# creating a vocabulary of lemmas :
lemma_vocabulary = annotated_corpus.create_vocabulary(attributes="lemma", 
                                                      filters=[mangoes.corpus.remove_elements(nltk.corpus.stopwords.words('french')), 
                                                               mangoes.corpus.remove_elements(string.punctuation), 
                                                               mangoes.corpus.truncate(30)])
# creating a vocabulary of pos + lemmas :
pos_lemma_vocabulary = annotated_corpus.create_vocabulary(attributes=("lemma", "POS"), 
                                                          filters=[mangoes.corpus.remove_elements(nltk.corpus.stopwords.words('french'), attribute="lemma"), 
                                                                   mangoes.corpus.remove_elements(string.punctuation, attribute="lemma")])

# counting 
cc = mangoes.counting.count_cooccurrence(annotated_corpus, lemma_vocabulary, pos_lemma_vocabulary)
embeddings = mangoes.create_representation(cc, weighting=ppmi, reduction=mangoes.reduction.SVD(dimensions=10))

embeddings.pprint(display=display)

Counting words: 0it [00:00, ?it/s]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
ilimp,1.287783e-15,-0.0449463,0.1772036,-7.212714e-15,-1.029505e-14,-1.560793,-2.664022e-13,-2.09849,-5.635196,-0.05128225
*Meillet,5.937093e-15,-3.259013e-14,-9.867858e-14,-5.604915,-4.182209e-14,7.711643e-15,-8.325202e-15,-2.228843e-15,3.124124e-15,5.894733e-16
être,-7.702594e-16,-1.905841e-15,8.634973e-15,-4.498624e-15,2.092449e-14,3.412206e-14,4.629353,-5.873572e-13,-1.395793e-14,-5.815339e-15
*des,-3.85002e-15,0.8271542,1.57615,-3.852014e-14,-2.570367e-14,-1.948464,6.703539e-13,5.299007,-1.404107,-0.2238731
avoir,-1.934141e-16,3.276014e-15,8.558733e-15,-5.389409e-15,1.573786e-14,3.4072e-14,4.402983,-5.59993e-13,-1.295699e-14,-5.172201e-15
Parry,1.863747e-15,-1.285951e-14,-3.715705e-14,-2.177613,-1.603885e-14,2.807308e-15,-3.143711e-15,-6.550441e-16,1.314644e-15,2.283033e-16
linguiste,4.836326e-15,-0.5586713,-5.321274,9.600303e-14,8.949775e-15,-1.46314,2.20963e-13,1.566156,-0.3386486,-0.0354533
*du,-2.763908e-15,0.9076774,-0.1484263,-4.452175e-15,-1.809198e-15,-0.1547503,-5.659706e-14,-0.4078286,0.2421257,-6.891467
étude,-6.882001e-15,1.428334,0.2065183,-1.172871e-14,1.681144e-14,1.784244,2.103397e-13,1.842485,-1.103165,-0.07936164
*au,-1.132399e-15,0.3112931,-0.8876912,1.806507e-14,5.084142e-14,4.461321,7.652931e-14,0.9701313,-2.17039,-0.1398462


In [12]:
cc.pprint(display=display)

Unnamed: 0_level_0,ilimp,*Meillet,*des,être,avoir,Parry,linguiste,*du,étude,*au,...,résultat,présent,héberger,université,*Harvard,élève,Albert,*Lord,profondément,renouveler
Unnamed: 0_level_1,CLS,NPP,P+D,V,V,NPP,NC,P+D,NC,P+D,...,NC,NC,VPP,NC,NPP,NC,NPP,NPP,ADV,VPP
ilimp,0,0,0,1,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
*Meillet,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
être,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
*des,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
avoir,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
Parry,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
linguiste,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
*du,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
étude,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
*au,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
