# Mangoes : Composition

This notebook illustrates how to use the composition module of mangoes to create phrase vectors by combining word vectors. 

In this notebook, we will focus on adjective-noun (AN) phrases. 

In [7]:
import mangoes
import mangoes.composition

import sklearn.metrics

from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

## Content of this notebook

1. [Get a representation to learn from](#1.-Get-a-representation-to-learn-from)
2. [Compose](#2.-Compose)
3. [Learn the parameters and compose](#3.-Learn-the-parameters-and-compose)


## 1. Get a representation to learn from

Since we want to learn composition parameters from an embedding, we first need to build a representation for adjectives, nouns and AN. 

### Vocabulary
Let's choose a vocabulary from our corpus :

In [8]:
corpus = mangoes.Corpus("data/wiki_article_en", lower= True, ignore_punctuation=True)

Counting words: 0it [00:00, ?it/s]

In [9]:
## Checking out Corpus attributes 
print(corpus.content)
## Check if the given corpus is annotated in Boolean 
print(corpus.annotated)
print(corpus.nb_sentences)
print(corpus.name)
print(corpus.language)
print(corpus.lower)

data/wiki_article_en
False
382
data/wiki_article_en
None
True


In [10]:
adjectives = ['french', 'american', 'spanish', 'russian', 'italian', 'german', 'european', 
              'mexican', 'socialist', 'libertarian', 'marxist', 'first', 'social', 'free', 
              'important', 'different', 'new', 'general', 'great']
nouns = ['anarchist', 'revolution', 'movement', 'school', 'association', 'section', 'revolt']
adj_nouns = ['french revolution', 'mexican revolution', 'spanish revolution', 
             'german revolution', 'russian revolution',
             'german anarchist', 'italian anarchist', 'american anarchist', 
             'spanish anarchist', 'russian anarchist',
             'social revolt', 'social movement', 'social revolution',
             'socialist section', 'italian section']

In [11]:
vocabulary = mangoes.Vocabulary(adjectives + nouns + adj_nouns)

### Create the representation

In [12]:
# choose the 100 most frequent words as context words
context_words = corpus.create_vocabulary(filters=[mangoes.corpus.truncate(100)])
# consider the sentence as context
context = mangoes.context.Sentence(vocabulary=context_words)
# count the cooccurrences 
coocc_count = mangoes.counting.count_cooccurrence(corpus, vocabulary, context)
# apply PPMI to the counts to get a first sparse representation
sparse_representation = mangoes.create_representation(coocc_count,
                                                      weighting=mangoes.weighting.PPMI())
# apply SVD to the weighted counts to get a dense representation
representation = mangoes.create_representation(sparse_representation,
                                               reduction=mangoes.reduction.SVD(dimensions=20))

display(representation.to_df())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
french,0.9232378,0.2518977,0.1976913,0.5823899,0.1021536,-2.346297,0.3608367,-1.178371,-1.025432,0.2269521,-1.552892,0.5255774,0.5269176,-1.348291,2.090963,0.1986439,0.3327315,1.449723,-0.5417322,2.358469
american,1.011921,0.2307812,0.2674187,0.9504112,-0.6560689,1.190331,-0.2447758,-2.387688,2.246507,-0.4227611,1.097492,-1.759385,-0.199945,0.9939877,0.1766173,1.64364,-0.1360573,0.4157244,-0.2086322,3.515391
spanish,0.38449,-0.9919648,0.7720455,1.026798,-0.4559113,-0.4552277,0.7956519,0.398461,-0.7528701,0.4851349,0.5279303,-0.5622603,-0.3753503,0.9983573,-0.5887961,0.6857075,0.3892328,0.006850663,0.9775506,1.396507
russian,0.007174436,-0.7566456,-1.523177,1.475042,1.545955,0.7723306,-0.06539087,0.4700323,-0.6036873,-0.2046548,-2.844825,-0.1218294,-0.08964543,-1.388543,1.002374,2.457476,1.001777,-0.4625475,-0.5243237,2.551947
italian,-0.02207848,0.7327243,-0.8932005,0.1973344,0.4093956,-0.6570296,-0.833809,-0.1060872,0.4371266,-2.335336,-0.8960702,-0.1982598,-0.3248204,0.6170946,-1.345871,-1.785683,3.152765,-0.5528974,1.344408,3.146816
german,0.09044918,-0.6259619,-0.2763328,0.2095167,-1.075197,-0.3684076,-1.101813,-0.3883773,-0.2962404,2.034811,-0.5165124,-1.010789,0.9372634,-1.042798,-3.383304,1.07547,0.2720697,0.3899385,-1.884152,3.002028
european,-0.5195341,-0.0005981056,0.5892556,0.2222857,2.540496,0.7769772,-0.002348189,1.16948,0.5883077,0.2417877,-0.3999941,-0.8548218,-0.4825368,2.367484,0.3601892,-0.4478236,-1.478835,3.417065,-2.105688,2.960658
mexican,1.019551,1.232368,-0.5785188,0.2142487,-0.037979,0.2093605,0.3066307,0.753988,1.115593,1.917064,-1.144818,0.4401947,-1.642987,-0.6103295,-0.3001053,-1.314326,-0.2331189,0.982483,1.500832,1.239624
socialist,0.181204,-0.301455,1.2634,0.1836036,-0.709078,1.466179,0.6692192,-0.5454151,-0.6887641,1.08553,0.1190846,0.9056595,-0.1471218,-1.181902,2.723317,-0.3553119,0.6974363,-0.9300267,-0.904045,3.264135
libertarian,-0.3673574,0.3232677,-1.599944,0.2356308,-2.054945,1.699276,0.1281625,0.09704219,-0.8798461,-2.1702,0.06833526,0.6001506,-1.566836,-0.3923883,0.1983477,-0.1108656,-0.3294645,0.898691,0.4923463,2.593628


## 2. Compose

In this example, we will derive a new vector $\mathbf{p}$ for the phrase *'spanish revolution'* from the vectors $\mathbf{u}$ and $\mathbf{v}$ representing the words *'spanish'* and *'revolution'*.

Various models are available :


### Additive

In this model, $p$ is obtained by a weighted sum of $u$ and $v$ :

$$\mathbf{p = \alpha u + \beta v}$$

The `mangoes.composition.AdditiveComposer` let you learn $\mathbf{\alpha}$ and $\mathbf{\alpha}$ from your representation and apply them

In [13]:
additive = mangoes.composition.AdditiveComposer(representation)
additive.fit()

print("alpha=", additive.alpha, ", beta=", additive.beta)

alpha= 0.23998432440049766 , beta= 0.18570667435097205


Now we can create a new vector for the phrase "spanish revolution" using these composer :

In [14]:
predicted_vector = additive.predict("spanish", "revolution")

We can measure the distance between this vector and the 'observed' one :

In [15]:
observed_vector = representation[('spanish', 'revolution')]
sklearn.metrics.pairwise_distances(observed_vector.reshape(1,-1), predicted_vector.reshape(1,-1), metric='cosine')

array([[0.25506289]])

And check that the observed vector is (one of) the closet word(s) to the predicted one :

In [16]:
representation.get_closest_words(predicted_vector)

[('spanish', 0.11374605394199377),
 ('revolution', 0.19878176530487057),
 (Bigram(first='spanish', second='revolution'), 0.2550628862532164),
 ('great', 0.4006380255217601),
 ('american', 0.47959369999437507),
 (Bigram(first='social', second='revolt'), 0.4991334812931577),
 (Bigram(first='italian', second='anarchist'), 0.5272597951652993),
 ('revolt', 0.545477326236689),
 ('first', 0.555967046805788),
 ('socialist', 0.5626577557557526)]

In [17]:
representation.get_closest_words(representation["free"])

[('free', 0.0),
 (Bigram(first='social', second='revolt'), 0.38078405645451285),
 (Bigram(first='spanish', second='revolution'), 0.40018786699686204),
 ('revolution', 0.4019392918969279),
 ('important', 0.4038526248912737),
 ('libertarian', 0.4609095858567147),
 ('anarchist', 0.4702513974745852),
 ('school', 0.48370411646330513),
 (Bigram(first='socialist', second='section'), 0.48880976973598966),
 (Bigram(first='american', second='anarchist'), 0.4998128424672039)]

### Dilation

We can now do the same with dilation model where :
$$ \mathbf{p = (u.u)v + (\lambda - 1)(u.v)} $$

In [18]:
dilation = mangoes.composition.DilationComposer(representation)
dilation.fit()
print("lambda =", dilation.lambda_)

lambda = 0.0479736328125


In [19]:
predicted_vector = dilation.predict('spanish', 'revolution')
representation.get_closest_words(predicted_vector)

[('revolution', 0.08883381060292639),
 ('socialist', 0.3877974085450775),
 (Bigram(first='social', second='revolt'), 0.3892323092098494),
 ('free', 0.39044831954995074),
 ('new', 0.418613193096521),
 ('russian', 0.42490693718161976),
 (Bigram(first='socialist', second='section'), 0.4370704922434915),
 ('section', 0.45431084064948757),
 (Bigram(first='spanish', second='revolution'), 0.4592912722438147),
 ('revolt', 0.4728781544830122)]

### Full Additive

This model two weight matrices such as :

$$\mathbf{p = Au + Bv}$$


In [20]:
full_additive = mangoes.composition.FullAdditiveComposer(representation)
full_additive.fit()

In [21]:
predicted_vector = full_additive.predict("spanish", "revolution")
representation.get_closest_words(predicted_vector)

[(Bigram(first='spanish', second='revolution'), 0.0798612203420066),
 (Bigram(first='russian', second='revolution'), 0.1849092152475038),
 ('revolution', 0.3999667609136286),
 ('free', 0.42422954264848434),
 ('spanish', 0.516363951687342),
 (Bigram(first='german', second='revolution'), 0.6007668725140161),
 ('anarchist', 0.612696317569871),
 ('school', 0.6315393529618919),
 ('libertarian', 0.6483982101625264),
 ('important', 0.65414547506035)]

### Lexical Function

Finally, we can also learn a matrix $U$ for each adjective with the "lexical function" model where :
$$ \mathbf{p = Uv} $$

In [22]:
spanish = mangoes.composition.LexicalComposer(representation, "spanish")
spanish.fit()

In [23]:
predicted_vector = spanish.predict('revolution')
representation.get_closest_words(predicted_vector)

[(Bigram(first='spanish', second='revolution'), 0.0),
 (Bigram(first='russian', second='revolution'), 0.2748850980055798),
 ('revolution', 0.2797298800799053),
 ('free', 0.40018786699686204),
 ('spanish', 0.43533042891574325),
 ('libertarian', 0.597037930537175),
 ('anarchist', 0.6433865256501664),
 ('revolt', 0.6611179002823516),
 ('russian', 0.661944569060751),
 ('section', 0.6661411980893858)]