Medición de asociación semántica con word embeddings estáticos y con PMI.

NOTE las preguntas/comentarios son sobre cosas que ya vieron en la parte teórica

In [1]:
%%capture
!pip install watermark

In [2]:
%load_ext watermark

In [3]:
%watermark -udvp gensim,nltk,numpy,pandas,sklearn

Last updated: 2024-02-14

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

gensim : 4.3.2
nltk   : 3.8.1
numpy  : 1.25.2
pandas : 1.5.3
sklearn: 1.2.2



## Embeddings pre-entrenados vs. _from scratch_

Vamos a empezar descargando un set de embeddings preentrenados de gensim (alguno chico)

In [4]:
import gensim.downloader as api

api.info().keys()

dict_keys(['corpora', 'models'])

In [5]:
api.info()["models"].keys() # modelos disponibles

dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])

In [6]:
glove_wikig = api.load('glove-wiki-gigaword-50') # GloVe Wikipedia+Gigaword dim=50

In [7]:
len(glove_wikig.index_to_key)

400000

In [8]:
glove_wikig.index_to_key[:10]

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

In [9]:
glove_wikig["hello"]

array([-0.38497 ,  0.80092 ,  0.064106, -0.28355 , -0.026759, -0.34532 ,
       -0.64253 , -0.11729 , -0.33257 ,  0.55243 , -0.087813,  0.9035  ,
        0.47102 ,  0.56657 ,  0.6985  , -0.35229 , -0.86542 ,  0.90573 ,
        0.03576 , -0.071705, -0.12327 ,  0.54923 ,  0.47005 ,  0.35572 ,
        1.2611  , -0.67581 , -0.94983 ,  0.68666 ,  0.3871  , -1.3492  ,
        0.63512 ,  0.46416 , -0.48814 ,  0.83827 , -0.9246  , -0.33722 ,
        0.53741 , -1.0616  , -0.081403, -0.67111 ,  0.30923 , -0.3923  ,
       -0.55002 , -0.68827 ,  0.58049 , -0.11626 ,  0.013139, -0.57654 ,
        0.048833,  0.67204 ], dtype=float32)

In [10]:
"palabra_muy_rara" in glove_wikig, "the" in glove_wikig

(False, True)

In [11]:
# estan normalizados?
import numpy as np

np.linalg.norm(glove_wikig["hello"])

4.2438774

Vamos a entrenar embeddings en un corpus que nos pueda interesar: discursos presidenciales de EEUU (son corpora chiquitos, solo es ilustrativo)

In [12]:
# descargamos discursos:
import nltk
nltk.download('inaugural')

[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


True

In [13]:
from nltk.corpus import inaugural

print(inaugural.fileids()[-8:])

['1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt', '2013-Obama.txt', '2017-Trump.txt', '2021-Biden.txt']


In [14]:
# usamos bush hijo y obama:
bush_corpus = inaugural.raw('2001-Bush.txt') + "\n" + inaugural.raw('2005-Bush.txt')
obama_corpus = inaugural.raw('2009-Obama.txt') + "\n" + inaugural.raw('2013-Obama.txt')

Atención al **preprocesamiento** que decidamos usar!

Si se trata de una tarea supervisada, podemos probar distintas opciones y quedarnos con la mejor.

In [15]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [16]:
# Hacemos una lista de tokens para cada oracion. Las ventanas de coocurrencia
# se van a formar dentro de los limites de las oraciones.
from nltk.tokenize import sent_tokenize, word_tokenize
from string import punctuation

# Acá solamente eliminamos la puntuación y convertimos a minusc. a modo de ejemplo.
bush_sentences = []
for sentence in sent_tokenize(bush_corpus):
    words_ = [word.lower() for word in word_tokenize(sentence) if word not in punctuation]
    bush_sentences.append(words_)

obama_sentences = []
for sentence in sent_tokenize(obama_corpus):
    words_ = [word.lower() for word in word_tokenize(sentence) if word not in punctuation]
    obama_sentences.append(words_)

In [17]:
print(bush_sentences[-1])
print(obama_sentences[-1])

['may', 'god', 'bless', 'you', 'and', 'may', 'he', 'watch', 'over', 'the', 'united', 'states', 'of', 'america']
['god', 'bless', 'you', 'and', 'may', 'he', 'forever', 'bless', 'these', 'united', 'states', 'of', 'america']


Atención a los **hiperparámetros** que usamos! ¿Qué quiere decir cada uno?

NOTE si no vieron alguno en detalle en la parte teórica, se puede explicar acá.

In [18]:
from gensim.models import Word2Vec

params = {
    "vector_size": 100,
    "alpha": 0.025,
    "window": 10, # igual que GloVe
    "min_count": 5,
    "max_vocab_size": None,
    "seed": 33,
    "workers": 2,
    "sg": 1,
    "negative": 5,
    "epochs": 2,
}

In [19]:
w2v_bush = Word2Vec(bush_sentences, **params)
w2v_obama = Word2Vec(obama_sentences, **params)
# Para guardar:
# w2v_obama.save("obama_w2v.model")

In [20]:
print(len(w2v_bush.wv.index_to_key), len(w2v_obama.wv.index_to_key))

117 136


¿Cómo medimos similitud?

In [21]:
# algunas palabras interesantes:
words = [
    "america", "freedom", "hope", "god", "american", "citizens", "democracy",
    "liberty", "freedoms", "liberties", "rights", "justice", "equality",
    "opportunity", "nation", "security", "peace", "war",
]

for word in words:
   if word in w2v_bush.wv and word in w2v_obama.wv:
       print(f"{word} is in both vocabularies")

america is in both vocabularies
freedom is in both vocabularies
hope is in both vocabularies
god is in both vocabularies
american is in both vocabularies
citizens is in both vocabularies
liberty is in both vocabularies
nation is in both vocabularies


In [22]:
# con gensim:
print(w2v_bush.wv.n_similarity(["nation"], ["god"]))
print(w2v_obama.wv.n_similarity(["nation"], ["god"]))

# OJO! podemos comparar valores de similitud entre modelos distintos??

0.9221342
0.99276114


In [23]:
# a mano:
import numpy as np
from gensim.models import KeyedVectors

def cossim(v1: np.ndarray, v2: np.ndarray) -> float:
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

def get_vector(embeddings: KeyedVectors, word: str) -> np.ndarray:
    return embeddings[word]

def similarity(embeddings: KeyedVectors, word1: str, word2: str) -> float:
    return cossim(get_vector(embeddings, word1), get_vector(embeddings, word2))

print(similarity(w2v_bush.wv, "nation", "god"))
print(similarity(w2v_bush.wv, "god", "god"))

0.9221342
1.0


In [24]:
# las palabras más similares (con gensim):
print(w2v_bush.wv.most_similar("god"))
print(w2v_obama.wv.most_similar("god"))

[('the', 0.9594517946243286), ('of', 0.9570308327674866), ('a', 0.9568242430686951), ('in', 0.9549646377563477), ('and', 0.9525467753410339), ('we', 0.9509894847869873), ('our', 0.9501854181289673), ('will', 0.948925793170929), ('justice', 0.947930097579956), ('because', 0.9462941884994507)]
[('and', 0.9961374402046204), ('is', 0.9960247874259949), ('are', 0.9959288239479065), ('must', 0.9958963394165039), ('all', 0.9957402944564819), ('we', 0.9957396984100342), ('the', 0.9956983327865601), ('our', 0.9956619739532471), ('not', 0.9956296682357788), ('of', 0.9955527782440186)]


In [25]:
# a mano:
def most_similar(
        embeddings: KeyedVectors, word: str, topn: int = 10, remove_words: list = []
) -> list:
    word_vector = get_vector(embeddings, word)
    words = embeddings.index_to_key
    sims = []
    for w in words:
        if w not in remove_words + [word]:
            sims.append((w, cossim(word_vector, get_vector(embeddings, w))))
    return sorted(sims, key=lambda x: x[1], reverse=True)[:topn]

In [26]:
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

most_similar(w2v_bush.wv, "god", remove_words=stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[('justice', 0.94793004),
 ('freedom', 0.9435305),
 ('history', 0.94246864),
 ('americans', 0.9399281),
 ('know', 0.9395295),
 ('liberty', 0.9385651),
 ('society', 0.93821764),
 ('one', 0.93680054),
 ('time', 0.93362),
 ('america', 0.93257916)]

In [27]:
most_similar(w2v_obama.wv, "god", remove_words=stopwords)

[('must', 0.99589634),
 ('us', 0.99531686),
 ('--', 0.995276),
 ('oath', 0.9950712),
 ('time', 0.994988),
 ('today', 0.994887),
 ('one', 0.99487513),
 ('freedom', 0.9947009),
 ('common', 0.9946429),
 ('every', 0.99462664)]

In [28]:
# en los embeddings preentrenados:
glove_wikig.most_similar("god")

[('divine', 0.8702991604804993),
 ('heaven', 0.840563178062439),
 ('christ', 0.8195050954818726),
 ('faith', 0.7931702733039856),
 ('allah', 0.7795287370681763),
 ('holy', 0.7723780870437622),
 ('sacred', 0.7701617479324341),
 ('true', 0.7637671828269958),
 ('jesus', 0.7632808685302734),
 ('gods', 0.7590562105178833)]

In [29]:
glove_wikig.most_similar("argentina")

[('uruguay', 0.8979324698448181),
 ('brazil', 0.8712695240974426),
 ('portugal', 0.8712440133094788),
 ('chile', 0.8661478757858276),
 ('paraguay', 0.8622300028800964),
 ('spain', 0.8610804080963135),
 ('costa', 0.8431476950645447),
 ('ecuador', 0.8335972428321838),
 ('rica', 0.8328091502189636),
 ('argentine', 0.8223130106925964)]

In [30]:
# podemos promediar y luego calcular similitud:
glove_wikig.n_similarity(["fun", "funny"], ["argentina"])

0.039312582

In [31]:
# a mano:
def similarity_multiple(model, words1: list, words2: list) -> float:
    # promedios:
    v1 = np.mean([get_vector(model, w) for w in words1], axis=0)
    v2 = np.mean([get_vector(model, w) for w in words2], axis=0)
    # similitud:
    return cossim(v1, v2)

similarity_multiple(glove_wikig, ["fun", "funny"], ["argentina"])

0.039312575

In [32]:
paises = ["argentina", "brazil", "chile", "mexico", "peru", "uruguay", "venezuela"]
target_words = ["fun", "funny", "happy", "joy", "happiness", "cheerful", "party"]
for pais in paises:
    sim_ = similarity_multiple(glove_wikig, [pais], target_words)
    print(f"{pais}: {sim_:.2f}")

argentina: 0.16
brazil: 0.24
chile: 0.11
mexico: 0.24
peru: 0.09
uruguay: 0.03
venezuela: 0.12


In [33]:
# analogias
def analogias(
        embeddings: KeyedVectors, x1: str, x2: str, y1: str,
        topn: int = 10, remove_words: list = []
) -> list:
    """
    "x1" es a "x2" lo que "y1" es a ...
    """
    v_x1 = get_vector(embeddings, x1)
    v_x2 = get_vector(embeddings, x2)
    v_y1 = get_vector(embeddings, y1)
    v_y2 = v_y1 + (v_x2 - v_x1)
    # # version gensim (normaliza antes de sumar):
    # v_x1_normalized = v_x1 / np.linalg.norm(v_x1)
    # v_x2_normalized = v_x2 / np.linalg.norm(v_x2)
    # v_y1_normalized = v_y1 / np.linalg.norm(v_y1)
    # v_y2 = np.mean([v_y1_normalized, v_x2_normalized, -v_x1_normalized], axis=0)
    words = embeddings.index_to_key
    sims = []
    for w in words:
        if w not in [x1, x2, y1] + remove_words:
            sims.append((w, cossim(v_y2, get_vector(embeddings, w))))
    return sorted(sims, key=lambda x: x[1], reverse=True)[:topn]


In [34]:
analogias(glove_wikig, "argentina", "tango", "brazil", remove_words=stopwords)

[('flamenco', 0.70419997),
 ('vallenato', 0.6495547),
 ('dance', 0.6477156),
 ('capoeira', 0.6441592),
 ('vibe', 0.63394487),
 ('bossa', 0.62979794),
 ('dancing', 0.6267666),
 ('sanremo', 0.6194488),
 ('frevo', 0.6183353),
 ('samba', 0.61694306)]

In [35]:
analogias(glove_wikig, "cumbia", "argentina", "samba", remove_words=stopwords)

[('portugal', 0.75611746),
 ('brazil', 0.7116333),
 ('valencia', 0.70446074),
 ('porto', 0.7000426),
 ('rio', 0.6864369),
 ('spain', 0.67428696),
 ('madrid', 0.66925627),
 ('1-0', 0.6683855),
 ('costa', 0.66641146),
 ('barcelona', 0.66575617)]

In [36]:
# con gensim:
glove_wikig.most_similar(positive=["samba", "argentina"], negative=["cumbia"])

[('portugal', 0.7374612092971802),
 ('rio', 0.6975606679916382),
 ('brazil', 0.6965111494064331),
 ('porto', 0.6963253617286682),
 ('valencia', 0.6952071785926819),
 ('1-0', 0.6584232449531555),
 ('madrid', 0.6578923463821411),
 ('barcelona', 0.6550120711326599),
 ('costa', 0.6534873843193054),
 ('monaco', 0.651835024356842)]

## Evaluación de embeddings

¿Cómo medimos la calidad de los embeddings? Extrínsecamente, o **intrínsecamente**.

Usamos código robado de [word-embeddings-benchmarks](https://github.com/kudkudak/word-embeddings-benchmarks/tree/master) para evaluar en SimLex999.

In [37]:
import pandas as pd

def fetch_simlex999() -> pd.DataFrame:
    df = pd.read_csv('https://www.dropbox.com/s/0jpa1x8vpmk3ych/EN-SIM999.txt?dl=1', sep="\t")
    return df[['word1', 'word2', 'POS', 'SimLex999']]

df_simlex = fetch_simlex999()

In [38]:
len(df_simlex)

999

In [39]:
df_simlex.sample(6)

Unnamed: 0,word1,word2,POS,SimLex999
978,bring,restore,V,2.62
894,find,disappear,V,0.77
418,sinner,saint,N,1.6
191,steak,meat,N,7.47
807,greet,meet,V,6.17
831,succeed,try,V,3.98


In [40]:
def compute_similarities(embeddings: KeyedVectors, df: pd.DataFrame):
    words1 = df["word1"].tolist()
    words2 = df["word2"].tolist()
    missing_words = 0
    for word in words1 + words2:
        if word not in embeddings:
            missing_words += 1
    if missing_words > 0:
        print(f"Missing {missing_words} words. Will replace them with mean vector")
    mean_vector = np.mean(embeddings.vectors, axis=0, keepdims=True)
    A = np.vstack([get_vector(embeddings, w) if w in embeddings else mean_vector for w in words1])
    B = np.vstack([get_vector(embeddings, w) if w in embeddings else mean_vector for w in words2])
    scores = np.array([cossim(v1, v2) for v1, v2 in zip(A, B)])
    return scores

In [41]:
df_simlex["glove_scores"] = compute_similarities(glove_wikig, df_simlex)
df_simlex.head(2)

Unnamed: 0,word1,word2,POS,SimLex999,glove_scores
0,old,new,A,1.58,0.61972
1,smart,intelligent,A,9.2,0.776587


In [42]:
similarity(glove_wikig, "old", "new")

0.61971974

In [43]:
import scipy

scipy.stats.spearmanr(df_simlex["SimLex999"], df_simlex["glove_scores"]).correlation

0.2645792192990813

¿Tiene sentido evaluar los embeddings de los discursos presidenciales en estos benchmarks?

In [44]:
w2v_obama_scores = compute_similarities(w2v_obama.wv, df_simlex)
scipy.stats.spearmanr(df_simlex["SimLex999"], w2v_obama_scores).correlation

Missing 1949 words. Will replace them with mean vector


0.019195523182419714

## PMI

Vamos a usar GloVe para computar las coocurrencias porque es más rápido que hacerlo con una función de Python.

In [45]:
!git clone https://github.com/stanfordnlp/GloVe.git

fatal: destination path 'GloVe' already exists and is not an empty directory.


In [46]:
%%capture
!cd GloVe && make

In [47]:
# vamos a usar todos los discursos para tener mas datos:
files = inaugural.fileids()
presidents_corpus = ""
for f in files:
    presidents_corpus = presidents_corpus + "\n" + inaugural.raw(f)
presidents_sentences = []
for sentence in sent_tokenize(presidents_corpus):
    words_ = [word.lower() for word in word_tokenize(sentence) if word not in punctuation]
    presidents_sentences.append(words_)

In [48]:
# guardamos corpus con una oracion por linea
with open('sentences.txt', 'w') as f:
    for sentence in presidents_sentences:
        f.write(' '.join(sentence) + '\n')

In [49]:
# generamos el vocab
!GloVe/build/vocab_count -min-count 5 -verbose 0 < sentences.txt > vocab.txt

BUILDING VOCABULARY
Using vocabulary of size 2734.



In [50]:
!head -2 vocab.txt

the 10194
of 7185


In [51]:
# leemos el vocab como dict:
str2count = {}
with open("vocab.txt", "r") as f:
    for line in f:
        word, count = line.split()
        str2count[word] = int(count)

str2idx = dict(zip(str2count.keys(), range(len(str2count))))

In [52]:
str2count["american"], str2idx["the"]

(171, 0)

In [53]:
# generamos un .bin con las coocurrencias
!GloVe/build/cooccur -vocab-file vocab.txt -verbose 0 -window-size 10 -distance-weighting 0 < sentences.txt > coocs.bin
# ventanas: +-10 sin ponderar por distancia al centro

COUNTING COOCCURRENCES
[0GMerging cooccurrence files: processed 481014 lines.



In [54]:
# cuestiones tecnicas no importantes para leer coocs como sparse matrix
import array
from ctypes import Structure, c_int, c_double, sizeof
from os import path
from scipy import sparse
from tqdm import tqdm

class CREC(Structure):
    """c++ class to read triples (idx, idx, cooc) from GloVe binary file
    """
    _fields_ = [('idx1', c_int),
                ('idx2', c_int),
                ('value', c_double)]


class IncrementalCOOMatrix:
    """class to create scipy.sparse.coo_matrix
    """

    def __init__(self, shape, dtype=np.double):
        self.dtype = dtype
        self.shape = shape
        self.rows = array.array('i')
        self.cols = array.array('i')
        self.data = array.array('d')

    def append(self, i, j, v):
        m, n = self.shape
        if (i >= m or j >= n):
            raise Exception('Index out of bounds')
        self.rows.append(i)
        self.cols.append(j)
        self.data.append(v)

    def tocoo(self):
        rows = np.frombuffer(self.rows, dtype=np.int32)
        cols = np.frombuffer(self.cols, dtype=np.int32)
        data = np.frombuffer(self.data, dtype=self.dtype)
        return sparse.coo_matrix((data, (rows, cols)), shape=self.shape)


def build_cooc_matrix(str2idx, cooc_file):
    """
    Build full coocurrence matrix from cooc. data in binary glove file and glove vocab text file
    Row and column indices are numeric indices from vocab_file
    There must be (i,j) for every (j,i) such that C[i,j]=C[j,i]
    """
    vocab_size = len(str2idx)  # vocab size (largest word index)
    size_crec = sizeof(CREC)  # crec: structura de coocucrrencia en Glove
    C = IncrementalCOOMatrix((vocab_size, vocab_size))
    K = path.getsize(cooc_file) / size_crec # total de coocurrencias
    pbar = tqdm(total=K)
    # open bin file and store coocs in C
    with open(cooc_file, 'rb') as f:
        # read and overwrite into cr while there is data
        cr = CREC()
        while (f.readinto(cr) == size_crec):
            C.append(cr.idx1-1, cr.idx2-1, cr.value) # porque glove empieza en 1
            pbar.update(1)
    pbar.close()
    return C.tocoo().tocsr()

In [55]:
# Ahora si!
cooc_matrix = build_cooc_matrix(str2idx, "coocs.bin")

def get_cooc(w1: str, w2: str, str2idx: dict = str2idx, cooc_matrix=cooc_matrix) -> float:
    if w1 not in str2idx:
        print(f"{w1} not in vocab")
        return 0.
    if w2 not in str2idx:
        print(f"{w2} not in vocab")
        return 0.
    idx1 = str2idx[w1]
    idx2 = str2idx[w2]
    return cooc_matrix[idx1, idx2]

100%|██████████| 481014/481014.0 [00:00<00:00, 704929.30it/s]


In [56]:
# como se interpretan estos valores?
pares = [
    ("the", "the"), ("of", "the"), ("the", "of"), ("united", "states"), ("americans", "war")
]
for par in pares:
    print(par, get_cooc(*par))

('the', 'the') 26220.0
('of', 'the') 13142.0
('the', 'of') 13142.0
('united', 'states') 167.0
('americans', 'war') 1.0


In [57]:
# coocs. con una palabra
idx = str2idx["god"]
coocs = cooc_matrix[idx, :].toarray()[0]
coocs_dict = dict(zip(str2idx.keys(), coocs))
sorted(coocs_dict.items(), key=lambda x: x[1], reverse=True)[:20]

[('the', 92.0),
 ('and', 92.0),
 ('of', 74.0),
 ('in', 41.0),
 ('to', 38.0),
 ('our', 30.0),
 ('bless', 28.0),
 ('that', 24.0),
 ("'s", 22.0),
 ('you', 22.0),
 ('i', 21.0),
 ('we', 20.0),
 ('a', 15.0),
 ('almighty', 15.0),
 ('with', 14.0),
 ('america', 14.0),
 ('as', 13.0),
 ('is', 12.0),
 ('will', 12.0),
 ('may', 12.0)]

In [58]:
def get_pmi(words_w: list, words_c: list, stridx=str2idx, cooc_matrix=cooc_matrix):
    """Un PMI por cada palabra en W con respecto a las palabras en C"""
    idx_w = [str2idx[w] for w in words_w if w in stridx]
    idx_c = [str2idx[w] for w in words_c if w in str2idx]
    total_count = cooc_matrix.sum()
    count_c = cooc_matrix[idx_c, :].sum()
    counts_w = cooc_matrix.sum(axis=0)[:,idx_w]
    counts_w_c = cooc_matrix[idx_c,:][:,idx_w].sum(axis=0)
    return np.array(pmi(counts_w_c, counts_w, count_c, total_count)).flatten()


def pmi(counts_wc, counts_w, count_c, count_tot):
    """
    PMI for given word counts of lists of words W and C. It works vectorized accross W
    if needed.
    Param:
        - counts_wc: co-ocurrence array between C and W
        - counts_w: co-ocurrence array for W words
        - count_c: co-ocurrence count C
        - count_tot: total co-occurrence count
    """
    numerador = counts_wc * count_tot
    denominador = counts_w * count_c
    res = np.log(numerador / denominador)
    return res

In [59]:
get_pmi(["god", "america"], ["almighty"])

array([3.81201175, 0.37664903])

In [60]:
get_pmi(["united", "country"], ["states"])

array([ 2.934153  , -0.53648942])

¿Qué tipo de asociaciones captura PMI? ¿Y la similitud de embeddings?

(OJO, estamos analizando corpus distintos)

In [61]:
most_similar(glove_wikig, "country")

[('nation', 0.91625977),
 ('bringing', 0.871822),
 ('now', 0.8387486),
 ('countries', 0.8238735),
 ('still', 0.8222363),
 ('already', 0.8211523),
 ('far', 0.8190747),
 ('has', 0.8157515),
 ('well', 0.8132111),
 ('decades', 0.8061228)]

In [62]:
all_words = list(str2idx.keys())
pmis = get_pmi(all_words, ["country"])
pmis_dict = dict(zip(all_words, pmis))
sorted(pmis_dict.items(), key=lambda x: x[1], reverse=True)[:10]

  res = np.log(numerador / denominador)


[('beloved', 2.87804230551936),
 ('ashamed', 2.7751478823567273),
 ('father', 2.5440361613933407),
 ('sake', 2.448245096751665),
 ('believes', 2.4386756457355143),
 ('extensive', 2.3921556301006213),
 ('summoned', 2.3433654659311896),
 ('section', 2.2839420454603885),
 ('title', 2.28045164052062),
 ('ardent', 2.2683302799882754)]

In [75]:
# Cómo medimos coocurrencias de primer orden? Qué miden los embeddings?
print(glove_wikig.n_similarity(["very"], ["good"]))
print(glove_wikig.n_similarity(["bad"], ["good"]))
print(glove_wikig.n_similarity(["watermelon"], ["good"]))
print(glove_wikig.n_similarity(["black"], ["white"]))

0.89199126
0.79648936
0.18132898
0.9058137


## Bonus track

Identificacion de collocations con nPMI

In [76]:
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

collocations_model = Phrases(
    sentences=obama_sentences,
    min_count=3, threshold=.8, # valor umbral de NPMI
    scoring='npmi', connector_words=ENGLISH_CONNECTOR_WORDS)

In [77]:
collocations_model.export_phrases()

{'my_fellow': 0.8909582035968001,
 'health_care': 0.9053459170173651,
 'men_and_women': 0.9767488312387391,
 'god_bless': 0.8106918340347303,
 'united_states': 0.9607150061023549,
 'complete_until': 1.0}

In [78]:
sentences_with_collocations = collocations_model[obama_sentences]

In [79]:
print(sentences_with_collocations[0])
print(sentences_with_collocations[-1])

['my_fellow', 'citizens', 'i', 'stand', 'here', 'today', 'humbled', 'by', 'the', 'task', 'before', 'us', 'grateful', 'for', 'the', 'trust', 'you', 'have', 'bestowed', 'mindful', 'of', 'the', 'sacrifices', 'borne', 'by', 'our', 'ancestors']
['god_bless', 'you', 'and', 'may', 'he', 'forever', 'bless', 'these', 'united_states', 'of', 'america']


In [80]:
# ahora podemos entrenar embeddings considerando las collocations como tokens:
# w2v_model = Word2Vec(sentences_with_collocations, ...)