<a href="https://colab.research.google.com/github/LCaravaggio/NLP/blob/main/notebooks/03_Embeddings_PMI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Medición de asociación semántica con word embeddings estáticos y con PMI.

---

TAREA: responder donde dice **PREGUNTA**

In [None]:
!pip install -U gensim watermark
# NOTE requiere Restart Session luego de instalar!

In [None]:
%load_ext watermark

In [None]:
%watermark -udvp nltk,numpy,pandas,sklearn,gensim

## Embeddings pre-entrenados vs. _from scratch_

Vamos a empezar descargando un set de embeddings preentrenados de gensim (alguno chico)

In [None]:
import gensim.downloader as api

api.info().keys()

In [None]:
api.info()["models"].keys() # modelos disponibles

In [None]:
glove_wikig = api.load('glove-wiki-gigaword-50') # GloVe Wikipedia+Gigaword dim=50

In [None]:
len(glove_wikig.index_to_key)

In [None]:
glove_wikig.index_to_key[:10]

In [None]:
glove_wikig["hello"]

In [None]:
"palabra_muy_rara" in glove_wikig, "the" in glove_wikig

In [None]:
import numpy as np

np.linalg.norm(glove_wikig["hello"])

**PREGUNTA 1** ¿vienen normalizados los embeddings de glove_wikig? ¿qué quiere decir normalizado?

Vamos a entrenar embeddings en discursos presidenciales de EEUU (son corpora chiquitos, solo es ilustrativo)

In [None]:
# descargamos discursos:
import nltk
nltk.download('inaugural')

In [None]:
from nltk.corpus import inaugural

print(inaugural.fileids()[-8:])

In [None]:
# usamos bush hijo y obama:
bush_corpus = inaugural.raw('2001-Bush.txt') + "\n" + inaugural.raw('2005-Bush.txt')
obama_corpus = inaugural.raw('2009-Obama.txt') + "\n" + inaugural.raw('2013-Obama.txt')

Atención al **preprocesamiento** que decidamos usar!

Si se trata de una tarea supervisada, podemos probar distintas opciones y quedarnos con la mejor.

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')

In [None]:
# Hacemos una lista de tokens para cada oracion. Las ventanas de coocurrencia
# se van a formar dentro de los limites de las oraciones.
from nltk.tokenize import sent_tokenize, word_tokenize
from string import punctuation

# Acá solamente eliminamos la puntuación y convertimos a minusc. a modo de ejemplo.
bush_sentences = []
for sentence in sent_tokenize(bush_corpus):
    words_ = [word.lower() for word in word_tokenize(sentence) if word not in punctuation]
    bush_sentences.append(words_)

obama_sentences = []
for sentence in sent_tokenize(obama_corpus):
    words_ = [word.lower() for word in word_tokenize(sentence) if word not in punctuation]
    obama_sentences.append(words_)

**PREGUNTA 2** ¿Qué quiere decir "ventana de coocurrencia" en el contexto de estos embeddings?

In [None]:
print(bush_sentences[-1])
print(obama_sentences[-1])

Atención a los **hiperparámetros** que usamos:

In [None]:
from gensim.models import Word2Vec

params = {
    "vector_size": 100,
    "alpha": 0.025,
    "window": 10, # igual que GloVe
    "min_count": 5,
    "max_vocab_size": None,
    "sg": 1, # 0: CBOW, 1: Skip-gram
    "negative": 5,
    "epochs": 2,
    "seed": 33,
    "workers": 2,
}

**PREGUNTA 3** ¿qué signfican los parámetros: min_count, negative, epochs?

In [None]:
w2v_bush = Word2Vec(bush_sentences, **params)
w2v_obama = Word2Vec(obama_sentences, **params)
# Para guardar:
# w2v_obama.save("obama_w2v.model")

In [None]:
print(len(w2v_bush.wv.index_to_key), len(w2v_obama.wv.index_to_key))

In [None]:
# algunas palabras interesantes:
words = [
    "america", "freedom", "hope", "god", "american", "citizens", "democracy",
    "liberty", "freedoms", "liberties", "rights", "justice", "equality",
    "opportunity", "nation", "security", "peace", "war",
]

for word in words:
   if word in w2v_bush.wv and word in w2v_obama.wv:
       print(f"{word} is in both vocabularies")

In [None]:
# Medición de similitud con gensim:
print(w2v_bush.wv.n_similarity(["nation"], ["god"]))
print(w2v_obama.wv.n_similarity(["nation"], ["god"]))

**PREGUNTA 4** ¿es correcto comparar los dos valores de similitud anteriores entre sí?

In [None]:
# a mano:
import numpy as np
from gensim.models import KeyedVectors

def cossim(v1: np.ndarray, v2: np.ndarray) -> float:
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

def get_vector(embeddings: KeyedVectors, word: str) -> np.ndarray:
    return embeddings[word]

def similarity(embeddings: KeyedVectors, word1: str, word2: str) -> float:
    return cossim(get_vector(embeddings, word1), get_vector(embeddings, word2))

print(similarity(w2v_bush.wv, "nation", "god"))
print(similarity(w2v_bush.wv, "god", "god"))

In [None]:
# las palabras más similares (con gensim):
print(w2v_bush.wv.most_similar("god"))
print(w2v_obama.wv.most_similar("god"))

In [None]:
# a mano:
def most_similar(
        embeddings: KeyedVectors, word: str, topn: int = 10, remove_words: list = []
) -> list:
    word_vector = get_vector(embeddings, word)
    words = embeddings.index_to_key
    sims = []
    for w in words:
        if w not in remove_words + [word]:
            sims.append((w, cossim(word_vector, get_vector(embeddings, w))))
    return sorted(sims, key=lambda x: x[1], reverse=True)[:topn]

In [None]:
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

most_similar(w2v_bush.wv, "god", remove_words=stopwords)

In [None]:
most_similar(w2v_obama.wv, "god", remove_words=stopwords)

In [None]:
# en los embeddings preentrenados:
glove_wikig.most_similar("god")

In [None]:
glove_wikig.most_similar("argentina")

In [None]:
# podemos promediar y luego calcular similitud:
glove_wikig.n_similarity(["fun", "funny"], ["argentina"])

In [None]:
# a mano:
def similarity_multiple(model, words1: list, words2: list) -> float:
    # promedios:
    v1 = np.mean([get_vector(model, w) for w in words1], axis=0)
    v2 = np.mean([get_vector(model, w) for w in words2], axis=0)
    # similitud:
    return cossim(v1, v2)

similarity_multiple(glove_wikig, ["fun", "funny"], ["argentina"])

In [None]:
paises = ["argentina", "brazil", "chile", "mexico", "peru", "uruguay", "venezuela"]
target_words = ["fun", "funny", "happy", "joy", "happiness", "cheerful", "party"]
for pais in paises:
    sim_ = similarity_multiple(glove_wikig, [pais], target_words)
    print(f"{pais}: {sim_:.2f}")

In [None]:
# analogias
def analogias(
        embeddings: KeyedVectors, x1: str, x2: str, y1: str,
        topn: int = 10, remove_words: list = []
) -> list:
    """
    "x1" es a "x2" lo que "y1" es a ...
    """
    v_x1 = get_vector(embeddings, x1)
    v_x2 = get_vector(embeddings, x2)
    v_y1 = get_vector(embeddings, y1)
    v_y2 = v_y1 + (v_x2 - v_x1)
    # # version gensim (normaliza antes de sumar):
    # v_x1_normalized = v_x1 / np.linalg.norm(v_x1)
    # v_x2_normalized = v_x2 / np.linalg.norm(v_x2)
    # v_y1_normalized = v_y1 / np.linalg.norm(v_y1)
    # v_y2 = np.mean([v_y1_normalized, v_x2_normalized, -v_x1_normalized], axis=0)
    words = embeddings.index_to_key
    sims = []
    for w in words:
        if w not in [x1, x2, y1] + remove_words:
            sims.append((w, cossim(v_y2, get_vector(embeddings, w))))
    return sorted(sims, key=lambda x: x[1], reverse=True)[:topn]


In [None]:
analogias(glove_wikig, "argentina", "tango", "brazil", remove_words=stopwords)

In [None]:
analogias(glove_wikig, "cumbia", "argentina", "samba", remove_words=stopwords)

In [None]:
# con gensim:
glove_wikig.most_similar(positive=["samba", "argentina"], negative=["cumbia"])

## Evaluación de embeddings

¿Cómo medimos la calidad de los embeddings? Extrínsecamente, o **intrínsecamente**.

Usamos código de [word-embeddings-benchmarks](https://github.com/kudkudak/word-embeddings-benchmarks/tree/master) para evaluar en SimLex999.

In [None]:
import pandas as pd

def fetch_simlex999() -> pd.DataFrame:
    df = pd.read_csv('https://www.dropbox.com/s/0jpa1x8vpmk3ych/EN-SIM999.txt?dl=1', sep="\t")
    return df[['word1', 'word2', 'POS', 'SimLex999']]

df_simlex = fetch_simlex999()

In [None]:
len(df_simlex)

In [None]:
df_simlex.sample(6)

In [None]:
def compute_similarities(embeddings: KeyedVectors, df: pd.DataFrame):
    words1 = df["word1"].tolist()
    words2 = df["word2"].tolist()
    missing_words = 0
    for word in words1 + words2:
        if word not in embeddings:
            missing_words += 1
    if missing_words > 0:
        print(f"Missing {missing_words} words. Will replace them with mean vector")
    mean_vector = np.mean(embeddings.vectors, axis=0, keepdims=True)
    A = np.vstack([get_vector(embeddings, w) if w in embeddings else mean_vector for w in words1])
    B = np.vstack([get_vector(embeddings, w) if w in embeddings else mean_vector for w in words2])
    scores = np.array([cossim(v1, v2) for v1, v2 in zip(A, B)])
    return scores

In [None]:
df_simlex["glove_scores"] = compute_similarities(glove_wikig, df_simlex)
df_simlex.head(2)

In [None]:
similarity(glove_wikig, "old", "new")

In [None]:
import scipy

scipy.stats.spearmanr(df_simlex["SimLex999"], df_simlex["glove_scores"]).correlation

**PREGUNTA 5** ¿Qué representa el valor inmediatamente anterior? ¿Cómo se interpreta? ¿Para qué sirve?

In [None]:
w2v_obama_scores = compute_similarities(w2v_obama.wv, df_simlex)
scipy.stats.spearmanr(df_simlex["SimLex999"], w2v_obama_scores).correlation

**PREGUNTA 6** ¿Tiene sentido evaluar los embeddings entrenados en los discursos presidenciales en estos benchmarks?

**PREGUNTA 7** ¿Por qué usaríamos nuestros propios embeddings en lugar de usar preentrenados?

## PMI

Vamos a usar GloVe para computar las coocurrencias porque es  mucho más rápido que hacerlo con una función de Python.

In [None]:
!git clone https://github.com/stanfordnlp/GloVe.git

In [None]:
%%capture
!cd GloVe && make

In [None]:
# vamos a usar todos los discursos para tener mas datos:
files = inaugural.fileids()
presidents_corpus = ""
for f in files:
    presidents_corpus = presidents_corpus + "\n" + inaugural.raw(f)
presidents_sentences = []
for sentence in sent_tokenize(presidents_corpus):
    words_ = [word.lower() for word in word_tokenize(sentence) if word not in punctuation]
    presidents_sentences.append(words_)

In [None]:
# guardamos corpus con una oracion por linea
with open('sentences.txt', 'w') as f:
    for sentence in presidents_sentences:
        f.write(' '.join(sentence) + '\n')

In [None]:
# generamos el vocab
!GloVe/build/vocab_count -min-count 5 -verbose 0 < sentences.txt > vocab.txt

In [None]:
!head -2 vocab.txt

In [None]:
# leemos el vocab como dict:
str2count = {}
with open("vocab.txt", "r") as f:
    for line in f:
        word, count = line.split()
        str2count[word] = int(count)

str2idx = dict(zip(str2count.keys(), range(len(str2count))))

In [None]:
str2count["american"], str2idx["the"]

In [None]:
# generamos un .bin con las coocurrencias
!GloVe/build/cooccur -vocab-file vocab.txt -verbose 0 -window-size 10 -distance-weighting 0 < sentences.txt > coocs.bin
# ventanas: +-10 sin ponderar por distancia al centro

In [None]:
# cuestiones tecnicas no importantes para leer coocs como sparse matrix
import array
from ctypes import Structure, c_int, c_double, sizeof
from os import path
from scipy import sparse
from tqdm import tqdm

class CREC(Structure):
    """c++ class to read triples (idx, idx, cooc) from GloVe binary file
    """
    _fields_ = [('idx1', c_int),
                ('idx2', c_int),
                ('value', c_double)]


class IncrementalCOOMatrix:
    """class to create scipy.sparse.coo_matrix
    """

    def __init__(self, shape, dtype=np.double):
        self.dtype = dtype
        self.shape = shape
        self.rows = array.array('i')
        self.cols = array.array('i')
        self.data = array.array('d')

    def append(self, i, j, v):
        m, n = self.shape
        if (i >= m or j >= n):
            raise Exception('Index out of bounds')
        self.rows.append(i)
        self.cols.append(j)
        self.data.append(v)

    def tocoo(self):
        rows = np.frombuffer(self.rows, dtype=np.int32)
        cols = np.frombuffer(self.cols, dtype=np.int32)
        data = np.frombuffer(self.data, dtype=self.dtype)
        return sparse.coo_matrix((data, (rows, cols)), shape=self.shape)


def build_cooc_matrix(str2idx, cooc_file):
    """
    Build full coocurrence matrix from cooc. data in binary glove file and glove vocab text file
    Row and column indices are numeric indices from vocab_file
    There must be (i,j) for every (j,i) such that C[i,j]=C[j,i]
    """
    vocab_size = len(str2idx)  # vocab size (largest word index)
    size_crec = sizeof(CREC)  # crec: structura de coocucrrencia en Glove
    C = IncrementalCOOMatrix((vocab_size, vocab_size))
    K = path.getsize(cooc_file) / size_crec # total de coocurrencias
    pbar = tqdm(total=K)
    # open bin file and store coocs in C
    with open(cooc_file, 'rb') as f:
        # read and overwrite into cr while there is data
        cr = CREC()
        while (f.readinto(cr) == size_crec):
            C.append(cr.idx1-1, cr.idx2-1, cr.value) # porque glove empieza en 1
            pbar.update(1)
    pbar.close()
    return C.tocoo().tocsr()

**PREGUNTA 8** ¿qué quiere decir que la matriz de coocurrencias es rala / sparse?

In [None]:
# Ahora si!
cooc_matrix = build_cooc_matrix(str2idx, "coocs.bin")

def get_cooc(w1: str, w2: str, str2idx: dict = str2idx, cooc_matrix=cooc_matrix) -> float:
    if w1 not in str2idx:
        print(f"{w1} not in vocab")
        return 0.
    if w2 not in str2idx:
        print(f"{w2} not in vocab")
        return 0.
    idx1 = str2idx[w1]
    idx2 = str2idx[w2]
    return cooc_matrix[idx1, idx2]

In [None]:
pares = [
    ("the", "the"), ("of", "the"), ("the", "of"), ("united", "states"), ("americans", "war")
]
for par in pares:
    print(par, get_cooc(*par))

**PREGUNTA 9** ¿cómo se interpreta el valor `('the', 'the') 26396.0`?

In [None]:
# coocs. con una palabra
idx = str2idx["god"]
coocs = cooc_matrix[idx, :].toarray()[0]
coocs_dict = dict(zip(str2idx.keys(), coocs))
sorted(coocs_dict.items(), key=lambda x: x[1], reverse=True)[:20]

In [None]:
def get_pmi(words_w: list, words_c: list, stridx=str2idx, cooc_matrix=cooc_matrix):
    """Un PMI por cada palabra en W con respecto a las palabras en C"""
    idx_w = [str2idx[w] for w in words_w if w in stridx]
    idx_c = [str2idx[w] for w in words_c if w in str2idx]
    total_count = cooc_matrix.sum()
    count_c = cooc_matrix[idx_c, :].sum()
    counts_w = cooc_matrix.sum(axis=0)[:,idx_w]
    counts_w_c = cooc_matrix[idx_c,:][:,idx_w].sum(axis=0)
    return np.array(pmi(counts_w_c, counts_w, count_c, total_count)).flatten()


def pmi(counts_wc, counts_w, count_c, count_tot):
    """
    PMI for given word counts of lists of words W and C. It works vectorized accross W
    if needed.
    Param:
        - counts_wc: co-ocurrence array between C and W
        - counts_w: co-ocurrence array for W words
        - count_c: co-ocurrence count C
        - count_tot: total co-occurrence count
    """
    numerador = counts_wc * count_tot
    denominador = counts_w * count_c
    res = np.log(numerador / denominador)
    return res

In [None]:
get_pmi(["god", "america"], ["almighty"])

In [None]:
get_pmi(["united", "country"], ["states"])

**PREGUNTA 10** ¿En términos generales, qué diferencia hay en el _tipo de asociaciones_ que captura PMI vs. la similitud entre embeddings?

OJO: estos PMI surgen de un corpus distinto que los embeddings anteriores

In [None]:
most_similar(glove_wikig, "country")

In [None]:
all_words = list(str2idx.keys())
pmis = get_pmi(all_words, ["country"])
pmis_dict = dict(zip(all_words, pmis))
sorted(pmis_dict.items(), key=lambda x: x[1], reverse=True)[:10]

In [None]:
print(glove_wikig.n_similarity(["very"], ["good"]))
print(glove_wikig.n_similarity(["bad"], ["good"]))
print(glove_wikig.n_similarity(["watermelon"], ["good"]))
print(glove_wikig.n_similarity(["black"], ["white"]))

**PREGUNTA 11** ¿Por qué es relativamente alta la asociación entre "bad" y "good" medida con embeddings?

## Bonus track

Identificacion de collocations con nPMI

In [None]:
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

collocations_model = Phrases(
    sentences=obama_sentences,
    min_count=3, threshold=.8, # valor umbral de NPMI
    scoring='npmi', connector_words=ENGLISH_CONNECTOR_WORDS)

In [None]:
collocations_model.export_phrases()

In [None]:
sentences_with_collocations = collocations_model[obama_sentences]

In [None]:
print(sentences_with_collocations[0])
print(sentences_with_collocations[-1])

In [None]:
# ahora podemos entrenar embeddings considerando las collocations como tokens:
# w2v_model = Word2Vec(sentences_with_collocations, ...)