# Text Vector Representations

### José Pablo Kiesling Lange

In [1]:
from pprint import pprint
import numpy as np
import random
from collections import defaultdict, Counter
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from gensim.models import Word2Vec

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\TheKi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Según la documentación del dataset ([sklearn.datasets.fetch_20newsgroups](https://scikit-learn.org/0.19/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups)) hay parámetros los cuales se pueden usar estratégicamente antes del procesamiento y ahorrar operaciones. Para delimitar las categorías indicadas en el laboratorio, se tienen que encontrar.

In [3]:
newsgroups_train = fetch_20newsgroups(subset='all')

In [4]:
pprint(list(newsgroups_train.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [5]:
categories = [
    "talk.politics.guns",
    "talk.politics.mideast",
    "talk.politics.misc",
    "rec.autos",
]

In [6]:
newsgroups_train.data[0]

"From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>\nSubject: Pens fans reactions\nOrganization: Post Office, Carnegie Mellon, Pittsburgh, PA\nLines: 12\nNNTP-Posting-Host: po4.andrew.cmu.edu\n\n\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

Además, como se puede ver, los elementos del dataset son correos, por lo que se puede hacer uso del parámetro `remove` para eliminar las líneas no deseadas y quedarnos solo con el cuerpo del mensaje. Esta decisión se toma ya que no se quiere que el modelo entienda que la palabra "From:" y un correo tienen relación.

In [7]:
remove_content = (
    "headers",
    "footers",
    "quotes",
)

In [8]:
dataset = fetch_20newsgroups(subset="all", categories=categories, remove=remove_content)

In [9]:
corpus = dataset.data

## Preprocesamiento del corpus

In [11]:
corpus_preprocessed = []

### Eliminación de caracteres no alfabéticos

Dado que hay caracteres no alfabéticos en el corpus, se debe implementar una función para eliminarlos. Esta función tomará un texto como entrada y devolverá el texto procesado.

In [12]:
def remove_non_alpha(text):
    text = text.replace("\n", " ")
    text = text.replace("\r", " ")
    text = text.replace("\t", " ")
    return ''.join(c for c in text if c.isalpha() or c.isspace())

### Case Folding

Para la estandarización de las palabras, se pasará todo el texto a minúsculas. Esto ayudará a reducir la variabilidad en la representación de las palabras y facilitará el procesamiento posterior.

In [13]:
def case_folding(text):
    return text.lower()

In [14]:
def preprocessing_pipeline(text):
    text = remove_non_alpha(text)
    text = case_folding(text)
    return text

Se agregarán las palabras <SOF> y <EOF> al inicio y al final de cada texto preprocesado, respectivamente para que la última palabra de un correo electrónico no sea relacionada con la primer palabra del siguiente.

In [15]:
for text in corpus:
    text_preprocessed = preprocessing_pipeline(text)
    corpus_preprocessed.append(text_preprocessed)

In [17]:
tokens = [email.split() for email in corpus_preprocessed]

## Construcción de representación TF-IDF

In [18]:
vectorizer = TfidfVectorizer()

In [19]:
X = vectorizer.fit_transform(corpus_preprocessed)


In [20]:
vectorizer.get_feature_names_out()

array(['aa', 'aaa', 'aaaaaaaaaaaa', ..., 'zwischen', 'zx', 'zxr'],
      dtype=object)

In [21]:
X.shape

(3615, 35198)

In [22]:
def get_top_terms_for_document(document, amount=5):
    row = vectorizer.transform([document])
    row = row.tocsr()[0]

    order = np.argsort(row.data)[::-1]
    top_ids = row.indices[order][:amount]
    top_scores = row.data[order][:amount]

    features = vectorizer.get_feature_names_out()
    
    return list(zip(features[top_ids], top_scores))

In [23]:
random.seed(327)
doc_indices = random.sample(range(len(corpus_preprocessed)), 3)

In [24]:
corpus_sample = [corpus_preprocessed[idx] for idx in doc_indices]

In [25]:
for document in corpus_sample:
    tops = get_top_terms_for_document(document, amount=5)

    print(f"\nDocumento: {document} — top términos TF-IDF:")
    for term, score in tops:
        print(f"  {term:20s} {score:.4f}")


Documento: wow i hadnt realized how venomous this was getting  be careful herethe problem isnt the rich but the values and the systems that make the rich rich  things are designed in such a way that in order to go with the system and make money everything else we care about goes to shit  i have to constantly remind myself that the goal of human society is not to make money  money doesnt make us happy it just prevents certain things making us more unhappy  therefore dont shoot the rich  shoot the conservatives  drewcifer — top términos TF-IDF:
  rich                 0.4764
  money                0.2636
  make                 0.2499
  shoot                0.2198
  the                  0.1952

Documento:  anas a high rank israeli officer was killed during a clash whith a hamas anas mujahid  the terrorist israelis chased and killed a young mujahid anas using antitank missiles  the terrorist zionists cut the mujahids anas body into small pieces to the extend that his body was not recognize

### Palabras con mayor peso en algunos documentos.

**Doc 1**: Los términos reflejan una discusión probablemente sobre riqueza, donde además aparece un término cargado `shoot` que puede cambiar el tono.

**Doc 2**: Es evidente que el texto trata sobre terrorismo y conflictos bélicos. Aquí TF-IDF funciona muy bien porque resalta términos muy específicos y técnicos del tema, los cuales difícilmente aparecerán en documentos de autos.

**Doc 3**: Aunque son palabras más comunes, muestra cómo TF-IDF puede resaltar vocabulario particular aunque no sea estrictamente técnico, ayudando a identificar la idea principal o el estilo del texto.

En general, se observa que TF-IDF prioriza vocabulario distintivo: nombres propios o términos específicos de un conflicto. Además, el alto peso de estas palabras indica que son poco frecuentes en el resto del corpus pero muy relevantes en los documentos en particular donde fueron analizados.

### Limitaciones de TF-IDF respecto a la semántica

* **No captura relaciones entre palabras**:
Es muy evidente el caso en el **Doc 2** donde aparecen `missile` y `missiles` como términos separados, cuando semánticamente son lo mismo.

* **Ignora sinónimos**:
Por ejemplo, si en lugar de `money` se usara `cash`, TF-IDF lo trataría como palabra totalmente distinta, sin reconocer que ambos se refieren a lo mismo. Lo mismo con `missiles` y `rockets`.

* **No entiende contexto**:
El término `shoot` en **Doc 3** podría referirse a disparar, a una sesión de fotos o incluso a una expresión coloquial. TF-IDF no distingue estos significados, solo ve frecuencia.

* **Sensibilidad a términos comunes**:
Palabras como `the` aparecen en los tops porque, aunque muy frecuentes en general, pueden quedar con peso significativo si en ese documento tienen una proporción mayor que en otros. Esto puede “ensuciar” el análisis semántico.


## Construcción de representación PPMI

La implementación de ambos métodos se basó en una discusión de **Stack Overflow**: https://stackoverflow.com/questions/58701337/how-to-construct-ppmi-matrix-from-a-text-corpus

Para los resultados, se muestra los mismos 3 documentos analizados anteriormente.

### Construya una matriz de co-ocurrencia palabra-contexto con ventana deslizante.

In [26]:
def co_occurrence_df(sentences, window_size=2, top_k=800):
    tokenized_docs = [s.split() for s in sentences]
    unigrams = Counter()
    for toks in tokenized_docs:
        unigrams.update([t for t in toks])
    vocab = [w for w,_ in unigrams.most_common(top_k)]
    vocab_set = set(vocab)

    co_counts = defaultdict(Counter)
    for toks in tokenized_docs:
        n = len(toks)
        for i, w in enumerate(toks):
            if w not in vocab_set:
                continue
            L = max(0, i - window_size)
            R = min(n, i + window_size + 1)
            ctx = toks[L:i] + toks[i+1:R]
            for c in ctx:
                if c in vocab_set:
                    co_counts[w][c] += 1

    df = pd.DataFrame(0, index=vocab, columns=vocab, dtype=np.int32)
    for w, row in co_counts.items():
        for c, cnt in row.items():
            df.at[w, c] = cnt
    return df

In [27]:
def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1
    
    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

In [28]:
co_occurrence_matrix = co_occurrence(corpus_sample[:1], window_size=2)

In [29]:
co_occurrence_matrix

Unnamed: 0,a,about,and,are,be,but,care,careful,certain,conservatives,...,to,unhappy,us,values,venomous,was,way,we,with,wow
a,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
about,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,1,0,0
and,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
are,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
be,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
was,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
way,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
we,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
with,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


### Calcule la matriz PPMI

In [30]:
def to_ppmi(df: pd.DataFrame) -> pd.DataFrame:
    row_tot = df.sum(axis=1).astype(float)
    col_tot = df.sum(axis=0).astype(float)
    total = col_tot.sum()

    expected = np.outer(row_tot.values, col_tot.values) / (total if total>0 else 1.0)
    with np.errstate(divide='ignore'):
        ratio = df.values / expected
        pmi = np.log2(ratio, where=(ratio>0))
    pmi[~np.isfinite(pmi)] = 0.0
    pmi[pmi < 0] = 0.0
    return pd.DataFrame(pmi, index=df.index, columns=df.columns, dtype=np.float32)


In [31]:
def pmi(df, positive=True):
    col_totals = df.sum(axis=0)
    total = col_totals.sum()
    row_totals = df.sum(axis=1)
    expected = np.outer(row_totals, col_totals) / total
    df = df / expected
    # Silence distracting warnings about log(0):
    with np.errstate(divide='ignore'):
        df = np.log(df)
    df[np.isinf(df)] = 0.0  # log(0) = 0
    if positive:
        df[df < 0] = 0.0
    return df

In [32]:
ppmi = pmi(co_occurrence_matrix)

In [33]:
ppmi

Unnamed: 0,a,about,and,are,be,but,care,careful,certain,conservatives,...,to,unhappy,us,values,venomous,was,way,we,with,wow
a,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.000000,3.124565,0.000000,0.0,0.0
about,0.000000,0.000000,0.0,0.0,0.000000,0.0,3.124565,0.000000,0.0,0.0,...,1.738271,0.0,0.0,0.000000,0.000000,0.000000,0.000000,3.124565,0.0,0.0
and,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,2.431418,0.000000,0.000000,0.000000,0.000000,0.0,0.0
are,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0
be,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,3.124565,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,3.124565,0.000000,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
was,0.000000,0.000000,0.0,0.0,3.124565,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.000000,3.124565,0.000000,0.000000,0.000000,0.0,0.0
way,3.124565,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0
we,0.000000,3.124565,0.0,0.0,0.000000,0.0,3.124565,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0
with,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,1.738271,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0


### Discuta ventajas y desventajas de PPMI respecto a TF-IDF

#### Ventajas 

- PPMI captura mejor las relaciones semánticas entre palabras al considerar la co-ocurrencia en un contexto más amplio.
- PPMI puede resaltar términos que son informativos en contextos específicos dada la frecuencia de co-ocurrencia.
- PPMI es menos sensible a la frecuencia de documentos y más a la estructura de co-ocurrencia, lo que puede ser beneficioso en ciertos análisis semánticos.


#### Desventajas

- PPMI puede ser computacionalmente más costoso que TF-IDF, especialmente en grandes corpus de texto.
- PPMI puede generar matrices dispersas y de alta dimensión, lo que puede dificultar su manejo y análisis.
- PPMI no considera la frecuencia de los términos en el documento, lo que puede ser una limitación en ciertos contextos de recuperación de información.

## Construcción de representación Word2Vec

In [34]:
w2v = Word2Vec(
    sentences=tokens,
    vector_size=100,
    window=5,
    min_count=1,
    workers=4
)

In [35]:
wv = w2v.wv

### Palabras más cercanas y lejanas en el espacio vectorial

In [36]:
def most_similar_words(word, topn=10):
    if word not in wv:
        return []
    return wv.most_similar(word, topn=topn)

In [37]:
def least_similar_words(word, topn=10, search_top_k=5000):
    if word not in wv:
        return []
    vocab = list(wv.key_to_index.keys())[:search_top_k]
    v = wv[word] / np.linalg.norm(wv[word])
    sims = []
    for w in vocab:
        if w == word: 
            continue
        u = wv[w] / np.linalg.norm(wv[w])
        sims.append((w, float(np.dot(v, u))))
    sims.sort(key=lambda x: x[1])
    return sims[:topn]

In [38]:
for w in ["car", "government", "gun", "israel", "ford"]:
        print(f"\nMás cercanas a '{w}':")
        for t, s in most_similar_words(w, topn=8):
            print(f"  {t:15s}  {s:.3f}")
        print(f"\nMás lejanas a '{w}':")
        for t, s in least_similar_words(w, topn=8, search_top_k=5000):
            print(f"  {t:15s}  {s:.3f}")


Más cercanas a 'car':
  price            0.853
  engine           0.841
  driver           0.833
  little           0.822
  delorean         0.819
  dealership       0.815
  oil              0.815
  post             0.814

Más lejanas a 'car':
  turkish          -0.179
  by               -0.173
  armenian         -0.162
  muslim           -0.129
  against          -0.119
  kurdish          -0.089
  exterminate      -0.086
  reparations      -0.069

Más cercanas a 'government':
  force            0.855
  citizens         0.855
  israel           0.853
  serbians         0.837
  irish            0.834
  society          0.825
  federal          0.825
  land             0.822

Más lejanas a 'government':
  my               -0.105
  edt              0.014
  her              0.032
  am               0.035
  sadikov          0.044
  briefing         0.047
  ago              0.050
  v                0.065

Más cercanas a 'gun':
  crime            0.848
  laws             0.846
  choice      

### Diferencias con TF-IDF y PPMI

#### Word2Vec vs TF-IDF

**Word2Vec** como vimos en clase, aprende vectores densos y continuos que capturan la semántica por contexto. Dos palabras sin superposición léxica pueden quedar cerca `car` y `automobile`.

**TF-IDF** es disperso y léxico: mide importancia de términos por documento, no entiende sinónimos ni contexto.

#### Word2Vec vs PPMI

**PPMI** ya capta asociación estadística de co-ocurrencias, pero en una matriz enorme |V|×|V| y sin aprendizaje paramétrico.

**Word2Vec** puede verse como una compresión no lineal de co-ocurrencias que produce embeddings compactos y generalizables.

## Evaluación comparativa

In [39]:
y = dataset.target
target_names = dataset.target_names

In [40]:
X_train_txt, X_test_txt, y_train, y_test = train_test_split(
    corpus_preprocessed, y, test_size=0.2, random_state=327, stratify=y
)

#### TF-IDF

In [41]:
X_train_tfidf = vectorizer.fit_transform(X_train_txt)
X_test_tfidf  = vectorizer.transform(X_test_txt)

clf_tfidf = LogisticRegression(max_iter=1000, n_jobs=None)
clf_tfidf.fit(X_train_tfidf, y_train)
pred_tfidf = clf_tfidf.predict(X_test_tfidf)
acc_tfidf = accuracy_score(y_test, pred_tfidf)
print(f"TF-IDF accuracy: {acc_tfidf:.4f}")

TF-IDF accuracy: 0.7939


#### PPMI

In [42]:
def docs_from_word_matrix(word_matrix: pd.DataFrame, docs_tokens):
    idx = word_matrix.index
    word2row = {w:i for i,w in enumerate(idx)}
    W = word_matrix.values
    D = []
    for toks in docs_tokens:
        rows = [word2row[w] for w in toks if w in word2row]
        if rows:
            D.append(W[rows].mean(axis=0))
        else:
            D.append(np.zeros(W.shape[1], dtype=np.float32))
    return np.vstack(D)

In [43]:
ppmi_cooc = co_occurrence_df(X_train_txt, window_size=2, top_k=800)
ppmi_mat = to_ppmi(ppmi_cooc)

In [44]:
X_train_toks = [s.split() for s in X_train_txt]
X_test_toks  = [s.split() for s in X_test_txt]

In [45]:
X_train_ppmi = docs_from_word_matrix(ppmi_mat, X_train_toks)
X_test_ppmi  = docs_from_word_matrix(ppmi_mat, X_test_toks)

clf_ppmi = LogisticRegression(max_iter=1000)
clf_ppmi.fit(X_train_ppmi, y_train)
pred_ppmi = clf_ppmi.predict(X_test_ppmi)
acc_ppmi = accuracy_score(y_test, pred_ppmi)
print(f"PPMI (avg word vectors) accuracy: {acc_ppmi:.4f}")

PPMI (avg word vectors) accuracy: 0.7012


#### Word2Vec

In [46]:
tokens_train = [s.split() for s in X_train_txt]
w2v_cls = Word2Vec(
    sentences=tokens_train,
    vector_size=100, window=5,
    min_count=2, sg=1, epochs=10, workers=4
)
wv_cls = w2v_cls.wv

In [47]:
def doc_avg_w2v(tokens, wv):
    vecs = [wv[w] for w in tokens if w in wv]
    if vecs:
        return np.mean(vecs, axis=0)
    return np.zeros(wv.vector_size, dtype=np.float32)

In [48]:
X_train_w2v = np.vstack([doc_avg_w2v(t, wv_cls) for t in tokens_train])
X_test_w2v  = np.vstack([doc_avg_w2v(s.split(), wv_cls) for s in X_test_txt])

In [49]:
clf_w2v = LogisticRegression(max_iter=1000)
clf_w2v.fit(X_train_w2v, y_train)
pred_w2v = clf_w2v.predict(X_test_w2v)
acc_w2v = accuracy_score(y_test, pred_w2v)
print(f"Word2Vec (avg word vectors) accuracy: {acc_w2v:.4f}")

Word2Vec (avg word vectors) accuracy: 0.7607


### Tabla comparativa

In [50]:
results = pd.DataFrame({
    "Representación": ["TF-IDF", "PPMI", "Word2Vec"],
    "Precisión": [acc_tfidf, acc_ppmi, acc_w2v]
})
results

Unnamed: 0,Representación,Precisión
0,TF-IDF,0.793914
1,PPMI,0.701245
2,Word2Vec,0.760719


Con base en los resultados obtenidos, se observa que **TF-IDF** alcanzó la mayor precisión **(79.39%)**, lo cual es consistente con su fortaleza en tareas de clasificación de documentos. Por otro lado, **Word2Vec** obtuvo un **76.07%**, mostrando que los embeddings capturan relaciones semánticas útiles, pero al promediar los vectores de palabras se pierde parte de la estructura contextual. Finalmente, **PPMI** alcanzó un **70.12%**, ya que si bien resalta asociaciones de co-ocurrencia entre palabras, su alta dimensionalidad y dispersión reducen su efectividad práctica en clasificación.

## Discusión final

### Cómo cada representación captura (o no) relaciones semánticas.

**TF-IDF** como se pudo ver en todo el laboratorio se basa solo en la frecuencia de términos  no captura relaciones semánticas. **PPMI**, en cambio, sí tiene cierta semántica al resaltar asociaciones entre palabras que tienden a aparecer en contextos similares, lo que permite identificar relaciones de co-ocurrencia significativas. Y sobre todo **Word2Vec**  aprende representaciones densas y continuas donde palabras con contextos parecidos ocupan posiciones cercanas en el espacio vectorial, lo que le permite capturar sinónimos, analogías y similitud semántica de forma más natural que los otros dos enfoques.

### Escenarios donde cada técnica es más útil

**TF-IDF** resulta muy útil en tareas de clasificación de documentos y recuperación de información, donde lo importante es identificar términos relevantes. **PPMI** se ajusta mejor a estudios semánticos, ya que su matriz muestra las asociaciones entre palabras y permite explorar patrones de co-ocurrencia. **Word2Vec** es ideal para tareas que requieren capturar similitud y relaciones semánticas profundas, como búsqueda semántica, detección de sinónimos, análisis de sentimientos.


### Limitaciones prácticas (memoria, tiempo de cómputo, interpretabilidad)

**TF-IDF** es eficiente y fácil de interpretar, pero genera vectores dispersos de alta dimensión y carece de semántica. **PPMI** como se puede ver en los tiempos de ejecución, presenta el mayor reto práctico: la matriz de co-ocurrencias es de tamaño |V|×|V| (donde |V| es el vocabulario), lo que puede volverse costoso en corpus grandes. **Word2Vec**, aunque produce embeddings densos y compactos, requiere más tiempo de entrenamiento y ajustes de hiperparámetros, y sus representaciones no son directamente interpretables, lo que dificulta explicar por qué dos palabras aparecen cercanas en el espacio vectorial.