# Semantic Similarity Between Keywords Using Word Embeddings

This notebook demonstrates how to:
- Load keywords from `reports.csv`
- Use `flair` to embed each keyword with static embeddings (e.g., GloVe)
- Embed a user-defined query term
- Compute cosine similarity and find the most semantically similar keywords

In [1]:
# !pip install flair

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
from flair.embeddings import WordEmbeddings
from flair.data import Sentence
from sklearn.metrics.pairwise import cosine_similarity

## Load and prepare keywords from reports.csv

In [13]:
# Load reports.csv and extract keywords
reports_path = Path("../data/api/reports.csv")
df = pd.read_csv(reports_path).fillna("")

keywords = set()
for kw_list in df["keywords"]:
    kws = [k.strip().lower() for k in kw_list.split(",") if k.strip()]
    keywords.update(kws)
keywords = sorted(keywords)
print(f"Loaded {len(keywords)} unique keywords.")

Loaded 1665 unique keywords.


In [18]:
import pandas as pd

# Convertir a DataFrame para visualización tabular
keywords_df = pd.DataFrame(keywords, columns=["keyword"])
keywords_df.to_csv("../data/api/keywords_alphabetical.csv", index=False)

keywords_df.head(20)  # muestra las 20 primeras



Unnamed: 0,keyword
0,"""tripadvisor"
1,% offense
2,%share/channel
3,& my
4,+ voice
5,120 day
6,2018
7,2019
8,2020
9,2021


## Load Word Embedding model

https://flairnlp.github.io/docs/tutorial-embeddings/classic-word-embeddings

In [42]:
embedding = WordEmbeddings('en')  # Alternatives: 'en' (fasttex), 'en-glove'.
print("Embedding model loaded.")

Embedding model loaded.


## Embed each keyword and build a dictionary

In [43]:
keyword_vectors = {}
for kw in keywords:
    sentence = Sentence(kw, use_tokenizer=True)
    embedding.embed(sentence)
    if sentence:
        # calculate a mean value between word embeddings (for keyphrases)
        vector = np.mean([token.embedding.cpu().numpy() for token in sentence], axis=0)
        keyword_vectors[kw] = vector        

print(f"Embedded {len(keyword_vectors)} keywords.")

Embedded 1665 keywords.


## Search for keywords similar to a given query

In [44]:
query = "earnings"
query_sentence = Sentence(query, use_tokenizer=True)
embedding.embed(query_sentence)

if query_sentence:
    query_vector = np.mean([token.embedding.cpu().numpy() for token in query_sentence], axis=0).reshape(1, -1)
    scores = {}
    for kw, vec in keyword_vectors.items():
        sim = cosine_similarity(query_vector, vec.reshape(1, -1))[0][0]
        scores[kw] = sim

    top_k = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:10]
    print(f"Top keywords similar to '{query}':\n")
    for kw, score in top_k:
        print(f"{kw}: {score:.4f}")
else:
    print(f"'{query}' could not be embedded.")

Top keywords similar to 'earnings':

income: 0.7223
revenue performance: 0.6793
revenue forecast: 0.6475
revenues: 0.6327
cash expenses: 0.6320
revenue: 0.6244
sales payroll: 0.6233
corporate sales: 0.6106
revenue management: 0.6104
the revenue management: 0.6080
