# <h1 align="center"><font color="red">Working with Embeddings: Closed versus Open Source</font></h1>

<font color="pink">Senior Data Scientist.: Dr. Eddy Giusepe Chirinos Isidro</font>

* Notebook baseado no tutorial de [Ida Silfverskiöld](https://towardsdatascience.com/working-with-embeddings-closed-versus-open-source-39491f0b95c2).

* Dataset: [ilsilfverskiold/linkedin_profiles_synthetic](https://huggingface.co/datasets/ilsilfverskiold/linkedin_profiles_synthetic)

![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*EaD1Iv5O6UX5EInQ09SYUA.png)

![](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*PO0sRdIm0Cn6Ni6qd7nTrw.png)

# <font color="gree">Instalação e importação do Dataset</font>

In [None]:
#%pip install datasets scikit-learn matplotlib -qq
#%pip install nbformat>=4.2.0

#%pip install --upgrade ipywidgets

In [1]:
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import plotly.express as px

# Synthetic LinkedIn profiles with the embeddings:
dataset = load_dataset("ilsilfverskiold/linkedin_profiles_synthetic")
profiles = dataset['train']

# Anonymous job descriptions with embeddings:
dataset = load_dataset("ilsilfverskiold/linkedin_recruitment_questions_embedded")
applications = dataset['train']

In [2]:
# Profiles with the different embeddings - pick the embeddings you'd like to use:
profiles

Dataset({
    features: ['FirstName', 'LastName', 'Headline', 'Location', 'About Me', 'Experience', 'Education', 'Skills', 'Certifications', 'Recommendations', 'text', 'embeddings_nv-embed-v1', 'embeddings_nv-embedqa-e5-v5', 'embeddings_bge-m3', 'embeddings_arctic-embed-l', 'embeddings_mistral-7b-v2', 'embeddings_gte-large-en-v1.5', 'embeddings_text-embedding-ada-002', 'embeddings_text-embedding-3-small', 'embeddings_voyage-3', 'embeddings_mxbai-embed-large-v1 '],
    num_rows: 6904
})

In [3]:
# Go through the applications to see which query you'll search with:
applications

Dataset({
    features: ['application', 'position', 'natural_language', 'embeddings_nv-embed-v1', 'embeddings_nv-embedqa-e5-v5', 'embeddings_bge-m3', 'embeddings_arctic-embed-l', 'embeddings_mistral-7b-v2', 'embeddings_text-embedding-ada-002', 'embeddings_text-embedding-3-small', 'embeddings_voyage-3', 'embeddings_mxbai-embed-large-v1'],
    num_rows: 20
})

In [7]:
application = applications[1] # deciding on the second application
application_text = application['natural_language']
print("Application we're looking for: \n\n",application_text)

Application we're looking for: 

 We're is seeking a dynamic Product Marketing Manager to drive the growth of our market leading CMS in the enterprise market
You will play a pivotal role in positioning us as the preferred solution for developers and web development teams within large organizations
You will collaborate closely with sales, marketing, and product teams to develop and execute effective go-to-market strategies, create compelling messaging, and support a successful enterprise sales motion


In [8]:
# Get the query embeddings for an embedding model - in here we're picking mxbai-embed-large-v1:
query_embedding_vector = np.array(application['embeddings_mxbai-embed-large-v1'])

embeddings_list = [np.array(emb) for emb in profiles['embeddings_mxbai-embed-large-v1 ']] # note the extra space
texts = profiles['text']

In [10]:
texts[:3]

['Augmented Reality Developer | Creating Immersive Experiences Dedicated and innovative AR developer with a passion for creating engaging and interactive experiences. Skilled in Unity, ARKit, and ARCore. Collaborative team player with a strong background in computer science and software engineering. Senior AR Developer | Pixelloid | Zurich, Switzerland | 2020-2022 | Led the development of multiple AR projects for various clients, including a popular museum exhibit and a cutting-edge industrial training platform. Utilized Unity and ARKit to create immersive experiences that increased user engagement by 30%.; AR Developer | Breadwinner Studios | Berlin, Germany | 2018-2020 | Worked on several AR projects, including a mobile app for furniture shopping and a virtual try-on platform for fashion brands. Contributed to the development of in-house tools for tracking and optimizing AR experiences.; Junior AR Developer | Whimsy Tech | Amsterdam, Netherlands | 2016-2018 | Collaborated with a team

# <font color="gree">Similaridade</font>

In [11]:
# Let's first try to calculate the cosine similarity (without clustering):
def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = []
for idx, emb in enumerate(embeddings_list):
    sim = cosine_similarity(query_embedding_vector, emb)
    similarities.append(sim)

In [13]:
results = list(zip(range(1, len(texts) + 1), similarities, texts))
sorted_results = sorted(results, key=lambda x: x[1], reverse=True)

# Let's display the results as well:
print("\nSimilarity Results (sorted from highest to lowest):")
for idx, sim, text in sorted_results[:20]:  # adjust if you want to show more
    percentage = (sim + 1) / 2 * 100
    text_preview = ' '.join(text.split()[:10])
    print(f"Text {idx} similarity: {percentage:.2f}% - Preview: {text_preview}...")



Similarity Results (sorted from highest to lowest):
Text 3615 similarity: 89.59% - Preview: Product Marketing Manager | Building Go-to-Market Strategies for Growth Results-driven...
Text 6299 similarity: 89.56% - Preview: Product Marketing Manager | Driving Growth & Customer Engagement Results-driven...
Text 3232 similarity: 89.09% - Preview: Product Marketing Manager | Driving Product Growth through Data-Driven Strategies...
Text 5959 similarity: 88.90% - Preview: Product Marketing Manager | Data-Driven Growth Expert Results-driven Product Marketing...
Text 5635 similarity: 88.84% - Preview: Product Marketing Manager | Driving Growth through Data-Driven Marketing Strategies...
Text 5835 similarity: 88.74% - Preview: Product Marketing Manager | Cloud-Based SaaS Results-driven Product Marketing Manager...
Text 139 similarity: 88.66% - Preview: Product Marketing Manager | Scaling Growth through Data-Driven Strategies Experienced...
Text 6688 similarity: 88.48% - Preview: Product Marketi

In [14]:
# Let's now try to set up our cluster from the embeddings from the profiles
embeddings_array = np.array(embeddings_list)

num_clusters = 10 # you can pick another number here

kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(embeddings_array)

cluster_labels = kmeans.labels_

pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings_array)

In [15]:
# Let's now see how query fits into the clustering:
query_embedding_array = np.array(query_embedding_vector).reshape(1, -1)
reduced_query_embedding = pca.transform(query_embedding_array)

# Let's also predict which cluster the query would belong to:
query_cluster_label = kmeans.predict(query_embedding_array)[0]
print(f"The query belongs to cluster {query_cluster_label}")

The query belongs to cluster 5


# <font color="gree">Visualizando os Embeddings</font>

In [20]:
# Let's now visualise the cluster with the query mapped out as well

labels = ['Data Point'] * len(embeddings_array)

truncated_texts = []
for text in texts:
    words = text.strip().split()
    truncated_text = ' '.join(words[:5]) if len(words) >= 5 else text.strip()
    truncated_texts.append(truncated_text)

query_words = application_text.strip().split()
truncated_query_text = ' '.join(query_words[:5]) if len(query_words) >= 5 else application_text.strip()

df = pd.DataFrame({
    'Component 1': reduced_embeddings[:, 0],
    'Component 2': reduced_embeddings[:, 1],
    'Cluster': cluster_labels.astype(str),
    'Label': labels,
    'Text': truncated_texts
})

df_query = pd.DataFrame({
    'Component 1': [reduced_query_embedding[0, 0]],
    'Component 2': [reduced_query_embedding[0, 1]],
    'Cluster': [str(query_cluster_label)],
    'Label': ['Query'],
    'Text': [truncated_query_text]
})

df = pd.concat([df, df_query], ignore_index=True)

fig = px.scatter(
    df,
    x='Component 1',
    y='Component 2',
    color='Cluster',
    hover_data=['Label', 'Text'],
    symbol=df['Label'].apply(lambda x: 'x' if x == 'Query' else 'circle'),
    size=df['Label'].apply(lambda x: 10 if x == 'Query' else 5),
    title='Embedding Clusters Visualization with Truncated Texts'
)

fig.show()

In [21]:
# Let's now do semantic search but only in the correct cluster to see if it helps filter out irrelevant results
cluster_indices = np.where(cluster_labels == query_cluster_label)[0]

cluster_embeddings = embeddings_array[cluster_indices]
cluster_texts = [texts[i] for i in cluster_indices]

similarities_in_cluster = []
for idx, emb in zip(cluster_indices, cluster_embeddings):
    sim = cosine_similarity(query_embedding_vector, emb)
    similarities_in_cluster.append((idx, sim))

similarities_in_cluster.sort(key=lambda x: x[1], reverse=True)

top_n = 40  # adjust this number if you want to display more matches
top_matches = similarities_in_cluster[:top_n]

print(f"\nTop {top_n} similar texts in the same cluster as the query:")
for idx, sim in top_matches:
    percentage = (sim + 1) / 2 * 100
    text_preview = ' '.join(texts[idx].split()[:10])
    print(f"Text {idx+1} similarity: {percentage:.2f}% - Preview: {text_preview}...")


Top 40 similar texts in the same cluster as the query:
Text 3615 similarity: 89.59% - Preview: Product Marketing Manager | Building Go-to-Market Strategies for Growth Results-driven...
Text 3232 similarity: 89.09% - Preview: Product Marketing Manager | Driving Product Growth through Data-Driven Strategies...
Text 5959 similarity: 88.90% - Preview: Product Marketing Manager | Data-Driven Growth Expert Results-driven Product Marketing...
Text 5635 similarity: 88.84% - Preview: Product Marketing Manager | Driving Growth through Data-Driven Marketing Strategies...
Text 5835 similarity: 88.74% - Preview: Product Marketing Manager | Cloud-Based SaaS Results-driven Product Marketing Manager...
Text 139 similarity: 88.66% - Preview: Product Marketing Manager | Scaling Growth through Data-Driven Strategies Experienced...
Text 6688 similarity: 88.48% - Preview: Product Marketing Manager | Driving Business Growth through Data-Driven Insights...
Text 6405 similarity: 88.27% - Preview: Product Mar