# Embeddings

En este laboratorio veremos cómo funciona un modelo de embedding y su representación gráfica

## Dependencias


In [1]:
!pip install \
  cohere==5.15.0 \
  umap-learn[plot]==0.5.7 \
  altair==5.5.0 \
  datasets==3.6.0 \
  usearch==2.17.7 \
  np==1.0.2 \
  fastavro==1.10.0 \
  httpx-sse==0.4.0 \
  types-requests==2.32.0.20250328 \
  dill==0.3.8 \
  multiprocess==0.70.16 \
  xxhash==3.5.0 \
  datashader==0.18.1 \
  pyct==0.5.0 \
  fsspec==2025.3.0


Collecting cohere==5.15.0
  Downloading cohere-5.15.0-py3-none-any.whl.metadata (3.4 kB)
Collecting datasets==3.6.0
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting usearch==2.17.7
  Downloading usearch-2.17.7-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (32 kB)
Collecting np==1.0.2
  Downloading np-1.0.2.tar.gz (7.4 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fastavro==1.10.0
  Downloading fastavro-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx-sse==0.4.0
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting types-requests==2.32.0.20250328
  Downloading types_requests-2.32.0.20250328-py3-none-any.whl.metadata (2.3 kB)
Collecting dill==0.3.8
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess==0.70.16
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting xxhash==3.5.0
  Downloading xxhash-3.5.0-cp311-c

In [2]:
import cohere
import pandas as pd
import getpass

api_key = getpass.getpass("Enter your Cohere API Key:")
co = cohere.Client(api_key)

Enter your Cohere API Key:··········


## Similitud entre frases


In [6]:
import numpy as np
phrases = ["The One Ring is in Mordor", "I love soup", "The Ring holds the power to control all others", "The One Ring is inside Mordor"]

model="embed-english-v3.0"
input_type="search_query"

res = co.embed(texts=phrases,
                model=model,
                input_type=input_type,
                embedding_types=['float'])

(p1, p2, p3, p4) = res.embeddings.float

# compare them
def calculate_similarity(a, b):
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(calculate_similarity(p1, p2))

print(calculate_similarity(p2, p3))

print(calculate_similarity(p1, p3))

print(calculate_similarity(p1, p4))

0.14846361318201975
0.15578270310509199
0.5286080687211606
0.9316496595119268


## Word Embeddings
Vamos a ver la representación vectorial de algunas palabras:


In [7]:
words = pd.DataFrame({'text':
  [
      'robot',
      'car',
      'bacon'
  ]})

words

Unnamed: 0,text
0,robot
1,car
2,bacon


In [8]:
words_emb = co.embed(texts=list(words['text']), model='embed-english-v2.0').embeddings

In [9]:
word_0 = words_emb[0]
word_1 = words_emb[1]
word_2 = words_emb[2]

In [10]:
word_2

[-0.9995117,
 2.5078125,
 2.3574219,
 0.42382812,
 -0.09765625,
 0.34960938,
 -0.5966797,
 1.0224609,
 -0.8461914,
 2.3261719,
 -1.7128906,
 -0.5913086,
 0.67041016,
 3.5097656,
 1.4179688,
 0.7763672,
 -1.5068359,
 2.1171875,
 0.84472656,
 -0.65722656,
 -0.4650879,
 1.2519531,
 -0.33618164,
 -0.021606445,
 2.1953125,
 -1.9541016,
 -0.10180664,
 -0.75146484,
 -1.8447266,
 0.13476562,
 -1.1347656,
 -1.5830078,
 0.12841797,
 0.70214844,
 0.76220703,
 0.20410156,
 0.82910156,
 -0.8569336,
 -0.0007324219,
 -0.56591797,
 -0.20092773,
 1.1445312,
 0.6088867,
 -1.8085938,
 1.8095703,
 -1.4140625,
 1.1494141,
 3.0703125,
 -0.41308594,
 3.2695312,
 -1.09375,
 -0.05810547,
 -0.78271484,
 0.4741211,
 -0.37475586,
 -1.2265625,
 -0.97558594,
 -5.3046875,
 -0.07800293,
 -0.9042969,
 -0.38427734,
 1.1914062,
 -1.3085938,
 0.9199219,
 0.60839844,
 2.7050781,
 -1.2910156,
 -3.0371094,
 -0.059417725,
 0.14123535,
 -2.6210938,
 0.041137695,
 -0.14355469,
 -1.9042969,
 -0.22631836,
 1.3935547,
 0.25952148

## Embeddings Frases
Ahora la representación vectorial de algunas frases


In [11]:
import pandas as pd

sentences = pd.DataFrame({'text':
  [
   'Where is the One Ring?',
   'The One Ring is in Mordor',
   'What power does the Ring hold?',
   'The Ring holds the power to control all others',
   'Where do Hobbits live?',
   'Hobbits live in The Shire',
   'What is the Elven bread?',
   'Elven bread is Lembas, sustenance for a long journey',
  ]})

print(sentences)


                                                text
0                             Where is the One Ring?
1                          The One Ring is in Mordor
2                     What power does the Ring hold?
3     The Ring holds the power to control all others
4                             Where do Hobbits live?
5                          Hobbits live in The Shire
6                           What is the Elven bread?
7  Elven bread is Lembas, sustenance for a long j...


In [12]:
emb = co.embed(texts=list(sentences['text']), model='embed-english-v2.0').embeddings

for e in emb:
    print(e)

[0.6894531, -1.6904297, -2.5546875, 0.14904785, -0.70654297, -0.5991211, -1.1914062, 0.006713867, -0.026138306, 3.0, -0.8652344, 1.3671875, -0.23547363, -0.17687988, -0.640625, -0.36645508, -0.86328125, 2.3085938, -0.1026001, -0.06500244, 1.1865234, 0.24279785, -1.3291016, -0.29614258, 0.8613281, -1.0859375, 0.6088867, -1.6123047, -1.3056641, -1.4931641, 0.13928223, -2.0546875, -0.04550171, 2.2441406, 0.7963867, -0.64208984, 0.55126953, -1.5068359, 0.5625, 1.65625, -1.8837891, 0.08251953, -1.2021484, -1.8515625, -0.61083984, 0.39233398, -1.2607422, 0.3828125, 0.7285156, 0.8574219, 1.8486328, -1.6142578, 2.0996094, -0.1986084, 0.49926758, -0.3955078, -4.9375, -1.84375, 2.1699219, 0.5239258, -1.5302734, 0.6015625, 0.6977539, -0.58984375, -1.6279297, 0.74072266, -0.7871094, -0.86083984, 0.6948242, -0.7807617, -2.2539062, 2.8652344, -1.0419922, 0.32055664, -0.17321777, 1.3173828, -0.44458008, -1.0410156, -0.7988281, -3.4589844, -0.052642822, -0.39868164, 0.3557129, 0.74560547, -0.6201172, 

In [13]:
import umap
import altair as alt

In [14]:
def umap_plot(text, emb):

    cols = list(text.columns)
    reducer = umap.UMAP(n_neighbors=2)
    umap_embeds = reducer.fit_transform(emb)
    df_explore = text.copy()
    df_explore['x'] = umap_embeds[:,0]
    df_explore['y'] = umap_embeds[:,1]

    # Plot
    chart = alt.Chart(df_explore).mark_circle(size=60).encode(
        x=#'x',
        alt.X('x',
            scale=alt.Scale(zero=False)
        ),
        y=
        alt.Y('y',
            scale=alt.Scale(zero=False)
        ),
        tooltip=cols
    ).properties(
        width=700,
        height=400
    )
    return chart

In [15]:
chart = umap_plot(sentences, emb)
chart.interactive()



## Embeddings más grandes
Nos descargaremos algunos artículos de Wikipedia y buscaremos su similitud.

Usaremos un dataset de Cohere con artículos de Wikipedia ya pasados por su modelo de Embedding.

[Cohere/wikipedia-2023-11-embed-multilingual-v3](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3)


In [16]:
from datasets import load_dataset

lang = "simple"
top_k = 5
max_docs = 5000

docs_stream = load_dataset("Cohere/wikipedia-2023-11-embed-multilingual-v3", lang, split="train", streaming=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/30.2k [00:00<?, ?B/s]

In [17]:
titles = []
texts = []
doc_embeddings = []

for doc in docs_stream:
    titles.append(doc['title'])
    texts.append(doc['text'])
    doc_embeddings.append(doc['emb'])
    if len(titles) >= max_docs:
        break

wiki_articles = pd.DataFrame({
    'title': titles,
    'text': texts,
    'emb': doc_embeddings
})

In [18]:
articles = wiki_articles[['title', 'text']]
embeds = np.array(doc_embeddings)

In [19]:
def umap_plot_big(text, emb):

    cols = list(text.columns)
    reducer = umap.UMAP(n_neighbors=100)
    umap_embeds = reducer.fit_transform(emb)
    df_explore = text.copy()
    df_explore['x'] = umap_embeds[:,0]
    df_explore['y'] = umap_embeds[:,1]

    # Plot
    chart = alt.Chart(df_explore).mark_circle(size=60).encode(
        x=#'x',
        alt.X('x',
            scale=alt.Scale(zero=False)
        ),
        y=
        alt.Y('y',
            scale=alt.Scale(zero=False)
        ),
        tooltip=cols
    ).properties(
        width=700,
        height=400
    )
    return chart

In [20]:
chart = umap_plot_big(articles, embeds)
chart.interactive()



## Full search example

In [21]:
docs = []
doc_embeddings = []

for doc in docs_stream:
    docs.append(doc)
    doc_embeddings.append(doc['emb'])
    if len(docs) >= max_docs:
        break

doc_embeddings = np.asarray(doc_embeddings)

In [22]:
query = 'What is Enigma?'

In [24]:
response = co.embed(texts=[query], model='embed-multilingual-v3.0', input_type="search_query")
query_embedding = response.embeddings
query_embedding = np.asarray(query_embedding)

dot_scores = np.matmul(query_embedding, doc_embeddings.transpose())[0]
top_k_hits = np.argpartition(dot_scores, -top_k)[-top_k:].tolist()

top_k_hits.sort(key=lambda x: dot_scores[x])

In [25]:
print("Query:", query)
for doc_id in top_k_hits:
    print(docs[doc_id]['title'])
    print(docs[doc_id]['text'])
    print(docs[doc_id]['url'], "\n")

Query: What is Enigma?
Microsoft
Bing is a search engine similar to Google. It used to be under the MSN brand and was later known as Live Search, but became its own service in 2009. Bing is known for the different images that appear on the background of its home page.
https://simple.wikipedia.org/wiki/Microsoft 

Alan Turing
Using cryptanalysis, he helped to break the codes of the Enigma machine. After that, he worked on other German codes.
https://simple.wikipedia.org/wiki/Alan%20Turing 

Microsoft
Internet Explorer is a piece of software that lets people look at things online (known as browsing) and download things from the Internet. In 2015, it was replaced by Microsoft Edge.
https://simple.wikipedia.org/wiki/Microsoft 

Encyclopedia
An encyclopedia (also known in English as an encyclopædia) is a collection (usually a book or website) of information.  Some are called "encyclopedic dictionaries".
https://simple.wikipedia.org/wiki/Encyclopedia 

Alan Turing
Alan was a brilliant mathem