# Getting Started With Embeddings: Notebook Companion



## 1. Embedding a dataset


In [3]:
model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = "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"

The first time you generate the embeddings it may take a while (approximately 20 seconds) for the API to return them. We use the `retry` decorator (install with `pip install retry`) so that if on the first try `output = query(dict(inputs = texts))` doesn't work, wait 10 seconds and try again three times. The reason this happens is because on the first request, the model needs to be downloaded and installed in the server, but subsequent calls are much faster.

In [2]:
%%capture
!pip install retry

In [4]:
import requests
from retry import retry

api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}

In [8]:
@retry(tries=3, delay=10)
def query(texts):
    response = requests.post(api_url, headers=headers, json={"inputs": texts})
    result = response.json()
    if isinstance(result, list):
      return result
    elif list(result.keys())[0] == "error":
      raise RuntimeError(
          "The model is currently loading, please re-run the query."
          )

In [11]:
texts = ["How do I get a replacement Medicare card?",
        "What is the monthly premium for Medicare Part B?",
        "How do I terminate my Medicare Part B (medical insurance)?",
        "How do I sign up for Medicare?",
        "Can I sign up for Medicare Part B if I am working and have health insurance through an employer?",
        "How do I sign up for Medicare Part B if I already have Part A?",
        "What are Medicare late enrollment penalties?",
        "What is Medicare and who can get it?",
        "How can I get help with my Medicare Part A and Part B premiums?",
        "What are the different parts of Medicare?",
        "Will my Medicare premiums be higher because of my higher income?",
        "What is TRICARE ?",
        "Should I sign up for Medicare Part B if I have Veterans’ Benefits?"]

output = query(texts)

In [12]:
import pandas as pd

embeddings = pd.DataFrame(output)

In [None]:
print(embeddings)

         0         1         2         3         4         5         6    \
0  -0.023889  0.055259 -0.011655 -0.033414 -0.012261 -0.024873 -0.012663   
1  -0.012688  0.046874 -0.010502 -0.020384 -0.013361  0.042322  0.016628   
2   0.000494  0.119412  0.005229 -0.092734  0.007773 -0.005325  0.034506   
3  -0.029711  0.023298 -0.057041 -0.012183 -0.013710  0.029796  0.063739   
4  -0.025628  0.070389 -0.017380 -0.056567  0.028576  0.052823  0.067062   
5  -0.022656  0.021160  0.005105 -0.046494  0.009074  0.041495  0.054268   
6  -0.002911  0.060791 -0.009176 -0.006133  0.040492  0.036594  0.002054   
7  -0.080526  0.059888 -0.048847 -0.040176 -0.063342  0.041848  0.119045   
8  -0.034388  0.072501  0.014440 -0.036695  0.014019  0.063070  0.034683   
9  -0.005964  0.025044 -0.003182 -0.025243 -0.039823 -0.012772  0.044713   
10 -0.039008 -0.010610 -0.007383 -0.050190 -0.002518 -0.041641  0.026969   
11 -0.095983 -0.063012 -0.116906 -0.059075 -0.051323 -0.003439  0.018687   
12 -0.011629

## 2. Host embeddings for free on the Hugging Face Hub


In [13]:
%%capture
!pip install huggingface-hub

In [14]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `Unal` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `Unal`


In [15]:
!huggingface-cli repo create embedded_faqs_medicare --type dataset --organization ITESM

[90mgit version 2.34.1[0m
[90mgit-lfs/3.0.2 (GitHub; linux amd64; go 1.18.1)[0m

You are about to create [1mdatasets/ITESM/embedded_faqs_medicare[0m
Proceed? [Y/n] Y
(Request ID: Root=1-6757b04d-278730e3104e85567512e6c1;7b63979a-4c16-4489-8328-0171044fe57e)

403 Forbidden: You don't have the rights to create a dataset under the namespace "ITESM".
Cannot access content at: https://huggingface.co/api/repos/create.
Make sure your token has the correct permissions.
[1m[31m{"error":"You don't have the rights to create a dataset under the namespace \"ITESM\""}[0m


In [16]:
# This is code required to install git-lfs however it already is installed in Colab instances.
#!git lfs install

In [17]:
!git clone https://{Jersonp2003}:{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}@huggingface.co/datasets/ITESM/embedded_faqs_medicare

Cloning into 'embedded_faqs_medicare'...
remote: Enumerating objects: 7, done.[K
remote: Total 7 (delta 0), reused 0 (delta 0), pack-reused 7 (from 1)[K
Unpacking objects: 100% (7/7), 47.65 KiB | 7.94 MiB/s, done.
Encountered 1 file(s) that should have been pointers, but weren't:
	embeddings.csv


In [18]:
embeddings.to_csv("embedded_faqs_medicare/embeddings.csv", index=False)
print(embeddings.shape)

(13, 384)


Changing directory to our repo `embedded_faqs_medicare`.

In [19]:
%cd embedded_faqs_medicare/

/content/embedded_faqs_medicare


In [20]:
!git lfs track *.csv
!git add .gitattributes
!git add embeddings.csv

"embeddings.csv" already supported


In [21]:
!git config --global user.email "your email here"
!git config --global user.name "your git user here"

In [22]:
!git commit -m "First version of the embedded_faqs_medicare dataset"
!git push

[main 363172e] First version of the embedded_faqs_medicare dataset
 1 file changed, 3 insertions(+), 14 deletions(-)
 rewrite embeddings.csv (100%)
remote: Password authentication in git is no longer supported. You must use a user access token or an SSH key instead. See https://huggingface.co/blog/password-git-deprecation
fatal: Authentication failed for 'https://huggingface.co/datasets/ITESM/embedded_faqs_medicare/'


## 3. Get the most similar Frequently Asked Questions to a query


In [23]:
%%capture
!pip install datasets

In [24]:
import torch
from datasets import load_dataset

faqs_embeddings = load_dataset('ITESM/embedded_faqs_medicare')
dataset_embeddings = torch.from_numpy(faqs_embeddings["train"].to_pandas().to_numpy()).to(torch.float)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


embeddings.csv:   0%|          | 0.00/106k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/13 [00:00<?, ? examples/s]

In [25]:
question = ["How can Medicare help me?"]
output = query(question)

In [26]:
query_embeddings = torch.FloatTensor(output)
print(f"The size of our embedded dataset is {dataset_embeddings.shape} and of our embedded query is {query_embeddings.shape}.")

The size of our embedded dataset is torch.Size([13, 384]) and of our embedded query is torch.Size([1, 384]).


In [27]:
%%capture
!pip install -U sentence-transformers

In [28]:
from sentence_transformers.util import semantic_search

hits = semantic_search(query_embeddings, dataset_embeddings, top_k=5)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [29]:
[texts[hits[0][i]['corpus_id']] for i in range(len(hits[0]))]

['How can I get help with my Medicare Part A and Part B premiums?',
 'What is Medicare and who can get it?',
 'How do I sign up for Medicare?',
 'What are the different parts of Medicare?',
 'Will my Medicare premiums be higher because of my higher income?']

In [30]:

question = ["I lost my card, how can I get a replacement Medicare card"]

output = query(question)
query_embeddings = torch.FloatTensor(output)

faqs_embeddings = load_dataset('ITESM/embedded_faqs_medicare')
dataset_embeddings = torch.from_numpy(faqs_embeddings["train"].to_pandas().to_numpy()).to(torch.float)


hits = semantic_search(query_embeddings, dataset_embeddings, top_k=3)

for i in range(len(hits[0])):
  print(texts[hits[0][i]['corpus_id']])

How do I get a replacement Medicare card?
How do I sign up for Medicare?
How can I get help with my Medicare Part A and Part B premiums?
