# Getting Started With Embeddings: Notebook Companion



![](/../assets/80_getting_started_with_embeddings/thumbnail.png)

## 1. Embedding a dataset


In [16]:
model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = "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"

In [17]:
%%capture
!pip install retry

In [18]:
import requests
from retry import retry

api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}

In [19]:
@retry(tries=3, delay=10)
def query(texts):
    response = requests.post(api_url, headers=headers, json={"inputs": texts})
    result = response.json()
    if isinstance(result, list):
      return result
    elif list(result.keys())[0] == "error":
      raise RuntimeError(
          "The model is currently loading, please re-run the query."
          )

In [20]:
texts = ["How to activate Gameloft service?",
        "How to activate Kitmania service?",
        "How do I activate channels in dialog tv ?",
        "How to stop automatic reloading in my mobile?",
        "How can I change my subscription plan ?",
        "How to upgrade my data package in fiber ?",
        "How much would it be to switch to fiber from adsl ?",
        "Can I request to change my router ?",
        "Where can I buy a sim card ?",
        "Where's the nearest outlet of dialog in Galle ?",
        "How much is it for the unlimited packages ?",
        "Is there any constraints on unlimited packages ?",
        "What are the benifits of switching to dialog?",
        "How much is the coverage of dialog over the country?",
        "How do I reload my sim using mydialog app ?"]

output = query(texts)

In [21]:
import pandas as pd

embeddings = pd.DataFrame(output)

In [22]:
print(embeddings)

         0         1         2         3         4         5         6    \
0  -0.029219 -0.085512 -0.077847 -0.094003  0.015870  0.037714  0.099469   
1  -0.008334 -0.033018 -0.079689 -0.125476  0.020065  0.011969 -0.010356   
2   0.084577 -0.160313 -0.066792 -0.065669 -0.032276  0.011812  0.037455   
3  -0.004485  0.055164  0.059364  0.032874 -0.010698 -0.004722  0.014576   
4  -0.009960  0.006991  0.002154  0.001607 -0.000578  0.055080 -0.011466   
5   0.045093 -0.089408  0.059152 -0.029200 -0.023397 -0.062992 -0.059605   
6  -0.005542 -0.084162  0.028302  0.068891 -0.001797 -0.069586  0.019558   
7  -0.035500 -0.007333  0.031845 -0.023818  0.007876 -0.001336 -0.077200   
8  -0.025864 -0.019233  0.032804 -0.000257 -0.017394  0.001911 -0.001474   
9   0.022011 -0.055546 -0.081685  0.030110 -0.053201  0.000104  0.030299   
10 -0.020933  0.012309  0.003850 -0.011158 -0.022893 -0.011582  0.039846   
11 -0.002877 -0.057657  0.000468 -0.066328 -0.003648 -0.000191 -0.065667   
12  0.004036

## 2. Host embeddings for free on the Hugging Face Hub


In [23]:
%%capture
!pip install huggingface-hub

In [24]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid.
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential he

In [25]:
!huggingface-cli repo create embedded_faqs_telco --type dataset

[90mgit version 2.25.1[0m
[90mgit-lfs/2.9.2 (GitHub; linux amd64; go 1.13.5)[0m

You are about to create [1mdatasets/Sandeepa/embedded_faqs_telco[0m
Proceed? [Y/n] Y

Your repo now lives at:
  [1mhttps://huggingface.co/datasets/Sandeepa/embedded_faqs_telco[0m

You can clone it locally with the command below, and commit/push as usual.

  git clone https://huggingface.co/datasets/Sandeepa/embedded_faqs_telco



In [26]:
# This is code required to install git-lfs however it already is installed in Colab instances.
#!git lfs install

In [27]:
!git clone https://Sandeepa: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@huggingface.co/datasets/Sandeepa/embedded_faqs_telco

Cloning into 'embedded_faqs_telco'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (3/3), 520 bytes | 520.00 KiB/s, done.


In [28]:
embeddings.to_csv("embedded_faqs_telco/embeddings.csv", index=False)
print(embeddings.shape)

(15, 384)


Changing directory to our repo `embedded_faqs_telco`.

In [29]:
%cd embedded_faqs_telco/

/content/embedded_faqs_telco


In [30]:
!git lfs track *.csv
!git add .gitattributes
!git add embeddings.csv

Tracking "embeddings.csv"


In [31]:
!git config --global user.email "Sandeepdevin@gmail.com"
!git config --global user.name "Sandeepa"

In [32]:
!git commit -m "Follow through embedded tutorial with telco questions"
!git push

[main 012291b] Follow through embedded tutorial with telco questions
 2 files changed, 4 insertions(+)
 create mode 100644 embeddings.csv
Uploading LFS objects: 100% (1/1), 122 KB | 0 B/s, done.
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 2 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 483 bytes | 483.00 KiB/s, done.
Total 4 (delta 1), reused 0 (delta 0)
To https://huggingface.co/datasets/Sandeepa/embedded_faqs_telco
   a2c38e9..012291b  main -> main


## 3. Get the most similar Frequently Asked Questions to a query


In [33]:
%%capture
!pip install datasets

In [34]:
import torch
from datasets import load_dataset

faqs_embeddings = load_dataset('Sandeepa/embedded_faqs_telco')
dataset_embeddings = torch.from_numpy(faqs_embeddings["train"].to_pandas().to_numpy()).to(torch.float)

Downloading and preparing dataset csv/Sandeepa--embedded_faqs_telco to /root/.cache/huggingface/datasets/Sandeepa___csv/Sandeepa--embedded_faqs_telco-81f94fff8292a887/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/122k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/Sandeepa___csv/Sandeepa--embedded_faqs_telco-81f94fff8292a887/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [35]:
question = ["How can I activate Dialog Tv packages?"]
output = query(question)

In [36]:
query_embeddings = torch.FloatTensor(output)
print(f"The size of our embedded dataset is {dataset_embeddings.shape} and of our embedded query is {query_embeddings.shape}.")

The size of our embedded dataset is torch.Size([15, 384]) and of our embedded query is torch.Size([1, 384]).


In [37]:
%%capture
!pip install -U sentence-transformers

In [38]:
from sentence_transformers.util import semantic_search

hits = semantic_search(query_embeddings, dataset_embeddings, top_k=5)

In [39]:
[texts[hits[0][i]['corpus_id']] for i in range(len(hits[0]))]

['How do I activate channels in dialog tv ?',
 'How much is the coverage of dialog over the country?',
 "Where's the nearest outlet of dialog in Galle ?",
 'What are the benifits of switching to dialog?',
 'How to activate Kitmania service?']