# FAQs Engine using OpenSource Embeddings

An **embedding** is a numerical representation of a piece of information.
Text, documents, images, audio, etc. The representation captures the semantic meaning of what is being embedded, making it robust for many industry applications.

Given the text "What is the main benefit of voting?", an embedding of the sentence could be represented in a vector space, for example, with a list of 384 numbers (for example, [0.84, 0.42, ..., 0.02]). Since this list captures the meaning, we can do exciting things, like calculating the distance between different embeddings to determine how well the meaning of two sentences matches.

## Is this a "New" thing?

This is not quite a revolutionary discover, at the end of the day is another way of encoding unstructured data and text into an a processable format for our models to ingest. 
Some examples of encoding are OneHotEncoding,Count-base representations, TF-IDF, etc.

### OneHotEncoding

it creates one really long vectors that is as long as the number of words in your vocabulary. This is not a good idea because it creates a very sparse vector. Filling the vectors with zeroes except for the position representing a given word.This encoding is better suited for deep learning models working with small vocabularies or to encode labels.

### Count-base representations

they try to squeeze a whole sentence in a vector.There are two main approaches to this: Bag of Words and Bag of N-Grams. Bag of Words is a representation that counts the number of times a word appears in a document. Bag of N-Grams is a representation that counts the number of times a group of N words appear in a document,called n-grams.
the problem with Bag of word is that we do not taje into consideration the context of each word, something that n-grams try to solve by grabbing some of the adjacent words as context, it is still not enough to capture the meaning of a sentence. and way more inefficient than Embeddings.

### TF-IDF

It keeps track of how many times a word appears in a document and how many times this word occurs in other documents throughout the entire training data.
this aims to identify the common-use words like "a,is or he" and make a distinction between them and the words that are context-specific.

this approaches although the have been helpful in the NLP area for years they have serious shortcomes:

- They do not take any context into account
- they cannot deal with unknown words previously unseen during the training phase
- they create very sparse vectors

## Embeddings

Embeddings aim to represent words in a vector space,using dense vectors, where similar words are closer together and dissimilar words are further apart. This is a very powerful concept because it allows us to use mathematical operations on vectors to answer questions about words. For example, we can calculate the distance between two words and find the closest word to a given word.


There are many pre-trained and Open Source words embeddings available, and they are usually trained on large corpora of text. some examples are Word2Vec, GloVe, FastText, etc.

Although it is possible to create your own embedding, using it as a layer before the core of your model to create vector representations of the text ingested, it would be a very time and resource consuming task,requiring you a considerable text corpus.


## Embeddings as a Service

Next, there is a step by step guide to implement a FAQs engine using Open Source technologies and embeddign models thanks to HuggingFace Inference API.



![](/../assets/80_getting_started_with_embeddings/thumbnail.png)

## Embedding a dataset


first we need to install the dependencies and set up the auth credentials for the inference API and pick the model we want to use.



In [None]:
#!pip install retry python-dotenv requests pandas huggingface-hub datasets sentence-transformers

In [2]:
import os
from dotenv import load_dotenv
import requests
from retry import retry
import pandas as pd
import torch
from datasets import load_dataset
from sentence_transformers.util import semantic_search
load_dotenv()


True

In [3]:
model_id = "sentence-transformers/all-MiniLM-L6-v2"
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")

The first time you generate the embeddings it may take a while (approximately 20 seconds) for the API to return them.

We use the `retry` decorator so that if on the first try `output = query(dict(inputs = texts))` doesn't work, wait 10 seconds and try again three times.

The reason this happens is because on the first request, the model needs to be downloaded and installed in the server, but subsequent calls are much faster.

In [4]:
api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {HUGGINGFACEHUB_API_TOKEN}"}

In [5]:
@retry(tries=3, delay=10)
def query(texts):
    response = requests.post(api_url, headers=headers, json={"inputs": texts})
    result = response.json()
    if isinstance(result, list):
      return result
    elif list(result.keys())[0] == "error":
      raise RuntimeError(
          "The model is currently loading, please re-run the query."
          )

## About the example

we will be using Medicare FAQs as our text corpus, in this stage we could base our data in pdf files, word documents, websites or any other format that we could extract text from.

In [6]:
texts = ["How do I get a replacement Medicare card?",
        "What is the monthly premium for Medicare Part B?",
        "How do I terminate my Medicare Part B (medical insurance)?",
        "How do I sign up for Medicare?",
        "Can I sign up for Medicare Part B if I am working and have health insurance through an employer?",
        "How do I sign up for Medicare Part B if I already have Part A?",
        "What are Medicare late enrollment penalties?",
        "What is Medicare and who can get it?",
        "How can I get help with my Medicare Part A and Part B premiums?",
        "What are the different parts of Medicare?",
        "Will my Medicare premiums be higher because of my higher income?",
        "What is TRICARE ?",
        "Should I sign up for Medicare Part B if I have Veterans’ Benefits?"]

output = query(texts)


In [11]:
print(f'{len(texts)}/{len(output)} questions embedded \n\n Example embedding:\n Dim:{len(output[0])} \n value: {output[0]}')


13/13 questions embedded 

 Example embedding:
 Dim:384 
 value: [-0.023889435455203056, 0.055258527398109436, -0.011654871515929699, -0.033414289355278015, -0.01226054597645998, -0.024872783571481705, -0.012663397006690502, 0.025345897302031517, 0.018508462235331535, -0.08350814133882523, -0.0930199921131134, 0.014486260712146759, -0.017410926520824432, -0.08834367990493774, -0.004479092080146074, -0.04632588103413582, -0.013193883001804352, 0.03538179397583008, 0.062311142683029175, 0.048589639365673065, -0.05911844223737717, 0.05413540080189705, -0.06439687311649323, 0.03402399271726608, 0.006636396516114473, 0.03591707721352577, -0.0678376629948616, -0.01773529127240181, -0.012721833772957325, 0.046462420374155045, 0.10864363610744476, 0.023821400478482246, -0.02699640765786171, 0.037173960357904434, 0.09759815037250519, -0.027030108496546745, -0.04542990401387215, 0.031817324459552765, -0.033746279776096344, -0.015198435634374619, -0.021535668522119522, 0.014811212196946144, -0.02

Lets make a dataframe out of it to make it easier to work with.

In [12]:

embeddings = pd.DataFrame(output)

In [31]:
print(embeddings.iloc[0])

0     -0.023889
1      0.055259
2     -0.011655
3     -0.033414
4     -0.012261
         ...   
379    0.027754
380    0.020411
381    0.005778
382    0.034098
383   -0.006889
Name: 0, Length: 384, dtype: float64


## 2. Host embeddings on the Hugging Face Hub

HuggingFace works as a Cloud Directory of datasets, models and metrics. It is a place where you can find and share your own models and datasets, leveraging the power of git lfs of large models and datasets shared by the community.

In [None]:
!huggingface-cli repo create embedded_faqs_medicare --type dataset

In [None]:
# This is code required to install git-lfs however it already is installed in Colab instances.
#!git lfs install

Now clone the empty dataset repository to your local machine

In [None]:
!git clone https://{Your HF username}:{your token}@huggingface.co/datasets/ITESM/embedded_faqs_medicare

Cloning into 'embedded_faqs_medicare'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (3/3), done.


### Save our Embeddings

at this point we could have opted to save our embeddings into a vector database such as PinceCone,Chroma or FAISS. However, for the sake of simplicity we will save our embeddings into a CSV file.

In [14]:
embeddings.to_csv("embedded_faqs_medicare_test/embeddings.csv", index=False)
print(embeddings.shape)

(13, 384)


*cd* into our repo directory

In [None]:
%cd embedded_faqs_medicare_test

/content/embedded_faqs_medicare


we need to track our file in version control system, commiting and then pushing to the remote repository. This is done by the following commands:

In [None]:
!git lfs track *.csv
!git add .gitattributes
!git add embeddings.csv
!git commit -m "First version of the embedded_faqs_medicare dataset"
!git push

Tracking "embeddings.csv"


## 3. Get the most similar Frequently Asked Questions to a query

getting the dataset out of the train property of our repository object, then we will parse it to  a pandas dataframe to convert it to a numpy representation that can be formatted by torch, it is important to specify the data type as float32 to avoid any errors in the training process.

In [18]:
faqs_embeddings = load_dataset('AlejoTorres2001/embedded_faqs_medicare_test')
dataset_embeddings = torch.from_numpy(faqs_embeddings["train"].to_pandas().to_numpy()).to(torch.float)

Downloading data: 100%|██████████| 106k/106k [00:00<00:00, 147kB/s]
Downloading data files: 100%|██████████| 1/1 [00:00<00:00,  1.34it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 317.25it/s]
Generating train split: 13 examples [00:00, 356.37 examples/s]


In [26]:
question = ["tell me how to sign un for Medicare"]
output = query(question)

We will create an embedding of the query that can represent its semantic meaning. We then compare it to each embedding in our FAQ dataset to identify which is closest to the query in vector space.

In [27]:
query_embeddings = torch.FloatTensor(output)
print(f"The size of our embedded dataset is {dataset_embeddings.shape} and of our embedded query is {query_embeddings.shape}.")

The size of our embedded dataset is torch.Size([13, 384]) and of our embedded query is torch.Size([1, 384]).


the last step si to perfom a semantic search on the embedding space. This is done by using the cosine similarity between the embedding of the query and the embedding of the documents. The result is a list of documents sorted by similarity.
in thi case we are retrieving the top 5 most similar documents to the query.

In [32]:

hits = semantic_search(query_embeddings, dataset_embeddings, top_k=5)

print(hits)

[[{'corpus_id': 3, 'score': 0.7229963541030884}, {'corpus_id': 5, 'score': 0.6329377889633179}, {'corpus_id': 2, 'score': 0.5877507328987122}, {'corpus_id': 0, 'score': 0.5482667684555054}, {'corpus_id': 4, 'score': 0.50208580493927}]]


By using the original position of each FAQ in the original text corpus we can get the exact text chunck using the corpus_id inside the hits object.
this process would be the same if we had used a large text divided in chunks instead of a list of FAQs.

In [29]:
[texts[hits[0][i]['corpus_id']] for i in range(len(hits[0]))]

['How do I sign up for Medicare?',
 'How do I sign up for Medicare Part B if I already have Part A?',
 'How do I terminate my Medicare Part B (medical insurance)?',
 'How do I get a replacement Medicare card?',
 'Can I sign up for Medicare Part B if I am working and have health insurance through an employer?']