# An extended approach using embeddings
As stated on OpenAI's documentation, embeddings are commonly used for:
- Search (where results are ranked by relevance to a query string)
- Clustering (where text strings are grouped by similarity)
- Recommendations (where items with related text strings are recommended)
- Anomaly detection (where outliers with little relatedness are identified)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by their most similar label)

Using an embedding model we can obtain a vector representing the input text. If we do this for a bunch of documents and store it in a vector store, we can verify similarity between the documents and the user prompt and retrieve the right one(s). We can then augment the system or user prompt with the retrieved information.

[Here](https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/) is an intuitive explanation of embeddings.

**IMPORTANT:** Be sure you load an embedding model through the `Model Embedding Setting` panel of the `Local Server` tab, otherwise it will default to using the main model and get stuck.

In [60]:
from openai import OpenAI
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

def get_embedding(text):
    text = text.replace("\n", " ")
    embedding = client.embeddings.create(input = [text], model="").data[0].embedding
    embedding = np.array(embedding).reshape(1, -1)
    embedding = normalize(embedding) # This didn't remove the bias üò≠ (discussed below)
    return embedding

def find_relevant_context(query, rag_docs, vector_store, top_k=1):
    query_embedding = get_embedding(query)
    similarities = [
        cosine_similarity(query_embedding, document_embedding)
        for document_embedding in vector_store
    ]
    
    top_k_indices = np.argsort(similarities, axis=0)[-top_k:][::-1]
    
    return [rag_docs[int(idx[0][0])] for idx in top_k_indices]

import pdfx
from re import compile
regex_spaces = compile(r"  +")

def get_pdf_text(pdf_filepath):
    pdf = pdfx.PDFx(pdf_filepath)
    
    text = pdf.get_text()
    text = text.replace("\n", " ")
    text = regex_spaces.sub(" ", text)
    if pdf.get_references():
        text += " references: " + str(pdf.get_references_as_dict())
    
    return text

Now we load our documents and create a vector store to perform semantic search on.

In [66]:
rag_docs = [
    get_pdf_text("Sebastiaan-Indesteege-CV2024.pdf"),
    get_pdf_text("John-Doe-CV.pdf")
]

vector_store = [
    get_embedding(rag_docs[0]),
    get_embedding(rag_docs[1])
]

In [67]:
user_query = "salesman"
find_relevant_context(user_query, rag_docs, vector_store, top_k=1)

['John Doe Full Address ‚ñ™ City, State, ZIP ‚ñ™ Phone Number ‚ñ™ E-mail OBJECTIVE: Design apparel print for an innovative retail company EDUCATION : UNIVERSITY OF MINNESOTA City, State College of Design May 2011 Bachelor of Science in Graphic Design Cumulative GPA 3.93 , Dean‚Äôs List Twin cities Iron Range Scholarship WORK EXPERIENCE: AMERICAN EAGLE City, State Sales Associate July 2009 - present Collaborated with the store merchandiser creating displays to attract clientele Use my trend awareness to assist customers in their shopping experience Thoroughly scan every piece of merchandise for inventory control Process shipment to increase my product knowledge PLANET BEACH City, State Spa Consultant Aug. 2008 - present Sell retail and memb erships to meet company sales goals Build organizational skills by single handedly running all operating procedures Communicate with clients to fulfill their wants and needs Attend promotional events to market our services Handle cash and deposits du

In [68]:
user_query = "artist"
find_relevant_context(user_query, rag_docs, vector_store, top_k=1)

["Sebastiaan Indesteege Junior Data Scientist indesteege.sebastiaan@gmail.com 0489 10 29 01 Brussels, Belgium 31/07/1991 github.com/Huraqan sebastiaan-indesteege-08702a56 huraqan Following the arts and sciences that interest me, I hope to one day understand the essence of existence before my own fades away. Today artificial intelligence is more interesting than ever and I strive to know more. Languages English Fran√ßais Nederlands Espa√±ol Skills Programming C#, Python, GDScript, GLSL Machine Learning Scikit-learn, Huggingface transformers & diffusers, PyTorch, NLP: NLTK & Spacy, CV: Stable Diffusion, AUDIO: STT & TTS & RVC Data Analysis SQL, Pandas, Numpy, Matplotlib, Seaborn Deployment Streamlit, FastAPI, Docker, Gradio Visual Design Photoshop, Pixel Shading Music Production Composition, Mixing, Mastering Soft Skills Autodidact ‚Äî I can learn alone People skills ‚Äî I can lead a team Curious ‚Äî I like to know moreProjects BeCode, Corporate use-cases & personal challenges 03/2024 ‚Ä

Let's do an experiment with random strings to see if we can retrieve both documents with equal probabillities.

In [40]:
import random
import string

def generate_random_string(length):
    letters = string.ascii_letters + string.digits
    return ''.join(random.choice(letters) for i in range(length))

Run the cell below as many times as you like, the cosine similarity score seems to be consistently biased towards the first document...

In [72]:
for _ in range(10):
    user_query = generate_random_string(length=20)
    print(find_relevant_context(user_query, rag_docs, vector_store, top_k=1))

["Sebastiaan Indesteege Junior Data Scientist indesteege.sebastiaan@gmail.com 0489 10 29 01 Brussels, Belgium 31/07/1991 github.com/Huraqan sebastiaan-indesteege-08702a56 huraqan Following the arts and sciences that interest me, I hope to one day understand the essence of existence before my own fades away. Today artificial intelligence is more interesting than ever and I strive to know more. Languages English Fran√ßais Nederlands Espa√±ol Skills Programming C#, Python, GDScript, GLSL Machine Learning Scikit-learn, Huggingface transformers & diffusers, PyTorch, NLP: NLTK & Spacy, CV: Stable Diffusion, AUDIO: STT & TTS & RVC Data Analysis SQL, Pandas, Numpy, Matplotlib, Seaborn Deployment Streamlit, FastAPI, Docker, Gradio Visual Design Photoshop, Pixel Shading Music Production Composition, Mixing, Mastering Soft Skills Autodidact ‚Äî I can learn alone People skills ‚Äî I can lead a team Curious ‚Äî I like to know moreProjects BeCode, Corporate use-cases & personal challenges 03/2024 ‚Ä

Why? Well, maybe we're missing an important step to make sure our embeddings don't become too general: chunking.

**Chunking** splits up a text into smaller pieces so that the embeddings capture finer detail. Usually some overlap is used to make sure the chunks preserve the context at their boundaries.

Yeah, we're not doing that. I'm sure plenty of libraries offer ways to do chunking and keeping a vector store with proper functionality.

It's time to move on to chatbots üòú
</br>
</br>
</br>
</br>
</br>
</br>
</br>

**Sidenote:** Some other possible causes for the bias suggested by ChatGPT (the irony):

- **Normalization Issues:** If the embeddings are not properly normalized, it might skew the similarity calculations. (Does not seem to be the issue here, since normalization has been applied.)
- **Embeddings Distribution:** If one of your documents has an embedding that is closer to the centroid of all embeddings, it might appear more relevant.
- **Embedding Model Characteristics:** The embedding model might have a bias towards certain types of content, which can affect the similarity scores.

Be sure to let me know if you know more. My contact info is in my CV üòÅ