# Using Ollama for embedding

Finding similar datasets within a whole set of datasets can be a difficult task, especially if we are looking for conceptual similarities. Furthermore, we might not even know what we are looking for specifically, but have a vague idea of what we are looking for. This is where similarity search comes in handy.

Searching related datasets can be done by embedding the datasets in a high-dimensional space and then using a similarity metric to find the most similar datasets. With mdmodels, we can easily do this by using the `embedding` function. This function will take one or more datasets and create a vector representation of them. We can then use this vector representation to find the most similar datasets.

In this example, we will be using a simple data model of persons and their hobbies. The task will be to find the most similar persons to a given query.

> In order to run this example, you need to have [ollama](https://ollama.com/download) and the model `mxbai-embed-large` running on your machine.
> You can also leave out `base_url` if you prefer to use OpenAI's embedding API.


In [3]:
import rich
import numpy as np

from mdmodels import DataModel
from mdmodels.llm import embedding

In [12]:
# Lets start by loading the data model and the embeddings
model = DataModel.from_markdown("model.md")
embeddings = np.load("embeddings.npy")

# Next, we load the persons from the JSONL file
with open("persons.jsonl", "r") as file:
    persons = [model.Person.model_validate_json(line) for line in file]


In order to find the most similar persons to a given query, we are defining a function that will take a query and print the most similar persons. The function may look complicated at first, but it is essentially doing the following:

1) Embed the query
2) Compute the cosine similarity between the query embedding and all person embeddings
3) Print the most similar persons

It makes use of the `embedding` function that we imported earlier, which expects either a `str`, a `DataModel` instance or a list of `DataModel` instances.


In [10]:
def find(query: str | type[DataModel], n: int = 1) -> list[DataModel]:
    """Find the most similar person to the query."""

    console = rich.console.Console()

    # 1) Embed the query
    query_embedding = embedding(
        query,
        model="mxbai-embed-large",
        base_url="http://localhost:11434/v1",
        api_key="ollama",
    )

    # 2) Compute cosine similarity between query and all persons
    def cosine_similarity(a, b):
        return np.sum(a * b, axis=1) / (np.linalg.norm(a, axis=1) * np.linalg.norm(b))

    # 3) Get the index of the most similar person
    similarities = cosine_similarity(embeddings, query_embedding)

    if n == 1:
        console.print(f"Query: {query}\nAnswer: {persons[np.argsort(similarities)[::-1][0]]}")
    else:
        top_indices = np.argsort(similarities)[::-1][:n]
        
        answer = f"Query: {query}\nAnswers:"
        
        for i in top_indices:
            answer += f"\n  - ({similarities[i]:.2f}) {persons[i]}"

        console.print(answer)

Great, now we can use the function to ask questions about the dataset! First, we will explore how the function works by searching for similar persons to a given string query. This can be useful if we have a vague idea of what we are looking for, but are not sure about the exact details. After that, we will create a new person and find the most similar ones in the dataset, which can be useful if we want to find related datasets.


In [11]:
# Similarity search by string
find("I am young and I want to learn how to code. Who should I ask?")
find("Who is the oldest person?")
find("What are the hobbies of the person named Jane Smith?")
find("Who is the person who likes to play chess?")

# Create a new person and find related ones
new_person = model.Person(name="Jack", age=32, hobbies=["Rust", "Ableton"])
find(new_person, n=3)



## Conclusion

In this example, we have seen how we can use embeddings to find similar datasets within a whole set of datasets. This can be useful if we want to find related datasets, but are not sure about the exact details. We have also seen how we can use the `embedding` function to embed datasets and then use the cosine similarity to find the most similar datasets. This of course is a very simple example, but it shows the power of embeddings and how seamlessly mdmodels integrates within the LLM ecosystem.