# Basic embedding retrieval with Chroma & Monitoring with Langtrace

### 1. Setup Langtrace

- Sign-up for a free account on [Langtrace](https://langtrace.ai)

- Create a Project and generate an API Key

- Import modules and install the Langtrace Python SDK


# Setup modules & Environment

In [4]:
%pip install -Uq chromadb numpy datasets langtrace-python-sdk

Note: you may need to restart the kernel to use updated packages.


In [5]:
import os
os.environ['LANGTRACE_API_KEY'] = '0471fd9f0b0e7a5dbcf76d7ed5c835f61c86e5c663a541af8538908eac749bcb'

# Initialize Langtrace & Load Dataset
As a demonstration we use the SciQ dataset, available from HuggingFace.

In this notebook, we will demonstrate how to retrieve supporting evidence for a given question.



In [6]:
# Get the SciQ dataset from HuggingFace
from langtrace_python_sdk import langtrace
from datasets import load_dataset

langtrace.init()

dataset = load_dataset("sciq", split="train")

# Filter the dataset to only include questions with a support
dataset = dataset.filter(lambda x: x["support"] != "")

print("Number of questions with support: ", len(dataset))

  from tqdm.autonotebook import tqdm


Number of questions with support:  10481


sent to https://langtrace.ai/api/trace with 1 spans
sent to https://langtrace.ai/api/trace with 1 spans
sent to https://langtrace.ai/api/trace with 1 spans
sent to https://langtrace.ai/api/trace with 1 spans


## Loading the data into Chroma

Chroma comes with a built-in embedding model, which makes it simple to load text.
We can load the SciQ dataset into Chroma with just a few lines of code.

In [11]:
# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
import chromadb

client = chromadb.Client()
# Create a new Chroma collection to store the supporting evidence. We don't need to specify an embedding fuction, and the default will be used.
collection = client.create_collection("sciq_supports")
# Embed and store the first 100 supports for this demo
collection.add(
    ids=[str(i) for i in range(0, 100)],  # IDs are just strings
    documents=dataset["support"][:100],
    metadatas=[{"type": "support"} for _ in range(0, 100)
    ],
)

UniqueConstraintError: Collection sciq_supports already exists

## Querying the data

Once the data is loaded, we can use Chroma to find supporting evidence for the questions in the dataset.
In this example, we retrieve the most relevant result according to the embedding similarity score.

Chroma handles computing similarity and finding the most relevant results for you, so you can focus on building your application.

In [12]:
results = collection.query(
    query_texts=dataset["question"][:15],
    n_results=1)

We display the query questions along with their retrieved supports

In [14]:
# Print the question and the corresponding support
for i, q in enumerate(dataset['question'][:15]):
    print(f"Question: {q}")
    print(f"Retrieved support: {results['documents'][i][0]}")
    print()

Question: What type of organism is commonly used in preparation of foods such as cheese and yogurt?
Retrieved support: Agents of Decomposition The fungus-like protist saprobes are specialized to absorb nutrients from nonliving organic matter, such as dead organisms or their wastes. For instance, many types of oomycetes grow on dead animals or algae. Saprobic protists have the essential function of returning inorganic nutrients to the soil and water. This process allows for new plant growth, which in turn generates sustenance for other organisms along the food chain. Indeed, without saprobe species, such as protists, fungi, and bacteria, life would cease to exist as all organic carbon became “tied up” in dead organisms.

Question: What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?
Retrieved support: Without Coriolis Effect the global winds would blow north to south