# Basic embedding retrieval with Chroma

This notebook demonstrates the most basic use of Chroma to store and retrieve information using embeddings. This core building block is at the heart of many powerful AI applications.

## What are embeddings?

Embeddings are the A.I-native way to represent any kind of data, making them the perfect fit for working with all kinds of A.I-powered tools and algorithms. They can represent text, images, and soon audio and video.

To create an embedding, data is fed into an embedding model, which outputs vectors of numbers. The model is trained in such a way that 'similar' data, e.g. text with similar meanings, or images with similar content, will produce vectors which are nearer to one another, than those which are dissimilar.

## Embeddings and retrieval

We can use the similarity property of embeddings to search for and retrieve information. For example, we can find documents relevant to a particular topic, or images similar to a given image. Rather than searching for keywords or tags, we can search by finding data with similar semantic meaning.


In [None]:
%pip install -Uq chromadb numpy datasets

## Example Dataset

As a demonstration we use the [SciQ dataset](https://arxiv.org/abs/1707.06209), available from [HuggingFace](https://huggingface.co/datasets/sciq).

Dataset description, from HuggingFace:

> The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

In this notebook, we will demonstrate how to retrieve supporting evidence for a given question.


In [2]:
from google.colab import files
uploaded = files.upload()

Saving Exit_person_profiles.csv to Exit_person_profiles.csv


In [9]:
import pandas as pd
exit_df = pd.read_csv('Exit_person_profiles.csv')
exit_df.head(1)

Unnamed: 0,Name,Email,Membership Duration,Interests,Expertise,Ambition,Family Status,Location,Notes
0,Aiden Smith,Aiden.Smith@email.com,4 years,"Healthcare, Nonprofit","Public health, Fundraising",Run a successful nonprofit organization,Single,Washington D.C.,Recently started a new job as a fundraiser fo...


In [None]:
# Get the SciQ dataset from HuggingFace
from datasets import load_dataset

dataset = load_dataset("sciq", split="train")

# Filter the dataset to only include questions with a support
dataset = dataset.filter(lambda x: x["support"] != "")

print("Number of questions with support: ", len(dataset))

Number of questions with support:  10481


## Loading the data into Chroma

Chroma comes with a built-in embedding model, which makes it simple to load text.
We can load the SciQ dataset into Chroma with just a few lines of code.


In [31]:
# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
import chromadb

client = chromadb.Client()

In [50]:
# Create a new Chroma collection to store the supporting evidence. We don't need to specify an embedding fuction, and the default will be used.
collection4 = client.create_collection('sample_collection4')
# Original code: collection = client.create_collection("sciq_supports")

In [24]:
# Open csv, store each line as a string in the "lines" list
with open('Exit_person_profiles.csv', 'r') as f:
    lines = f.readlines()

In [34]:
# Create list of id's for each line in csv, to be fed into a chromadb collection
ids = []
for i in range(len(lines)):
  ids.append(str(i))

In [51]:
# Embed and store the lines in the colleciton
collection4.add(
    ids=ids,  # IDs are just strings
    documents=lines,
    metadatas=[{"type": "support"} for _ in range(len(lines))
    ],
)

In [None]:
# Original code to automatically create id's and metadata names for range of 100. Also fetches 100 lines of a dataset
'''collection.add(
    ids=[str(i) for i in range(0, 100)],  # IDs are just strings
    documents=dataset["support"][:100],
    metadatas=[{"type": "support"} for _ in range(0, 100)
    ],
)'''

In [62]:
# Query prompt
results = collection4.query(
    query_texts=["I'm looking for someone who knows how to build a home"],
    n_results=4
)

In [64]:
results['documents']

[['Paxon Wallace,Paxon.Wallace@email.com, 1 year ," Homesteading, renewable energy, DIY projects "," Solar panel installation, wind turbine maintenance, carpentry ", Build a self-sufficient homestead that runs entirely on renewable energy sources and share his experiences with others through workshops and tutorials , Single ," rural area, Vermont ", Is a skilled carpenter and has experience installing solar panels and wind turbines. He is interested in learning more about permaculture design and natural building techniques. \n',
  'Trace McLoughlin,Trace.McLoughlin@email.com, 8 months ," Home improvement, DIY projects, woodworking "," Stay-at-home dad, handyman, woodworker ", Start a custom furniture making business ," Married, three children ", Houston ," Shares step-by-step guides and videos of his DIY projects, offers tips and advice on home repair and improvement "\n',
  'Alexander Saunders,Alexander.Saunders@email.com, 7 months ," Design, Architecture "," Interior design, Sustaina

## Querying the data

Once the data is loaded, we can use Chroma to find supporting evidence for the questions in the dataset.
In this example, we retrieve the most relevant result according to the embedding similarity score.

Chroma handles computing similarity and finding the most relevant results for you, so you can focus on building your application.


In [None]:
results = collection.query(
    query_texts=dataset["question"][:10],
    n_results=1)

we display the query questions along with their retrieved supports

In [None]:
# Print the question and the corresponding support
for i, q in enumerate(dataset['question'][:10]):
    print(f"Question: {q}")
    print(f"Retrieved support: {results['documents'][i][0]}")
    print()

Question: What type of organism is commonly used in preparation of foods such as cheese and yogurt?
Retrieved support: Agents of Decomposition The fungus-like protist saprobes are specialized to absorb nutrients from nonliving organic matter, such as dead organisms or their wastes. For instance, many types of oomycetes grow on dead animals or algae. Saprobic protists have the essential function of returning inorganic nutrients to the soil and water. This process allows for new plant growth, which in turn generates sustenance for other organisms along the food chain. Indeed, without saprobe species, such as protists, fungi, and bacteria, life would cease to exist as all organic carbon became “tied up” in dead organisms.

Question: What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?
Retrieved support: Without Coriolis Effect the global winds would blow north to south

## What's next?

Check out the Chroma documentation to [get started](https://docs.trychroma.com/getting-started) with building your own applications.

The core embeddings based retrieval functionality demonstrated here is at the heart of many powerful AI applications, like using large language models with Chroma to [chat with your documents](https://github.com/chroma-core/chroma/tree/main/examples/chat_with_your_documents), as well as memory for agents like [BabyAgi](https://github.com/yoheinakajima/babyagi) and [Voyager](https://github.com/MineDojo/Voyager).

Chroma is already integrated with many popular AI applications frameworks, including [LangChain](https://python.langchain.com/docs/integrations/vectorstores/chroma) and [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/ChromaIndexDemo.html).

Join our community to learn more and get help with your projects: [Discord](https://discord.gg/MMeYNTmh3x) | [Twitter](https://twitter.com/trychroma)

We are [hiring](https://trychroma.notion.site/careers-chroma-9d017c3007c7478ebd85bad854101497?pvs=4)!