# Vector databases with Chroma


We will use semantic search. Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query


### Sources
- Pypi: https://pypi.org/project/chromadb/
- GitHub: https://github.com/chroma-core/chroma
- Blog: https://blog.langchain.dev/langchain-chroma/
- Getting started: https://docs.trychroma.com/getting-started
- Semantic Search: https://en.wikipedia.org/wiki/Semantic_search

### Contents
0. Install packages
1. Getting started with chromadb

## 0. Install packages

In [14]:
!pip install chromadb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




## 1. Getting started with chromadb

In [2]:
#create a client
import chromadb
chroma_client = chromadb.Client()

Using embedded DuckDB without persistence: data will be transient


In [3]:
#create a collection
collection = chroma_client.create_collection(name="my_collection")

No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## 2. Downloading data
We will use Quentin Tarantino filmscripts

In [5]:
import wget
pulp_fiction=wget.download('https://assets.scriptslug.com/live/pdf/scripts/pulp-fiction-1994.pdf')
res_dogs = wget.download('https://assets.scriptslug.com/live/pdf/scripts/reservoir-dogs-1992.pdf')
jackie_brown = wget.download('https://assets.scriptslug.com/live/pdf/scripts/jackie-brown-1997.pdf')

  0% [                                                        ]      0 / 147913  5% [...                                                     ]   8192 / 147913 11% [......                                                  ]  16384 / 147913 16% [.........                                               ]  24576 / 147913 22% [............                                            ]  32768 / 147913 27% [...............                                         ]  40960 / 147913 33% [..................                                      ]  49152 / 147913 38% [.....................                                   ]  57344 / 147913 44% [........................                                ]  65536 / 147913 49% [...........................                             ]  73728 / 147913 55% [...............................                         ]  81920 / 147913 60% [..................................                      ]  90112 / 147913 66% [.................................

In [8]:
import glob
my_pdfs = glob.glob('*.pdf')
my_pdfs

['pulp-fiction-1994.pdf',
 'jackie-brown-1997.pdf',
 'pulp-fiction-1994 (1).pdf',
 'reservoir-dogs-1992.pdf',
 'temp.pdf']

In [10]:
collection.add(
    
    documents=["pulp-fiction-1994.pdf", "jackie-brown-1997.pdf","reservoir-dogs-1992.pdf"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2", "id3"]
)

In [11]:
results = collection.query(
    query_texts=["Who is Jackie Brown"],
    n_results=2
)

In [12]:
results

{'ids': [['id2', 'id1']],
 'embeddings': None,
 'documents': [['jackie-brown-1997.pdf', 'pulp-fiction-1994.pdf']],
 'metadatas': [[{'source': 'my_source'}, {'source': 'my_source'}]],
 'distances': [[0.6836943030357361, 1.7194474935531616]]}