# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

Let's get our vectorDB from before.

## Vectorstore retrieval


In [4]:
from getpass import getpass
import os

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass("you api key:")

### Similarity Search

In [5]:
from langchain.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
persist_directory = 'learn/chroma/'

In [6]:
embedding = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [7]:
print(vectordb._collection.count())

330


In [8]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [9]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [10]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [11]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [12]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [17]:
question = "what is the blue planet?"
docs_ss = vectordb.similarity_search(question,k=3)

In [18]:
docs_ss[0].page_content[:100]

'Unlike the other 63 major planets and satellites in our solar system, the planet Earth is the\nonly o'

In [19]:
docs_ss[1].page_content[:100]

'www.har un ya hya.com - www.har un ya hya.net\nen.harunyahya.tv'

Note the difference in results with `MMR`.

In [20]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [21]:
docs_mmr[0].page_content[:100]

'Unlike the other 63 major planets and satellites in our solar system, the planet Earth is the\nonly o'

In [22]:
docs_mmr[1].page_content[:100]

'www.har un ya hya.com - www.har un ya hya.net\nen.harunyahya.tv'

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [23]:
question = "what is the blue planet?"

In [24]:
docs = vectordb.similarity_search(question,k=5)

In [25]:
for d in docs:
    print(d.metadata)

{'page': 93, 'source': 'English_THE_CREATION_OF_THE_UNIVERSE.pdf'}
{'page': 9, 'source': 'English_THE_CREATION_OF_THE_UNIVERSE.pdf'}
{'page': 79, 'source': 'English_THE_CREATION_OF_THE_UNIVERSE.pdf'}
{'page': 86, 'source': 'English_THE_CREATION_OF_THE_UNIVERSE.pdf'}
{'page': 84, 'source': 'English_THE_CREATION_OF_THE_UNIVERSE.pdf'}
