# WIP QA Metadata Vector Store

Set up a simple Question-Answering system with LangChain and CassIO, using Cassandra as the Vector Database.

_**NOTE:** this uses Cassandra's "Vector Similarity Search" capability.
Make sure you are connecting to a vector-enabled database for this demo._

In [None]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader

The following line imports the Cassandra flavor of a LangChain vector store:

In [None]:
from langchain.vectorstores.cassandra import Cassandra

A database connection is needed to access Cassandra. The following assumes
that a _vector-search-capable Astra DB instance_ is available. Adjust as needed.

In [None]:
from cqlsession import getCQLSession, getCQLKeyspace
cqlMode = 'astra_db' # 'astra_db'/'local'
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)

Both an LLM and an embedding function are required.

Below is the logic to instantiate the LLM and embeddings of choice. We chose to leave it in the notebooks for clarity.

In [None]:
import os
from llm_choice import suggestLLMProvider

llmProvider = suggestLLMProvider()
# (Alternatively set llmProvider to 'GCP_VertexAI', 'OpenAI', 'Azure_OpenAI' ... manually if you have credentials)

if llmProvider == 'GCP_VertexAI':
    from langchain.llms import VertexAI
    from langchain.embeddings import VertexAIEmbeddings
    llm = VertexAI()
    myEmbedding = VertexAIEmbeddings()
    print('LLM+embeddings from Vertex AI')
elif llmProvider == 'OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'open_ai'
    from langchain.llms import OpenAI
    from langchain.embeddings import OpenAIEmbeddings
    llm = OpenAI(temperature=0)
    myEmbedding = OpenAIEmbeddings()
    print('LLM+embeddings from OpenAI')
elif llmProvider == 'Azure_OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'azure'
    os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']
    os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']
    os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']
    from langchain.llms import AzureOpenAI
    from langchain.embeddings import OpenAIEmbeddings
    llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],
                      engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])
    myEmbedding = OpenAIEmbeddings(model=os.environ['AZURE_OPENAI_EMBEDDINGS_MODEL'],
                                   deployment=os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT'])
    print('LLM+embeddings from Azure OpenAI')
else:
    raise ValueError('Unknown LLM provider.')

## NOTE: skip to "load from existing" if loaded already

## A minimal example

The following is a minimal usage of the Cassandra vector store. The store is created and filled at once, and is then queried to retrieve relevant parts of the indexed text, which are then stuffed into a prompt finally used to answer a question.

The following creates an "index creator", which knows about the type of vector store, the embedding to use and how to preprocess the input text:

_(Note: stores built with different embedding functions will need different tables. This is why we append the `llmProvider` name to the table name in the next cell.)_

In [None]:
table_name = 'vs_test_md_' + llmProvider

index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Cassandra,
    embedding=myEmbedding,
    text_splitter=CharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=0,
    ),
    vectorstore_kwargs={
        'session': session,
        'keyspace': keyspace,
        'table_name': table_name,
    },
)

Loading a local text (a short story by E. A. Poe will do)

In [None]:
loader1 = TextLoader('texts/amontillado.txt', encoding='utf8')
loader2 = TextLoader('texts/mask.txt', encoding='utf8')
loader3 = TextLoader('texts/manuscript.txt', encoding='utf8')
loaders = [loader1, loader2, loader3]

This takes a few seconds to run, as it must calculate embedding vectors for a number of chunks of the input text:

In [None]:
# Note: Certain LLM providers need workaround to evaluate batch embeddings
#       (as done in next cell).
#       As of 2023-06-29, Azure OpenAI would  error with:
#           "InvalidRequestError: Too many inputs. The max number of inputs is 1"
if llmProvider == 'Azure_OpenAI':
    from langchain.indexes.vectorstore import VectorStoreIndexWrapper
    for loader in loaders:
        docs = loader.load()
        subdocs = index_creator.text_splitter.split_documents(docs)
        #
        print(f'subdocument {0} ...', end=' ')
        vs = index_creator.vectorstore_cls.from_documents(
            subdocs[:1],
            index_creator.embedding,
            **index_creator.vectorstore_kwargs,
        )
        print('done.')
        for sdi, sd in enumerate(subdocs[1:]):
            print(f'subdocument {sdi+1} ...', end=' ')
            vs.add_texts(texts=[sd.page_content], metadata=[sd.metadata])
            print('done.')
        #
    index = VectorStoreIndexWrapper(vectorstore=vs)

In [None]:
if llmProvider != 'Azure_OpenAI':
    index = index_creator.from_loaders(loaders)

_Note: depending on how you load rows in your store, there might be ways to add your own metadata. Ask Langchain docs! For now, we have a `source` metadata field with the file path, and we'll use that one._

## ... or Load From Existing

Use the following cell if the table has been populated already

In [None]:
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

myCassandraVStore = Cassandra(
    embedding=myEmbedding,
    session=session,
    keyspace=keyspace,
    table_name='vs_test_md_' + llmProvider,
)
loaded_index = VectorStoreIndexWrapper(vectorstore=myCassandraVStore)

## QA with metadata

In [None]:
Q1 = "Is the storm scary?"
Q2 = "Who arrives in the room?"

### No metadata (baseline case)

In [None]:
print(loaded_index.query(Q1))
print("="*20)
print(loaded_index.query(Q2))

### With metadata

In [None]:
r_k = {"search_kwargs": {"filter": {"source": "texts/manuscript.txt"}}}

print(loaded_index.query(Q1, retriever_kwargs=r_k))
print("="*20)
print(loaded_index.query(Q2, retriever_kwargs=r_k))

In [None]:
r_k2 = {"search_kwargs": {"filter": {"source": "texts/amontillado.txt"}}}

print(loaded_index.query(Q1, retriever_kwargs=r_k2))
print("="*20)
print(loaded_index.query(Q2, retriever_kwargs=r_k2))

## Spawning a "retriever" from the index

### Baseline

In [None]:
retriever_0 = loaded_index.vectorstore.as_retriever(search_kwargs={
    'k': 2,
})

In [None]:
retriever_0.get_relevant_documents(Q2)

### With metadata

In [None]:
retriever_m = loaded_index.vectorstore.as_retriever(search_kwargs={
    'k': 2,
    'filter': {'source': 'texts/manuscript.txt'},
})
retriever_m.get_relevant_documents(Q2)

## MMR test

### No metadata

In [None]:
Qx = "Who is scared?"
for i, doc in enumerate(myCassandraVStore.search(Qx, search_type='mmr', k=2)):
    print(f'[{i:2}]: {doc.metadata["source"]} ==> {doc.page_content}')

### With metadata

In [None]:
for i, doc in enumerate(myCassandraVStore.search(Qx, search_type='mmr', k=2, filter={'source': 'texts/amontillado.txt'})):
    print(f'[{i:2}]: {doc.metadata["source"]} ==> {doc.page_content}')