In [None]:
%%capture
!pip install llama-index==0.10.25 llama-index-embeddings-cohere llama-index-llms-cohere qdrant-client llama-index-vector-stores-qdrant 

In [None]:
import os
from getpass import getpass

In [None]:
CO_API_KEY = os.environ['CO_API_KEY'] or getpass("Enter your Cohere API key: ")

# Querying

- 📊 Now that you've loaded your data and built an index, it's time to focus on the core of an LLM application: querying.

- 🤖 Querying at its simplest involves making a prompt call to an LLM - this could be asking a question, requesting a summary, or giving more complex instructions.

- 🔗 For more advanced uses, querying can include repeated or chained prompt calls to an LLM, or even a reasoning loop across multiple components.

Let's first instantiate the `qdrant` vector store.

In [None]:
import os

file_path = 'persisted_storage/qdrant/docstore.json'
print(os.path.exists(file_path))

In [None]:
import qdrant_client
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext

embed_model = CohereEmbedding(model_name="embed-english-v3.0")

# initialize qdrant client
client = qdrant_client.QdrantClient(
    path="persisted_storage/qdrant"
)

vector_store = QdrantVectorStore(
    client=client, 
    collection_name="it_can_be_done",
    path="persisted_storage/qdrant/",
    prefer_grpc=True
)

# assign qdrant vector store to storage context
storage_context = StorageContext.from_defaults(
    vector_store=vector_store,
    persist_dir="persisted_storage/qdrant"
    )

# load your index from stored vectors
index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store, 
    embed_model=embed_model,
    storage_context=storage_context
)

# 🧐 The `QueryEngine`

A Query Engine is a higher-level construct that uses an `Index` (and by extension, a `Retriever`) to answer queries. 

It not only retrieves the relevant data but also processes it to generate a response to the query. A `Query Engine` uses the `Retriever` to fetch data and then applies additional logic to generate a response.

Here's what happens under the hood:

- 📚 **Retrieval**: Find and return the most relevant documents from the `Index` using strategies like "top-k" semantic retrieval.

- 🔧 **Postprocessing**: Optionally rerank, transform, or filter retrieved Nodes, often based on specific metadata like keywords.

- 🔄 **Response Synthesis**: Combine the query, relevant data, and prompt to generate a response from your LLM.

Note, there are [a wide variety of Query Engines](https://github.com/run-llama/llama_index/tree/main/llama-index-core/llama_index/core/query_engine) available in LlamaIndex. We won't touch on all of them in this course, but I encourage you to explore what's available and think of how you may be able to use them.


In [None]:
from llama_index.llms.cohere import Cohere

llm = Cohere(model="command-r")

query_engine = index.as_query_engine(llm=llm, streaming=True)

response = query_engine.query(
    "What do the Sikh Stoics believe?"
)

response.print_response_stream()

### Streaming response

In [None]:
response = query_engine.query(
    "What poems by Rudyard Kipling are in this book?"
)

response.print_response_stream()

### 💬 Chat Engine

In [None]:
chat_engine = index.as_chat_engine(llm=llm)

chat_engine.chat_repl()

### Chat modes

#### Simple

Chat with LLM, without making use of a knowledge base. To use this mode set `chat_mode="simple"`.

Corresponds to [`SimpleChatEngine`](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/chat_engine/simple.py). 

#### Condense question

Generate a standalone question from the conversation context and the last message. Then, ask the query engine for a response. To use this mode set `chat_mode="condense_question"`.

Corresponds to [`CondenseQuestionChatEngine`](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/chat_engine/condense_question.py).

#### Context 

Retrieve text from the index based on the user's message. Utilize this context to formulate a response. To use this mode set `chat_mode="context"`.

Corresponds to [`ContextChatEngine`](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/chat_engine/context.py).

#### Condense plus context

Condense a conversation and latest user message to a standalone question. Then build a context for the standalone question from a retriever. Finally, pass the context along with prompt and user message to LLM to generate a response. To use this mode set `chat_mode="condense_plus_context"`.

Corresponds to [`CondensePlusContextChatEngine`](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/chat_engine/condense_plus_context.py).

#### ReACT
Corresponds to [`ReActAgent`](https://github.com/run-llama/llama_index/blob/37c95965426bddae82cec1ad49d3aa82e8bfe819/llama-index-core/llama_index/core/agent/react/base.py#L36).

Use a ReAct agent loop with query engine tools. To use this mode set `chat_mode="react"`.

#### Best

Select the best chat engine based on the current LLM. To use this mode set `chat_mode="best"`.

Corresponds to `OpenAIAgent` if using an OpenAI model that supports function calling API, otherwise, corresponds to `ReActAgent`.

In [None]:
from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=1500)

chat_engine = index.as_chat_engine(
    llm=llm,
    chat_mode="context",
    memory=memory,
    system_prompt=(
        "You are a chatbot, able to have normal interactions, as well as talk"
        " about a book of poems called 'It Can Be Done'."
    ),
)

chat_engine.chat_repl()

# Customizing Querying

- 🔧 **Customizing Retrieval**: Use LlamaIndex's low-level composition API to adjust `top_k` value for more granular control over query results.

- 📈 **Adding Post-Processing**: Implement a step to ensure only nodes meeting a minimum similarity score are included, balancing between data richness and relevance.

- 🎚️ **SimilarityPostprocessor**: Set a similarity score threshold, compatible only with embedding-based retrievers, to ensure high relevance.

In [None]:
from llama_index.core import get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

# configure a retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

# configure a post processor
similarity_processor = SimilarityPostprocessor(similarity_cutoff=0.42)

# configure a response sythesizer
response_synthsizer = get_response_synthesizer(llm=llm)

# create a query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthsizer,
    node_postprocessors=[similarity_processor],
)


In [None]:
query_engine.query("How do the authors express the theme of resilience in the face of challenges?")

In [None]:
query_engine.query("Compare the portrayal of internal versus external battles in the narratives and poems")

In [None]:
client.close()