<a href="https://colab.research.google.com/github/vectara/example-notebooks/blob/main/notebooks/using-vectara-with-llamaindex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectara and LlamaIndex

## About Vectara

[Vectara](https://vectara.com/) is the trusted GenAI and semantic search platform that provides an easy-to-use API for document indexing and querying. 

Vectara provides an end-to-end managed service for Retrieval Augmented Generation or [RAG](https://vectara.com/grounded-generation/), which includes:

1. A way to extract text from document files and chunk them into sentences.

2. The state-of-the-art [Boomerang](https://vectara.com/how-boomerang-takes-retrieval-augmented-generation-to-the-next-level-via-grounded-generation/) embeddings model. Each text chunk is encoded into a vector embedding using Boomerang, and stored in the Vectara internal knowledge (vector+text) store. Thus, when using Vectara with LlamaIndex you do not need to call a separate embedder model - this happens automatically within the Vectara backend.

3. A query service that automatically encodes the query into embedding, and retrieves the most relevant text segments (including support for [Hybrid Search](https://docs.vectara.com/docs/api-reference/search-apis/lexical-matching) and [MMR](https://vectara.com/get-diverse-results-and-comprehensive-summaries-with-vectaras-mmr-reranker/))

4. An option to create [generative summary](https://docs.vectara.com/docs/learn/grounded-generation/grounded-generation-overview), based on the retrieved documents, including citations.

See the [Vectara API documentation](https://docs.vectara.com/docs/) for more information on how to use the API.

The main benefits for using Vectara for a RAG application are:
* **Easy to use**: Vectara takes care of much detail required for a fully functional, highly scalable and robust RAG application, so as a user you don't have to code up these pieces and maintain them over time
* **Scalable and Secure**: building GenAI applications may seem easy at first, and Vectara provides instant scalablility to millions of documents, while maintaing data security and privacy, as well as latency SLAs.

## About Llama Index

LlamaIndex is a "data framework" to help you build LLM apps:

1. It includes **data connectors** to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)
2. It provides ways to **structure your data** (indices, graphs) so that this data can be easily used with LLMs.
3. It provides an **advanced retrieval/query interface over your data**: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.

LlamaIndex's high-level API allows beginner users to use LlamaIndex to ingest and query their data in just a few lines of code, whereas its lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.

Vectara is implemented in LlamaIndex as a [Managed Service](https://docs.llamaindex.ai/en/stable/community/integrations/managed_indices.html#vectara), abstracting all of Vectara's powerful API so they are easily integrated into LlamaIndex.

In this notebook, we will demonstrate some of the great ways you can use Vectara together with LlamaIndex.

## Getting Started

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

To get started with Vectara, [sign up](https://vectara.com/integrations/llamaindex) (if you haven't already) and follow our [quickstart](https://docs.vectara.com/docs/quickstart) guide to create a corpus and an API key. 

Once you have these, you can provide them as environment variables, which will be used by the LlamaIndex code later on:

In [1]:
#!pip install -U llama-index llama-index-indices-managed-vectara arxiv

import os
os.environ['VECTARA_API_KEY'] = "<YOUR_VECTARA_API_KEY>"
os.environ['VECTARA_CORPUS_ID'] = "<YOUR_VECTARA_CORPUS_ID>"
os.environ['VECTARA_CUSTOMER_ID'] = "<YOUR_VECTARA_CUSTOMER_ID>"



## Loading Data Into Vectara

As mentioned above, Vectara is a RAG managed service, and in many cases data may be uploaded to the index ahead of time (e.g. by using [Airbyte](https://docs.airbyte.com/integrations/destinations/vectara), directly via Vectara's [indexing API](https://docs.vectara.com/docs/api-reference/indexing-apis/indexing) or using tools like [vectara-ingest](https://github.com/vectara/vectara-ingest)), but another easy way is via the VectaraIndex constructor: `from_documents()`.

For this notebook we will assume the Vectara corpus is empty and will load PDF documents from Arxiv, using Python's [arxiv](https://github.com/lukasschwab/arxiv.py) library. We would pull in data from the top papers related to "climate change":

In [2]:
import arxiv

client = arxiv.Client()
search = arxiv.Search(
  query = "(ti:embedding model) OR (ti:sentence embedding)",
  max_results = 100,
  sort_by = arxiv.SortCriterion.Relevance
)
papers = list(client.results(search))

In [3]:
[p.entry_id for p in papers][:5]

['http://arxiv.org/abs/2402.14776v1',
 'http://arxiv.org/abs/2007.01852v2',
 'http://arxiv.org/abs/1910.13291v1',
 'http://arxiv.org/abs/2104.06719v1',
 'http://arxiv.org/abs/1511.08198v3']

Next, download the Arxiv paper, and upload them into Vectara using the `add_file()`. 

In [4]:
import shutil
from llama_index.indices.managed.vectara import VectaraIndex

data_folder = 'temp'
os.makedirs(data_folder, exist_ok=True)

# Create Vectara Index
index = VectaraIndex()

# Upload content ofr all papers
for paper in papers:
    try:
        paper_fname = paper.download_pdf(data_folder)
    except Exception as e:
        print(f"File {paper_fname} failed to load with error {e}")
        continue
    metadata = {
        'url': paper.entry_id,
        'title': paper.title,
        'author': str(paper.authors[0]),
        'published': str(paper.published.date())
    }        
    index.insert_file(file_path=paper_fname, metadata=metadata)

shutil.rmtree(data_folder)
del papers, index

LLM is explicitly disabled. Using MockLLM.
Embeddings have been explicitly disabled. Using MockEmbedding.


Two important things to note here:
1. Vectara processes each file uploaded on the backend, and performs appropriate chunking. So you don't need to apply any local processing, or choose a chunking strategy. 
2. We have used the fields `url`, `title`, `author`, and `published` as metadata fields (where author is the first author if there are many, just to simplify). You will need to make sure those fields are defined in your Vectara corpus as [filterable metadata fields](https://docs.vectara.com/docs/learn/metadata-search-filtering/filter-overview) to ensure we can filter by them in query time.

So that's it for upload. 

## Querying with the VectaraIndex
We can now ask questions using the VectaraIndex.

In [5]:
index = VectaraIndex()
query = "What is sentence embedding?"

LLM is explicitly disabled. Using MockLLM.
Embeddings have been explicitly disabled. Using MockEmbedding.


In [6]:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(query)
print(response)

Sentence embedding is a technique used in Natural Language Processing (NLP) that involves encoding sentences into fixed-length numerical vectors for various downstream tasks. This process typically includes an encoder, such as an LSTM encoder, to convert the input sentence into a meaningful representation [1]. The encoded sentence vectors can then be used for tasks like classification and inference, where a softmax classifier may be employed to make predictions based on the sentence embeddings [2]. Additionally, sentence embedding models may utilize both German and English encoders to generate embeddings for multilingual applications [3]. The goal of sentence embedding is to capture the semantic meaning of sentences in a continuous vector space, enabling efficient processing and analysis of textual data for a wide range of NLP tasks.


Note that the response here is fully generated by Vectara. There is no additional LLM involved (or API key you need to setup). The response also includes citations (marked in square brackets), which provide links to references used to generate this response by Vectara. 
<br>
When we use `print(response)` it simply prints the response text. But the `response` object also has the citations:

In [7]:
[(inx, n.node.metadata['url']) for inx,n in enumerate(response.source_nodes)]

[(0, 'http://arxiv.org/abs/2305.03010v1'),
 (1, 'http://arxiv.org/abs/1904.05542v1'),
 (2, 'http://arxiv.org/abs/1904.05542v1'),
 (3, 'http://arxiv.org/abs/1904.05542v1'),
 (4, 'http://arxiv.org/abs/2305.15077v2')]

Vectara supports [max-marginal-relevance](https://docs.vectara.com/docs/api-reference/search-apis/reranking#maximal-marginal-relevance-mmr-reranker) natively in the backend, and this is available as a query mode. 

Let's see an example of how to use MMR: We will run the same query but this time we will use MMR where mmr_diversity_bias=0.3 provides a tradeoff between relevance and diversity (0.0 is full relevance, 1.0 is only diversity):

In [8]:
query_engine = index.as_query_engine(
    similarity_top_k=5,
    vectara_query_mode="mmr",
    mmr_k=50,
    mmr_diversity_bias=0.3,
)
response = query_engine.query(query)
print(response)

Sentence embedding is a crucial text processing technique in NLP, with various models proposed for tasks like author profiling, sentiment classification, and textual entailment [1]. Different methods like ELMo, BERT, and SBERT-WK have been used to compute sentence embeddings by leveraging deep contextualized word models and fusion techniques [3, 5]. These models aim to preserve the original sentence meanings effectively in the embedded vectors, enhancing performance in document classification and sentiment analysis tasks [3]. Additionally, exploring analogical relationships in sentence embedding spaces can reveal regularities and unique properties for semantic matching capabilities [6, 7]. The Relational Sentence Embedding (RSE) method introduces relation modeling to leverage multi-source relational data for improved generalizability in supervised sentence embedding learning [7].


In [9]:
[(inx, n.node.metadata['url']) for inx,n in enumerate(response.source_nodes)]

[(0, 'http://arxiv.org/abs/1703.03130v1'),
 (1, 'http://arxiv.org/abs/2305.15077v2'),
 (2, 'http://arxiv.org/abs/1808.05505v3'),
 (3, 'http://arxiv.org/abs/1904.05542v1'),
 (4, 'http://arxiv.org/abs/2002.09620v2')]

As you can see, the results are now reranked in a way that provides more diversity instead of maximizing pure relevance. This in turn results in a different set of chunks used to generate the response.

So far we've used Vectara's internal summarization capability, which is the best way for most users.

You can still use Llama-Index's standard VectorStore `as_query_engine()` method, in which case Vectara's summarization won't be used, and you would be using an external LLM (like OpenAI's GPT-4 or similar) and a custom prompt from LlamaIndex to generate the summary. For this option just set `summary_enabled=False`

For this you would need to specify your own OpenAI API key in the environment:

> `os.environ['OPENAI_API_KEY'] = '<YOUR_OPENAI_API_KEY>`

In [10]:
from llama_index.llms.openai import OpenAI
llm=OpenAI(model="gpt-3.5-turbo", temperature=0)

In [11]:
query_engine = index.as_query_engine(
    similarity_top_k=5,
    summary_enabled=False,
    llm=llm
)
response = query_engine.query(query)
print(response)

Sentence embedding refers to the process of converting a sentence into a fixed-length numerical vector representation that captures the meaning and context of the sentence.


## Using Auto Retriever with Vectara

LlamaIndex's auto-retriever functionality is really cool. 
It is most useful when you have metadata fields (like in our case of papers from Arxiv), and would like a query that references a metadata field to be automatically interpreted in the right way.

For example, if I ask "what is a paper about climate change risks published after 2020", the auto-retriever would (behind the scences) interpret ths into a query "what is a paper about climate change risks" along with a filter condition of "published > 2020"

Let's see how this works with the Vectara Index.
First, we have to define a `VectorStoreInfo` structure that defines the meta data fields the auto-retriever knows about to do its job:

In [12]:
from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo

vector_store_info = VectorStoreInfo(
    content_info="information about a paper",
    metadata_info=[
        MetadataInfo(
            name="published",
            description="The date the paper was published",
            type="string",
        ),
        MetadataInfo(
            name="author",
            description="The author of the paper",
            type="string",
        ),
        MetadataInfo(
            name="title",
            description="The title of the papers",
            type="string",
        ),
        MetadataInfo(
            name="url",
            description="The URL for this paper",
            type="string",
        ),
    ],
)

Auto-retrieval is implemented before calling Vectara as a query transformation. 

Now we can define the `VectaraAutoRetriever`, which can perform auto-retrieval using Vectara:

In [13]:
from llama_index.indices.managed.vectara import VectaraAutoRetriever
retriever = VectaraAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    llm=llm,
    verbose=True
)
res = retriever.retrieve("What is sentence embedding, based on papers before 2019?")
[(r.metadata['published'], r.text) for r in res]

Using query str: What is sentence embedding?
Using implicit filters: [('published', '<', '2019')]
final filter string: (doc.published < '2019')


[('2017-03-09',
  'Instead of using a vector, we use a 2-D matrix\nto represent the embedding, with each row of the matrix attending on a different\npart of the sentence. We also propose a self-attention mechanism and a special\nregularization term for the model. As a side effect, the embedding comes with an\neasy way of visualizing what speciﬁc parts of the sentence are encoded into the\nembedding. We evaluate our model on 3 different tasks: author proﬁling, senti-\nment classiﬁcation and textual entailment. Results show that our model yields a\nsigniﬁcant performance gain compared to other sentence embedding methods in\nall of the 3 tasks.'),
 ('2018-08-16',
  'This\nproblem can be alleviated by obtaining more of para-\nphrase sentence pairs. Conclusion Sentence embedding is one of the most important text\nprocessing techniques in NLP. To date,  various sen-\ntence embedding models have been proposed and have\nyielded good performances in document classification\nand sentiment analys

As you can see, the Auto Retriever was able to translate the natural language text into a shorter query and a proper condition (in this case `doc.published < 2019`).

We can also of course ask a question directly: we use the `VectaraQueryEngine` which can work with the `VectaraAutoRetriever` directly:

In [14]:
from llama_index.indices.managed.vectara.query import VectaraQueryEngine
from llama_index.indices.managed.vectara import VectaraAutoRetriever

ar = VectaraAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    llm=llm,
    summary_enabled=True,
    summary_num_results=5,
    verbose=True
)

query_engine = VectaraQueryEngine(retriever=ar)
response = query_engine.query("What is sentence embedding, based on papers before 2019?")
print(response)

Using query str: What is sentence embedding?
Using implicit filters: [('published', '<', '2019')]
final filter string: (doc.published < '2019')
Sentence embedding is a crucial technique in natural language processing, aiming to represent the semantic meaning of sentences in a low-dimensional vector form. Various models have been proposed to generate embedding vectors that capture the essence of sentences, enhancing performance in tasks like machine translation, document classification, sentiment analysis, and more [4]. These models focus on preserving the original sentence meanings effectively in the embedded vectors, leading to significant performance gains compared to other methods [2]. By transforming sentences into structured representations, sentence embedding can improve the outcomes of NLP tasks, such as document classification and machine translation [4]. The Siamese Continuous Bag of Words (CBOW) model is one approach that efficiently estimates high-quality sentence embeddings

## Advanced querying with QueryFusionRetriever

The QueryFusion [Retriever](https://docs.llamaindex.ai/en/stable/examples/retrievers/reciprocal_rerank_fusion.html#reciprocal-rerank-fusion-retriever) is an advanced query mechanism whereby the original query is pre-processed to generate N variations. Each of these rephrased queries is then run against the Vectara engine and rank-fusion is used to combine the best results. 

Let's see this in action:

In [15]:
query = "is SBERT a dual encoder? what type of DL architecture does it use?"
query_engine = index.as_query_engine(
    similarity_top_k=3,
    summary_enabled=False,
    llm=llm,
)
response = query_engine.query(query)
print(response)

SBERT is a dual encoder. It uses a dual-encoder architecture.


In [16]:
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
import nest_asyncio

rf_retriever = QueryFusionRetriever(
    [index.as_retriever(similarity_top_k=2)],
    similarity_top_k=2,
    num_queries=5,  # this includes the origianl query; set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=True,
    verbose=True,
)

nest_asyncio.apply()     # apply nested async to run in a notebook
query_engine = RetrieverQueryEngine.from_args(rf_retriever)
response = query_engine.query(query)
print(response)

Generated queries:
1. What is SBERT and how does it differ from other encoder models?
2. Comparison of SBERT with other dual encoder models in natural language processing.
3. Deep dive into the architecture of SBERT and its implementation in deep learning.
4. Exploring the benefits and drawbacks of using SBERT as a dual encoder in machine learning tasks.
SBERT is a dual encoder model. It uses a siamese network structure, which is a type of deep learning architecture.


We can see how the QueryFusionRetriever created additional query variations (they are displayed since we used `verbose=True`) and then the overall response includes the results fused together. This is very helpful in this case because the QueryFusionRetriever creates sub-questions that inquire about the specific architecture of SBERT which is relevant context to answering this question properly.

## The Vectara-RAG LlamaPack

[Llama Packs](https://docs.llamaindex.ai/en/stable/community/llama_packs/root.html) are a community-driven hub of prepackaged modules for LlamaIndex.

Vectara's integration with LlamaIndex provides Vectara RAG, a Llama Pack with ready-to-go RAG.

Try it out for yourself by following these [instructions](https://github.com/run-llama/llama-hub/tree/main/llama_hub/llama_packs/vectara_rag).

## Summary

In this notebook we've seen various examples for using Vectara with LlamaIndex, which provides the following benefits:
* Vectara provides a complete RAG pipeline, so you don't have to deal with a lot of the details around data ingestion: pre-processing, chunking, embedding, etc. Instead all these steps are handled automatically and efficiently in Vectara. 
* Being a platform, Vectara uses its own internal Embedding model (Boomerang), its own vector storage, and calls the LLM for summarization, so you don't have to maintain separate API keys and relationships with additional vendors or install other products.
* Vectara is built for large scale GenAI applications, and with the tools provided by LlamaIndex like Auto Retrieval and Query Fusion, you can easily build and test advanced RAG applications at enteprise scale.