# RAG With llama-index  + Milvus + LLama

References
- https://docs.llamaindex.ai/en/stable/examples/vector_stores/MilvusIndexDemo/
- https://docs.llamaindex.ai/en/stable/api_reference/storage/vector_store/milvus/?h=milvusvectorstore#llama_index.vector_stores.milvus.MilvusVectorStore

## Step-1: Configuration

In [1]:
from my_config import MY_CONFIG

MY_CONFIG.DB_URI = './rag_2_llamaindex.db'
MY_CONFIG.COLLECTION_NAME = 'llamaindex_papers'

## Step-2: Setup Embeddings

In [2]:
# If connection to https://huggingface.co/ failed, uncomment the following path
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [3]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

Settings.embed_model = HuggingFaceEmbedding(
    model_name = MY_CONFIG.EMBEDDING_MODEL
)



## Step-3: Connect to Milvus

In [4]:
# connect to vector db
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore

vector_store = MilvusVectorStore(
    uri = MY_CONFIG.DB_URI ,
    dim = MY_CONFIG.EMBEDDING_LENGTH , 
    collection_name = MY_CONFIG.COLLECTION_NAME,
    overwrite=False  # so we load the index from db
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

print ("✅ Connected Llama-index to Milvus instance: ", MY_CONFIG.DB_URI )

✅ Connected Llama-index to Milvus instance:  ./rag_2_llamaindex.db


## Step-4: Load Document Index from DB

In [5]:
%%time

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store, storage_context=storage_context)

print ("✅ Loaded index from vector db:", MY_CONFIG.DB_URI )

✅ Loaded index from vector db: ./rag_2_llamaindex.db
CPU times: user 255 ms, sys: 18.9 ms, total: 274 ms
Wall time: 271 ms


## Step-5: Setup LLM

In [6]:
from llama_index.llms.replicate import Replicate
from llama_index.core import Settings

llm = Replicate(
    model= MY_CONFIG.LLM_MODEL,
    temperature=0.1
)

Settings.llm = llm

## Step-6: Query

In [7]:
query_engine = index.as_query_engine()
res = query_engine.query("What was the training data used to train Granite models?")
print(res)



Based on the provided context information, the training data used to train the Granite models includes:

* 3.5T to 4.5T tokens of code data
* Natural language datasets related to code
* High-quality data with two phases of training:
	+ Phase 1: 4 trillion tokens of code data comprising 116 languages
	+ Phase 2: 500B tokens (80% code and 20% language data) from various domains, including technical, mathematics, and web documents

Note that the data is tokenized via byte pair encoding (BPE) and the same tokenizer as StarCoder is employed.


In [8]:
query_engine = index.as_query_engine()
res = query_engine.query("What is attention mechanism?")
print(res)



Based on the provided context information, it appears that the attention mechanism is a technique used in the encoder self-attention in layer 5 of 6, which allows the model to focus on specific parts of the input when processing it. This is evident from the visualizations provided, which show the attention heads attending to distant dependencies in the input text.

In the first example, the attention heads are shown to attend to a distant dependency of the verb "making", completing the phrase "making...more difficult". In the second example, the attention heads are shown to exhibit behavior related to the structure of the sentence, with different heads performing different tasks.

From this, it can be inferred that the attention mechanism is a way for the model to selectively focus on certain parts of the input, allowing it to better understand the context and relationships between different elements in the input.


In [9]:
query_engine = index.as_query_engine()
res = query_engine.query("When was the moon landing?")
print(res)



I'm happy to help! However, I don't see any information about the moon landing in the provided context. The text appears to be discussing IBM Granite Code Models and their performance on various benchmarks. Therefore, I cannot provide an answer to the query about the moon landing. If you could provide more context or clarify the question, I'd be happy to try and assist you further!
