# RAG With llama-index  + Milvus + Qwen - Part 2

References

- https://studio.nebius.com/
- https://docs.llamaindex.ai/en/stable/examples/vector_stores/MilvusIndexDemo/
- https://docs.llamaindex.ai/en/stable/api_reference/storage/vector_store/milvus/?h=milvusvectorstore#llama_index.vector_stores.milvus.MilvusVectorStore

- 1) проверьте файл .env
- 2) kernel - studio1

## Step-1: Configuration

In [23]:
import os
from dotenv import load_dotenv
load_dotenv()

if os.getenv('NEBIUS_API_KEY'):
    print ("✅ Found NEBIUS_API_KEY in environment, using it")
else:
    raise ValueError("❌ NEBIUS_API_KEY not found in environment. Please set it in .env file before running this script.")

✅ Found NEBIUS_API_KEY in environment, using it


In [3]:
! pip install -r requirements.txt

Ignoring appnope: markers 'sys_platform == "darwin"' don't match your environment
Ignoring cffi: markers 'implementation_name == "pypy"' don't match your environment
Ignoring pycparser: markers 'implementation_name == "pypy"' don't match your environment
Ignoring pywin32: markers 'platform_python_implementation != "PyPy" and sys_platform == "win32"' don't match your environment
Collecting aiohttp==3.12.15 (from -r requirements.txt (line 5))
  Downloading aiohttp-3.12.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting aiosqlite==0.21.0 (from -r requirements.txt (line 9))
  Downloading aiosqlite-0.21.0-py3-none-any.whl.metadata (4.3 kB)
Collecting anyio==4.10.0 (from -r requirements.txt (line 13))
  Downloading anyio-4.10.0-py3-none-any.whl.metadata (4.0 kB)
Collecting asttokens==3.0.0 (from -r requirements.txt (line 19))
  Downloading asttokens-3.0.0-py3-none-any.whl.metadata (4.7 kB)
Collecting attrs==25.3.0 (from -r requirements.txt (line 21))
 

## Step-2: Setup Embedding Model

We have a choice of local embedding model (fast) or running it on the cloud

If running locally:
- choose smaller models
- less accuracy but faster

If running on the cloud
- We can run large models (billions of params)

In [27]:
import os
from llama_index.core import Settings

# Option 1: Running embedding models on Nebius cloud
from llama_index.embeddings.nebius import NebiusEmbedding
EMBEDDING_MODEL = 'Qwen/Qwen3-Embedding-8B'  # 8B params
EMBEDDING_LENGTH = 4096  # Length of the embedding vector
Settings.embed_model = NebiusEmbedding(
                        model_name=EMBEDDING_MODEL,
                        embed_batch_size=50,  # Batch size for embedding (default is 10)
                        api_key=os.getenv("NEBIUS_API_KEY") # if not specfified here, it will get taken from env variable
                       )

## Option 2: Running embedding models locally
# from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
# Settings.embed_model = HuggingFaceEmbedding(
#     # model_name = 'sentence-transformers/all-MiniLM-L6-v2' # 23 M params
#     model_name = 'BAAI/bge-small-en-v1.5'  # 33M params
#     # model_name = 'Qwen/Qwen3-Embedding-0.6B'  # 600M params
#     # model_name = 'BAAI/bge-en-icl'  # 7B params
#     #model_name = 'intfloat/multilingual-e5-large-instruct'  # 560M params
# )



## Step-3: Connect to Milvus

In [28]:
from pymilvus import MilvusClient

DB_URI = './rag.db'  # For embedded instance
COLLECTION_NAME = 'rag'

milvus_client = MilvusClient(DB_URI)
print ("✅ Connected to Milvus instance: ", DB_URI)


✅ Connected to Milvus instance:  ./rag.db


In [29]:
# connect to vector db
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore

vector_store = MilvusVectorStore(
    uri = DB_URI ,
    dim = EMBEDDING_LENGTH ,
    collection_name = COLLECTION_NAME,
    overwrite=False
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

print ("✅ Connected Llama-index to Milvus instance: ", DB_URI )

✅ Connected Llama-index to Milvus instance:  ./rag.db


## Step-4: Load Document Index from DB

In [30]:
%%time

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store, storage_context=storage_context)

print ("✅ Loaded index from vector db:", DB_URI )

✅ Loaded index from vector db: ./rag.db
CPU times: user 2.73 ms, sys: 0 ns, total: 2.73 ms
Wall time: 2.67 ms


## Step-5: Setup LLM

In [34]:
from llama_index.llms.nebius import NebiusLLM
from llama_index.core import Settings

Settings.llm = NebiusLLM(
                #model='openai/gpt-oss-120b',
                model='Qwen/Qwen3-30B-A3B',
                # model='deepseek-ai/DeepSeek-R1-0528',
                api_key=os.getenv("NEBIUS_API_KEY") # if not specfified, it will get taken from env variable
    )


## Step-6: Query

In [35]:
query_engine = index.as_query_engine()
#res = query_engine.query("What is AstaP")
#print(res)

response_object = query_engine.query("What is AstaP")

# Print the response string (might be empty)
print("Final Answer:", str(response_object))

# 🔬 CRITICAL: Check what information was actually retrieved
print("\n--- Retrieved Source Nodes ---")
for i, node in enumerate(response_object.source_nodes):
    print(f"Node {i+1}:")
    print(f"Score (Similarity): {node.score}")
    print(f"Text Content: {node.node.text[:500]}...") # Print first 500 chars
    print("------")

Final Answer: Empty Response

--- Retrieved Source Nodes ---


## Making sure the model uses context

Let's ask a generic factual question "When was the moon landing".

Now the model should know this generic factual answer.

But since we are querying documents, we want to the model to find answers from within the documents.

It should come back with something like "provided context does not have information about moon landing"

In [10]:
query_engine = index.as_query_engine()
res = query_engine.query("When was the moon landing?")
print(res)

Empty Response
