# Naive RAG with Milvus and LangChain

This notebook contains an implementation of RAG with Milvus, LangChain, and HuggingFace. Its purpose is to provide you with a starting point for coding, if required.


### Load (quantized) Phi-4 for Apple Sillicon hardware

Using default `transformers` implementation is too slow on my MacBook (even though it is set to use `mps` device). Hence, I use the `mlx-lm` library. On `cuda` platforms, I recommend `unsloth`.


In [1]:
%%capture
!pip install langchain_milvus # TODO: Get rid of warning message

In [2]:
%%capture
!pip install langchain_community langchain_huggingface

In [3]:
%%capture
## Uncomment on CUDA platforms like Google Colab
!pip install unsloth
# # Also get the latest nightly Unsloth!
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install pymilvus[model]

In [4]:
!pip install requests



In [5]:
import torch

if torch.backends.mps.is_available():
    from mlx_lm import load

    model, tokenizer = load(
        "mlx-community/phi-4-4bit"
    )  # <= replace with smaller model depending on WiFi bandwidth

elif torch.cuda.is_available():
    from unsloth import FastLanguageModel

    model_name = "unsloth/Phi-4-unsloth-bnb-4bit"
    max_seq_length = 2048
    load_in_4bit = True

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Phi-4",
        max_seq_length=max_seq_length,
        load_in_4bit=load_in_4bit,
        # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
    )

else:
    raise Exception(
        "You most likely don't have sufficient hardware to run this notebook... :("
    )

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.7: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

### Integration with LangChain


In [6]:
from langchain_core.messages import HumanMessage

if torch.backends.mps.is_available():
    from langchain_community.llms.mlx_pipeline import MLXPipeline as Pipeline
    from langchain_community.chat_models.mlx import ChatMLX as Chat

    llm = Pipeline(
        model=model,
        tokenizer=tokenizer,
        pipeline_kwargs={"max_tokens": 1024, "temp": 0.1},
    )

elif torch.cuda.is_available():
    import transformers
    from langchain_huggingface import HuggingFacePipeline as Pipeline
    from langchain_huggingface import ChatHuggingFace as Chat

    FastLanguageModel.for_inference(model)

    hf_pipeline = transformers.pipeline(
        model=model,
        tokenizer=tokenizer,
        task="text-generation",
        # device="cuda",
        # repetition_penalty=1.15,
        return_full_text=False,
        max_new_tokens=1024,
        # output_scores=True,
        # use_cache=False,
        # truncation=True
    )

    llm = Pipeline(pipeline=hf_pipeline)

chat = Chat(llm=llm)

Device set to use cuda:0


### Test language model

On Apple Silicon, ignore the warning, which is due to a breaking change in one of the libraries used in the past couple of weeks. That's why I pin `mlx-lm==0.20.6`.


In [7]:
messages = [
    HumanMessage(
        content="What happens when an unstoppable force meets an immovable object?"
    ),
]

res = chat.invoke(messages)
print(res.content)

As a language model, I cannot provide a definitive answer to the question of what happens when an unstoppable force meets an immovable object, as it is a classic paradox that challenges the definitions of "unstoppable" and "immovable." This scenario is often used to explore concepts in philosophy, physics, and logic, highlighting the limitations of language and the nature of absolutes.

In philosophy, this paradox is used to question the nature of reality and the limits of human understanding. In physics, it challenges the principles of motion and force, as the existence of both an unstoppable force and an immovable object simultaneously defies the laws of classical mechanics.

Ultimately, the paradox serves as a thought experiment rather than a problem with a clear solution, encouraging deeper exploration of the concepts involved.


In [8]:
import requests
# Replace with your GitHub repository details
OWNER = "microsoft"  # e.g., "octocat"
REPO = "vscode"  # e.g., "hello-world"


# Base URL for GitHub API
BASE_URL_ISSUES = f"https://api.github.com/repos/{OWNER}/{REPO}/issues"


# Headers (include the token if accessing private repositories or to increase the rate limit)
#HEADERS = {"Authorization": f"token {GITHUB_TOKEN}"} if GITHUB_TOKEN else {}


def get_all_issues():
   issues = []
   page = 1  # GitHub paginates results (default 30 per page)


   while page <= 100:
       response = requests.get(BASE_URL_ISSUES, params={"state": "all", "page": page})
       response.raise_for_status()  # Raise an error for bad HTTP status codes
       data = response.json()
       if not data:
           break  # No more issues to fetch
       issues.extend(data)
       page += 1

   return issues

# Fetch issues
all_issues = get_all_issues()


HTTPError: 403 Client Error: rate limit exceeded for url: https://api.github.com/repos/microsoft/vscode/issues?state=all&page=61

In [None]:
#print(all_issues)
#for issue in all_issues:
#  print("keys")
#  print(issue.keys())
#  print(issue['title'])
#  print(issue['body'])

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
def split_issues(all_issues):
 all_docs = []
 for issue in all_issues:
   try:
     text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200, length_function=len,
     is_separator_regex=False)
     concatenated = issue['title'] + ' ' + issue['body']
     texts = text_splitter.create_documents([concatenated])
     docs = text_splitter.split_documents(texts)

     all_docs.extend(docs)
   except:
     pass
     #print(f'skipped issue')
 return all_docs


issue_chunks = split_issues(all_issues)

In [None]:
docs = [chunk.page_content for chunk in issue_chunks]
print(len(docs))

In [None]:
# from pymilvus import model

# # If connection to https://huggingface.co/ failed, uncomment the following path
# # import os
# # os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# # This will download a small embedding model "paraphrase-albert-small-v2" (~50MB).
# embedding_fn = model.DefaultEmbeddingFunction()

# # Text strings to search from.
# # docs = [
# #     "Artificial intelligence was founded as an academic discipline in 1956.",
# #     "Alan Turing was the first person to conduct substantial research in AI.",
# #     "Born in Maida Vale, London, Turing was raised in southern England.",
# # ]

# vectors = embedding_fn.encode_documents(docs)
# # The output vector has 768 dimensions, matching the collection that we just created.
# print("Dim:", embedding_fn.dim, vectors[0].shape)  # Dim: 768 (768,)

# # Each entity has id, vector representation, raw text, and a subject label that we use
# # to demo metadata filtering later.
# data = [
#     {"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"}
#     for i in range(len(vectors))
# ]

# print("Data has", len(data), "entities, each with fields: ", data[0].keys())
# print("Vector dim:", len(data[0]["vector"]))

### Prepare the Data


### Build naive RAG with Milvus and LangChain


In [None]:
from langchain_community.embeddings import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

In [None]:
from langchain_milvus import Milvus, Zilliz

vectorstore = Milvus.from_documents(  # or Zilliz.from_documents
    documents=issue_chunks,
    embedding=embeddings,
    connection_args={
        "uri": "./milvus_demo.db",
    },
    drop_old=True,  # Drop the old Milvus collection if it exists
    index_params={
        "metric_type": "COSINE",
        "index_type": "FLAT",  # <= NOTE: Currently a bug where langchain_milvus defaults to "HNSW" index, which doesn't work with Milvus Lite
        "params": {},
    },
)


### Test vector database


In [None]:
query = "Why is my UI slow?"
res = vectorstore.similarity_search(query, k=1)
print(res[0].page_content[0:1024] + "...")

### Extra LangChain stuff


In [None]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Define the prompt template for generating AI responses
PROMPT_TEMPLATE = """
Human: You are an AI assistant, and provides answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>

<question>
{question}
</question>

The response should be specific and use statistics or numbers when possible.

Assistant:"""

# Create a PromptTemplate instance with the defined template and input variables
prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)
# Convert the vector store to a retriever
retriever = vectorstore.as_retriever()


# Define a function to format the retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

### LangChain Expression Language


In [None]:
# Define the RAG (Retrieval-Augmented Generation) chain for AI response generation
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# rag_chain.get_graph().print_ascii()

# Invoke the RAG chain with a specific question and retrieve the response
res = rag_chain.invoke(query)

In [None]:
import textwrap

# TODO: Better text wrapping in Colab
print(textwrap.fill(res, width=80, replace_whitespace=False, drop_whitespace=False))

### You have successfully built and run a RAG pipeline using Milvus, Hugging Face, and LangChain libraries!
