# Naive RAG with Milvus and LangChain

This notebook contains an implementation of RAG with Milvus, LangChain, and HuggingFace. Its purpose is to provide you with a starting point for coding, if required.


### Load (quantized) Phi-4 for Apple Sillicon hardware

Using default `transformers` implementation is too slow on my MacBook (even though it is set to use `mps` device). Hence, I use the `mlx-lm` library. On `cuda` platforms, I recommend `unsloth`.


In [1]:
%%capture
!pip install langchain_milvus # TODO: Get rid of warning message

In [2]:
%%capture
!pip install langchain_community langchain_huggingface

In [3]:
%%capture
## Uncomment on CUDA platforms like Google Colab
!pip install unsloth
# # Also get the latest nightly Unsloth!
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install pymilvus[model]

In [4]:
!pip install requests



In [7]:
import torch

if torch.backends.mps.is_available():
    from mlx_lm import load

    model, tokenizer = load(
        "mlx-community/phi-4-4bit"
    )  # <= replace with smaller model depending on WiFi bandwidth

elif torch.cuda.is_available():
    from unsloth import FastLanguageModel

    model_name = "unsloth/Phi-4-unsloth-bnb-4bit"
    max_seq_length = 2048
    load_in_4bit = True

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Phi-4",
        max_seq_length=max_seq_length,
        load_in_4bit=load_in_4bit,
        # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
    )

else:
    raise Exception(
        "You most likely don't have sufficient hardware to run this notebook... :("
    )

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.7: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/160k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.39G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/170 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/18.0k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/917k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.15M [00:00<?, ?B/s]

### Integration with LangChain


In [8]:
from langchain_core.messages import HumanMessage

if torch.backends.mps.is_available():
    from langchain_community.llms.mlx_pipeline import MLXPipeline as Pipeline
    from langchain_community.chat_models.mlx import ChatMLX as Chat

    llm = Pipeline(
        model=model,
        tokenizer=tokenizer,
        pipeline_kwargs={"max_tokens": 1024, "temp": 0.1},
    )

elif torch.cuda.is_available():
    import transformers
    from langchain_huggingface import HuggingFacePipeline as Pipeline
    from langchain_huggingface import ChatHuggingFace as Chat

    FastLanguageModel.for_inference(model)

    hf_pipeline = transformers.pipeline(
        model=model,
        tokenizer=tokenizer,
        task="text-generation",
        # device="cuda",
        # repetition_penalty=1.15,
        return_full_text=False,
        max_new_tokens=1024,
        # output_scores=True,
        # use_cache=False,
        # truncation=True
    )

    llm = Pipeline(pipeline=hf_pipeline)

chat = Chat(llm=llm)

Device set to use cuda:0


tokenizer_config.json:   0%|          | 0.00/18.0k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/917k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.15M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

### Test language model

On Apple Silicon, ignore the warning, which is due to a breaking change in one of the libraries used in the past couple of weeks. That's why I pin `mlx-lm==0.20.6`.


In [13]:
import requests
# Replace with your GitHub repository details
OWNER = "microsoft"  # e.g., "octocat"
REPO = "vscode"  # e.g., "hello-world"
GITHUB_TOKEN = "github"


# Base URL for GitHub API
BASE_URL_ISSUES = f"https://api.github.com/repos/{OWNER}/{REPO}/issues"


# Headers (include the token if accessing private repositories or to increase the rate limit)
HEADERS = {"Authorization": f"token {GITHUB_TOKEN}"} if GITHUB_TOKEN else {}


def get_all_issues():
   issues = []
   page = 1  # GitHub paginates results (default 30 per page)


   while page <= 100:
       response = requests.get(BASE_URL_ISSUES, headers=HEADERS, params={"state": "all", "page": page})
       response.raise_for_status()  # Raise an error for bad HTTP status codes
       data = response.json()
       if not data:
           break  # No more issues to fetch
       issues.extend(data)
       page += 1

   return issues

# Fetch issues
all_issues = get_all_issues()


In [14]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
def split_issues(all_issues):
 all_docs = []
 for issue in all_issues:
   try:
     text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200, length_function=len,
     is_separator_regex=False)
     concatenated = issue['title'] + ' ' + issue['body']
     texts = text_splitter.create_documents([concatenated])
     docs = text_splitter.split_documents(texts)

     all_docs.extend(docs)
   except:
     pass
     #print(f'skipped issue')
 return all_docs


issue_chunks = split_issues(all_issues)

In [15]:
docs = [chunk.page_content for chunk in issue_chunks]
print(len(docs))

4207


### Prepare the Data


### Build naive RAG with Milvus and LangChain


In [16]:
from langchain_community.embeddings import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

  embeddings = SentenceTransformerEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [17]:
from langchain_milvus import Milvus, Zilliz

vectorstore = Milvus.from_documents(  # or Zilliz.from_documents
    documents=issue_chunks,
    embedding=embeddings,
    connection_args={
        "uri": "./milvus_demo.db",
    },
    drop_old=True,  # Drop the old Milvus collection if it exists
    index_params={
        "metric_type": "COSINE",
        "index_type": "FLAT",  # <= NOTE: Currently a bug where langchain_milvus defaults to "HNSW" index, which doesn't work with Milvus Lite
        "params": {},
    },
)


### Test vector database


In [18]:
query = "Why is my UI slow?"
res = vectorstore.similarity_search(query, k=1)
print(res[0].page_content[0:1024] + "...")

Very slow performance of moving a chat into the editor I am having a Chat session that I move into the editor. The entire UI freezes when I do so, here is the perf trace:

[Trace-20250120T084527.json.zip](https://github.com/user-attachments/files/18473913/Trace-20250120T084527.json.zip)

<img width="1680" alt="Image" src="https://github.com/user-attachments/assets/abf2996a-1bee-412d-a885-4d6f8ed88fdf" />

Not sure if it matters, but I seem to have included a pasted image:

<img width="196" alt="Image" src="https://github.com/user-attachments/assets/906f1e68-4a0c-4131-98e7-8f086d3f67ea" />...


### Extra LangChain stuff


In [19]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Define the prompt template for generating AI responses
PROMPT_TEMPLATE = """
Human: You are an AI assistant, and provides answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>

<question>
{question}
</question>

The response should be specific and use statistics or numbers when possible.

Assistant:"""

# Create a PromptTemplate instance with the defined template and input variables
prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)
# Convert the vector store to a retriever
retriever = vectorstore.as_retriever()


# Define a function to format the retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

### LangChain Expression Language


In [20]:
# Define the RAG (Retrieval-Augmented Generation) chain for AI response generation
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# rag_chain.get_graph().print_ascii()

# Invoke the RAG chain with a specific question and retrieve the response
res = rag_chain.invoke(query)

In [21]:
import textwrap

print(textwrap.fill(res, width=80, replace_whitespace=False, drop_whitespace=False))

 Based on the information provided, there are several potential reasons for the 
slow performance of your UI in VS Code:

1. **Performance Issue with Chat 
Session**: Moving a chat session into the editor causes the entire UI to freeze.
 This could be due to resource-intensive operations or conflicts within the 
editor when handling chat data, especially if a pasted image is included.

2. 
**General Slow Performance**: You mentioned that VS Code has become super slow, 
with tasks like saving or changing source control taking up to half a minute. 
This could be due to:
   - **High CPU Usage**: Your system has a powerful CPU 
(e.g., AMD Ryzen 7 6800H with 16 cores), but if VS Code is consuming too many 
resources, it could slow down.
   - **Memory Usage**: Your system has 63.19GB of
 total memory, with 20.83GB free. Ensure that other applications are not 
consuming excessive memory, which could impact VS Code's performance.
   - 
**Extensions**: Although you've disabled all extensions, s

### You have successfully built and run a RAG pipeline using Milvus, Hugging Face, and LangChain libraries!
