# Hybrid Search RAG Pipeline in LlamaIndex

This notebook demonstrates how to build a Hybrid Search Retrieval Augmented Generation (RAG) pipeline using Open Source Models using `HuggingFace` and `FastEmbeddings` with `llama-index`

## Setup

First, install the necessary packages:




## Install Necessary Packages and save Access Tokens:

In [1]:
!pip install llama-index-vector-stores-chroma
!pip install llama-index
!pip install llama-index-embeddings-fastembed

Collecting llama-index-vector-stores-chroma
  Downloading llama_index_vector_stores_chroma-0.1.10-py3-none-any.whl (5.0 kB)
Collecting chromadb<0.6.0,>=0.4.0 (from llama-index-vector-stores-chroma)
  Downloading chromadb-0.5.3-py3-none-any.whl (559 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m559.5/559.5 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-core<0.11.0,>=0.10.1 (from llama-index-vector-stores-chroma)
  Downloading llama_index_core-0.10.53.post1-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m57.9 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb<0.6.0,>=0.4.0->llama-index-vector-stores-chroma)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m82.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastap

In [4]:
#!pip install llama-index-llms-huggingface-api
!pip install llama-index-llms-huggingface

Collecting llama-index-llms-huggingface
  Downloading llama_index_llms_huggingface-0.2.4-py3-none-any.whl (11 kB)
Collecting text-generation<0.8.0,>=0.7.0 (from llama-index-llms-huggingface)
  Downloading text_generation-0.7.0-py3-none-any.whl (12 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers[torch]<5.0.0,>=4.37.0->llama-index-llms-huggingface)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from transformers[torch]<5.0.0,>=4.37.0->llama-index-llms-huggingface)
  Downloading accelerate-0.32.1-py3-none-any.whl (314 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, text-generation, accelerate, llama-index-llms-huggingface
  Attempting uninstall: tokenizers
    

## Set Up Hugging Face API Token

In [3]:
import os
from getpass import getpass

# HUGGINGFACEHUB_API_TOKEN = getpass("API:")

# # Set the API token in the environment variable
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

## Load and Split Medical Documents:



In [20]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()

## Set Up FastEmbeddings Embeddings and HuggingFace LLM



In [21]:
from llama_index.embeddings.fastembed import FastEmbedEmbedding
# define embedding function
embed_model = FastEmbedEmbedding(model_name="thenlper/gte-large")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

In [69]:
# from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

# llm = HuggingFaceInferenceAPI(
#     model_name="HuggingFaceH4/zephyr-7b-alpha", token=HUGGINGFACEHUB_API_TOKEN
# )
from llama_index.core import PromptTemplate
MODEL = "HuggingFaceH4/zephyr-7b-alpha"
system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

from llama_index.llms.huggingface import HuggingFaceLLM

llm = HuggingFaceLLM(
    tokenizer_name=MODEL,
    model_name=MODEL,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_kwargs={"max_length": 4096},
)

# llm = HuggingFaceLLM(
#     context_window=4096,
#     max_new_tokens=256,
#     generate_kwargs={"temperature": 0.7, "do_sample": False},
#     system_prompt=system_prompt,
#     query_wrapper_prompt=query_wrapper_prompt,
#     tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
#     model_name="StabilityAI/stablelm-tuned-alpha-3b",
#     device_map="auto",
#     stopping_ids=[50278, 50279, 50277, 1, 0],
#     tokenizer_kwargs={"max_length": 4096},
#     # uncomment this if using CUDA to reduce memory usage
#     # model_kwargs={"torch_dtype": torch.float16}
# )

from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained(MODEL).encode
)

tokenizer = AutoTokenizer.from_pretrained(MODEL)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [70]:
chat = [
  {"role": "user", "content": "Hello, how are you?"},
]
print(tokenizer.apply_chat_template(chat, tokenize=False))
llm.complete(tokenizer.apply_chat_template(chat, tokenize=False))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<|user|>
Hello, how are you?</s>



CompletionResponse(text="\nI do not have feelings or emotions, but i'm programmed to respond to your inquiries and provide you with helpful information. how can i assist you today?", additional_kwargs={}, raw={'model_output': tensor([[    1,   523, 28766, 11123, 28766,  3409, 28766,  1838, 28766, 28767,
            13, 16230, 28725,   910,   460,   368, 28804,     2, 28705,    13,
         28789, 28766,  4816,  8048, 12738, 28766, 28767,    13, 28737,   511,
           459,   506,  9388,   442, 13855, 28725,   562,   613, 28742, 28719,
          2007,  1591,   298,  9421,   298,   574,   297, 10851,   497,   304,
          3084,   368,   395, 10865,  1871, 28723,   910,   541,   613,  6031,
           368,  3154, 28804,     2]])}, logprobs=None, delta=None)

## Define LLM and Embedding in Settings

By default LlamaIndex uses OpenAI, so we need to override the settings

In [71]:
from llama_index.core import Settings

Settings.llm = llm

Settings.embed_model = embed_model

## Create Vectorstore with Chroma

In [42]:
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
import chromadb

## Index your document

First we save the data in disk
- Create a Persist directory where the data will be stored
- Define a unique collection for each index.
- Store the data in StorageContext

In [72]:
!rm -rf ./chroma_db

In [79]:
db = chromadb.PersistentClient(path="./chroma_db_v2")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [80]:
db.list_collections()

[<chromadb.api.models.Collection.Collection at 0x7c8977108580>]

In [81]:
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

In [82]:
index.as_query_engine().query("summerize Tarun's role at AI Planet")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Response(response="\nTarun is a software developer at AI Planet, where he works on developing AI-based solutions for various industries. He is passionate about using technology to solve real-world problems and is constantly exploring new ways to improve the company's products. Tarun is also a mentor and coach to other developers, helping them to improve their skills and advance their careers. Overall, Tarun is a key member of the AI Planet team, contributing to the company's growth and success.", source_nodes=[NodeWithScore(node=TextNode(id_='81140a75-2458-410e-80d9-29fce7ac0183', embedding=None, metadata={'page_label': '10', 'file_name': 'ncert short story.pdf', 'file_path': '/content/data/ncert short story.pdf', 'file_type': 'application/pdf', 'file_size': 863728, 'creation_date': '2024-07-09', 'last_modified_date': '2024-07-09'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_ke

## Load the index

Notice, when you load, we don't use `documents`

In [83]:
db2 = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db2.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
index = VectorStoreIndex.from_vector_store(
    vector_store,
    embed_model=embed_model,
)

In [84]:
query_engine = index.as_query_engine()
response = query_engine.query("summerize Tarun's role at AI Planet")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [85]:
response.response

'\nTarun is a software developer at AI Planet, where he works on developing AI-based solutions for various industries. He is also involved in research and development of new AI technologies and algorithms.'

In [86]:
response = query_engine.query("summarize Loscalzo Jonathan's role at AI Planet")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [87]:
response.response

"\nLoscalzo Jonathan is a co-founder and CEO of AI Planet, a company that provides AI-powered solutions for various industries. As the CEO, he is responsible for overseeing the company's operations, strategy, and growth. He has extensive experience in the technology industry and has held leadership roles in several other companies. At AI Planet, he aims to leverage the power of AI to transform businesses and improve people's lives."