# RAG

**Retrieval-Augmented Generation (RAG)** is a technique model that enhances generative AI by incorporating a retrieval mechanism, which fetches relevant documents or passages from a large corpus. When using a vector database, this process becomes even more efficient and effective.<br>
**Rag** is used here for making a conversational chat system based on domain knowledge. 

## How RAG Works with a Vector Database

1. **Document Embedding**:
   - Each document or passage in the corpus is converted into a high-dimensional vector using a pre-trained embedding model, we have seen that BAAI-bg-en-Large was the model we used for embeddings
   - These embeddings capture the semantic meaning of the text, enabling more accurate retrieval of relevant information.

2. **Storing Embeddings in a Vector Database**:
   - The embeddings are stored in a vector database, a specialized database optimized for storing and querying high-dimensional vectors.
   - The vector database allows for efficient similarity search, finding the most relevant documents based on their vector representations.

3. **Query Embedding**:
   - When a query is presented, it is also converted into a high-dimensional vector using the same embedding model.
   - This query embedding represents the semantic meaning of the query.

4. **Vector Search**:
   - The query embedding is used to search the vector database.
   - The vector database performs a nearest neighbor search to find the most similar document embeddings to the query embedding.
   - The top-K most relevant documents or passages are retrieved based on their similarity to the query embedding.

5. **Generative Component**:
   - The retrieved documents are passed to the generative component, typically an llm.
   - The generative model uses the information from the retrieved documents to generate a coherent and contextually appropriate response.




In [1]:
#Install dependencies
!pip install -q -U bitsandbytes
!pip install -q sentence_transformers
!pip install -q accelerate==0.21.0 transformers==4.31.0 tokenizers==0.13.3
!pip install -q einops==0.6.1
!pip install -q xformers==0.0.22.post7
!pip install -q langchain==0.1.4
!pip install -q chromadb FlagEmbedding



[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kaggle-environments 1.14.11 requires transformers>=4.33.1, but you have transformers 4.31.0 which is incompatible.
sentence-transformers 3.0.1 requires transformers<5.0.0,>=4.34.0, but you have transformers 4.31.0 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 3.0.1 requires transformers<5.0.0,>=4.34.0, but you have transformers 4.31.0 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 24.4.1 requires cubinlinker, which is not installed.
cudf 24.4.1 r

In [2]:
from langchain.memory import ConversationBufferMemory
from langchain_community.llms import Together
from langchain.chains import ConversationalRetrievalChain
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import HuggingFacePipeline
from transformers import StoppingCriteria, StoppingCriteriaList
import torch
from torch import cuda, bfloat16
import transformers

## Loading llm
**- Choosing the right llm is crucial for Rag,here we are loading a custom llama-2-7b-chat model fine tuned for legal tasks using the hugging face<br>
pipeline <br> - Here due to limited resources the model is quantized and loaded in 4-bit precision <br> - You can try loading it in full precision** .

In [3]:

llm = "Hashif/Indian-legal-Llama-2-7b-v2" #try out different models

# Here we are loading the llm into a text generation pipeline.

def load_pipeline(model_name):
    model_id = model_name
    device = f"cuda:{cuda.current_device()}" if cuda.is_available() else "cpu"
    # set quantization configuration to load large model with less GPU memory
    # this requires the `bitsandbytes` library
    bnb_config = transformers.BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=bfloat16,
    )

    # begin initializing HF items, you need an access token
    model_config = transformers.AutoConfig.from_pretrained(
        model_id,
        # use_auth_token=hf_auth
    )

    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_id,
        trust_remote_code=True,
        config=model_config,
        quantization_config=bnb_config,
        device_map="auto",
        # use_auth_token=hf_auth
    )
    model.eval()
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_id,
        # use_auth_token=hf_auth
    )
    #Stopping criteria are set to give better quality outputs, by clipping it.
    stop_list = ["\nHuman:", "\n```\n", "\n\n"]

    stop_token_ids = [tokenizer(x)["input_ids"] for x in stop_list]
    stop_token_ids
    stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]

    class StopOnTokens(StoppingCriteria):
        def __call__(
            self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
        ) -> bool:
            for stop_ids in stop_token_ids:
                if torch.eq(input_ids[0][-len(stop_ids) :], stop_ids).all():
                    return True
            return False

    stopping_criteria = StoppingCriteriaList([StopOnTokens()])
    generate_text = transformers.pipeline(
        model=model,
        tokenizer=tokenizer,
        return_full_text=True,  # langchain expects the full text
        task="text-generation",
        # we pass model parameters here too
        stopping_criteria=stopping_criteria,  # without this model rambles during chat
        temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
        max_new_tokens=2048,# adjust it according to the model
        repetition_penalty=1.1,  # without this output begins repeating
    )
    return generate_text
pipeline = load_pipeline(llm)




config.json:   0%|          | 0.00/634 [00:00<?, ?B/s]

2024-06-12 19:16:31.116141: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-12 19:16:31.116260: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-12 19:16:31.346386: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/183 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

## KEY COMPONENTS

 1.   **Llm Chain**
    - Chains allow you to go beyond just a single API call to a language model and instead chain together multiple calls in a logical sequence. They allow you to combine multiple components to create a coherent application
    - Langchain framework is used for building chains for RAG.
    - The chain combines input and output responses of the chat system, to make sure that the conversation doesn't fall out of context.
    
    
2. **Vector Database**
    - The vector database holds our knowledge in embedding vector format
    
3. **Retriever**
    - The retriever component retrieves relevent information from the vector database. The retrieval is done here by measures like cosine-similarity search on the database.
    
4. **Memory**
    - The memory components helps in remembering previous queriers and responses

In [4]:
""" If you are woking on kaggle notebooks:
       1. Zip and upload your vectordatabase into a google drive
       2. Download and unzip into the kaggle working directory by running this cell """

!conda install -q -y gdown
!gdown -q --id 1pSOOesrWzNRb3hrb-erFcEox6Tlz02Yh
!unzip -q vectordb2.zip


  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Retrieving notices: ...working... done
Channels:
 - rapidsai
 - nvidia
 - conda-forge
 - defaults
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - gdown


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2024.6.2           |     pyhd8ed1ab_0         157 KB  conda-forge
    filelock-3.14.0            |     pyhd8ed1ab_0          16 KB  conda-forge
    gdown-5.2.0                |     pyhd8ed1ab_0          21 KB  conda-forge
    openssl-3.3.1              |       h4ab18f5_0         2.8 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.0 MB

The following NEW packages will be INSTALLED:

  filelock           conda-forge/noarc

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [5]:
model_name = "BAAI/bge-base-en"#Embedding model
encode_kwargs = {"normalize_embeddings": True}  # set True to compute cosine similarity
model_norm = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs={"device": "cuda"}, encode_kwargs=encode_kwargs)#change device to cpu if u not on gpu
llm = HuggingFacePipeline(pipeline=pipeline)

def make_chain(llm):

    persist_directory = "/kaggle/working/vectordb2"#path to your vectordb
    vectordbs = Chroma(
        persist_directory=persist_directory, embedding_function=model_norm
    )
    memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
    retriever = vectordbs.as_retriever(search_kwargs={"k": 3})
    qa = ConversationalRetrievalChain.from_llm(
        llm,
        retriever=retriever,
        memory=memory
    )
    return qa

qa = make_chain(llm)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/719 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## INFERENCE
- Try out how it performs and get the results

In [6]:
query="hello"
prompt = f"""#Instruction: You are a legal advisor, give services to the clients like drafting petitions, clearing doubts and providing legal assistance according to their queries
                        #client:{query}
                        #Answer: """
result = qa({"question": prompt})
result['answer']

  warn_deprecated(


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

' \n"The norms for empanelment of academic counsellors include familiarity with ODL mode, learners and their needs, difference between ODL and conventional face to face education, awareness about instructional design, learner-centric approach in blended mode of learning, use of different delivery media including online and computer mediated communication, and information and communication technology enabled learning."\n\nEducation Law Consultant, Advocate\n\nNote: This is a summary of the provided law and is not intended to be a definitive analysis of all its aspects. It is suggested that you consult with a qualified legal professional before making any decisions based on this summary.\n\nDisclaimer: The information provided is for general purposes only and should not be construed as professional advice. It is recommended to consult with a legal expert for specific cases or situations."\n\nIndian Kanoon - http://indiankanoon.org/doc/139829370/\n\n[/] The norms for empanelment of academ

' \n                        **/\n\nThe answer to your question is not available as it is based on a hypothetical scenario and does not provide any specific information or advice. I am here to assist you with any legal questions or issues you might have, but I cannot provide legal advice without knowing more about your case. Please share more details or ask a specific question, and I will do my best to help. Remember, this is for general information purposes only and should not be considered as legal advice. It is important to consult a qualified lawyer for personalized guidance.'