<a href="https://colab.research.google.com/github/SoumojjalSen/RAG-Chatbot/blob/main/RAG_chatbot_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RAG System Using Llama2 With Hugging Face

In [1]:
!pip install pypdf



In [2]:
!pip install -q transformers einops accelerate langchain bitsandbytes

In [3]:
## Embedding
!pip install sentence_transformers



In [4]:
!pip install llama_index



In [5]:
!pip install llama-index-llms-huggingface



In [6]:
from llama_index.core import VectorStoreIndex,SimpleDirectoryReader,ServiceContext,PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.prompts.prompts import SimpleInputPrompt

* VectorStoreIndex: This typically represents an index that stores vector embeddings of text data. It enables fast similarity searches, so you can quickly find relevant content based on context or keywords by comparing vector embeddings.

* SimpleDirectoryReader: This reads files from a specified directory. It helps in loading multiple files for processing, such as documents you want to index or analyze.

* ServiceContext: This provides a shared context or configuration for services interacting with the index. It often manages settings like API keys, logging, or other resources that services need.

In [8]:
ls

[0m[01;34msample_data[0m/


In [10]:
documents = SimpleDirectoryReader("sample_data").load_data()
documents

[Document(id_='c5c8e0f1-1386-4486-92a2-e13fa21a7a70', embedding=None, metadata={'file_path': '/content/sample_data/README.md', 'file_name': 'README.md', 'file_type': 'text/markdown', 'file_size': 962, 'creation_date': '2025-01-07', 'last_modified_date': '2000-01-01'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text="This directory includes a few sample datasets to get you started.\n\n*   `california_housing_data*.csv` is California housing data from the 1990 US\n    Census; more information is available at:\n    https://docs.google.com/document/d/e/2PACX-1vRhYtsvc5eOR2FWNCwaBiKL6suIOrxJig8LcSBbmCbyYsayia_DvPOOBlXZ4CAlQ5nlDD8kTaIDRwrN/pub\n\

In [11]:
system_prompt="""
You are a Q&A assistant. Your goal is to answer questions as
accurately as possible based on the instructions and context provided.
"""
## Default format supportable by LLama2
query_wrapper_prompt = SimpleInputPrompt("<|USER|>{query_str}<|ASSISTANT|>")

In [12]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
The token `first_token` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `first_token`


In [13]:
import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.2, "do_sample": False}, # temperature tells how creative the LLM model should be
    system_prompt=system_prompt, # system prompt we have created
    query_wrapper_prompt=query_wrapper_prompt, # query wrapper prompt we have created
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf", # llama model with 7 billion parameter
    model_name="meta-llama/Llama-2-7b-chat-hf",
    device_map="auto",
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.float16 , "load_in_8bit":True}
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [14]:
%pip install llama-index-embeddings-langchain

Collecting llama-index-embeddings-langchain
  Downloading llama_index_embeddings_langchain-0.3.0-py3-none-any.whl.metadata (661 bytes)
Downloading llama_index_embeddings_langchain-0.3.0-py3-none-any.whl (2.5 kB)
Installing collected packages: llama-index-embeddings-langchain
Successfully installed llama-index-embeddings-langchain-0.3.0


In [15]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.14-py3-none-any.whl.metadata (2.9 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading langchain_community-0.3.14-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Downloading pydantic_settings-2.7.1-py3-none-any.whl (29 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv, httpx-sse, pydantic-settings, langchain-community
Succes

In [16]:
## The entire model has been uploaded

from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.core import ServiceContext
from llama_index.embeddings.langchain import LangchainEmbedding

embed_model = LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    )

# This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

  HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Now, I have Embed model, LLM model and documents... So, we are going to combine all of them using ServiceContext. ServiceContext bundles all the indexes and queries.

But as ServiceContext is deprecated, we use Settings

In [17]:
# service_context=ServiceContext.from_defaults(
#     chunk_size=1024,
#     llm=llm,
#     embed_model=embed_model
# )

from llama_index.core import Settings

Settings.chunk_size = 1024
Settings.llm = llm
Settings.embed_model = embed_model

In [18]:
index=VectorStoreIndex.from_documents(documents)

In [19]:
index

<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x7dd50477b2e0>

In [20]:
query_engine=index.as_query_engine()

In [27]:
response=query_engine.query("Write the code in C++ to calculate the LCS taking sample strings")

In [28]:
print(response)

The code to calculate the LCS in C++ is as follows:
```
#include <iostream>
#include <cstring>
#include <cstdlib>
using namespace std;

// Function to calculate the LCS of two strings
void lcs(string X, string Y, int m, int n, int *c, char *b) {
    // Initialize the counters and the bit vector
    *c = 0;
    *b = "";
    for (int i = 0; i < m; i++) {
        for (int j = 0; j < n; j++) {
            if (X[i] == Y[j]) {
                (*c)++;
                (*b) += "↖";
            } else if ((*c) > (i - 1) + (j - 1)) {
                (*c)++;
                (*b) += "↑";
            } else {
                (*c)++;
                (*b) += "←";
            }
        }
    }
}

// Example usage
int main() {
    string X = "
