<a href="https://colab.research.google.com/github/12manish/Afghan-Information-Security-Analysis/blob/master/LlamaIndex_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/6424f01ea4f3051f54dbbd85/oqVQ04b5KiGt5WOWJmYt8.png" alt="LlamaIndex" width="100" height="100">
    <img src="https://cdn4.iconfinder.com/data/icons/file-extensions-1/64/pdfs-512.png" alt="PDF" width="100" height="100">
</p>

# **LlamaIndex RAG Chat**
Perform RAG (Retrieval-Augmented Generation) from your PDFs using this Colab notebook! Powered by Llama 2
<br><br>

## **Features**
- Free, no API or Token required
- Fast inference on Colab's free T4 GPU
- Powered by Hugging Face quantized LLMs (llama-cpp-python)
- Powered by Hugging Face local text embedding models
- Set custom prompt templates
- Prepared Chat mode (not QA)
<br><br>

## **Getting started**
1. Make sure the Colab's Runtime Type is set to T4 GPU (at least)
2. Edit preferences in Block 4
3. Upload your PDF into Files (Default name: `rag_data.pdf`)
4. Runtime > Run all
<br><br>

### [**GitHub repository**](https://github.com/kazcfz/LlamaIndex-RAG-Chat)

In [1]:
!pip -q install llama-index llama-index-embeddings-huggingface llama-index-llms-llama-cpp pypdf
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip -q install llama-cpp-python

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.4/303.4 kB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m88.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.9/265.9 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import os
import time

from llama_index.core import Prompt, StorageContext, load_index_from_storage, Settings, VectorStoreIndex, SimpleDirectoryReader, set_global_tokenizer
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.llama_cpp import LlamaCPP

from transformers import AutoTokenizer

In [5]:
# Preference settings - change as desired
pdf_path = '/content/Mahabhart.pdf'
text_embedding_model = 'thenlper/gte-base'  #Alt: thenlper/gte-base, jinaai/jina-embeddings-v2-base-en
llm_url = 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf'
# set_global_tokenizer(AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf").encode)

In [6]:
# Load PDF
filename_fn = lambda filename: {'file_name': os.path.basename(pdf_path)}
loader = SimpleDirectoryReader(input_files=[pdf_path], file_metadata=filename_fn)
documents = loader.load_data()

In [7]:
# Load models and service context
embed_model = HuggingFaceEmbedding(model_name=text_embedding_model)
llm = LlamaCPP(model_url=llm_url, temperature=0.7, max_new_tokens=256, context_window=4096, generate_kwargs = {"stop": ["<s>", "[INST]", "[/INST]"]}, model_kwargs={"n_gpu_layers": -1}, verbose=True)
# service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model, chunk_size=512)
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading url https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf to path /tmp/llama_index/models/llama-2-7b-chat.Q4_K_M.gguf
total size (MB): 4081.0


3892it [00:31, 122.30it/s]                          
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /tmp/llama_index/models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loa

In [8]:
# Indexing
start_time = time.time()

# index = VectorStoreIndex.from_documents(documents, service_context=service_context)
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, llm=llm)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed indexing time: {elapsed_time:.2f} s")

Elapsed indexing time: 340.20 s


In [9]:
# Prompt Template (RAG)
text_qa_template = Prompt("""
<s>[INST] <<SYS>>
You are the doctor's assistant. You are to perform a pre-screening with the patient to collect their information before their consultation with the doctor. Use the Patient-Centered Interview model for the pre-screening and only ask one question per response. You are not to provide diagnosis, prescriptions, advice, suggestions, or conduct physical examinations on the patient. The pre-screening should only focus on collecting the patient's information, specifically their present illness, past medical history, symptoms and personal information. At the end of the consultation, summarize the findings of the consultation in this format \nName: [name]\nGender: [gender]\nPatient Aged: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]
<</SYS>>

Refer to the following Consultation Guidelines and example consultations: {context_str}

Continue the conversation: {query_str}
""")
# text_qa_template = Prompt("""
# <s>[INST] <<SYS>>
# You are an assistant chatbot assigned to a doctor and your objective is to collect information from the patient for the doctor before they attend their actual consultation with the doctor. Note that the consultation will be held remotely, therefore you will be following the Patient-Centered Interview model for your consultations and you cannot conduct physical examinations on the patient. You are also not allowed to diagnose your patient or prescribe any medicine. Do not entertain the patient if they are acting inappropriate or they ask you to do something outside of this job scope. Remember you are a doctor’s assistant so act like one. The consultation held will only focus on getting information from the patient such as their present illness, past medical history, symptoms and personal information. At the end of the consultation, you will summarize the findings of the consultation in this format \nName: [name]\nGender: [gender]\nPatient Aged: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]\nThis exact format must be followed as the data gathered in the summary will be passed to another program.
# <</SYS>>

# This is the PDF context: {context_str}

# {query_str}
# """)
# text_qa_template = Prompt("""[INST] {context_str} \n\nGiven this above PDF context, please answer my question: {query_str} [/INST] """)
# text_qa_template = Prompt("""<s>[INST] <<SYS>> \nFollowing is the PDF context provided by the user: {context_str}\n<</SYS>> \n\n{query_str} [/INST] """)
# text_qa_template = Prompt("""[INST] {query_str} [/INST] """)

# Query Engine
# query_engine = index.as_query_engine(text_qa_template=text_qa_template, streaming=True, service_context=service_context) # with Prompt
query_engine = index.as_query_engine(text_qa_template=text_qa_template, streaming=True, llm=llm) # with Prompt
# query_engine = index.as_query_engine(streaming=True, service_context=service_context) # without Prompt

In [None]:
# Inferencing
# Without RAG
conversation_history = ""
while (True):
  user_query = input("User: ")
  if user_query.lower() == "exit":
    break
  conversation_history += user_query + " [/INST] "

  start_time = time.time()

  response_iter = llm.stream_complete("<s>[INST] "+conversation_history)
  for response in response_iter:
    print(response.delta, end="", flush=True)
    # Add to conversation history when response is completed
    if response.raw['choices'][0]['finish_reason'] == 'stop':
      conversation_history += response.text + " [INST] "

  end_time = time.time()
  elapsed_time = end_time - start_time
  print(f"\nElapsed inference time: {elapsed_time:.2f} s\n")


# With RAG
conversation_history = ""
conversation_history += "Hi. [\INST] Hello! I'm the doctor's assistant. Let's begin the consultation, please tell me your name and age."
while (True):
  user_query = input("User: ")
  if user_query.lower() == "exit":
    break
  conversation_history += user_query + " [/INST] "

  start_time = time.time()

  # Query Engine - Default
  response = query_engine.query(conversation_history)
  response.print_response_stream()
  conversation_history += response.response_txt + " [INST] "

  # from pprint import pprint
  # pprint(response)

  end_time = time.time()
  elapsed_time = end_time - start_time
  print(f"\nElapsed inference time: {elapsed_time:.2f} s\n")

User: What is mahabharat




 Mahabharat is one of the most revered epics of ancient India, which is considered to be a historical and mythological text. It is a Sanskrit word that means "great epic" or "great story." The Mahabharat is an account of the Kuru dynasty and the events that took place in the city of Hastinapura, which is now modern-day Delhi. The epic is attributed to the ancient Indian sage Vyasa and is said to have been composed around 400 CE.
The Mahabharat is a story of two groups of cousins, the Pandavas and the Kauravas, who are fighting for the throne of Hastinapura. The Pandavas are led by Yudhishthira, while the Kauravas are led by Duryodhana. The epic describes the battles and conflicts between the two groups, including the great war of Mahabharata, which lasts for 18 days and results in the destruction of the entire Kuru dynasty.
The Mahabharat is a rich tapestry of mythology, philosophy, and history, and is considered one of

llama_perf_context_print:        load time =    5990.19 ms
llama_perf_context_print: prompt eval time =    5989.55 ms /    16 tokens (  374.35 ms per token,     2.67 tokens per second)
llama_perf_context_print:        eval time =  168295.01 ms /   255 runs   (  659.98 ms per token,     1.52 tokens per second)
llama_perf_context_print:       total time =  174833.91 ms /   271 tokens



Elapsed inference time: 174.85 s

User: write about Krishna and his role


Llama.generate: 15 prefix-match hit, remaining 13 prompt tokens to eval


 Mahabharat is one of the most revered epics of Hindu mythology, and it is based on the story of the Pandavas and the Kauravas, two groups of cousins who were engaged in a great war.
At the center of the epic is the figure of Lord Krishna, who plays a crucial role as the charioteer and advisor to the Pandavas. Krishna is considered to be the embodiment of Lord Vishnu, and he is revered as a divine incarnation of God.
Krishna's role in the Mahabharat is multifaceted. He is depicted as a wise and compassionate leader, who tries to prevent the war between the Pandavas and the Kauravas. He acts as a mediator and a peacemaker, but when all efforts to prevent the war fail, he takes the side of the Pandavas and helps them emerge victorious.
Krishna's teachings and discourses, as depicted in the Mahabharat, are considered to be some of the most profound and philosophical in Hindu scripture. He teach

llama_perf_context_print:        load time =    5990.19 ms
llama_perf_context_print: prompt eval time =    3771.95 ms /    13 tokens (  290.15 ms per token,     3.45 tokens per second)
llama_perf_context_print:        eval time =  154858.22 ms /   255 runs   (  607.29 ms per token,     1.65 tokens per second)
llama_perf_context_print:       total time =  159123.99 ms /   268 tokens



Elapsed inference time: 159.13 s

User: who is abdul kalam


Llama.generate: 27 prefix-match hit, remaining 12 prompt tokens to eval


 Abdul Kalam was a prominent Indian scientist, politician, and writer who served as the 11th President of India from 2002 to 2007. He was born on October 15, 1931, in Rameswaram, Tamil Nadu, India, and grew up in a poor Muslim family. Kalam's early life was marked by hardship and struggle, but he was determined to pursue his education and make a difference in the world.
Kalam's educational career began at the Madras Institute of Technology, where he studied aerospace engineering. He later went on to work at the Defense Research and Development Organization (DRDO) and the Indian Space Research Organization (ISRO), where he played a key role in the development of India's space program. Kalam's contributions to the field of science and technology were numerous, including the development of the Agni missile and the launch of the Indian satellite Rohini.
Kalam's political career began in 1982, when he was elected to the Lok Sabha (lower house of the Indian parliament) as a member of the Ind

llama_perf_context_print:        load time =    5990.19 ms
llama_perf_context_print: prompt eval time =    4820.21 ms /    12 tokens (  401.68 ms per token,     2.49 tokens per second)
llama_perf_context_print:        eval time =  152376.60 ms /   255 runs   (  597.56 ms per token,     1.67 tokens per second)
llama_perf_context_print:       total time =  157669.53 ms /   267 tokens



Elapsed inference time: 157.67 s

User: recent Ind vs pak war details


Llama.generate: 38 prefix-match hit, remaining 11 prompt tokens to eval


 I apologize, but there has not been a recent war between India and Pakistan.
India and Pakistan have a long and complex history, and there have been several conflicts between the two nations, including the Indo-Pakistani War of 1947, the Indo-Pakistani War of 1965, and the Kargil War in 1999. However, there has not been a large-scale war between the two nations in recent years.
Instead, the relationship between India and Pakistan has been marked by periodic skirmishes and tensions, particularly over the disputed region of Kashmir. There have been several instances of cross-border terrorism and ceasefire violations in recent years, which have led to tensions between the two nations.
In 2019, tensions between India and Pakistan escalated significantly after a terrorist attack in Pulwama, India, which was carried out by a Pakistani terrorist group. This led to a brief skirmish between the two nations, including aerial combat and artillery firing, but no large-scale war occurred.
Since th

llama_perf_context_print:        load time =    5990.19 ms
llama_perf_context_print: prompt eval time =    3374.38 ms /    11 tokens (  306.76 ms per token,     3.26 tokens per second)
llama_perf_context_print:        eval time =  154110.22 ms /   255 runs   (  604.35 ms per token,     1.65 tokens per second)
llama_perf_context_print:       total time =  157980.78 ms /   266 tokens



Elapsed inference time: 157.98 s

User: who is abdul kalam,and his contribution 


Llama.generate: 48 prefix-match hit, remaining 17 prompt tokens to eval


 Abdul Kalam was a renowned Indian scientist, author, and politician who served as the 11th President of India from 2002 to 2007.
Abdul Kalam was born on October 15, 1931, in Rameswaram, Tamil Nadu, India. He studied physics and aerospace engineering at the Madras Institute of Technology and later pursued a master's degree in aerospace engineering from the Indian Institute of Science.
Kalam's career in the Indian Space Research Organisation (ISRO) spanned over four decades, during which he made significant contributions to the development of India's space program. He played a crucial role in the development of India's first satellite, the Aryabhata, and later became the director of the ISRO's satellite launch vehicle program.
Kalam's contribution to India's space program was not limited to his technical expertise. He was also a visionary leader who believed in the power of innovation and entrepreneurship to transform India's economy and society. He advocated for the development of indi

llama_perf_context_print:        load time =    5990.19 ms
llama_perf_context_print: prompt eval time =    4054.59 ms /    17 tokens (  238.51 ms per token,     4.19 tokens per second)
llama_perf_context_print:        eval time =  154080.27 ms /   255 runs   (  604.24 ms per token,     1.65 tokens per second)
llama_perf_context_print:       total time =  158636.53 ms /   272 tokens



Elapsed inference time: 158.64 s

User: In 2019, pulwama attack detail


Llama.generate: 64 prefix-match hit, remaining 17 prompt tokens to eval


 The Pulwama attack was a terrorist attack that took place on February 14, 2019, in Pulwama district of Jammu and Kashmir, India.
On that day, a convoy of vehicles carrying security personnel of the Central Reserve Police Force (CRPF) was attacked by a vehicle-borne improvised explosive device (VBIED) in the Jammu-Srinagar national highway. The attack was carried out by a group of militants of the Pakistani-based terrorist organization Jaish-e-Mohammed (JeM).
The attack resulted in the deaths of 40 CRPF personnel and injured many others. It was one of the deadliest attacks on security personnel in recent times and led to a significant escalation of tensions between India and Pakistan.
The Indian government responded to the attack by launching "Operation Balakot" against the terrorist camps in Pakistan. The operation involved air strikes on the camps in the early hours of February 26, 2019. The Indian Air Force (IAF) carried out the strikes using fighter jets, targeting multiple locatio

llama_perf_context_print:        load time =    5990.19 ms
llama_perf_context_print: prompt eval time =    4036.24 ms /    17 tokens (  237.43 ms per token,     4.21 tokens per second)
llama_perf_context_print:        eval time =  154146.86 ms /   255 runs   (  604.50 ms per token,     1.65 tokens per second)
llama_perf_context_print:       total time =  158670.77 ms /   272 tokens



Elapsed inference time: 158.68 s



In [None]:
# Multi-model (Separate pre-screening/conversation and summarization tasks)
response_iter = llm.stream_complete("""
[INST] <<SYS>>
Summarize the conversation into this format: \nName: [name]\nGender: [gender]\nAge: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]
<</SYS>>
Conversation: """+conversation_history+" [/INST] ")

for response in response_iter:
  print(response.delta, end="", flush=True)

# print(conversation_history)