Objective:
To retrieve information from various LLM research papers using RAG technique.

Steps:
1. Keep the research papers into gdrive
2. Load the information of the research papers
3. Split the documents into chunks
4. Embed using Hugging face embedding method
5. Store the embedding in Weaviate Vector databse
6. Using similarity search, fetch K nearest responses to a query
7. Use Mistral LLM via Hugging Face API to 'augment' and generate a concise response to the query



In [1]:
# install the required libraries
!pip install weaviate-client langchain tiktoken pypdf rapidocr-onnxruntime


Collecting weaviate-client
  Downloading weaviate_client-4.5.5-py3-none-any.whl (306 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m306.8/306.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.1.16-py3-none-any.whl (817 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidocr-onnxruntime
  Downloading rapidocr_onnxruntime-1.3.16-py3-none-any.whl (14.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/163.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-no

In [3]:
# Using the hugging face provided opensource embedding model

from langchain.embeddings import HuggingFaceEmbeddings
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {"device": "cuda"}
embeddings = HuggingFaceEmbeddings(
  model_name=embedding_model_name,
  model_kwargs=model_kwargs
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [4]:
# now lets load our folder of research pdf documents post mounting gdrive
!ls ./drive/MyDrive/llms/research_papers

attention.pdf  lora.pdf		peft.pdf	    rag_for_intensive_nlp.pdf  yolo.pdf
llama2.pdf     one_bit_llm.pdf	rag_evaluation.pdf  rag_vs_finetune.pdf


In [5]:
# reading the directory of pdf files
from langchain.document_loaders import PyPDFDirectoryLoader

data = PyPDFDirectoryLoader('drive/MyDrive/llms/research_papers').load()

In [6]:
len(data)

197

In [7]:
data[0].page_content

'Published as a conference paper at ICLR 2015\nNEURAL MACHINE TRANSLATION\nBYJOINTLY LEARNING TO ALIGN AND TRANSLATE\nDzmitry Bahdanau\nJacobs University Bremen, Germany\nKyungHyun Cho Yoshua Bengio∗\nUniversit ´e de Montr ´eal\nABSTRACT\nNeural machine translation is a recently proposed approach to machine transla-\ntion. Unlike the traditional statistical machine translation, the neural machine\ntranslation aims at building a single neural network that can be jointly tuned to\nmaximize the translation performance. The models proposed recently for neu-\nral machine translation often belong to a family of encoder–decoders and encode\na source sentence into a ﬁxed-length vector from which a decoder generates a\ntranslation. In this paper, we conjecture that the use of a ﬁxed-length vector is a\nbottleneck in improving the performance of this basic encoder–decoder architec-\nture, and propose to extend this by allowing a model to automatically (soft-)search\nfor parts of a source sentenc

In [8]:
# split the data into compatible chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
text_chunks = text_splitter.split_documents(data)

In [9]:
type(text_chunks), len(text_chunks)

(list, 4690)

In [10]:
text_chunks[1].page_content

'KyungHyun Cho Yoshua Bengio∗\nUniversit ´e de Montr ´eal\nABSTRACT\nNeural machine translation is a recently proposed approach to machine transla-'

In [11]:
# Now the step to load these chunks to our Weavitae vector database

from langchain.vectorstores import Weaviate
import weaviate

In [12]:
from google.colab import userdata
WEAVIATE_URL = 'https://mb-rag-llm-799vd5g0.weaviate.network'
WEAVIATE_API_KEY = userdata.get('WEAVIATE_API_KEY')

weaviate_client = weaviate.Client(
    url=WEAVIATE_URL, auth_client_secret=weaviate.AuthApiKey(WEAVIATE_API_KEY)
)

In [13]:
vector_db = Weaviate.from_documents(
    text_chunks, embeddings, client=weaviate_client, by_text=False
)

In [18]:
searches = vector_db.similarity_search("what is a llm?", k=5)

In [19]:
for idx, content in enumerate(searches):
  print(f"Search {idx}: {content.page_content}")

Search 0: contribute to the responsible development of LLMs.
∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com
†Second author
Search 1: limited the development of LLMs to a few players. There have been public releases of pretrained LLMs
Search 2: (Askell et al., 2021a) similar to Section 3.3. We observe that the safety capabilities of LLMs can be efficiently
Search 3: better understand and analyze the varied behavior exhibited by LLMs across different demographic groups.
70
Search 4: LLMs with knowledge from a reference textual
database, which enables them to act as a natu-
ral language layer between a user and textual
databases, reducing the risk of hallucinations.


## As seen above, we have got incoherent/incosistent responses for our question. Lets see if using a LLM can improve this response.

In [20]:
from langchain.prompts import ChatPromptTemplate

In [21]:
template = """You are an datascientist assistant bot. Users will come to you and ask you various questions
on datascience and machine learning. You need to use the retrieved context that is provided to you and provide answer
to the user based on their question. Try to be elaborative and use upto 1000 words in your reply. Your response should be complete
and shouldn't break in the middle of a sentence. Also, if you don't know something, tell them politely that you have no idea on this topic.
Question: {query}
Context: {context}
Answer:
"""

In [22]:
template

"You are an datascientist assistant bot. Users will come to you and ask you various questions\non datascience and machine learning. You need to use the retrieved context that is provided to you and provide answer\nto the user based on their question. Try to be elaborative and use upto 1000 words in your reply. Your response should be complete\nand shouldn't break in the middle of a sentence. Also, if you don't know something, tell them politely that you have no idea on this topic.\nQuestion: {query}\nContext: {context}\nAnswer:\n"

In [23]:
prompt=ChatPromptTemplate.from_template(template)

In [24]:
from langchain.chat_models import ChatOpenAI
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

In [25]:
# Using Mistral via Hugging Face API

from langchain import HuggingFaceHub

llm_model = HuggingFaceHub(
    huggingfacehub_api_token=userdata.get('HUGGINGFACE_API_KEY'),
    repo_id="mistralai/Mistral-7B-Instruct-v0.1",
    model_kwargs={"temperature":1, "max_length":180}
)

  warn_deprecated(


In [26]:
output_parser=StrOutputParser()
retriever=vector_db.as_retriever()

In [27]:
rag_chain = (
    {"context": retriever,  "query": RunnablePassthrough()}
    | prompt
    | llm_model
    | output_parser
)

In [28]:
response = rag_chain.invoke('what is an LLM?')
print(response)

Human: You are an datascientist assistant bot. Users will come to you and ask you various questions
on datascience and machine learning. You need to use the retrieved context that is provided to you and provide answer
to the user based on their question. Try to be elaborative and use upto 1000 words in your reply. Your response should be complete
and shouldn't break in the middle of a sentence. Also, if you don't know something, tell them politely that you have no idea on this topic.
Question: what is an LLM?
Context: [Document(page_content='contribute to the responsible development of LLMs.\n∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com\n†Second author', metadata={'page': 0, 'source': 'drive/MyDrive/llms/research_papers/llama2.pdf'}), Document(page_content='limited the development of LLMs to a few players. There have been public releases of pretrained LLMs', metadata={'page': 2, 'source': 'drive/MyDrive/llms/research_papers/llama2.pdf'}), Document(page_cont

In [29]:
response = rag_chain.invoke('what are someways to finetune a llm')
print(response)

Human: You are an datascientist assistant bot. Users will come to you and ask you various questions
on datascience and machine learning. You need to use the retrieved context that is provided to you and provide answer
to the user based on their question. Try to be elaborative and use upto 1000 words in your reply. Your response should be complete
and shouldn't break in the middle of a sentence. Also, if you don't know something, tell them politely that you have no idea on this topic.
Question: what are someways to finetune a llm
Context: [Document(page_content='contribute to the responsible development of LLMs.\n∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com\n†Second author', metadata={'page': 0, 'source': 'drive/MyDrive/llms/research_papers/llama2.pdf'}), Document(page_content='limited the development of LLMs to a few players. There have been public releases of pretrained LLMs', metadata={'page': 2, 'source': 'drive/MyDrive/llms/research_papers/llama2.pdf'})

In [30]:
response = rag_chain.invoke('what is a RAG')
print(response)

Human: You are an datascientist assistant bot. Users will come to you and ask you various questions
on datascience and machine learning. You need to use the retrieved context that is provided to you and provide answer
to the user based on their question. Try to be elaborative and use upto 1000 words in your reply. Your response should be complete
and shouldn't break in the middle of a sentence. Also, if you don't know something, tell them politely that you have no idea on this topic.
Question: what is a RAG
Context: [Document(page_content='for the generated answer. Indeed, RAG systems are\noften used in applications where the factual con-\nsistency of the generated text w.r.t. the grounded', metadata={'page': 2, 'source': 'drive/MyDrive/llms/research_papers/rag_evaluation.pdf'}), Document(page_content='examples/rag/ . An interactive demo of RAG models can be found at https://huggingface.co/rag/\n2', metadata={'page': 1, 'source': 'drive/MyDrive/llms/research_papers/rag_for_intensive_nl

In [36]:
answer = response.split(sep="Answer:")[1].strip()
print(answer)

RAG (Retrieval-Augmentation-Generation) is a technique used in natural language processing (NLP) to improve the quality of generated text by incorporating context relevant to the question being asked. It involves three main steps: retrieval, augmentation, and generation.

During the retrieval step, the model retrieves relevant passages from a knowledge base or corpus that contain information related to the question being asked. These passages are then used as input
