# RAG Research

One of the challenges faced by individual scientists and research organizations daily is the need for quick and reliable information support to address their tasks. Current solutions, such as Internet search engines, may offer unreliable information limited to online sources, while LLMs (large language models) are prone to "hallucinations." RAG (Retrieval Augmented Generation) solutions, however, address these shortcomings by generating answers based on verified data.

Therefore, a decision was made to develop an AI assistant based on the RAG system, based on information from grant and university websites for solving organizational issues, as well as literary sources from the field of biology. The main tasks solved by the AI assistant:
- Providing support for organizational tasks related to grant submissions (grant aggregation, selection recommendations, document preparation assistance);
- Answers to questions of a specialized, scientific nature.

## Import packages

In [239]:
! pip install nltk numpy pandas unidecode python-dotenv tqdm rouge-score chromadb pdfminer.six docarray pymupdf
! pip install transformers torch scikit-learn
! pip install langchain langchain-core langchain-community langchain_experimental langchain-chroma langchain_mistralai



In [240]:
import os
import json
import nltk
import string
import numpy as np
import pandas as pd
from unidecode import unidecode
import transformers
import torch
from sklearn.metrics.pairwise import cosine_similarity
from pathlib import Path
from tqdm import tqdm
from dotenv import load_dotenv
from getpass import getpass
from rouge_score import rouge_scorer

from langchain_community.document_loaders import PyMuPDFLoader, PDFMinerLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain_mistralai.chat_models import ChatMistralAI
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.chains import RetrievalQA
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains.question_answering.stuff_prompt import CHAT_PROMPT as DEFAULT_PROMPT
from langchain.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
from langchain_experimental.text_splitter import SemanticChunker

## Setup environment variables

You have to define the following environment variables in the `.env` file, terminal environment, or input field within this Jupyter notebook:
1. MISTRAL_API_KEY

In [241]:
env_variables = [
  'MISTRAL_API_KEY',
]

load_dotenv()

for key in env_variables:
  value = os.getenv(key)

  if value is None:
    value = getpass(key)

  os.environ[key] = value

## Download NLTK dictionaries

These dictionaries are needed for further text preprocessing.

In [242]:
dict_ids = [
  'punkt_tab',
  'punkt',
  'stopwords',
  'wordnet',
]

for dict_id in dict_ids:
  nltk.download(dict_id, quiet=True)

## Setup metrics

### Text preprocessing

Define a function for text preprocessing, which is an important step before calculating any metrics. This preprocessing function will help in cleaning the text data, making it ready for further analysis. The preprocessing involves several steps:
1. Lowercasing
2. Stopwords removal
3. Lemmatization
4. Remove accents from characters

In [243]:
lemmatizer = nltk.stem.WordNetLemmatizer()

def preprocess(corpus: str) -> str:
  corpus = corpus.lower()
  stopset = nltk.corpus.stopwords.words('english') + nltk.corpus.stopwords.words('russian') + list(string.punctuation)
  tokens = nltk.word_tokenize(corpus)
  tokens = [t for t in tokens if t not in stopset]
  tokens = [lemmatizer.lemmatize(t) for t in tokens]
  corpus = ' '.join(tokens)
  corpus = unidecode(corpus)
  return corpus

#### Example

In [244]:
sentences = [
  'The quick brown fox jumps over the lazy dog',
  'Zwölf Boxkämpfer jagen Viktor quer über den großen Sylter Deich',
  'Дует на море циклон, попадает на Цейлон',
]

for before in sentences:
  after = preprocess(before)

  print(f'Before: {before}')
  print(f'After: {after}')
  print()

Before: The quick brown fox jumps over the lazy dog
After: quick brown fox jump lazy dog

Before: Zwölf Boxkämpfer jagen Viktor quer über den großen Sylter Deich
After: zwolf boxkampfer jagen viktor quer uber den grossen sylter deich

Before: Дует на море циклон, попадает на Цейлон
After: duet more tsiklon popadaet tseilon



### Embedding Initialization

Here we are initializing the Llama 3 embeddings model. The `OllamaEmbeddings` class is a component of the Ollama library, a set of pre-trained language models. This model is capable of embedding corpora of any length into a 4096-dimensional vector.

The use of `OllamaEmbeddings` requires the installation of a local Ollama server, which can be found at https://ollama.com.

In [245]:
embeddings = OllamaEmbeddings(model='llama3')

#### Example

In [246]:
text_embedding = np.array(embeddings.embed_query('Hello, brave new world'))

In [247]:
text_embedding.shape

(4096,)

In [248]:
text_embedding

array([ 1.79492247, -1.9449054 ,  4.83772612, ..., -1.88062811,
       -2.57043958,  0.60931432])

### Set up sentences for metrics examples

In [249]:
expected_answer = 'An integral is a fundamental concept in calculus that represents the area under a curve'
predicted_answer = 'In mathematics, an integral is a fundamental concept asdasdasdasdasdas in calculus that represents the area under a curve or, more generally, the accumulation of quantities'

### Average embeddings cosine similarity metric

This function calculates the average cosine similarity between expected answers and LLM predicted answers using their respective embeddings. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them:

$$
K(a, b) = \frac{\sum \limits_{i=1}^n a_i b_i}{\sqrt{\sum \limits_{i=1}^n a_i^2} \cdot \sqrt{\sum \limits_{i=1}^n b_i^2}}
$$

In [250]:
def embeddings_cosine_sim_metric(expected_answers: list[str], predicted_answers: list[str]) -> float:
  results = []

  for expected_answer, predicted_answer in zip(expected_answers, predicted_answers):
    expected_answer = preprocess(expected_answer)
    predicted_answer = preprocess(predicted_answer)

    expected_embedding = np.array(embeddings.embed_query(expected_answer))
    predicted_embedding = np.array(embeddings.embed_query(predicted_answer))

    sim = cosine_similarity(
      expected_embedding.reshape(1, -1),
      predicted_embedding.reshape(1, -1),
    )[0][0]

    results.append(sim)

  return np.mean(results)

#### Example

In [251]:
embeddings_cosine_sim_metric([expected_answer], [predicted_answer])

0.8501560579200915

### BLEU Metric

This function calculates the average BLEU (Bilingual Evaluation Understudy) score between expected answers and predicted answers. The BLEU score is a measure that compares a candidate translation of text to one or more reference translations.

A smoothing function is defined to calculate the BLEU score. Smoothing is useful when a perfect match is not found. It ensures that the BLEU scores aren't zero.

In [252]:
smoothie_f = nltk.translate.bleu_score.SmoothingFunction().method4

def bleu_metric(expected_answers, predicted_answers):
  scores = []

  for expected_answer, predicted_answer in zip(expected_answers, predicted_answers):
    expected_answer = preprocess(expected_answer)
    predicted_answer = preprocess(predicted_answer)

    predicted_tokens = nltk.word_tokenize(predicted_answer)
    expected_tokens = [nltk.word_tokenize(expected_answer)]

    score = nltk.translate.bleu_score.sentence_bleu(
      expected_tokens,
      predicted_tokens,
      smoothing_function=smoothie_f,
    )

    scores.append(score)

  return np.mean(scores)

#### Example

In [253]:
bleu_metric([expected_answer], [predicted_answer])

0.30661487102926754

### ROGUE-1 (Recall-Oriented Understudy for Gisting Evaluation 1-gram Scoring)

In [254]:
rogue_1_scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)

def rogue_1_metric(expected_answers, predicted_answers):
  scores = []

  for expected_answer, predicted_answer in zip(expected_answers, predicted_answers):
    expected_answer = preprocess(expected_answer)
    predicted_answer = preprocess(predicted_answer)

    result = rogue_1_scorer.score(
      expected_answer,
      predicted_answer,
    )

    scores.append(result['rouge1'])

  return np.mean(scores)

#### Example

In [255]:
rogue_1_metric([expected_answer], [predicted_answer])

0.7733918128654972

### ROGUE-L (Recall-Oriented Understudy for Gisting Evaluation Longest Common Subsequence)

In [256]:
rogue_l_scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

def rogue_l_metric(expected_answers, predicted_answers):
  scores = []

  for expected_answer, predicted_answer in zip(expected_answers, predicted_answers):
    expected_answer = preprocess(expected_answer)
    predicted_answer = preprocess(predicted_answer)

    result = rogue_l_scorer.score(
      expected_answer,
      predicted_answer,
    )

    scores.append(result['rougeL'])

  return np.mean(scores)

#### Example

In [257]:
rogue_l_metric([expected_answer], [predicted_answer])

0.7733918128654972

## Setup document loaders

### Define documents path

In [258]:
docs_dir = Path('./docs')

### PyPDF

In [259]:
def get_py_mu_pdf_docs():
  docs = []

  for file in docs_dir.iterdir():
    if file.is_file() and file.suffix == '.pdf':
      loader = PyMuPDFLoader(file, extract_images=True)
      docs.extend(loader.load())

  return docs

### PDFMiner

The `PDFMinerLoader` class forms a component of the pdfminer library, a utility designed for extracting data from PDF files. The `PDFMinerLoader` instance serves the purpose of loading the content contained within the PDF document. For additional details, please refer to: https://github.com/pdfminer/pdfminer.six

In [260]:
def get_pdf_miner_docs():
  docs = []

  for file in docs_dir.iterdir():
    if file.is_file() and file.suffix == '.pdf':
      loader = PDFMinerLoader(file)
      docs.extend(loader.load())

  return docs

## Setup document splitters

### Recursive character text splitter

The `RecursiveCharacterTextSplitter` is utilized to reduce the size of these chunks to no more than 700 characters each.

In [261]:
def split_docs_by_character_splitter(docs):
  text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=700,
    chunk_overlap=0,
    length_function=len,
  )

  return text_splitter.split_documents(docs)

### Semantic splitter

The `SemanticChunker` divides text into chunks using a method similar to the k-nearest algorithm. Initially, it segments the text into sentences, organizes them into groups of three sentences, and subsequently combines those that exhibit similarity in the embedding space.

In [262]:
def split_docs_by_semantic_splitter(docs):
  text_splitter = SemanticChunker(embeddings)

  return text_splitter.create_documents(docs)

## Setup LLMs

### Llama 2

In [263]:
def get_llama2_llm(temperature=0):
  return Ollama(model='llama2', temperature=0)

### Llama 3

In [264]:
def get_llama3_llm(temperature=0):
  return Ollama(model='llama3', temperature=temperature)

### OpenBioLLM 8B

#### Setup PyTorch device

In [265]:
device = (
  'cuda'
  if torch.cuda.is_available()
  else 'mps'
  if torch.backends.mps.is_available()
  else 'cpu'
)

#### Setup LLM

For additional details, please visit https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B

The `openbiollm_parser` function is designed to remove unnecessary elements from the OpenBioLLM output.

In [266]:
def openbiollm_parser(output):
  idx = output.find('Helpful Answer: ')
  if idx != -1:
    return output[idx + len('Helpful answer: '):]
  else:
    return output

def get_openbiollm_8b_llm(temperature=0.0001):
  model = 'aaditya/OpenBioLLM-Llama3-8B'
  model_kwargs = {'torch_dtype': torch.bfloat16}
  pipeline = transformers.pipeline(
    'text-generation',
    model=model,
    model_kwargs=model_kwargs,
    device=device,
  )
  terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids('<|eot_id|>')
  ]
  llm = HuggingFacePipeline.from_model_id(
    model_id=model,
    task='text-generation',
    model_kwargs=model_kwargs,
    pipeline_kwargs={
      'max_new_tokens': 256,
      'eos_token_id': terminators,
      'do_sample': True,
      'temperature': temperature,
      'top_p': 0.9,
    },
  )
  return llm | openbiollm_parser

### Mistral

In [267]:
def get_mistral_llm(temperature=0):
  return ChatMistralAI(temperature=temperature)

## Setup index stores

### DocArray

For additional details, please visit https://docs.docarray.org

In [268]:
def get_doc_array_vector_store(docs=[]):
  index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
  ).from_documents(docs)
  return index.vectorstore

### Chroma

For additional details, please visit https://www.trychroma.com

In [269]:
def get_chroma_vector_store(docs=[]):
  vector_store = Chroma.from_documents(docs, embeddings)
  return vector_store

## Setup prompt templates

### Few-shot prompting

This type of prompting offers question examples and anticipated responses, aiding LLMs in answering more closely aligned with the provided dataset.

In [270]:
example_prompt_template = """
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Question: {question}
"""
few_shot_examples = [
  {
    "question": "Which cranial nerves are motor?",
    "answer": "Oculomotor\nTrochlear \nAbducens\nAccessory\nHypoglossal"
  },
  {
    "question": "ich of the cranial nerves have both sensory and motor control ?",
    "answer": "TrigeminalFacial GlossopharyngealVagus"
  },
  {
    "question": "Which regions of the cross section of the spinal cord have a larger ventral horn ?",
    "answer": "The cervical and lumbar regions have larger ventral horns. The thoracic region has a smaller ventral horn region because it controls the trunk so not many motor neurones are coming out. Thoracic region has a more prominent lateral horn where preganglionic neurones are present"
  },
  {
    "question": "What are the subdivisions of the vertebral column ?",
    "answer": "Cervical = 8\nThoracic= 12\nLumbar=5 \nSacral=5 \nCoccygeal"
  },
]
example_prompt = ChatPromptTemplate.from_messages(
  [
    ("human", example_prompt_template),
    ("ai", "{answer}\n"),
  ],
)
few_shot_prompt = FewShotChatMessagePromptTemplate(
  example_prompt=example_prompt,
  examples=few_shot_examples,
  input_variables=["question"],
)
base_prompt = ChatPromptTemplate.from_template("""
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
You answer in very short sentences and do not include extra information.

{context}

Question: {question}
Helpful Answer:"
""")
final_few_shot_prompt = ChatPromptTemplate.from_messages(
  [
    few_shot_prompt,
    base_prompt
  ]
)

## Setup experiments

### Load QA dataset

In [271]:
qa_df = pd.read_csv('brainscape.csv')
qa_df

Unnamed: 0,question,answer
0,What are the afferent cranial nerve nuclei?,Trigeminal sensory nucleus- fibres carry gener...
1,What is the order of the cranial nerves ?,1-olfactory\n2-optic\n3-oculomotor\n4-trochlea...
2,What are the efferent cranial nerve nuclei?,Edinger-westphal nucleus\nOculomotor nucleus\n...
3,Which nuclei share the embryo logical origin -...,Oculomotor nucleus Trochlear nucleus Abducens ...
4,Which nuclei share the embryo logical origin- ...,Trigeminal motor nucleus Facial motor nucleus ...
...,...,...
1047,What is the purpose of gephyrin in the glycine...,Involved in anchoring the receptor to a specif...
1048,What is the glycine receptor involved in ?,Reflex response\nCauses reciprocal inhibition ...
1049,What happens in hyperperplexia ?,It’s an exaggerated reflex Often caused by a m...
1050,What is hyperperplexia treated with ?,Benzodiazepine


### Setup experiments grid search parameters

In [272]:
doc_loaders = [
  ('PyMuPDF', get_py_mu_pdf_docs),
  ('PDFMiner', get_pdf_miner_docs),
]

text_splitters = [
  ('Character splitter', split_docs_by_character_splitter),
  ('Semantic splitter', split_docs_by_semantic_splitter),
]

llms = [
  # ('LLaMA 2', get_llama2_llm()),
  ('LLaMA 3', get_llama3_llm()),
  # ('OpenBioLLM Llama3 8B', get_openbiollm_8b_llm()),
  # ('Mistral', get_mistral_llm()),
]

vector_stores = [
  ('DocArray', get_doc_array_vector_store),
  # ('Chroma', get_chroma_vector_store),
]

prompts = [
  ('Default', DEFAULT_PROMPT),
  # ('Few-shot prompting', final_few_shot_prompt),
]

### Load cached RAGs responses

In [273]:
cache_path = Path('cache.json')

if not os.path.exists(cache_path):
  data = {}
  with open(cache_path, 'w') as file:
    json.dump(data, file)

with open(cache_path, 'r') as f:
  cache = json.load(f)

cache.keys()

dict_keys(['PDFMiner_Character splitter_LLaMA 3_Doc Array In Memory Search_True', 'PDFMiner_Character splitter_LLaMA 3_Doc Array In Memory Search_False', 'PDFMiner_Character splitter_OpenBioLLM Llama3 8B_Doc Array In Memory Search_True', 'PDFMiner_Character splitter_LLaMA 2_Doc Array In Memory Search_False', 'PDFMiner_Character splitter_LLaMA 2_Doc Array In Memory Search_True', 'PDFMiner_Character splitter_OpenBioLLM Llama3 8B_Doc Array In Memory Search_False'])

### Conduct the grid search and assess RAG metrics

In [274]:
df = pd.DataFrame()

questions = qa_df['question'].tolist()
expected_answers = qa_df['answer'].tolist()

for loader_name, get_docs in tqdm(doc_loaders, desc='Loaders'):
  docs = get_docs()

  for splitter_name, split_docs in tqdm(text_splitters, desc='Text splitters'):
    splitted_docs = split_docs(docs)

    for llm_name, llm in tqdm(llms, desc='LLMs', leave=False):
      for vector_store_name, get_vector_store in tqdm(vector_stores, desc='Vector Stores', leave=False):
        for use_docs in tqdm((False, True), desc='Use Docs', leave=False):
          for prompt_name, prompt_template in tqdm(prompts, desc='Prompts', leave=False):
            # If we try RAG with no documents, utilize the DocArray vector store without any documents
            if use_docs == False and vector_store_name != 'DocArray':
              continue

            # Create the final RAG out of all building blocks
            vector_store = get_vector_store(splitted_docs)
            qa_llm = RetrievalQA.from_chain_type(
              llm=llm,
              chain_type='stuff',
              retriever=vector_store.as_retriever(search_kwargs={"k" : 10}),
              verbose=False,
              chain_type_kwargs = {
                'prompt': prompt_template,
                'document_separator': '<<<<<>>>>>'
              },
            )

            # Generate RAG responses
            predicted_answers = []

            for index, question in tqdm(enumerate(questions), desc='Questions', leave=False):
              key = '_'.join([
                loader_name,
                splitter_name,
                llm_name,
                vector_store_name,
                use_docs,
              ])

              if not key in cache:
                cache[key] = {}
              if not question in cache[key]:
                cache[key][question] = qa_llm.invoke(question)['result']

              predicted_answers.append(cache[key][question])

              with open(cache_path, 'w') as f:
                json.dump(cache, f)

            # Evaluate metrics
            cos_sim = embeddings_cosine_sim_metric(expected_answers, predicted_answers)
            bleu_score = bleu_metric(expected_answers, predicted_answers)

            # Save results
            row = pd.DataFrame({
              'loader': loader_name,
              'splitter': splitter_name,
              'llm': llm_name,
              'vector_store': vector_store_name,
              'use_docs': use_docs,
              'prompt': prompt_name,
              'cos_sim': cos_sim,
              'bleu': bleu_score,
            }, index=[0])
            df = pd.concat([df, row], ignore_index=True)

Loaders:   0%|          | 0/2 [00:06<?, ?it/s]


KeyboardInterrupt: 

In [None]:
df

Unnamed: 0,llm,vector_store,use_docs,prompt,cos_sim,bleu
0,LLaMA 2,DocArray,False,Default,0,0
1,LLaMA 2,DocArray,False,Few-shot prompting,0,0
2,LLaMA 2,DocArray,True,Default,0,0
3,LLaMA 2,DocArray,True,Few-shot prompting,0,0
4,LLaMA 2,Chroma,False,Default,0,0
5,LLaMA 2,Chroma,False,Few-shot prompting,0,0
6,LLaMA 2,Chroma,True,Default,0,0
7,LLaMA 2,Chroma,True,Few-shot prompting,0,0
8,LLaMA 3,DocArray,False,Default,0,0
9,LLaMA 3,DocArray,False,Few-shot prompting,0,0
