<a href="https://colab.research.google.com/github/Shriansh16/Different_RAG_Techniques/blob/main/01_Query_Rewriting_and_Document_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader

### Reading the txt files from source directory

In [9]:
loader = DirectoryLoader('/content', glob='*.pdf', loader_cls=PyPDFLoader)
document = loader.load()

In [10]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=50,
    length_function=len
)
new_docs = text_splitter.split_documents(documents=document)
doc_strings = [doc.page_content for doc in new_docs]

## BGE Embddings

In [11]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-base-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

  embeddings = HuggingFaceBgeEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [12]:
vectors = embeddings.embed_documents(doc_strings)

### Creating Retriever using Vector DB

In [15]:
db = Chroma.from_documents(new_docs, embeddings)
retriever = db.as_retriever(search_kwargs={"k": 4})

In [17]:
from groq import Groq

### Query Rewriter or Augmented Queries

In [19]:
def query(text):
  client = Groq(api_key="")
  completion = client.chat.completions.create(
            model="llama-3.1-70b-versatile",
            messages=[
                {
                    "role": "system",
                    "content": """You are an AI language model assistant. Your task is to generate five
      different versions of the given user question to retrieve relevant documents from a vector
      database. By generating multiple perspectives on the user question, your goal is to help
      the user overcome some of the limitations of the distance-based similarity search.
      Provide these alternative questions separated by newlines.
      Only provide the query, do not do numbering at the start of the questions."""
                },
                {
                    "role": "user",
                    "content": text
                }
            ],
            temperature=0.6,
            max_tokens=1024,
            top_p=1
        )
  return completion.choices[0].message.content

In [41]:
query="what is self attention?"

In [28]:
variants_of_query=query(query)

In [29]:
variants_of_query

'What is the concept of self-attention in machine learning \nSelf-attention mechanism and its applications in deep learning \nHow does self-attention work in neural networks \nExplain the role of self-attention in transformer architectures \nWhat are the benefits of using self-attention in natural language processing tasks'

In [30]:
queries = variants_of_query.split('\n')

In [31]:
queries

['What is the concept of self-attention in machine learning ',
 'Self-attention mechanism and its applications in deep learning ',
 'How does self-attention work in neural networks ',
 'Explain the role of self-attention in transformer architectures ',
 'What are the benefits of using self-attention in natural language processing tasks']

In [33]:
docs = [retriever.get_relevant_documents(query) for query in queries]

## Removal of Duplicate docs(retrieved chunks)

In [36]:
unique_contents = set()
unique_docs = []
for sublist in docs:
    for doc in sublist:
        if doc.page_content not in unique_contents:
            unique_docs.append(doc)
            unique_contents.add(doc.page_content)
unique_contents = list(unique_contents)

In [37]:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Creation of Query(Original Query) and Doc Pairs

In [42]:
pairs = []
for doc in unique_contents:
    pairs.append([query, doc])

In [43]:
pairs[0]

['what is self attention?',
 'Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations\nfor different layer types. n is the sequence length, d is the representation dimension, k is the kernel\nsize of convolutions and r the size of the neighborhood in restricted self-attention.\nLayer Type Complexity per Layer Sequential Maximum Path Length\nOperations\nSelf-Attention O(n2 · d) O(1) O(1)\nRecurrent O(n · d2) O(n) O(n)\nConvolutional O(k · n · d2) O(1) O(logk(n))\nSelf-Attention (restricted) O(r · n · d) O(1) O(n/r)\n3.5 Positional Encoding\nSince our model contains no recurrence and no convolution, in order for the model to make use of the\norder of the sequence, we must inject some information about the relative or absolute position of the']

## Getting Scores on Pairs using Cross Encoder

In [44]:
scores = cross_encoder.predict(pairs)
scores

array([-1.7087057 ,  0.7559761 ,  2.9585962 ,  2.5362718 ,  1.3514689 ,
       -2.7385678 ,  6.508275  , -0.6133841 , -3.1395483 , -0.35369748,
       -0.8745614 ], dtype=float32)

In [45]:
scored_docs = zip(scores, unique_contents)
sorted_docs = sorted(scored_docs, reverse=True)
sorted_docs

[(6.508275,
  'the number of operations required to relate signals from two arbitrary input or output positions grows\nin the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes\nit more difficult to learn dependencies between distant positions [ 12]. In the Transformer this is\nreduced to a constant number of operations, albeit at the cost of reduced effective resolution due\nto averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as\ndescribed in section 3.2.\nSelf-attention, sometimes called intra-attention is an attention mechanism relating different positions\nof a single sequence in order to compute a representation of the sequence. Self-attention has been'),
 (2.9585962,
  'encoder.\n• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to\nall positions in the decoder up to and including that position. We need to prevent leftward\ninformation flow in the decode

## Reranking the documents

In [47]:
reranked_docs = [doc for _, doc in sorted_docs]
reranked_docs[:3]

['the number of operations required to relate signals from two arbitrary input or output positions grows\nin the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes\nit more difficult to learn dependencies between distant positions [ 12]. In the Transformer this is\nreduced to a constant number of operations, albeit at the cost of reduced effective resolution due\nto averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as\ndescribed in section 3.2.\nSelf-attention, sometimes called intra-attention is an attention mechanism relating different positions\nof a single sequence in order to compute a representation of the sequence. Self-attention has been',
 'encoder.\n• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to\nall positions in the decoder up to and including that position. We need to prevent leftward\ninformation flow in the decoder to preserve the auto-regre

## Now the LLM will use these chunks for creating the response