## FlashRank

**[FlashRank](https://python.langchain.com/docs/integrations/retrievers/flashrank-reranker/)** is an ultra-lightweight and high-speed Python library designed to enhance search and retrieval pipelines with efficient **re-ranking** capabilities. It leverages **State-of-the-Art (SoTA)** cross-encoder models to improve search result relevance, making it an ideal choice for applications requiring accurate information retrieval.

### Key Features:
- **Fast and Lightweight**: Optimized for speed and minimal resource consumption.
- **Seamless Integration**: Easily integrates with existing search pipelines.
- **State-of-the-Art Models**: Utilizes advanced cross-encoders for high-quality re-ranking.
- **Flexible and Scalable**: Supports various search and retrieval tasks across different domains.
- **Community Acknowledgment**: Developed with appreciation for the AI community and the contributions of model creators.

FlashRank is an excellent choice for developers looking to boost search performance without compromising on efficiency.


### Step-1: Required Package Installation

These dependencies will set up a complete environment for working on a RAG system using Flash Reranker.

In [1]:
!pip install flashrank langchain langchain_community huggingface-hub langchain-huggingface langchain-text-splitters faiss-cpu python-docx

In [None]:
!pip install sentence-transformers==2.2.2 InstructorEmbedding==1.0.1

### Step-2: Imports

These imports set up an environment for out tryout.

In [32]:
import torch
import intel_extension_for_pytorch as ipex
import langchain
import os
from dotenv import load_dotenv
from langchain_groq import ChatGroq
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain_huggingface import HuggingFacePipeline
from langchain_text_splitters import RecursiveCharacterTextSplitter

### Step 3: LLM Setup

In this step, we will be using the `llama-3.1-8b-instant` model from GROQ. To access and use the model, you will need to create an API key. 
Need steps for generate your API key, visit the following link: [GROQ API_Key_Generation](https://github.com/AryanKarumuri/Gen-AI-Projects/blob/main/README.md#api-key-generation-guide) 

In [None]:
load_dotenv()
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
if GROQ_API_KEY:
    llm=ChatGroq(groq_api_key=GROQ_API_KEY,model_name="llama-3.1-8b-instant")
    print(GROQ_API_KEY)
else:
    print("Add Groq API Key")

### Step-4: Device Setup

The `get_device()` function checks for the availability of different devices (**CUDA**, **XPU**, or **CPU**) and returns the appropriate device for computation.

#### **Functionality**

- **CUDA Availability Check**  
  - The function first checks if a CUDA-capable GPU is available using `torch.cuda.is_available()`.  
  - If CUDA is available, it selects the GPU as the device and prints the name of the GPU.  

- **XPU Availability Check**  
  - If CUDA is not available, the function checks for the presence of an XPU (Accelerator) using `torch.xpu.is_available()`.  
  - If XPU is available, it selects the XPU device and clears the XPU cache using `torch.xpu.empty_cache()` to ensure no previous memory is used.  

- **Fallback to CPU**  
  - If neither CUDA nor XPU is available, the function defaults to using the **CPU** and prints `"Using CPU"`.  

In [4]:
def get_device() -> torch.device:
    """Check and return the appropriate device (XPU, CUDA, or CPU)."""
    if torch.cuda.is_available():
        device_type = "cuda"
        device = torch.device(device_type)
        print(f"Using CUDA device: {torch.cuda.get_device_name(0)}")
    elif torch.xpu.is_available():
        device_type = "xpu"
        device = torch.device(device_type)
        torch.xpu.empty_cache()  # Empty the XPU cache if using XPU
        print(f"Using device: {torch.xpu.get_device_name()}")
    else:
        device_type = "cpu"
        device = torch.device(device_type)
        print("Using CPU")
        
    return device

In [None]:
get_device()

### Step-5: Doc Convertion

`TextLoader` is designed specifically for loading plain text (.txt) files. It won't work directly with .docx files, as they contain additional formatting and are not plain text.

The `convert_docx_to_txt()` function reads the contents of a `.docx` file and converts it into a plain text (`.txt`) file. This is useful when extracting text from Word documents for further processing or analysis.

#### **Function Overview**
- **Input:**  
  - `docx_path` - Path to the input `.docx` file.  
  - `txt_path` - Path to the output `.txt` file.  

- **Output:**  
  - A `.txt` file containing the extracted text from the Word document. Each paragraph is written to the text file on a new line.

#### **Functionality**
1. **Read DOCX File:**  
    - The function uses `Document()` from the `python-docx` library to load the Word document.  

2. **Write to TXT File:**  
    - Opens the output text file in write mode (`"w"`) with UTF-8 encoding to ensure compatibility with various characters.  

3. **Extract and Save Paragraphs:**  
    - Iterates through all paragraphs in the `.docx` file using `doc.paragraphs`.  
    - Writes each paragraph’s text to the `.txt` file, followed by a newline (`\n`) for formatting.

In [44]:
from docx import Document
def convert_docx_to_txt(docx_path, txt_path):
    doc = Document(docx_path)
    with open(txt_path, "w", encoding="utf-8") as f:
        for para in doc.paragraphs:
            f.write(para.text + "\n")

convert_docx_to_txt("data/attention.docx", "attention.txt")

In [45]:
documents = TextLoader("attention.txt").load()

In [None]:
documents

### Step-6: Document Loading and Chunking

- **`RecursiveCharacterTextSplitter`**: A text splitter is used to break down the loaded document into smaller chunks for easier processing by models. It splits the document into chunks of 1000 characters with a 300-character overlap between consecutive chunks.
- **`text_splitter.split_documents(docs)`**: This line splits the loaded documents (`docs`) into chunks and stores them in the `document_chunks` list

In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)

### Step-7: Getting Embeddings

- **`EMBEDDING_MODEL_NAME`**: Specifies the name of the pre-trained embedding model, which is `"hkunlp/instructor-large"`. This model is designed for generating embeddings.

- **`get_embeddings()`**: Initializes the `HuggingFaceInstructEmbeddings` with the given model name and provides instructions for how to represent documents and queries for retrieval.

In [None]:
# Pre-trained Embedding Model.
EMBEDDING_MODEL_NAME = "hkunlp/instructor-large"

#Embeddings
embeddings = HuggingFaceInstructEmbeddings(
            model_name=EMBEDDING_MODEL_NAME,
            embed_instruction="Represent the document for retrieval:",
            query_instruction="Represent the question for retrieving supporting documents:"
        )

### Step-8: DB Retriever

- **`FAISS.from_documents(texts, embeddings)`**  
  - Creates a FAISS index from a collection of documents (`texts`) and their corresponding vector representations (`embeddings`).  
  - FAISS is an efficient similarity search library used for searching in large datasets of vector embeddings.  

- **`.as_retriever()`**  
  - Converts the FAISS index into a retriever object, making it easy to query for relevant documents using similarity search.  

- **`search_kwargs={"k": 7}`**  
  - Specifies that the retriever should return the **top 7 most relevant documents** for each query, based on similarity scores.  

In [None]:
retriever = FAISS.from_documents(texts, embeddings).as_retriever(search_kwargs={"k": 7})

In [12]:
query = "What is multi-head attention?"

In [13]:
result = retriever.invoke(query)

In [18]:
result

[Document(metadata={'source': './attention.txt'}, page_content='MultiHead(Q, K, V ) = Concat(head1, ..., headh)WO\nwhere headi = Attention(QWQ, KWK, V WV )\ni\ti\ti\n\n\nWhere the projections are parameter matrices WQ ∈ Rdmodel×dk , WK ∈ Rdmodel×dk , WV ∈ Rdmodel×dv\n\nand WO ∈ Rhdv ×dmodel .\ni\ti\ti\n\nIn this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.'),
 Document(metadata={'source': './attention.txt'}, page_content='output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.\nMulti-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.\n\n\nMultiHead(Q, K, V ) = Concat(head1, ..., headh)WO\nwh

### Description:

The `pretty_print_docs()` function neatly displays a list of documents by printing their content and metadata in a formatted manner.

In [15]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [
                f"Document {i+1}:\n\n{d.page_content}\nMetadata: {d.metadata}"
                for i, d in enumerate(docs)
            ]
        )
    )

In [16]:
pretty_print_docs(result)

Document 1:

MultiHead(Q, K, V ) = Concat(head1, ..., headh)WO
where headi = Attention(QWQ, KWK, V WV )
i	i	i


Where the projections are parameter matrices WQ ∈ Rdmodel×dk , WK ∈ Rdmodel×dk , WV ∈ Rdmodel×dv

and WO ∈ Rhdv ×dmodel .
i	i	i

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.
Metadata: {'source': './attention.txt'}
----------------------------------------------------------------------------------------------------
Document 2:

output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.


MultiHead(Q, K, V ) = Concat(head1,

### Step-9: ContextualCompressionRetriever and FlaskReranker Setup

In [None]:
compressor = FlashrankRerank()

In [20]:
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)

In [21]:
compressed_docs = compression_retriever.invoke("What is multi-head attention?")

In [22]:
compressed_docs

[Document(metadata={'id': 3, 'relevance_score': 0.9992928, 'source': './attention.txt'}, page_content='Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n\nScaled Dot-Product Attention\tMulti-Head Attention\n\n\n\nFigure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.'),
 Document(metadata={'id': 2, 'relevance_score': 0.99906945, 'source': './attention.txt'}, page_content='Multi-Head Attention\nInstead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention f

In [23]:
pretty_print_docs(compressed_docs)

Document 1:

Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum

Scaled Dot-Product Attention	Multi-Head Attention



Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.
Metadata: {'id': 3, 'relevance_score': 0.9992928, 'source': './attention.txt'}
----------------------------------------------------------------------------------------------------
Document 2:

Multi-Head Attention
Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in pa

### Step-10: Chain Setup
The code uses a **RetrievalQA** chain to perform a question-answering task using a compressed retriever.

In [33]:
chain = RetrievalQA.from_chain_type(llm=llm, retriever=compression_retriever)
chain.invoke(query)

2025-03-25 09:54:23,250 - httpx - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


{'query': 'What is multi-head attention?',
 'result': 'Multi-head attention is a type of attention mechanism used in transformer models, particularly in natural language processing and machine translation tasks. It\'s an extension of the scaled dot-product attention mechanism.\n\nIn traditional attention, the attention weights are computed by taking the dot product of the query and key vectors, and then applying a scaling factor and a softmax function to get the final weights.\n\nMulti-head attention, on the other hand, linearly projects the query, key, and value vectors into multiple subspaces (or "heads") of dimensionality `dk`, `dk`, and `dv` respectively, where `h` is the number of heads. Each head computes attention weights independently, and the final output is a weighted sum of all the heads.\n\nThis allows the model to jointly attend to information from different representation subspaces at different positions. The intuition behind multi-head attention is that different heads c

### Below is the answer that is out-of-context. It is just givivng the answer based on relevance score

In [38]:
pretty_print_docs(compression_retriever.invoke("Who is the prime minister of India?"))

Document 1:

Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.

Attention Is All You Need






Ashish Vaswani∗ Google Brain avaswani@google.com
Noam Shazeer∗ Google Brain noam@google.com
Niki Parmar∗ Google Research nikip@google.com
Jakob Uszkoreit∗ Google Research usz@google.com
Metadata: {'id': 0, 'relevance_score': 0.9969857, 'source': './attention.txt'}
----------------------------------------------------------------------------------------------------
Document 2:

∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head
Metadata: {'id': 3, 'relevance_score': 0.93468195, 'source': './atte

### Below is the best practical example for using llm

In [36]:
query = "Who is the prime minister of India?"
chain = RetrievalQA.from_chain_type(llm=llm, retriever=compression_retriever)
chain.invoke(query)

2025-03-25 10:17:51,874 - httpx - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


{'query': 'Who is the prime minister of India?',
 'result': "I don't have the most current information on the prime minister of India. My knowledge cutoff is December 2023, and I may not have information on recent changes or updates. For the most accurate and up-to-date information, I recommend checking a reliable news source or official government website."}