<a href="https://colab.research.google.com/github/sunnysavita10/Generative-AI-Indepth-Basic-to-Advance/blob/main/FlashRerankPractical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Model Options:**
- **Nano**: ~4MB, blazing fast model with competitive performance (ranking precision).
- **Small**: ~34MB, slightly slower with the best performance (ranking precision).
- **Medium**: ~110MB, slower model with the best zero-shot performance (ranking precision).
- **Large**: ~150MB, slower model with competitive performance (ranking precision) for 100+ languages.

 **Flash Rank**: Ultra-lite & Super-fast Python library for search & retrieval re-ranking.

- **Ultra-lite**: No heavy dependencies. Runs on CPU with a tiny ~4MB reranking model.
- **Super-fast**: Speed depends on the number of tokens in passages and query, plus model depth.
- **Cost-efficient**: Ideal for serverless deployments with low memory and time requirements.
- **Based on State-of-the-Art Cross-encoders**: Includes models like ms-marco-TinyBERT-L-2-v2 (default), ms-marco-MiniLM-L-12-v2, rank-T5-flan, and ms-marco-MultiBERT-L-12.
- **Sleek Models for Efficiency**: Designed for minimal overhead in user-facing scenarios.

_Flash Rank is tailored for scenarios requiring efficient and effective reranking, balancing performance with resource usage._

In [None]:
!pip install flashrank

In [None]:
# Helper function for printing docs


def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [
                f"Document {i+1}:\n\n{d.page_content}\nMetadata: {d.metadata}"
                for i, d in enumerate(docs)
            ]
        )
    )

In [None]:
query = "How to speedup LLMs?"

In [None]:
passages = [
   {
      "id":1,
      "text":"Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.",
      "meta": {"additional": "info1"}
   },
   {
      "id":2,
      "text":"LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper",
      "meta": {"additional": "info2"}
   },
   {
      "id":3,
      "text":"There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.",
      "meta": {"additional": "info3"}

   },
   {
      "id":4,
      "text":"Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.",
      "meta": {"additional": "info4"}
   },
   {
      "id":5,
      "text":"vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels",
      "meta": {"additional": "info5"}
   }
]


In [None]:
from flashrank.Ranker import Ranker, RerankRequest

In [None]:
def get_result(query,passages,choice):
  if choice == "Nano":
    ranker = Ranker()
  elif choice == "Small":
    ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt")
  elif choice == "Medium":
    ranker = Ranker(model_name="rank-T5-flan", cache_dir="/opt")
  elif choice == "Large":
    ranker = Ranker(model_name="ms-marco-MultiBERT-L-12", cache_dir="/opt")
  rerankrequest = RerankRequest(query=query, passages=passages)
  results = ranker.rerank(rerankrequest)
  print(results)

  return results

In [None]:
%%time
print("sunny")

In [None]:
%%time
get_result(query,passages,"Nano")

In [None]:
%%time
get_result(query,passages,"Small")

In [None]:
%%time
get_result(query,passages,"Medium")

In [None]:
!pip install langchain_community

In [None]:
!pip install langchain_openai

In [None]:
from google.colab import userdata
OPENAI_API_KEY=userdata.get('OPENAI_API_KEY')

In [None]:
import os
os.environ["OPENAI_API_KEY"]=OPENAI_API_KEY

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

In [None]:
documents = TextLoader("/content/state_of_the_union.txt").load()

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

In [None]:
texts = text_splitter.split_documents(documents)

In [None]:
for id, text in enumerate(texts):
    text.metadata["id"] = id

In [None]:
texts

In [None]:
embedding = OpenAIEmbeddings(model="text-embedding-ada-002")

In [None]:
!pip install faiss-cpu

In [None]:
retriever = FAISS.from_documents(texts, embedding).as_retriever(search_kwargs={"k": 10})

In [None]:
query = "What did the president say about Ketanji Brown Jackson"

In [None]:
docs = retriever.invoke(query)

In [None]:
pretty_print_docs(docs)

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain_openai import ChatOpenAI

In [None]:
llm = ChatOpenAI(temperature=0)

In [None]:
compressor = FlashrankRerank()

In [None]:
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)

In [None]:
compressed_docs = compression_retriever.invoke("What did the president say about Ketanji Jackson Brown")

In [None]:
len(compressed_docs)


In [None]:
compressed_docs

In [None]:
print([doc.metadata["id"] for doc in compressed_docs])

In [None]:
pretty_print_docs(compressed_docs)

In [None]:
from langchain.chains import RetrievalQA

chain = RetrievalQA.from_chain_type(llm=llm, retriever=compression_retriever)

In [None]:
chain.invoke(query)