<a href="https://colab.research.google.com/github/AIWalaBro/Advanced_Rag/blob/main/1_HybridSearch_in_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import paired_cosine_distances
import numpy as np


## Document preprocessing and Query Preprocessing

In [2]:
# Sample documents
documents = [
    "This is a list which containig sample documents.",
    "Keywords are important for keyword-based search.",
    "Document analysis involves extracting keywords.",
    "Keyword-based search relies on sparse embeddings."
]

In [3]:
query = "keyword-based search"

In [4]:
import re
def preprocess_text(text):
  # convert text to lowercase
  text = text.lower()
  # remove the punctuation
  text = re.sub(r'[^\w\s]','',text)
  return text

In [5]:
preprocess_documents = [preprocess_text(doc) for doc in  documents]

In [6]:
preprocess_documents

['this is a list which containig sample documents',
 'keywords are important for keywordbased search',
 'document analysis involves extracting keywords',
 'keywordbased search relies on sparse embeddings']

In [7]:
query

'keyword-based search'

In [8]:
# let pass our query through the preprocess text
preprocess_query = preprocess_text(query)

In [9]:
preprocess_query

'keywordbased search'

### obeservation: when you passed your query from preprocess text function that time it should be "keyword based search" but instead what happend combine this word into sinlge one like "keywordbased search"
- `impact:` when you transform into embedding that time it will be differnt in tfidfvectorizer

In [10]:
# convert this into the sparse vectors
vector = TfidfVectorizer()
X = vector.fit_transform(preprocess_documents)
X

<4x21 sparse matrix of type '<class 'numpy.float64'>'
	with 24 stored elements in Compressed Sparse Row format>

In [11]:
X.toarray()

array([[0.        , 0.        , 0.37796447, 0.        , 0.37796447,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.37796447, 0.        , 0.        , 0.37796447, 0.        ,
        0.        , 0.37796447, 0.        , 0.        , 0.37796447,
        0.37796447],
       [0.        , 0.4533864 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.4533864 , 0.4533864 , 0.        ,
        0.        , 0.35745504, 0.35745504, 0.        , 0.        ,
        0.        , 0.        , 0.35745504, 0.        , 0.        ,
        0.        ],
       [0.46516193, 0.        , 0.        , 0.46516193, 0.        ,
        0.        , 0.46516193, 0.        , 0.        , 0.46516193,
        0.        , 0.        , 0.36673901, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.43671931, 0.        , 0.        , 0.       

In [12]:
X.toarray().shape

(4, 21)

In [13]:
X.toarray()[0]

array([0.        , 0.        , 0.37796447, 0.        , 0.37796447,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.37796447, 0.        , 0.        , 0.37796447, 0.        ,
       0.        , 0.37796447, 0.        , 0.        , 0.37796447,
       0.37796447])

In [14]:
# Raw implemenation of KEYWORD Search

query_embedding = vector.transform([preprocess_query])
query_embedding.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.70710678, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.70710678, 0.        , 0.        ,
        0.        ]])

In [15]:
query_embedding = vector.transform([preprocess_query])
query_embedding.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.70710678, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.70710678, 0.        , 0.        ,
        0.        ]])

#### Observation: see above 2 cells having huge differnce in embedding represntation.

#### Notes: tfidf advanced version BM25. it genreates the sparse vectors in most of the places.but here we are not using BM25

## ReRanking - Sparse vectores

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

In [17]:
similarities = cosine_similarity(X, query_embedding)

In [18]:
similarities
# see you can see for those words you got highest similarity

array([[0.        ],
       [0.50551777],
       [0.        ],
       [0.48693426]])

In [19]:
# with the help of this code i can do ranking

np.argsort(similarities, axis =0)

array([[0],
       [2],
       [3],
       [1]])

In [20]:
#Ranking
ranked_indices = np.argsort(similarities,axis=0)[::-1].flatten()

In [21]:
ranked_indices

array([1, 3, 2, 0])

In [22]:
ranked_documents = [documents[i] for i in ranked_indices]

In [23]:
# output of the ranked documents

for i, doc in enumerate(ranked_documents):
  print(f"rank =  {i+1} : {doc}" )

rank =  1 : Keywords are important for keyword-based search.
rank =  2 : Keyword-based search relies on sparse embeddings.
rank =  3 : Document analysis involves extracting keywords.
rank =  4 : This is a list which containig sample documents.


### ReRanking - Dense Vectors

In [24]:
# let suppose we having the embedidng of the sentences

document_embeddings = np.array([
    [0.634, 0.234, 0.867, 0.042, 0.249],
    [0.123, 0.456, 0.789, 0.321, 0.654],
    [0.987, 0.654, 0.321, 0.123, 0.456]
])

In [25]:
# and you having the sample search query

# Sample search query (represented as a dense vector)
query_embedding = np.array([[0.789, 0.321, 0.654, 0.987, 0.123]])

In [26]:
# # Calculate cosine similarity between query and documents
similarities = cosine_similarity(document_embeddings, query_embedding)

In [27]:
similarities

array([[0.73558979],
       [0.67357898],
       [0.71517305]])

In [28]:
# lets find out rank indices
ranked_indices = np.argsort(similarities, axis=0)[::-1].flatten()

In [29]:
ranked_indices

array([0, 2, 1])

In [30]:
[document_embeddings[i] for i in ranked_indices]

[array([0.634, 0.234, 0.867, 0.042, 0.249]),
 array([0.987, 0.654, 0.321, 0.123, 0.456]),
 array([0.123, 0.456, 0.789, 0.321, 0.654])]

In [31]:


# Output the ranked documents
for i, idx in enumerate(ranked_indices):
    print(f"Rank {i+1}: Document {idx+1}")

Rank 1: Document 1
Rank 2: Document 3
Rank 3: Document 2


### Keyword search on real time documents

In [32]:
doc_path = "/content/explainable.pdf"

In [33]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-4.3.1-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-4.3.1-py3-none-any.whl (295 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.8/295.8 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-4.3.1


In [34]:
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.2.13-py3-none-any.whl.metadata (2.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langchain<0.3.0,>=0.2.15 (from langchain_community)
  Downloading langchain-0.2.15-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.3.0,>=0.2.35 (from langchain_community)
  Downloading langchain_core-0.2.35-py3-none-any.whl.metadata (6.2 kB)
Collecting langsmith<0.2.0,>=0.1.0 (from langchain_community)
  Downloading langsmith-0.1.106-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain_community)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.22.0-py3-none-any.whl.metadata (7.2 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,

In [35]:
from langchain_community.document_loaders import PyPDFLoader

In [36]:
loader = PyPDFLoader(doc_path)

In [37]:
docs = loader.load()
docs

[Document(metadata={'source': '/content/explainable.pdf', 'page': 0}, page_content='Loan\nDefault\nPrediction\nwith\nExplainable\nAI\nProject\nOverview\nOverview\nLoan\ndefault\nprediction\nis\na\ncritical\napplication\nin\nthe\nfinancial\nindustry ,\nwhere\nlenders\nand\ninstitutions\naim\nto\nassess\nthe\nrisk\nassociated\nwith\nproviding\nloans\nto\nindividuals\nor\nbusinesses.\nThe\ngoal\nis\nto\nidentify\npotential\nborrowers\nmore\nlikely\nto\ndefault\non\ntheir\nloans,\nallowing\nlenders\nto\nmake\ninformed\ndecisions\nand\nmitigate\nfinancial\nrisks.\nIn\nthe\ncontext\nof\nloan\ndefault\nprediction,\nmachine\nlearning\nmodels\nhave\nshown\npromising\nresults\nin\naccurately\npredicting\nwhether\na\nborrower\nwill\ndefault\non\na\nloan.\nThese\nmodels\nleverage\nhistorical\ndata,\nsuch\nas\npast\nloan\nperformance,\nfinancial\nhistory ,\nemployment\ndetails,\nand\nother\nrelevant\nfactors,\nto\npredict\nthe\nlikelihood\nof\nfuture\nloan\ndefaults.\nHowever ,\nas\nmachine\nlearni

In [38]:
# convert into the chunks or tokens
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [39]:
splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)

In [40]:
chunks = splitter.split_documents(docs)
chunks

[Document(metadata={'source': '/content/explainable.pdf', 'page': 0}, page_content='Loan\nDefault\nPrediction\nwith\nExplainable\nAI\nProject\nOverview\nOverview\nLoan\ndefault\nprediction\nis\na\ncritical\napplication\nin\nthe\nfinancial\nindustry ,\nwhere\nlenders\nand\ninstitutions\naim\nto\nassess\nthe'),
 Document(metadata={'source': '/content/explainable.pdf', 'page': 0}, page_content='aim\nto\nassess\nthe\nrisk\nassociated\nwith\nproviding\nloans\nto\nindividuals\nor\nbusinesses.\nThe\ngoal\nis\nto\nidentify\npotential\nborrowers\nmore\nlikely\nto\ndefault\non\ntheir\nloans,\nallowing\nlenders\nto\nmake'),
 Document(metadata={'source': '/content/explainable.pdf', 'page': 0}, page_content='allowing\nlenders\nto\nmake\ninformed\ndecisions\nand\nmitigate\nfinancial\nrisks.\nIn\nthe\ncontext\nof\nloan\ndefault\nprediction,\nmachine\nlearning\nmodels\nhave\nshown\npromising\nresults\nin\naccurately\npredicting'),
 Document(metadata={'source': '/content/explainable.pdf', 'page': 0}, p

### loads the models

In [41]:
from langchain.embeddings import HuggingFaceEmbeddings, HuggingFaceInferenceAPIEmbeddings

In [None]:
HF_TOKEN = "YOUR HUGGING FACE TOKEN"

In [48]:
embeddings = HuggingFaceInferenceAPIEmbeddings(api_key = HF_TOKEN, model_name = "BAAI/bge-base-en-v1.5")

`note:` as you know for the keyword search we are not use dense embedding we use sparse embedding. it created on the vocabulary.

In [49]:
!pip install chromadb



In [50]:
from langchain.vectorstores import Chroma

#### till not created sparse vectors, created only dense vectors.

In [51]:
vectorestore = Chroma.from_documents(chunks, embeddings)

In [52]:
vectorstore_retriever = vectorestore.as_retriever(search_kwargs = {"k":3})

In [53]:
vectorstore_retriever
# i got an object, i will get the top3 results on the basis of similarity search

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceInferenceAPIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7a40ab04c760>, search_kwargs={'k': 3})

### Keyword Search

In [54]:
# its an update version of the tfidf only
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [55]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

In [56]:
keyword_retriever = BM25Retriever.from_documents(chunks)

In [57]:
keyword_retriever

BM25Retriever(vectorizer=<rank_bm25.BM25Okapi object at 0x7a40abd83940>)

In [58]:
keyword_retriever.k = 3

### vectorretriver  created is the dense vectors and it is created from hugging face embeddings and keyword retriver is sparse vectors and it is created from tfidf.

In [59]:
ensemble_retriever = EnsembleRetriever(retrievers = [vectorstore_retriever, keyword_retriever],weights=[0.3, 0.7])

# 0.3 giving to vectorstore retriever  and 0.7 giving to the keyword_retrivers

# Mixing vector search and keyword search for Hybrid search
`hybrid_score = (1 — alpha) * sparse_score + alpha * dense_score`

In [60]:
# i have to use at the end model quantized version of the model

model_name = "HuggingFaceH4/zephyr-7b-beta"

In [61]:
!pip install bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl (137.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.43.3


In [62]:
! pip install accelerate



#### Note :  always keep on the gpu while using accelerate else throws an error because it managing the GPU and it will be utilizing an GPU.
- ####  both package are important while using quantized version of the model

In [63]:
import torch
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline) # Changed AutoModelFORCausalLM to AutoModelForCausalLM
from langchain import HuggingFacePipeline

In [64]:
# function for laoding 4-bit quantized model

def load_quantized_model(model_name:str):

  '''
  model_name: Name or path of the model to be loaded.
  return: loaded quantized model.
  '''
  quantized_model_configuration = BitsAndBytesConfig(
      load_in_4bit = True,
      bnb_4bit_use_doble_quant = True,
      bnb_4bit_quant = "nf4",
      bnb_4bit_compute_dtype = torch.bfloat16)

  model = AutoModelForCausalLM.from_pretrained(
      model_name,
      torch_dtype = torch.bfloat16,
      quantization_config = quantized_model_configuration,
      device_map = "auto")

  return model


### every model has there own tokenization.
- 1. infirst phase we convert into the token thatmean token embedding
- 2. then psotional encoding
- 3. then self attention and after that to the neural network.

In [65]:

# initializing tokenizer
def initialize_tokenizer(model_name: str):
    """
    model_name: Name or path of the model for tokenizer initialization.
    return: Initialized tokenizer.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name, return_token_type_ids=False)
    tokenizer.bos_token_id = 1  # Set beginning of sentence token id
    return tokenizer

In [66]:
# initializing tokenizer
tokenizer = initialize_tokenizer(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [67]:
# load the model
model = load_quantized_model(model_name)

Unused kwargs: ['bnb_4bit_use_doble_quant', 'bnb_4bit_quant']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

# for vectorstore used Chromadb
chorma db has 3 variants
- 1. to the cloud
- 2. to the local disc
- 3. in the rambased
- `here we are using ram based.`

# Pipeline Creation

In [74]:

pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    use_cache=True,
    device_map="auto",
    max_length=2048,
    do_sample=True,
    top_k=5,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

In [68]:
pipeline('text-generation',
         model = model,
         tokenizer = tokenizer,
         use_cache = True,
         device_map = "auto",
         max_length = 2048,
         do_sample = True,
         top_k = 5,
         num_return_sequences = 1,
         eos_token_id = tokenizer.eos_token_id,
         pad_token_id = tokenizer.eos_token_id)

<transformers.pipelines.text_generation.TextGenerationPipeline at 0x7a3f6cb88940>

In [75]:
llm = HuggingFacePipeline(pipeline = pipeline)

In [76]:
from langchain.chains import RetrievalQA

In [77]:
# normal chain means vector store retriver
# i am not passing the ensemble retriver.
normal_chain = RetrievalQA.from_chain_type(
    llm = llm, chain_type = "stuff", retriever = vectorstore_retriever)


In [78]:
# creating hybrid chain
# in ensemble retrival (dense vector + sparse vector)
hybrid_chain = RetrievalQA.from_chain_type(
    llm = llm, chain_type = "stuff", retriever = ensemble_retriever)

In [79]:
response1 = normal_chain.invoke("What is Loan Default Prediction?")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


hybrid search

In [83]:
response2 = hybrid_chain.invoke('what is loan default prediction')

In [89]:
print(response2['result'])

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Loan
Default
Prediction
with
Explainable
AI
Project
Overview
Overview
Loan
default
prediction
is
a
critical
application
in
the
financial
industry ,
where
lenders
and
institutions
aim
to
assess
the

of
loan
default
prediction
and
its
impact
on
financial
decision-making
2.
Understand
the
data
preprocessing
techniques
to
clean,
encode,
and
balance
data
for
accurate
model
training
3.
Utilizing

using
the
command
pip
install
-r
requirements.txt
4.
All
the
instructions
for
running
the
code
are
present
in
readme.md
Project
Takeaways
1.
Understanding
the
significance
of
loan
default
prediction

in
accurately
predicting
whether
a
borrower
will
default
on
a
loan.
These
models
leverage
historical
data,
such
as
past
loan
performance,
financial
history ,
employment
details,
and
other
relevant

allowing
lenders
to
make
informed
decisions


In [None]:
print(response2.get['result'])