<a href="https://colab.research.google.com/github/HarshSonaiya/DL/blob/main/Approach1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install langchain elasticsearch langchain_community
!pip install pypdf sentence_transformers

Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting elasticsearch
  Downloading elasticsearch-8.15.1-py3-none-any.whl.metadata (8.7 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.0-py3-none-any.whl.metadata (2.8 kB)
Collecting langchain-core<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_core-0.3.2-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.125-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting elastic-transport<9,>=8.13 (from elasticsearch)
  Downloading elastic_transport-8.15.0-py3-none-any.whl.metadata (3.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_communit

In [2]:
from langchain.retrievers import BM25Retriever
from langchain.vectorstores import ElasticsearchStore
from elasticsearch import Elasticsearch
from typing import List
from sentence_transformers import SentenceTransformer
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document


  from tqdm.autonotebook import tqdm, trange


In [3]:
es_client = Elasticsearch() # Replace with your Elasticsearch connection details

In [9]:
index_name = 'temp'
    # Define mappings for dense and sparse vectors
mappings = {
        "properties": {
            "content": {
                "type": "text",
                "similarity":"BM25"
            },
            "dense_vector": {
                "type": "dense_vector",
                "dims": 384
            }
        }
    }

    # Create index with mapping
es_client.indices.create(index=index_name, mappings = mappings)


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'temp'})

In [12]:
def extract_content_from_pdf(file: str) -> List[Document]:
    """
    Extract and split content from a PDF file into chunks.

    Args:
        file (str): Path to the PDF file.

    Returns:
        List: A list of Documents containing various attributes
        like page_content, metadata,etc. extracted from the PDF.
    """
    loader = PyPDFLoader(file)
    docs = loader.load()
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=400)
    chunks = splitter.split_documents(docs)
    return chunks


In [13]:
chunks = extract_content_from_pdf("/content/LSTM.pdf")


In [14]:
chunks

[Document(metadata={'source': '/content/LSTM.pdf', 'page': 0}, page_content='Communicated by Ronald Williams\nLong Short-Term Memory\nSepp Hochreiter\nFakult ¨at f¨ur Informatik, Technische Universit ¨at M ¨unchen, 80290 M ¨unchen, Germany\nJ¨urgen Schmidhuber\nIDSIA, Corso Elvezia 36, 6900 Lugano, Switzerland\nLearning to store information over extended time intervals by recurrent'),
 Document(metadata={'source': '/content/LSTM.pdf', 'page': 0}, page_content='Long Short-Term Memory\nSepp Hochreiter\nFakult ¨at f¨ur Informatik, Technische Universit ¨at M ¨unchen, 80290 M ¨unchen, Germany\nJ¨urgen Schmidhuber\nIDSIA, Corso Elvezia 36, 6900 Lugano, Switzerland\nLearning to store information over extended time intervals by recurrent\nbackpropagation takes a very long time, mostly because of insufﬁcient,decaying error backﬂow. We brieﬂy review Hochreiter’s (1991) analysis ofthis problem, then address it by introducing a novel, efﬁcient, gradient-based method called long short-term memory (

In [17]:
DENSE_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
dense_embedding_model = SentenceTransformer(DENSE_MODEL)

In [18]:
def create_dense_vector(docs: Document, model: SentenceTransformer) :
    """
    Encode a list of Document objects using a HuggingFace model.

    Args:
        docs (Document): A Document object with 'page_content'.
        model (SentenceTransformer): An instance of SentenceTransformer.

    Returns:
        List[float]: A list of embeddings, one for each document.
    """
    # Extract page content from documents
    embeddings = [model.encode(docs.page_content)]

    return embeddings[0].tolist()


In [19]:
for i, doc in enumerate(chunks):

        dense_embedding = create_dense_vector(doc, dense_embedding_model)

        document = {
            "content": doc.page_content,
            "dense_vector": dense_embedding,
        }

        es_client.index(index="temp", id=str(i), body=document)

In [20]:
user_query = "What is the use of Gated Cell Units in LSTMs ?"

In [21]:
bm25_query = {
    "query": {
        "match": {
            "content": {
                "query": user_query
            }
        }
    },
    "size": 10
}

bm25_results = es_client.search(index=index_name, body=bm25_query)


In [22]:
bm25_results

ObjectApiResponse({'took': 617, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 160, 'relation': 'eq'}, 'max_score': 8.395834, 'hits': [{'_index': 'temp', '_id': '127', '_score': 8.395834, '_source': {'content': 'To ﬁnd out about LSTM’s practical limitations we intend to apply it to\nreal-world data. Application areas will include time-series prediction, musiccomposition, and speech processing. It will also be interesting to augmentsequence chunkers (Schmidhuber, 1992b, 1993) by LSTM to combine theadvantages of both.\nAppendix\nA.1 Algorithm Details. In what follows, the index kranges over output\nunits, iranges over hidden units, cjstands for the jth memory cell block, cv\nj\ndenotes the vth unit of memory cell block cj,u,l,mstand for arbitrary units,\nand tranges over all time steps of a given input sequence.\nThe gate unit logistic sigmoid (with range [0 ,1]) used in the experiments\nis\nf(x)=1\n1+exp(−x). (A.1)', 

In [23]:
query_vector = dense_embedding_model.encode(user_query).tolist()

dense_query = {
    "query": {
        "script_score": {
            "query": {
                "match_all": {}
            },
            "script": {
                "source": "cosineSimilarity(params.query_vector, 'dense_vector') + 1.0",
                "params": {
                    "query_vector": query_vector
                }
            }
        }
    },
    "size": 10
}

dense_results = es_client.search(index=index_name, body=dense_query)


In [24]:
dense_results

ObjectApiResponse({'took': 63, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 161, 'relation': 'eq'}, 'max_score': 1.560598, 'hits': [{'_index': 'temp', '_id': '127', '_score': 1.560598, '_source': {'content': 'To ﬁnd out about LSTM’s practical limitations we intend to apply it to\nreal-world data. Application areas will include time-series prediction, musiccomposition, and speech processing. It will also be interesting to augmentsequence chunkers (Schmidhuber, 1992b, 1993) by LSTM to combine theadvantages of both.\nAppendix\nA.1 Algorithm Details. In what follows, the index kranges over output\nunits, iranges over hidden units, cjstands for the jth memory cell block, cv\nj\ndenotes the vth unit of memory cell block cj,u,l,mstand for arbitrary units,\nand tranges over all time steps of a given input sequence.\nThe gate unit logistic sigmoid (with range [0 ,1]) used in the experiments\nis\nf(x)=1\n1+exp(−x). (A.1)', '

In [34]:
# response  = es_client.search(
#     index="temp",
#     body={
#         "query": {
#             "match": {
#             "content": {
#                 "query": user_query
#               }
#           }
#       }
#   },
#   knn={
#       "field": "dense_vector",
#       "query_vector":  query_vector,
#       "k": 10,
#       "num_candidates": 100
#     },
#   rank={
#       "rrf": {}
#     }
# )

  response  = es_client.search(


BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'failed to create query: to perform knn search on field [dense_vector], its mapping must have [index] set to [true]')

In [31]:
# AuthorizationException: AuthorizationException(403, 'security_exception', 'current license is non-compliant for [Reciprocal Rank Fusion (RRF)]')
# BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'failed to create query: to perform knn search on field [dense_vector], its mapping must have [index] set to [true]')


  resp = es_client.search(


BadRequestError: BadRequestError(400, 'parsing_exception', 'Unknown key for a START_OBJECT in [retriever].')

In [35]:
def rrf_rank(bm25_results, dense_results, rank_constant=20):
    combined_scores = {}

    # Process BM25 results
    for rank, hit in enumerate(bm25_results['hits']['hits']):
        doc_id = hit['_id']
        score = 1 / (rank + 1 + rank_constant)  # 1-based index for rank
        combined_scores[doc_id] = combined_scores.get(doc_id, 0) + score

    # Process Dense results
    for rank, hit in enumerate(dense_results['hits']['hits']):
        doc_id = hit['_id']
        score = 1 / (rank + 1 + rank_constant)  # 1-based index for rank
        combined_scores[doc_id] = combined_scores.get(doc_id, 0) + score

    # Sort by combined RRF score
    ranked_results = sorted(combined_scores.items(), key=lambda item: item[1], reverse=True)
    return ranked_results[:5]  # Return top 5 results

# Combine results using RRF
top_5_results = rrf_rank(bm25_results, dense_results)

In [36]:
top_5_results

[('127', 0.09523809523809523),
 ('33', 0.08391608391608392),
 ('40', 0.08347826086956522),
 ('41', 0.045454545454545456),
 ('141', 0.043478260869565216)]

In [38]:
rrf_doc_ids = [doc_id for doc_id, _ in top_5_results]

print("BM25 Results for Specified Document IDs:")
for hit in bm25_results['hits']['hits']:
    if hit['_id'] in rrf_doc_ids:
        doc_id = hit['_id']
        score = hit['_score']  # The BM25 score
        content = hit['_source']['content']  # Adjust based on your actual document structure
        print(f"Document ID: {doc_id}, BM25 Score: {score}, Content: {content}")

BM25 Results for Specified Document IDs:
Document ID: 127, BM25 Score: 8.395834, Content: To ﬁnd out about LSTM’s practical limitations we intend to apply it to
real-world data. Application areas will include time-series prediction, musiccomposition, and speech processing. It will also be interesting to augmentsequence chunkers (Schmidhuber, 1992b, 1993) by LSTM to combine theadvantages of both.
Appendix
A.1 Algorithm Details. In what follows, the index kranges over output
units, iranges over hidden units, cjstands for the jth memory cell block, cv
j
denotes the vth unit of memory cell block cj,u,l,mstand for arbitrary units,
and tranges over all time steps of a given input sequence.
The gate unit logistic sigmoid (with range [0 ,1]) used in the experiments
is
f(x)=1
1+exp(−x). (A.1)
Document ID: 41, BM25 Score: 6.765766, Content: 4.4 Memory Cell Blocks. Smemory cells sharing the same input gate
and the same output gate form a structure called a memory cell block of sizeS. Memory cell 

In [8]:
# index = "temp"
es_client.indices.delete(index="temp")

ObjectApiResponse({'acknowledged': True})