# AI-Powered Document Retrieval System

This Jupyter Notebook demonstrates an AI-powered document retrieval system using advanced NLP techniques. The system employs:

1. Document loading and text splitting
2. Sentence transformer-based text embedding
3. Chroma vector storage
4. Similarity-based retrieval

We use the SentenceTransformer model "all-MiniLM-L6-v2" for creating text embeddings, and leverage the langchain library for document processing and retrieval operations. This notebook walks through the process of loading documents, creating embeddings, storing them in a vector database, and performing semantic searches to retrieve relevant information.

Key libraries: langchain, sentence_transformers, chromadb

This system offers improved semantic understanding and context-aware retrieval compared to traditional keyword-based methods, making it ideal for applications requiring nuanced document search and information extraction.

1. `TextLoader` (from langchain.document_loaders):
   - Used for loading text documents into the system.

2. `CharacterTextSplitter` (from langchain.text_splitter):
   - Splits loaded documents into smaller chunks for processing.

3. `OpenAIEmbeddings` (from langchain.embeddings):
   - Provides text embedding capabilities using OpenAI's models.
   - Note: Requires an OpenAI API key to use.

4. `SentenceTransformerEmbeddings` (from langchain.embeddings.sentence_transformer):
   - An alternative embedding method using the SentenceTransformer library.
   - Doesn't require an API key and can run locally.

5. `Chroma` (from langchain.vectorstores):
   - A vector store for efficiently storing and retrieving embeddings.

6. `RetrievalQA` (from langchain.chains):
   - Implements a question-answering chain using retrieved documents.

In [1]:
#Please install the following libraries
!pip install langchain
!pip install chromadb
!pip install sentence_transformers
!pip install jmespath




[notice] A new release of pip available: 22.3 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.3 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.3 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.3 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

## Document Loaders:

Document loaders are powerful tools that serve as the foundation of our AI-powered retrieval system. They act as versatile bridges, connecting our application to a wide array of information sources:

- Text files (.txt, .md, .pdf)
- Web pages and HTML content
- Databases and structured data formats
- Email archives and chat logs
- Cloud storage services (Google Drive, Dropbox)
- And many more!

By leveraging document loaders, we can effortlessly ingest and process information from multiple origins, ensuring our system has access to a rich, diverse knowledge base. This flexibility allows us to tackle complex, real-world information retrieval challenges across various domains and data formats.

In [3]:
loader = TextLoader('projects/tax_examples/corpus.txt', encoding='utf-8')
documents = loader.load()

In [4]:
len(documents)

1

## Document Transformers

Document transformers are crucial components in our AI-powered retrieval system, acting as intelligent processors that prepare our raw data for advanced analysis:

- Text Splitting: Break down large documents into manageable chunks
  - Preserve context and meaning
  - Optimize for embedding and retrieval processes

- Redundancy Elimination: Identify and remove duplicate or near-duplicate content
  - Improve efficiency of storage and retrieval
  - Enhance the quality of search results

- Content Normalization: Standardize text format and structure
  - Handle inconsistencies in formatting, encoding, or language
  - Ensure uniform processing across diverse document sources

By employing document transformers, we significantly enhance the quality and efficiency of our data pipeline. This step is vital for:

1. Improving the accuracy of embeddings and semantic search
2. Reducing computational overhead in downstream processes
3. Ensuring a cleaner, more relevant dataset for analysis and retrieval

In this section, we'll explore how to leverage document transformers effectively, setting the stage for more accurate and efficient information retrieval.

In [5]:
text_splitter = CharacterTextSplitter (chunk_size=200,
chunk_overlap=0)

texts= text_splitter.split_documents(documents)

Created a chunk of size 320, which is longer than the specified 200
Created a chunk of size 220, which is longer than the specified 200
Created a chunk of size 203, which is longer than the specified 200
Created a chunk of size 253, which is longer than the specified 200
Created a chunk of size 249, which is longer than the specified 200
Created a chunk of size 385, which is longer than the specified 200
Created a chunk of size 220, which is longer than the specified 200
Created a chunk of size 230, which is longer than the specified 200
Created a chunk of size 228, which is longer than the specified 200
Created a chunk of size 202, which is longer than the specified 200
Created a chunk of size 359, which is longer than the specified 200
Created a chunk of size 338, which is longer than the specified 200
Created a chunk of size 241, which is longer than the specified 200
Created a chunk of size 642307, which is longer than the specified 200
Created a chunk of size 219, which is longer 

In [6]:
len(texts)

46

In [7]:
texts

[Document(page_content='15t such as legislation enacted after it was published go to irs.govpub15t.\n\nwhats new form w4p and form w4r.', metadata={'source': 'C:\\AI574\\corpus.txt'}),
 Document(page_content='previously form w4p was also used to make withholding elections for nonperiodic payments and eligible rollover distributions\n.\nwithholding elections for nonperiodic payments and eligible rollover distributions are now made on form w4r withholding certificate for nonperiodic payments and eligible rollover distributions.', metadata={'source': 'C:\\AI574\\corpus.txt'}),
 Document(page_content='also see how to treat 2021 and earlier forms w4p as if they were 2022 or later forms w4p later for an optional computational bridge.\n\nfor more information about form w4r see section 8 of pub.', metadata={'source': 'C:\\AI574\\corpus.txt'}),
 Document(page_content='employers may use an optional computational bridge to treat 2019 and earlier forms w4 as if they were 2020 or later forms w4 for

## Text Embedding Models: Bridging Language and Mathematics

Text embedding models are the cornerstone of modern natural language processing, serving as powerful translators between human language and machine-understandable numerical representations.


### The Power of Sentence Transformers

In our system, we leverage a sentence-transformers model, a sophisticated tool designed to:

- Transform sentences and paragraphs into a rich, 384-dimensional vector space
- Capture semantic relationships between words and phrases
- Provide a consistent, numerical representation of textual data

By employing these advanced embedding models, we transform raw text into a format that unlocks the full potential of AI-driven text analysis and retrieval. This critical step forms the foundation for our system's ability to understand and process human language with remarkable depth and accuracy.

In [8]:
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Vector Stores: Efficient Storage and Retrieval of Embedded Data
"
Vector stores are specialized databases designed to store and search over embedded data. 
In our project, we use Chroma, an AI-native open-source embedding database.

Key features of Chroma:
1. Optimized for storing and querying high-dimensional vectors
2. Enables efficient similarity search
3. Designed for building LLM (Large Language Model) applications
4. Makes knowledge, facts, and skills easily pluggable for LLMs
5. Open-source and free to use under the Apache 2.0 license

In this section, we'll demonstrate how to:
1. Create a Chroma vector store
2. Load text embeddings into Chroma
3. Perform similarity searches on the stomilar documents to your query

"""
By leveraging Chroma as our vector store, wecan efficiently store, retrieve, and 
search over our embedded text data, enablingpowerful semantic seh capabilities 
in our application.
"""

In [9]:
db = Chroma.from_documents(texts, embeddings)

Embeddings transform text into high-dimensional vectors of floating-point numbers. 
Let's explore what these numeric representations look like and how we can interpret them.

In [10]:
db._collection.get(include=['embeddings'])

{'ids': ['04c85bdb-997f-446f-988b-b8f17ced50cb',
  '0677af37-f7b5-476d-8ab3-8c494754ac52',
  '077cbc1c-582a-45e6-8289-15e8616d819c',
  '07d8961a-9c02-456e-934b-77ac511aee9a',
  '095279d0-b612-4c3c-8ef1-2c9e0cbfd769',
  '1660fd18-89af-4664-8628-fe4327ae6b91',
  '169d37d1-3896-4eb0-a14b-0748996f8aa9',
  '18153671-c7c8-47a8-b31c-a4a90b8f0559',
  '1c41424e-23ed-4481-a928-75a7b82ec84c',
  '296b36bb-df0e-45c3-b4b4-627c10cf3dbc',
  '2d47502d-93cc-4413-9cc8-55cc1b986c76',
  '46a527b5-af10-4afb-8c95-0c1ac254d93c',
  '4a9926c2-4468-4c40-b0b0-bc46663f88e1',
  '5218ab43-7277-421a-a34e-e46f362c1b5a',
  '53055b85-060e-423e-a947-348a6eedf272',
  '55957d6f-02b0-44c4-8d68-c65bee8b472a',
  '5b9bbe97-5c53-431e-a2e6-810bac455e50',
  '5fce6445-8f0f-4e9b-969e-4e2e930a5c46',
  '600876eb-e8bc-45be-8571-c274e3152bb3',
  '64401cb4-25de-4095-bfd2-4a5e9f087ff4',
  '687b8d8f-89fa-43f9-9fc6-e6c0bb3af4e1',
  '6bde4bfa-0a6d-4920-8c03-2aa47ee634b7',
  '813c9c24-e46a-482e-ac59-840eda3f2096',
  '8a30b35c-2342-4360-8ec1-

# Retrievers: Efficient Querying of Vector Stores
""
Retrievers are powerful tools that allow usquery our vector store and fetch relevant documents efficientlyments. In our system, we use a vector store retriever, which 
provides a simplified interface to our Chroma vector store.

Key features of Vector Store Retrievers:
1. Seamless integration with vector stores (e.g., Chroma)
2. Utilization of advanced search functionalities:
   - Similarity search
   - Maximal Marginal Relevance (MMR)
3. Customizable search parameters for fine-tuned retrieval

In this section, we'll demonstrate how to:
1. Create a retriever from our vector store
2. Customize the retrieval process
3. Perform queries and an200] + "...")  # Display first 200 characters
    print()

"""
The retriever everages the vector store's search capabilities to find the most 
relevant douments based on the semantic similarity between the query and the 
stored document embeddings.

Advanced Usage:
- Experiment with different 'k' values to balance between precision and recall
- Explore other search parameters like 'score_threshold' for filtering results
- Consider using MMR search for increased diversity in retrieved documents:
  retriever = db.as_retriever(search_type="mmr", search_kwargs={"k": 3})

B effectively using retrievers, we can efficiour application's powerful question-answering and information retrieval capabilitiesquion-answering and 
information retrieval capabilities in our application.
"""

In [11]:
retriever = db.as_retriever(search_kwargs={"k": 3})

In [12]:
retriever

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x0000016948DD7ED0>, search_kwargs={'k': 3})

## Test Case 1

# Retrieving Relevant Information
"
The 'get_relevant_documents' function is a powerful tool in our retrieval syste. 
It allows us to fetch a list of documents that are most relevant to a given query.

Key features:
1. Returns a list of relevant documents based on semantic similarity
2. Utilizes the underlying vector store for efficient retrieval
3. Customizable through the retriever's search parameters

In this section, we'll explore how to use this function and interpret it
data}")
    print()

"""
Interpretation of results:
- Each returned document is an object containing the document's content and metadata
- Documents are ordered by relevance to the query
- The number of returned documents depends on the retriever's configuration (e.g., 'k' value)

Practical applications:
1. Question Answering: Use retrieved documents to formulate answers to user queries
2. Information Summarization: Synthesize key information from relevant documents
3. Document Recommendations: Suggest related documents based on user interests

Tips for effective use:
- Experiment with different queries to understand the retrieval behavior
- Adjust the retriever's search parameters to fine-tune results
- Consider post-processing the retrieved documents for further analysis or presentation

By leveraging the 'get_relevant_documnts' function, you can efficiently extract 
relevant information from your dcument collection, enabling a wide range of 
powerful applications in informatiretrieval and natural language processing.
"""

In [13]:
docs = retriever.get_relevant_documents("Where to send comments?")

  warn_deprecated(


In [14]:
docs

[Document(page_content='we welcome your comments about this publication and suggestions for future editions.\n\nyou can send us comments through irs.govformcomments.', metadata={'source': 'C:\\AI574\\corpus.txt'}),
 Document(page_content='this section provides specific requirements for substitute submissions of form w4.', metadata={'source': 'C:\\AI574\\corpus.txt'}),
 Document(page_content='state bonus payments state bonus payments t tax forgiven combat zone related tax help spouse in missing status how to get tax help terrorist related forgiveness terrorist or military w when to file when to file my return where to file where to file my return y yugoslavia the kosovo area.', metadata={'source': 'C:\\AI574\\corpus.txt'})]

## Test Case 2

In [15]:
docs2 = retriever.get_relevant_documents("Can you deduct state and local income taxes?")

In [16]:
docs2

[Document(page_content='if you have a tax question not answered by this publication check irs.gov and how to get tax help at the end of this publication.', metadata={'source': 'C:\\AI574\\corpus.txt'}),
 Document(page_content='employees who write exempt on form w4 in the space below step 4c shall have no federal income tax withheld from their paychecks except in the case of certain supplemental wages.', metadata={'source': 'C:\\AI574\\corpus.txt'}),
 Document(page_content='generally an employee may claim exemption from federal income tax withholding because they had no federal income tax liability last year and expect none this year.\nsubstitute submissions of form w4 general requirements for any system set up to electronically receive a form w4 or form w4p are discussed earlier under electronic submission of forms w4 and w4p.', metadata={'source': 'C:\\AI574\\corpus.txt'})]