<a href="https://colab.research.google.com/github/Prajna1999/medbot/blob/main/medical_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Keyword lookup with vector lookup



# Documentation for Medical Data Notebook

## Table of Contents
1. [Overview](#overview)
2. [Initial Setup](#initial-setup)
3. [Imports](#imports)
4. [Configuration](#configuration)
5. [Main Functionality](#main-functionality)
6. [Usage](#usage)
7. [Additional Notes](#additional-notes)

## Overview
This notebook provides utilities and functions related to medical data processing. It makes use of various libraries and requires specific configurations to run correctly.

## Initial Setup
Before executing the main code, it's crucial to set up the environment correctly. This ensures smooth functioning and avoids dependency-related issues.

### Required Packages:
```python
!pip install llama-index
!pip install Ipython
!pip install pypdf
```

## Imports
The notebook requires several Python libraries and modules. These libraries provide essential functionalities used throughout the code.

### Libraries and Modules:
```python
import logging
import sys
from llama_index import ...
from IPython.display import Markdown, display, HTML
from typing import List
import openai
from getpass import getpass
import tiktoken
from llama_index.callbacks import CallbackManager, TokenCountingHandler
...
```

(Note: The ellipsis (`...`) is a placeholder indicating additional items. Refer to the original code for a complete list.)

## Configuration
Once the environment is set up and all required modules are imported, some configurations need to be made. These are essential for connecting to external services and ensuring the code runs as expected.

### OpenAI Configuration:
To utilize the OpenAI services, input your API key when prompted:
```python
openai.api_key=getpass("Enter your OAI key: ")
```

### Token Counter Configuration:
This configuration is essential for managing and counting tokens during processing.
```python
token_counter = TokenCountingHandler(...)
callback_manager = CallbackManager([token_counter])
```

## Main Functionality
This section contains the core operations of the notebook. While the code is extensive, a brief overview is provided below:

```python
...
#load documents
!pip install pypdf
#load data
documents=SimpleDirectoryReader('/content/medical_folder', recursive=True, exclude_hidden=True)...
```

(Note: This is a concise summary. For detailed operations, refer to the notebook directly.)

## Usage
1. Begin with the initial setup to ensure all dependencies are installed.
2. Follow with the imports section.
3. Proceed with the configuration, ensuring you have your OpenAI API key.
4. Execute the main functionality sections sequentially.

## Additional Notes
- Always ensure you have the latest versions of libraries to avoid compatibility issues.
- Secure your OpenAI API key and do not expose it in shared or public environments.
- For extensive data operations, consider monitoring resource usage to avoid potential system slowdowns.



In [2]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO, force=True)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


In [3]:
!pip install llama-index


Collecting llama-index
  Downloading llama_index-0.8.2.post1-py3-none-any.whl (676 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken (from llama-index)
  Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dataclasses-json (from llama-index)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Collecting langchain>=0.0.262 (from llama-index)
  Downloading langchain-0.0.265-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=0.26.4 (from llama-index)
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m7.3 M

In [4]:
!pip install Ipython

Collecting jedi>=0.16 (from Ipython)
  Downloading jedi-0.19.0-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi
Successfully installed jedi-0.19.0


In [10]:
from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    SimpleKeywordTableIndex,
    ServiceContext,
    StorageContext,
)

from IPython.display import Markdown, display, HTML

In [6]:
from typing import List

OAI API Key: You need your own OpenAI API Key

In [7]:
import openai
from getpass import getpass
openai.api_key=getpass("Enter your OAI key: ")

Enter your OAI key: ··········


In [8]:
import tiktoken
from llama_index.callbacks import CallbackManager, TokenCountingHandler
token_counter = TokenCountingHandler(
    tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode,
    verbose=False  # set to true to see usage printed to the console
)

callback_manager = CallbackManager([token_counter])

Load Data
1. Load Document
2. Inititalize a service Context
3. Intialize a storage context

In [12]:
#load documents
!pip install pypdf


Collecting pypdf
  Downloading pypdf-3.15.1-py3-none-any.whl (271 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/271.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/271.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m271.0/271.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.15.1


In [42]:
#load data
documents=SimpleDirectoryReader('/content/medical_folder', recursive=True, exclude_hidden=True).load_data()
print(documents)

[Document(id_='e79ea734-9ff5-4b83-8653-5f336dc3cfd7', embedding=None, metadata={'page_label': '1', 'file_name': 'D134BC00001 - AU 0301 Informed Consent Form(3).pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='9e5e6fc6577f38688816056fc4df613ffab1925e233364d18676e6279e73bbac', text=' \nStudy Code: D134BC00001   \nNMA Pregnancy  Participant Information Sheet/Consent Form  Version 1.0 dated 25May2021  \nLocal governance version  - Peter MacCallum Cancer Centre dated 05Jul2021                                              Page 1 of 4 Peter MacCallum Cancer Centre  Locations  \n305 Grattan Street  Melbourne  \nMelbourne  Bendigo  \nVictoria 3000 Australia  Box Hill   \n           Moorabbin  \nPostal Address  Sunshine   \nLocked Bag 1 A’Beckett Street      \nVictoria 8006 Australia     \n   \nPhone  +61 3 8559 5000   \nFax +61 3 8559 7379   \nABN  42 100 504 883   \npetermac.org  \n \nParticipant Information Sheet/ Consent Form  in the Event of Pre

In [43]:
#inititalize a service context
service_context=ServiceContext.from_defaults(chunk_size= 2048,
                                             callback_manager=callback_manager)
node_parser=service_context.node_parser
nodes=node_parser.get_nodes_from_documents(documents)
print(token_counter.total_embedding_token_count)
print(nodes)

0
[TextNode(id_='69275fad-9dec-4967-9dec-a7f2df9a3861', embedding=None, metadata={'page_label': '1', 'file_name': 'D134BC00001 - AU 0301 Informed Consent Form(3).pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='e79ea734-9ff5-4b83-8653-5f336dc3cfd7', node_type=None, metadata={'page_label': '1', 'file_name': 'D134BC00001 - AU 0301 Informed Consent Form(3).pdf'}, hash='9e5e6fc6577f38688816056fc4df613ffab1925e233364d18676e6279e73bbac')}, hash='eea8364c6abce57ac9d4cad3f55e416ef9b057a683962652624b04118dd30b21', text='Study Code: D134BC00001   \nNMA Pregnancy  Participant Information Sheet/Consent Form  Version 1.0 dated 25May2021  \nLocal governance version  - Peter MacCallum Cancer Centre dated 05Jul2021                                              Page 1 of 4 Peter MacCallum Cancer Centre  Locations  \n305 Grattan Street  Melbourne  \nMelbourne  Bendigo  \nVictoria 3000 Australia  Box Hill   \n   

In [44]:
#Initializing a storage context
storage_context=StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)
print(token_counter.total_embedding_token_count)

0


We build a vector index and keyword index over the same Document Store

In [45]:
vector_index=VectorStoreIndex(nodes, storage_context=storage_context)
keyword_index=SimpleKeywordTableIndex(nodes, storage_context=storage_context)
print(token_counter.total_embedding_token_count)

0


In [46]:
# import QueryBundle
from llama_index import QueryBundle

# import NodeWithScore
from llama_index.schema import NodeWithScore

# Retrievers
from llama_index.retrievers import (
    BaseRetriever,
    VectorIndexRetriever,
    KeywordTableSimpleRetriever,
)

from typing import List

In [47]:
class CustomRetriever(BaseRetriever):
    """Custom retriever that performs both semantic search and hybrid search."""

    def __init__(
        self,
        vector_retriever: VectorIndexRetriever,
        keyword_retriever: KeywordTableSimpleRetriever,
        mode: str = "AND",
    ) -> None:
        """Init params."""

        self._vector_retriever = vector_retriever
        self._keyword_retriever = keyword_retriever
        if mode not in ("AND", "OR"):
            raise ValueError("Invalid mode.")
        self._mode = mode

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve nodes given query."""

        vector_nodes = self._vector_retriever.retrieve(query_bundle)
        keyword_nodes = self._keyword_retriever.retrieve(query_bundle)

        vector_ids = {n.node.node_id for n in vector_nodes}
        keyword_ids = {n.node.node_id for n in keyword_nodes}

        combined_dict = {n.node.node_id: n for n in vector_nodes}
        combined_dict.update({n.node.node_id: n for n in keyword_nodes})

        if self._mode == "AND":
            retrieve_ids = vector_ids.intersection(keyword_ids)
        else:
            retrieve_ids = vector_ids.union(keyword_ids)

        retrieve_nodes = [combined_dict[rid] for rid in retrieve_ids]
        return retrieve_nodes

Plugin Retriever into Query Engine and run some queries

In [48]:
from llama_index import get_response_synthesizer
from llama_index.query_engine import RetrieverQueryEngine

#define the custom retriever
vector_retriever=VectorIndexRetriever(
    index=vector_index,
    similarity_top_k=2
)
keyword_retriever=KeywordTableSimpleRetriever(
    index=keyword_index,
)
#instantiate custom retriever
custom_retriever=CustomRetriever(
    vector_retriever,
    keyword_retriever
)
#define the response synthesizer
response_synthesizer=get_response_synthesizer()

#assemble query engine
custom_query_engine=RetrieverQueryEngine(
    retriever=custom_retriever,
    response_synthesizer=response_synthesizer,
)

# vector query engine
# vector_query_engine = RetrieverQueryEngine(
#     retriever=vector_retriever,
#     response_synthesizer=response_synthesizer,
# )
# # keyword query engine
# keyword_query_engine = RetrieverQueryEngine(
#     retriever=keyword_retriever,
#     response_synthesizer=response_synthesizer,
# )


Run the retrieval query


In [49]:
def execute_query():
    try:
        query = input("Please enter your query: ")

        if not query or not isinstance(query, str):
            return "Error: The query must be a non-empty string."

        if not hasattr(custom_query_engine, 'query') or not callable(custom_query_engine.query):
            return "Error: query_engine does not have a callable 'query' method."

        response = custom_query_engine.query(query.strip())
        print(response)
        if not response:
            return "Error: The query returned no results."

        return response
    except Exception as e:
        return f"An error occurred while executing the query: {str(e)}"

In [51]:
response=execute_query()
display(HTML(f'<p style="font-size:16px">{response.response}</p>'))

Please enter your query: Translate the paragraph within quotes; "Este anexo es para el estudio principal (D9950C00001) en el que ya participa. Este anexo es  para su posible continuación con el mismo fármaco o fármacos del estudio que ha recibido en  el estudio, aunque las exploraciones de su tumor muestren que su cáncer quizás haya  empeorado."
INFO:llama_index.indices.keyword_table.retrievers:> Starting query: Translate the paragraph within quotes; "Este anexo es para el estudio principal (D9950C00001) en el que ya participa. Este anexo es  para su posible continuación con el mismo fármaco o fármacos del estudio que ha recibido en  el estudio, aunque las exploraciones de su tumor muestren que su cáncer quizás haya  empeorado."
> Starting query: Translate the paragraph within quotes; "Este anexo es para el estudio principal (D9950C00001) en el que ya participa. Este anexo es  para su posible continuación con el mismo fármaco o fármacos del estudio que ha recibido en  el estudio, aunqu

In [36]:
display(HTML(f'<p style="font-size:16px">{response.response}</p>'))