[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb)

# Semantic Search

In this walkthrough we will see how to use Pinecone for semantic search. To begin we must install the required prerequisite libraries:

In [1]:
!pip install -qU \
  pinecone-client==3.1.0 \
  pinecone-datasets==0.7.0 \
  sentence-transformers==2.2.2

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Download

In this notebook we will skip the data preparation steps as they can be very time consuming and jump straight into it with the prebuilt dataset from *Pinecone Datasets*. If you'd rather see how it's all done, please refer to [this notebook](https://github.com/pinecone-io/examples/blob/master/learn/search/semantic-search/semantic-search.ipynb).

Let's go ahead and download the dataset.

In [2]:
from pinecone_datasets import load_dataset

dataset = load_dataset('quora_all-MiniLM-L6-bm25')
# we drop metadata as will use blob column
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# we will use 80K rows of the dataset between rows 240K -> 320K
dataset.documents.drop(dataset.documents.index[320_000:], inplace=True)
dataset.documents.drop(dataset.documents.index[:240_000], inplace=True)
dataset.head()

  from tqdm.autonotebook import tqdm


In [None]:
print(len(dataset))

80000


## Serverless or Pod-based?

Before getting started, decide whether to use serverless or pod-based index.

In [2]:
import os
os.environ['OPENAI_API_KEY'] = ''
os.environ["PINECONE_API_KEY"] = ""
os.environ["PINECONE_ENVIRONMENT"] = "gcp-starter"


In [174]:
import os

use_serverless = os.environ.get("USE_SERVERLESS", "False").lower() == "true"

## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by initializing our connection to Pinecone. To do this we need a [free API key](https://app.pinecone.io).

In [175]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pc.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
environment = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [176]:
from pinecone import ServerlessSpec, PodSpec

if use_serverless:
    spec = ServerlessSpec(cloud='aws', region='us-west-2')
else:
    spec = PodSpec(environment=environment)

Now we create a new index called `semantic-search-fast`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

In [177]:
index_name = "mediclassify2"

In [178]:
import time

existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=384,  # dimensionality of minilm
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 30}},
 'total_vector_count': 30}

Upsert the data:

In [None]:
# from tqdm.auto import tqdm

# for batch in tqdm(dataset.iter_documents(batch_size=500), total=160):
#     index.upsert(batch)

  0%|          | 0/160 [00:00<?, ?it/s]

## Making Queries

Now that our index is populated we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question. Let's begin.

In [179]:
from sentence_transformers import SentenceTransformer,util
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)
model

  return torch._C._cuda_getDeviceCount() > 0


SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

Now let's query.

In [23]:
query = "What are hospital's phone numbers?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': 'f1a4dcd6-8e8c-40fd-991c-c76ae8319d8a',
              'metadata': {'text': 'Patient’s Responsibilities:\n'
                                   '•Complete Information:  To provide '
                                   'complete information about his/her health, '
                                   'including past \n'
                                   'condition, past illness, hospitalization, '
                                   'medications, or any other matter '
                                   'pertaining to his health,\n'
                                   'etc.Complete Demographic Details:  To '
                                   'provide complete and accurate information '
                                   'regarding\n'
                                   'his identity, address, insurance cover, or '
                                   'any other information. Abide by Hospital '
                                   'Rules:  To \n'
                                

In the returned response `xc` we can see the most relevant questions to our particular query — we don't have any exact matches but we can see that the returned questions are similar in the topics they are asking about. We can reformat this response to be a little easier to read:

In [24]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.47: Patient’s Responsibilities:
•Complete Information:  To provide complete information about his/her health, including past 
condition, past illness, hospitalization, medications, or any other matter pertaining to his health,
etc.Complete Demographic Details:  To provide complete and accurate information regarding
his identity, address, insurance cover, or any other information. Abide by Hospital Rules:  To 
abide by hospital rules & responsibilities - no smoking policy, visitor policy, not to bring 
outside food, flowers, arms/weapons to the hospital. Minimum Luggage/Follow Infection 
Control: To keep minimum luggage in the ward for infection prevention and also abide by 
infection control norms as educated by the treating team. Respect & Courtesy for Others:  To 
treat other patients, attendants, visitors, and hospital staff with courtesy. Inform Treating 
Team: Not to take any medication/alternative therapy without the knowledge of the treating
0.41: team.Pay Fees: To pay for the

These are good results, let's try and modify the words being used to see if we still surface similar results.

Here we used different terms in our query than that of the returned documents. We substituted **"city"** for **"metropolis"** and **"populated"** for **"number of people"**.

Despite these very different terms and *lack* of term overlap between query and returned documents — we get highly relevant results — this is the power of *semantic search*.

You can go ahead and ask more questions above. When you're done, delete the index to save resources:

In [None]:
#pc.delete_index(index_name)

In [180]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

2024-04-12 14:35:40.660778: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-12 14:35:40.822365: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [181]:
lang_model_name = "./automatic_ticket_classification/checkpoints/roberta-base-squad2"

In [182]:
def extract_text(xc):
    results = []
    for result in xc['matches']:
        results.append(result['metadata']['text'])
    return results

In [183]:
def get_answer(xc,user_input):
    #nlp = pipeline('question-answering', model=lang_model_name, tokenizer=lang_model_name)
    nlp = pipeline('question-answering', model=lang_model_name, tokenizer=lang_model_name)
    docs = extract_text(xc)
    print(docs)
    
    QA_input = {
    'question': user_input,
    'context': docs
    }

    # b) Load model & tokenizer
    model = AutoModelForQuestionAnswering.from_pretrained(lang_model_name)
    tokenizer = AutoTokenizer.from_pretrained(lang_model_name)

    # Initialize an empty array to store answers
    answers = []

    # Iterate over each context string and find the answer
    for context in docs:
        # Tokenize input question and context
        inputs = tokenizer(QA_input["question"], context, return_tensors="pt")

        # Get start and end logits
        start_logits, end_logits = model(**inputs).start_logits, model(**inputs).end_logits

        # Decode the answer span
        answer_start = torch.argmax(start_logits)
        answer_end = torch.argmax(end_logits) + 1
        answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))

        # Skip if answer contains '<s>'
        if '<s>' in answer:
            continue

        # Clean the answer by removing leading or trailing special characters
        answer = answer.strip("<s>").strip("</s>").strip()

        # Add answer to the array
        answers.append(answer)

    if len(answers)<=0:
        print("No answers")


    return answers

In [184]:
query = "What kind of treatments are available?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)

# for result in xc['matches']:
#     print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

candidate_answers = extract_text(xc)

In [44]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [185]:
embedding = model.encode("what is hospital's address?")
candidate_embeddings = model.encode(candidate_answers)

similarities = util.cos_sim(embedding, candidate_embeddings)
print(similarities)
best_answer_index = similarities.argmax()
best_answer = candidate_answers[best_answer_index]
best_answer

tensor([[0.2034, 0.2506, 0.2641, 0.3274, 0.4053]])


"General Medicine & Infectious Diseases :\n    • Scope of Services: The department provides treatment for a range of infectious \ndiseases, including but not limited to:\n        ◦ Fever\n        ◦ Cough & Cold\n        ◦ Chronic Cough\n        ◦ Jaundice\n        ◦ Diabetes\n        ◦ Hypertension\n        ◦ Ischemic Heart Disease\n        ◦ Stroke (Paralysis)\n        ◦ Thyroid Disorders\n        ◦ Anemia (Low Hemoglobin)\n        ◦ Asthma & COPD\n        ◦ Vertigo\n        ◦ Migraine\n        ◦ Tuberculosis & HIV\n        ◦ Physician Fitness before Surgery\n        ◦ Fitness Certificate\n        ◦ Health Checkup\n        ◦ Vaccination\n        ◦ Central Line or Dialysis Catheter Insertion\n        ◦ Any other Medical Problem\nVelocity Hospital's General Medicine & Infectious Diseases department offers \ncomprehensive care for patients with infectious diseases, providing skilled diagnosis and\ntreatment under the care of experienced physicians."

==================================================================

In [3]:
os.environ['OPENAI_API_KEY'] = ''
os.environ["PINECONE_API_KEY"] = ""
os.environ["PINECONE_ENVIRONMENT"]="gcp-starter"

In [4]:
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
environment = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

api_key, environment

('c90b562b-0985-43e7-9949-dc502745b36e', 'gcp-starter')

In [7]:
from pinecone import Pinecone as PineconeClient

from transformers import AutoTokenizer, AutoModel
import torch
import os
import time
import re

index_name = "mediclassify3"

# Check if the index exists, if not, create it

# configure client
pc = PineconeClient(
    api_key=api_key,
    environment=environment
)
# connect to index
index = pc.Index(index_name)

time.sleep(1)
# view index stats
index.describe_index_stats()




{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 63}},
 'total_vector_count': 63}

In [8]:
pc.list_indexes(), index_name

({'indexes': [{'dimension': 384,
               'host': 'mediclassify2-998l0bo.svc.aped-4627-b74a.pinecone.io',
               'metric': 'cosine',
               'name': 'mediclassify2',
               'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
               'status': {'ready': True, 'state': 'Ready'}},
              {'dimension': 384,
               'host': 'mediclassify1-998l0bo.svc.aped-4627-b74a.pinecone.io',
               'metric': 'cosine',
               'name': 'mediclassify1',
               'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
               'status': {'ready': True, 'state': 'Ready'}},
              {'dimension': 384,
               'host': 'mediclassify3-998l0bo.svc.aped-4627-b74a.pinecone.io',
               'metric': 'cosine',
               'name': 'mediclassify3',
               'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
               'status': {'ready': True, 'state': 'Ready'}},
              {'

In [9]:
def clear_pinecone_index(api_key, index_name, environment='gcp-starter'):
    # Initialize Pinecone client
    
    # Check if the index exists
    if index_name in pc.list_indexes():
        # Delete the index
        pc.delete_index(index_name)
        print(f"Index '{index_name}' deleted.")
        time.sleep(5)

    
# Usage

clear_pinecone_index(api_key, index_name)

In [211]:
pc.list_indexes()

{'indexes': [{'dimension': 384,
              'host': 'mediclassify2-998l0bo.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'mediclassify2',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}},
             {'dimension': 384,
              'host': 'mediclassify1-998l0bo.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'mediclassify1',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}}]}

In [240]:
index_name = "mediclassify3"
# Recreate the index
pc.create_index(index_name, dimension=384, metric='cosine', spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
))  # Specify the appropriate dimension and metric
print(f"Index '{index_name}' recreated.")

Index 'mediclassify3' recreated.


In [241]:
# connect to index
index = pc.Index(index_name)

time.sleep(1)
# view index stats
index.describe_index_stats()


{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [17]:

# Load a pre-trained language model and tokenizer
model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"
#model_name = "sentence-transformers/all-MiniLM-L6-v2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def embed_text(text):
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt", max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().tolist()

def is_relevant(result, question):
    # Example of a simple relevance check
    score_threshold = 1  # Define a threshold score for relevance
    return result['score'] > score_threshold

def answer_question(question,k=1):
    # Embed the question
    question_embedding = embed_text(question)
    
    # Perform the similarity search in the filtered index
    results = index.query(vector=question_embedding, top_k=k, include_metadata=True)
    #print(results)
    
    # # Check for proper response handling
    # if not results.matches:
    #     return "No results found."

    # # Process the results based on score and filter irrelevant ones
    # relevant_results = [result for result in results.matches if result.score > 1]

    # # Check the number of relevant results
    # if not relevant_results:
    #     return "No sufficiently relevant results found.", results

    # if relevant_results[0].score > 2.2:
    #     # Return the first two results
    #     return "\n".join(process_result(result, question) for result in relevant_results[:1]) , results

    # If no single result dominates, return all relevant results

    for res in results.matches:
         print(res.score)
    return "\n\n===================\n".join(process_result(result,question) for result in results.matches)
    
def process_result(result, question):
    text = result['metadata']['text']
    if "phone" in question.lower():
        match = re.search(r"Phone:\s*(\d+)", text)
        return match.group(1) if match else ""
    elif "address" in question.lower():
            match = re.search(r"Address :([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+)", text)
            return ", ".join(match.groups()) if match else "Address not found"
    elif "emergency" in question.lower():
            match = re.search(r"Emergency:\s*(\d+)", text)
            return match.group(1) if match else ""
        
    elif "email" in question.lower():
        match = re.search(r"Email:\s*([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})", text)
        return match.group(1) if match else ""
    
    elif "departments" in question.lower():
        match = re.search(r"Departments of hospital:\s*(.*?)(?=\n[A-Z]|\Z)", text, re.DOTALL)
        return match.group(1).strip() if match else ""
    return text
    
    
    

In [11]:
def show_results (question,k=1):
    top_results = answer_question(question,k)
    print(f"Question : {question}")
    print("="*100)
    print(top_results)

In [12]:
show_results("who are visiting consultants?")

0.389739305
Question : who are visiting consultants?
Visiting Consultants :
Dr. Milan Modi , Specialization: Pulmonologist, Qualifications: DNB Chest Medicine, 
IDCCM, EDIC
Dr. Anil Patel ,Specialization: Nephrologist and Transplant Physician, Qualifications: 
MD Medicine, DNB Nephrology
Dr. Jaydeep Patel (Hirpara), Specialization: Nephrologist and Transplant 
Physician,Qualifications: MD Medicine, DM Nephrology
Dr. Mehul M. Luhar,Specialization: Psychiatrist,Qualifications: MBBS, DPM
Dr. Pintu Bhakhar,Specialization: Gastroenterology,Qualifications: DNB Medicine, 
DNB Gastro
Dr. Pallav Parikh,Specialization: Gastroenterology,Qualifications: MD Medicine (Gold 
Medalist), DNB Gastro
Dr. Pravin Borsadia,Specialization: Gastroenterology,Qualifications: MD Medicine, 
DNB Gastro
Dr. Yuti Maniya,Specialization: Ear, Nose & Throat Surgeon,Qualifications: MBBS, 
DLO(ENT)
Alisha Virani,Specialization: Audiologist and Speech Therapist,Qualifications: MASLP


In [13]:
show_results("give me emergency phone number")

0.389199108
Question : give me emergency phone number
7435081000


In [14]:
show_results("provide email")

0.271324128
Question : provide email



In [19]:
show_results("Please provide Hospital's Address Details")

{'matches': [{'id': 'd95ae71b-c092-4ca5-9aab-44c694854a8b',
              'metadata': {'text': 'Velocity Multispeciality Hospital - '
                                   'Surat \n'
                                   'Basic Information\n'
                                   'Address :  3rd Floor, Velocity '
                                   'Multispeciality Hospital, Velocity '
                                   'Business Hub, Near \n'
                                   'Madhuvan Circle, LP Savani Road, Adajan, '
                                   'Surat-395009.\n'
                                   ' \n'
                                   'Phone: 7435081000 \n'
                                   'Emergency: 7435082000 \n'
                                   'Email: velocityhospital@gmail.com \n'
                                   'Departments : \n'
                                   '- Anesthesia\n'
                                   '- Arthroscopy & Sports Medicine\n'
                  

In [292]:
show_results("what are patient's rights?")

0.397088856
Question : what are patient's rights?
Patient’s Rights:
Respect, Privacy & Safety: To be treated with respect, privacy, dignity, and in a safe 
environment.
Respect for Values & Belief: To be treated with respect regarding individuality, personal 
values, beliefs, spiritual and cultural traditions.
Safety: To be free from all forms of physical injury, abuse, and neglect. 
Confidentiality: To expect confidentiality of all records and communications to the 
extent provided by law.
Right to Information: To receive information on diagnosis, plan of care, prognosis, 
medications, progress during hospitalization/treatment, post-discharge care, etc.
Explanation About Care: To be explained about proposed care including diagnostic tests 
performed, associated risks, alternatives, possible complications, any modification 
regarding plan of care, and expected results of treatment. 
Cost: To know the expected cost of treatment.
Informed Consent: To give informed consent before the tran

In [282]:
show_results("what are patient's responsibilities?")

0.415157825
Question : what are patient's responsibilities?
Patient Responsibilities:
1. Provide complete health information.
2. Provide accurate demographic details.
3. Follow hospital rules (no smoking, visitor policy).
4. Keep minimum luggage for infection control.
5. Treat others with respect and courtesy.
6. Inform treating team before taking any medication.
7. Pay fees as per hospital regulations.
8. Respect other patients' urgent medical needs.
9. Follow prescribed treatment plan.
10. Accept measures for privacy and confidentiality.


In [283]:
show_results("who are visiting consultants?")

0.324123263
Question : who are visiting consultants?
Visiting Consultants :
Dr. Milan Modi , Specialization: Pulmonologist, Qualifications: DNB Chest Medicine, 
IDCCM, EDIC
Dr. Anil Patel ,Specialization: Nephrologist and Transplant Physician, Qualifications: 
MD Medicine, DNB Nephrology
Dr. Jaydeep Patel (Hirpara), Specialization: Nephrologist and Transplant 
Physician,Qualifications: MD Medicine, DM Nephrology
Dr. Mehul M. Luhar,Specialization: Psychiatrist,Qualifications: MBBS, DPM
Dr. Pintu Bhakhar,Specialization: Gastroenterology,Qualifications: DNB Medicine, 
DNB Gastro
Dr. Pallav Parikh,Specialization: Gastroenterology,Qualifications: MD Medicine (Gold 
Medalist), DNB Gastro
Dr. Pravin Borsadia,Specialization: Gastroenterology,Qualifications: MD Medicine, 
DNB Gastro
Dr. Yuti Maniya,Specialization: Ear, Nose & Throat Surgeon,Qualifications: MBBS, 
DLO(ENT)
Alisha Virani,Specialization: Audiologist and Speech Therapist,Qualifications: MASLP


In [1]:
show_results("what kind of facilities are provided in the hospital? ")

NameError: name 'show_results' is not defined

In [262]:
show_results("What is included in full body health checkup? ")

Question : What is included in full body health checkup? 
Health Checkup:
Velocity Hospital offers a comprehensive Full Body Health Checkup program in Surat. 
The program is designed to provide a detailed assessment of overall health, detect 
potential health issues early, and empower individuals to make informed decisions about
their lifestyle and preventive healthcare. Here are some key details about the health 
checkup:
    • General Physical Examination: A thorough examination to assess overall physical 
health, including measurements of blood pressure, heart rate, and body mass index 
(BMI).
    • Blood Tests: Comprehensive blood tests to analyze cholesterol levels, blood sugar, 
liver and kidney function, thyroid hormones, and complete blood count (CBC) to 
identify underlying medical conditions.
    • Imaging Studies: Advanced imaging techniques such as X-rays, ultrasound, and 
MRI scans to evaluate the condition of organs, bones, and soft tissues.


In [293]:
show_results("what is the treatment for digestive issues?",k=3)

0.21406661
0.207396626
0.195580885
Question : what is the treatment for digestive issues?
Specialists :
- Dr. Apoorva Shah: Pediatrics and Neonatology
- Dr. Binjul Solanki: Anesthesia
- Dr. Hardik C. Shah: Orthopedic Surgery
- Dr. Hardik G. Sheth: Orthopedic and Joint Replacement Surgery
- Dr. Hardik R. Shah: Pediatrics and Neonatology
- Dr. Jignesh Patel: Hepatobiliary and Gastrointestinal Surgery
- Dr. Ketan Rupala: Urosurgery
- Dr. Maulik Patel: Neuro and Spine Surgery
- Dr. Mitesh Patel: Orthopedic Surgery
- Dr. Prayag Makwana: Neurology
- Dr. Rahul Shah: Oncosurgery and Oncology
- Dr. Rinkan Virani: General and Laparoscopic Surgery
- Dr. Rohan Jariwala: General Medicine and Infectious Disease
- Dr. Ruchi Desai Thakor: Obstetrics, Gynecology, and Infertility
- Dr. Rushin B. Thakor: Plastic and Reconstructive Surgery
- Dr. Vitrag Shah: General Medicine and Critical Care
- Dr. Zankhna Shah: Obstetrics, Gynecology, and Infertility

approved by the US FDA.
• ICU & NICU equipped with du

In [294]:
show_results("does the hospital has general surgery department?")

0.444913626
Question : does the hospital has general surgery department?
List of Departments : 
- Anesthesia
- Arthroscopy & Sports Medicine
- Critical Care & Emergency Medicine
- Hepatobiliary & Gastrointestinal Surgery
- General & Laparoscopic Surgery
- General Medicine & Infectious Diseases
- Maxillofacial Surgery
- Neurology
- Neurosurgery
- Nutrition & Dietetics
- Obstetrics & Gynecology & Infertility
- Oncosurgery
- Orthopedics & Joint Replacement
- Pediatrics & Neonatology
- Physiotherapy
- Plastic & Reconstructive Surgery
- Urosurgery


In [295]:
show_results("provide me details on general medicine department")

0.48629719
Question : provide me details on general medicine department
General Medicine & Infectious Diseases :
    • Scope of Services: The department provides treatment for a range of infectious 
diseases, including but not limited to:
        ◦ Fever
        ◦ Cough & Cold
        ◦ Chronic Cough
        ◦ Jaundice
        ◦ Diabetes
        ◦ Hypertension
        ◦ Ischemic Heart Disease
        ◦ Stroke (Paralysis)
        ◦ Thyroid Disorders
        ◦ Anemia (Low Hemoglobin)
        ◦ Asthma & COPD
        ◦ Vertigo
        ◦ Migraine
        ◦ Tuberculosis & HIV
        ◦ Physician Fitness before Surgery
        ◦ Fitness Certificate
        ◦ Health Checkup
        ◦ Vaccination
        ◦ Central Line or Dialysis Catheter Insertion
        ◦ Any other Medical Problem
Velocity Hospital's General Medicine & Infectious Diseases department offers 
comprehensive care for patients with infectious diseases, providing skilled diagnosis and
treatment under the care of experienced physi

In [301]:
show_results("give me details on Orthopedics",k=1)

0.40055421
Question : give me details on Orthopedics
Orthopedic and joint replacement services:
Velocity Hospital provides comprehensive orthopedic care, including treatment for
knee, hip, and joint issues. Specialties include sports medicine, pediatric 
orthopedics, arthritis diagnosis, and pain management. Advanced treatments such 
as total knee and hip replacement, arthroscopic surgery, and spine surgery are 
offered. The hospital's digital orthopedic operating suite ensures precise implant 
sizing and positioning. Led by experts like Dr. Mitesh Patel and Dr. Hardik G. 
Sheth, the department offers treatments for various orthopedic conditions, ensuring
early mobilization and optimal outcomes for patients.


In [2]:
import pandas as pd

# Provide the column names explicitly
column_names = ["Question", "Answer"]

# Read the CSV file with the specified column names
file = r"D:\Yeshiva\Spring24\NLP\NLP-Project\dataset\documents\tickets.csv"
cl = pd.read_csv(file, header=None, names=column_names)

# Now you can access your DataFrame with appropriate column names
cl.head()


Unnamed: 0,Question,Answer
0,What specialized services does the Arthroscopy...,Arthroscopy Sports Medicine
1,How does the hospital describe its approach to...,Arthroscopy Sports Medicine
2,Can you list some of the specific procedures a...,Arthroscopy Sports Medicine
3,Who are the leading specialists mentioned in t...,Arthroscopy Sports Medicine
4,What are some of the features and facilities a...,Arthroscopy Sports Medicine


In [3]:
cl['Answer'].unique()

array(['Arthroscopy Sports Medicine', 'HR Department', 'Criticial Care',
       'Hepatobiliary & Gastrointestinal Surgery',
       'General and Laparoscopic Surgery',
       'General Medicine and Infectious Diseases', 'Health Checkup',
       'Maxillofacial Surgery', 'Neurology', 'Neurosurgery',
       'Nutrition & Dietetics', 'Obstetrics and Gynecology',
       'Oncosurgery', 'Orthopaedics',
       'Patient Rights and Responsibilities', 'Paediatrics Neonatology',
       'Physiotherapy', 'Plastic Reconstructive Surgery',
       'Scope of Services', 'Urosurgery', 'Visiting Consultants'],
      dtype=object)

In [None]:
import streamlit as st
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
import joblib

from utils.utils import *


if 'cleaned_data' not in st.session_state:
    st.session_state['cleaned_data'] =''
if 'sentences_train' not in st.session_state:
    st.session_state['sentences_train'] =''
if 'sentences_test' not in st.session_state:
    st.session_state['sentences_test'] =''
if 'labels_train' not in st.session_state:
    st.session_state['labels_train'] =''
if 'labels_test' not in st.session_state:
    st.session_state['labels_test'] =''
if 'svm_classifier' not in st.session_state:
    st.session_state['svm_classifier'] =''

 
st.title("Let's build our Model...")
 
# Create tabs
tab_titles = ['Data Preprocessing', 'Model Training', 'Model Evaluation',"Save Model"]
tabs = st.tabs(tab_titles)

# Adding content to each tab

#Data Preprocessing TAB...
with tabs[0]:
    st.header('Data Preprocessing')
    st.write('Here we preprocess the data...')

    # Capture the CSV file
    data = st.file_uploader("Upload CSV file",type="csv")

    button = st.button("Load data",key="data")

    if button:
        with st.spinner('Wait for it...'):
            our_data=read_data(data)
            embeddings=get_embeddings()
            st.session_state['cleaned_data'] = create_embeddings_model(our_data,embeddings)
        st.success('Done!')


#Model Training TAB
with tabs[1]:
    st.header('Model Training')
    st.write('Here we train the model...')
    button = st.button("Train model",key="model")
    
    if button:
            with st.spinner('Wait for it...'):
                st.session_state['sentences_train'], st.session_state['sentences_test'], st.session_state['labels_train'], st.session_state['labels_test']=split_train_test__data(st.session_state['cleaned_data'])
                
                # Initialize a support vector machine, with class_weight='balanced' because 
                # our training set has roughly an equal amount of positive and negative 
                # sentiment sentences
                st.session_state['svm_classifier']  = make_pipeline(StandardScaler(), SVC(class_weight='balanced')) 

                # fit the support vector machine
                st.session_state['svm_classifier'].fit(st.session_state['sentences_train'], st.session_state['labels_train'])
            st.success('Done!')

#Model Evaluation TAB
with tabs[2]:
    st.header('Model Evaluation')
    st.write('Here we evaluate the model...')
    button = st.button("Evaluate model",key="Evaluation")

    if button:
        with st.spinner('Wait for it...'):
            y_pred = st.session_state['svm_classifier'].predict(st.session_state['sentences_test'])
            accuracy_score_val = accuracy_score(st.session_state['labels_test'], y_pred)
            precision_val = precision_score(st.session_state['labels_test'], y_pred, average='weighted')
            recall_val = recall_score(st.session_state['labels_test'], y_pred, average='weighted')
            f1_val = f1_score(st.session_state['labels_test'], y_pred, average='weighted')

            st.success(f"Validation accuracy of svm_classifier : {100*accuracy_score_val:.2f}%")
            st.write(f"Precision: {precision_val}")
            st.write(f"Recall: {recall_val}")
            st.write(f"F1 Score: {f1_val}")

            # st.write("Confusion Matrix:")
            # cm = confusion_matrix(st.session_state['labels_test'], y_pred)
            # st.write(cm)


            st.write("A sample run:")


            #text="lack of communication regarding policy updates salary, can we please look into it?"
            text="Rude driver with scary driving"
            st.write("***Our issue*** : "+text)

            #Converting out TEXT to NUMERICAL representaion
            embeddings= get_embeddings()
            query_result = embeddings.embed_query(text)

            #Sample prediction using our trained model
            result= st.session_state['svm_classifier'].predict([query_result])
            st.write("***Department it belongs to*** : "+result[0])
            

        st.success('Done!')

#Save model TAB
with tabs[3]:
    st.header('Save model')
    st.write('Here we save the model...')

    button = st.button("Save model",key="save")
    if button:

        with st.spinner('Wait for it...'):
             joblib.dump(st.session_state['svm_classifier'], 'modelsvm.pk1')
        st.success('Done!')



In [None]:
import streamlit as st
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import joblib
import xgboost as xgb

from utils.utils import *


if 'cleaned_data' not in st.session_state:
    st.session_state['cleaned_data'] =''
if 'sentences_train' not in st.session_state:
    st.session_state['sentences_train'] =''
if 'sentences_test' not in st.session_state:
    st.session_state['sentences_test'] =''
if 'labels_train' not in st.session_state:
    st.session_state['labels_train'] =''
if 'labels_test' not in st.session_state:
    st.session_state['labels_test'] =''
if 'xgb_classifier' not in st.session_state:
    st.session_state['xgb_classifier'] =''
if 'label_encoder' not in st.session_state:
    st.session_state['label_encoder'] = LabelEncoder()
 
st.title("Let's build our Model...")
 
# Create tabs
tab_titles = ['Data Preprocessing', 'Model Training', 'Model Evaluation', "Save Model"]
tabs = st.tabs(tab_titles)

# Adding content to each tab

# Data Preprocessing TAB...
with tabs[0]:
    st.header('Data Preprocessing')
    st.write('Here we preprocess the data...')

    # Capture the CSV file
    data = st.file_uploader("Upload CSV file", type="csv")

    button = st.button("Load data", key="data")

    if button:
        with st.spinner('Wait for it...'):
            our_data = read_data(data)
            embeddings = get_embeddings()
            st.session_state['cleaned_data'] = create_embeddings_model(our_data, embeddings)
        st.success('Done!')


# Model Training TAB
with tabs[1]:
    st.header('Model Training')
    st.write('Here we train the model...')
    button = st.button("Train model", key="model")
    
    if button:
        with st.spinner('Wait for it...'):
            st.session_state['sentences_train'], st.session_state['sentences_test'], \
            st.session_state['labels_train'], st.session_state['labels_test'] = \
                split_train_test__data(st.session_state['cleaned_data'])

            # Encode string labels to integers
            st.session_state['labels_train_encoded'] = st.session_state['label_encoder'].fit_transform(st.session_state['labels_train'])
            
            # Initialize XGBoost classifier
            st.session_state['xgb_classifier'] = xgb.XGBClassifier()

            # fit the XGBoost classifier
            st.session_state['xgb_classifier'].fit(st.session_state['sentences_train'], 
                                                    st.session_state['labels_train_encoded'])
        st.success('Done!')

# Model Evaluation TAB
with tabs[2]:
    st.header('Model Evaluation')
    st.write('Here we evaluate the model...')
    button = st.button("Evaluate model", key="Evaluation")

    if button:
        with st.spinner('Wait for it...'):
            y_pred = st.session_state['xgb_classifier'].predict(st.session_state['sentences_test'])
            
            # Decode integer labels back to original string labels
            decoded_labels = st.session_state['label_encoder'].inverse_transform(y_pred)
            
            accuracy_score_val = accuracy_score(st.session_state['labels_test'], decoded_labels)
            precision_val = precision_score(st.session_state['labels_test'], decoded_labels, average='weighted')
            recall_val = recall_score(st.session_state['labels_test'], decoded_labels, average='weighted')
            f1_val = f1_score(st.session_state['labels_test'], decoded_labels, average='weighted')

            st.success(f"Validation accuracy of xgb_classifier : {100*accuracy_score_val:.2f}%")
            st.write(f"Precision: {precision_val}")
            st.write(f"Recall: {recall_val}")
            st.write(f"F1 Score: {f1_val}")

            st.write("A sample run:")

            text = "Rude driver with scary driving"
            st.write("***Our issue*** : " + text)

            embeddings = get_embeddings()
            query_result = embeddings.embed_query(text)

            result = st.session_state['xgb_classifier'].predict([query_result])

            # Decode integer label to string label
            department = st.session_state['label_encoder'].inverse_transform(result)[0]
            st.write("***Department it belongs to*** : " + department)

        st.success('Done!')


# Save model TAB
with tabs[3]:
    st.header('Save model')
    st.write('Here we save the model...')

    button = st.button("Save model", key="save")
    if button:
        with st.spinner('Wait for it...'):
             joblib.dump(st.session_state['xgb_classifier'], 'modelxgb.pk1')
        st.success('Done!')


In [None]:
import streamlit as st
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import joblib

from utils.utils import *

if 'cleaned_data' not in st.session_state:
    st.session_state['cleaned_data'] =''
if 'sentences_train' not in st.session_state:
    st.session_state['sentences_train'] =''
if 'sentences_test' not in st.session_state:
    st.session_state['sentences_test'] =''
if 'labels_train' not in st.session_state:
    st.session_state['labels_train'] =''
if 'labels_test' not in st.session_state:
    st.session_state['labels_test'] =''
if 'rf_classifier' not in st.session_state:
    st.session_state['rf_classifier'] =''

 
st.title("Let's build our Model...")
 
# Create tabs
tab_titles = ['Data Preprocessing', 'Model Training', 'Model Evaluation', "Save Model"]
tabs = st.tabs(tab_titles)

# Adding content to each tab

# Data Preprocessing TAB...
with tabs[0]:
    st.header('Data Preprocessing')
    st.write('Here we preprocess the data...')

    # Capture the CSV file
    data = st.file_uploader("Upload CSV file", type="csv")

    button = st.button("Load data", key="data")

    if button:
        with st.spinner('Wait for it...'):
            our_data = read_data(data)
            embeddings = get_embeddings()
            st.session_state['cleaned_data'] = create_embeddings_model(our_data, embeddings)
        st.success('Done!')


# Model Training TAB
with tabs[1]:
    st.header('Model Training')
    st.write('Here we train the model...')
    button = st.button("Train model", key="model")
    
    if button:
        with st.spinner('Wait for it...'):
            st.session_state['sentences_train'], st.session_state['sentences_test'], \
            st.session_state['labels_train'], st.session_state['labels_test'] = \
                split_train_test__data(st.session_state['cleaned_data'])

            # Initialize Random Forest Classifier
            st.session_state['rf_classifier'] = make_pipeline(StandardScaler(), RandomForestClassifier())

            # fit the Random Forest Classifier
            st.session_state['rf_classifier'].fit(st.session_state['sentences_train'], 
                                                    st.session_state['labels_train'])
        st.success('Done!')

# Model Evaluation TAB
with tabs[2]:
    st.header('Model Evaluation')
    st.write('Here we evaluate the model...')
    button = st.button("Evaluate model", key="Evaluation")

    if button:
        with st.spinner('Wait for it...'):
            y_pred = st.session_state['rf_classifier'].predict(st.session_state['sentences_test'])
            accuracy_score_val = accuracy_score(st.session_state['labels_test'], y_pred)
            precision_val = precision_score(st.session_state['labels_test'], y_pred, average='weighted')
            recall_val = recall_score(st.session_state['labels_test'], y_pred, average='weighted')
            f1_val = f1_score(st.session_state['labels_test'], y_pred, average='weighted')

            st.success(f"Validation accuracy of rf_classifier : {100*accuracy_score_val:.2f}%")
            st.write(f"Precision: {precision_val}")
            st.write(f"Recall: {recall_val}")
            st.write(f"F1 Score: {f1_val}")

            st.write("A sample run:")

            text = "Rude driver with scary driving"
            st.write("***Our issue*** : " + text)

            embeddings = get_embeddings()
            query_result = embeddings.embed_query(text)

            result = st.session_state['rf_classifier'].predict([query_result])
            st.write("***Department it belongs to*** : " + result[0])

        st.success('Done!')

# Save model TAB
with tabs[3]:
    st.header('Save model')
    st.write('Here we save the model...')

    button = st.button("Save model", key="save")
    if button:
        with st.spinner('Wait for it...'):
             joblib.dump(st.session_state['rf_classifier'], 'modelrf.pk1')
        st.success('Done!')


In [None]:
Can you describe the features of the ICU and NICU at Velocity Hospital?,HR Department

What specific services are offered under General & Laparoscopic Surgery at Velocity Hospital?,General and Laparoscopic Surgery

Could you explain the pulmonary function tests offered during the health checkup?,Health Checkup

What sets Velocity Hospital's neurosurgery department apart from other hospitals in terms of expertise and technology?,Neurosurgery


---