# SympAI: RAG-Enabled Chatbot (Dev Notebook)
This notebook uses a simple RAG pipeline to inject medical context into Groq-hosted LLaMA responses for a symptom-based assistant.

In [10]:
!pip install groq sentence-transformers faiss-cpu pandas kaggle

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.0.1 -> 25.1
[notice] To update, run: C:\Users\alex\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [11]:
import os
from groq import Groq
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pandas as pd

In [12]:
## Use your Groq dev key (safe for local use)
client = Groq(api_key="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")

## Download and Prep The Clinical Text Dataset


In [13]:
# Configure Kaggle API credentials // can also upload kaggle.json to ~/.kaggle
!pip install kaggle
import os
os.environ['KAGGLE_USERNAME'] = 'alexanderchak'  # kaggle username
os.environ['KAGGLE_KEY'] = '022110f2f25a0f9c9d44fd3276e49bdf'  # kaggle api key

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.0.1 -> 25.1
[notice] To update, run: C:\Users\alex\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [14]:
# Download the dataset
!kaggle datasets download -d tboyle10/medicaltranscriptions

# Unzip it
!unzip medicaltranscriptions.zip

Dataset URL: https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions
License(s): CC0-1.0
medicaltranscriptions.zip: Skipping, found more recently modified local copy (use --force to force download)


'unzip' is not recognized as an internal or external command,
operable program or batch file.


In [15]:
# Load the clinical text dataset
df = pd.read_csv('mtsamples.csv')
print(f"Dataset loaded with {len(df)} records")
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'mtsamples.csv'

In [9]:
# Explore the dataset structure
print("Dataset columns:", df.columns.tolist())
print("\nMedical specialties distribution:")
print(df['medical_specialty'].value_counts().head(10))

# Check for missing values
print("\nMissing values per column:")
print(df.isna().sum())

Dataset columns: ['Unnamed: 0', 'description', 'medical_specialty', 'sample_name', 'transcription', 'keywords']

Medical specialties distribution:
medical_specialty
Surgery                          1103
Consult - History and Phy.        516
Cardiovascular / Pulmonary        372
Orthopedic                        355
Radiology                         273
General Medicine                  259
Gastroenterology                  230
Neurology                         223
SOAP / Chart / Progress Notes     166
Obstetrics / Gynecology           160
Name: count, dtype: int64

Missing values per column:
Unnamed: 0              0
description             0
medical_specialty       0
sample_name             0
transcription          33
keywords             1068
dtype: int64


In [10]:
# Create generalized, de-identified facts from the dataset for safe RAG use

def create_generalized_fact(row):
    specialty = row['medical_specialty'] if isinstance(row['medical_specialty'], str) else ""
    symptom = row['description'] if isinstance(row['description'], str) else ""
    text = row['transcription'] if isinstance(row['transcription'], str) else ""

    # Remove line breaks and lowercase for keyword matching
    text_clean = text.lower().replace("\n", " ").strip()

    # Check for diagnostic relevance
    if "diagnosis" in text_clean or "assessment" in text_clean or "impression" in text_clean:
        fact = f"In {specialty}, symptoms like '{symptom}' may be evaluated for diagnostic conditions."
    elif "symptom" in text_clean or "pain" in text_clean or "fever" in text_clean:
        fact = f"Patients in {specialty} presenting with '{symptom}' may be assessed for related issues."
    else:
        fact = f"In {specialty}, patients with health concerns described as '{symptom}' may receive further evaluation."

    return fact

# Create a corpus of generalized, safe medical facts
df['medical_fact'] = df.apply(create_generalized_fact, axis=1)

# Remove rows with empty or meaningless facts
df = df[df['medical_fact'].str.len() > 30].reset_index(drop=True)

# Display a sample safe fact
print("Sample generalized medical fact:")
print(df['medical_fact'].iloc[0])

# Extract clean corpus for embedding
medical_corpus = df['medical_fact'].tolist()
print(f"\nPrepared {len(medical_corpus)} generalized facts for embedding")

Sample generalized medical fact:
In  Allergy / Immunology, symptoms like ' A 23-year-old white female presents with complaint of allergies.' may be evaluated for diagnostic conditions.

Prepared 4999 generalized facts for embedding


In [11]:
# Create embeddings for the medical corpus
print("Loading sentence transformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')

# Process in batches to handle memory constraints
batch_size = 64
embeddings = []

print("Creating embeddings (this may take a while)...")
for i in range(0, len(medical_corpus), batch_size):
    batch = medical_corpus[i:i+batch_size]
    batch_embeddings = model.encode(batch)
    embeddings.append(batch_embeddings)
    print(f"Processed {min(i+batch_size, len(medical_corpus))}/{len(medical_corpus)} documents")

# Combine all batches
embeddings = np.vstack(embeddings)
print(f"Created embeddings with shape: {embeddings.shape}")

# Build FAISS index for fast similarity search
print("Building FAISS index...")
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings).astype('float32'))
print("FAISS index built successfully")

Loading sentence transformer model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Creating embeddings (this may take a while)...
Processed 64/4999 documents
Processed 128/4999 documents
Processed 192/4999 documents
Processed 256/4999 documents
Processed 320/4999 documents
Processed 384/4999 documents
Processed 448/4999 documents
Processed 512/4999 documents
Processed 576/4999 documents
Processed 640/4999 documents
Processed 704/4999 documents
Processed 768/4999 documents
Processed 832/4999 documents
Processed 896/4999 documents
Processed 960/4999 documents
Processed 1024/4999 documents
Processed 1088/4999 documents
Processed 1152/4999 documents
Processed 1216/4999 documents
Processed 1280/4999 documents
Processed 1344/4999 documents
Processed 1408/4999 documents
Processed 1472/4999 documents
Processed 1536/4999 documents
Processed 1600/4999 documents
Processed 1664/4999 documents
Processed 1728/4999 documents
Processed 1792/4999 documents
Processed 1856/4999 documents
Processed 1920/4999 documents
Processed 1984/4999 documents
Processed 2048/4999 documents
Processed

In [12]:
def retrieve_context(query, k=3):
    """Retrieve relevant medical information based on the query."""
    query_embedding = model.encode([query])
    distances, indices = index.search(np.array(query_embedding).astype('float32'), k)

    # Get the retrieved documents
    retrieved_docs = [medical_corpus[i] for i in indices[0]]

    # For debugging
    print(f"Retrieved {len(retrieved_docs)} documents with distances: {distances[0]}")

    return retrieved_docs

In [13]:
def ask_sympai(user_input):
    # Retrieve relevant medical context
    context = retrieve_context(user_input)

    # Format the context block with clear separation between documents
    context_block = "\n\nGENERAL CLINICAL BACKGROUND:\n" + "\n\n---\n\n".join(context)

    # Safe, strict system prompt
    system_prompt = (
        "You are SympAI, a virtual symptom assistant. "
        "You ONLY respond to symptom-related health questions. "
        "You must not impersonate any patient or assume personal health data. "
        "Use the clinical background only to support your responses in general terms. "
        "Always remind the user to consult a real healthcare provider for diagnosis or treatment. "
        "Politely decline requests for non-medical or entertainment topics."
    )

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_input + context_block}
    ]

    # Get response from Groq
    response = client.chat.completions.create(
        model="llama3-8b-8192",
        messages=messages,
        temperature=0.4,
        max_tokens=800
    )

    return response.choices[0].message.content

In [14]:
# Test the enhanced SympAI with diverse medical questions
test_questions = [
    "I've been experiencing chest pain and shortness of breath. What could this indicate?",
    "What are the symptoms of urinary tract infections?",
    "My child has a sore throat with white patches. Could it be strep throat?",
    "I'm having joint pain and fatigue. What conditions might this suggest?"
]

for question in test_questions:
    print(f"\n\nQUESTION: {question}")
    print("\nANSWER:")
    print(ask_sympai(question))
    print("\n" + "-"*80)



QUESTION: I've been experiencing chest pain and shortness of breath. What could this indicate?

ANSWER:
Retrieved 3 documents with distances: [0.48530215 0.50388    0.5220967 ]
I'm not a doctor, but I can try to provide some general information that might be helpful. However, please keep in mind that I'm not a substitute for a medical professional, and you should consult a healthcare provider for a proper evaluation and diagnosis.

Chest pain and shortness of breath can be symptoms of various conditions, including cardiovascular and pulmonary disorders. Based on the information you provided, it's possible that your symptoms may be related to a condition that affects the heart or lungs.

Some potential causes of chest pain and shortness of breath could include:

1. Coronary artery disease: This is a condition where the blood vessels that supply the heart become narrowed or blocked, leading to chest pain, shortness of breath, and other symptoms.
2. Angina: This is a type of chest pain 

## Save the Model for Future Use

This section saves the FAISS index and corpus data so you don't need to rebuild the embeddings every time.

In [15]:
# Save the FAISS index and corpus for future use
import pickle

# Save the FAISS index
faiss.write_index(index, 'medical_faiss_index.bin')

# Save the corpus and other necessary data
with open('medical_corpus_data.pkl', 'wb') as f:
    pickle.dump({
        'medical_corpus': medical_corpus,
        'df_info': {
            'columns': df.columns.tolist(),
            'shape': df.shape
        }
    }, f)

print("Saved FAISS index and corpus data for future use")

Saved FAISS index and corpus data for future use


## Loading Saved Model (For Future Sessions)

You can use this code in future sessions to load the saved model instead of rebuilding it.

In [16]:
# Code to load the saved model in future sessions
'''
import faiss
import pickle

# Load the FAISS index
index = faiss.read_index('medical_faiss_index.bin')

# Load the corpus data
with open('medical_corpus_data.pkl', 'rb') as f:
    data = pickle.load(f)
    medical_corpus = data['medical_corpus']

print(f"Loaded index with {index.ntotal} vectors and {len(medical_corpus)} documents")
'''

'\nimport faiss\nimport pickle\n\n# Load the FAISS index\nindex = faiss.read_index(\'medical_faiss_index.bin\')\n\n# Load the corpus data\nwith open(\'medical_corpus_data.pkl\', \'rb\') as f:\n    data = pickle.load(f)\n    medical_corpus = data[\'medical_corpus\']\n\nprint(f"Loaded index with {index.ntotal} vectors and {len(medical_corpus)} documents")\n'