# üè• ShifaMind - Google Colab Setup

**One-click setup for running ShifaMind on Google Colab**

This notebook will:
1. Mount your Google Drive (where your data is stored)
2. Clone the ShifaMind repository
3. Install only the dependencies Colab doesn't have
4. Configure paths to your Google Drive data
5. Run the complete pipeline

---

## Prerequisites

Your Google Drive should have data organized like this:

```
My Drive/
‚îî‚îÄ‚îÄ ShifaMind/
    ‚îî‚îÄ‚îÄ 01_Raw_Datasets/
        ‚îî‚îÄ‚îÄ Extracted/
            ‚îú‚îÄ‚îÄ umls-2025AA-metathesaurus-full/
            ‚îÇ   ‚îî‚îÄ‚îÄ 2025AA/META/
            ‚îÇ       ‚îú‚îÄ‚îÄ MRCONSO.RRF
            ‚îÇ       ‚îú‚îÄ‚îÄ MRDEF.RRF
            ‚îÇ       ‚îî‚îÄ‚îÄ MRSTY.RRF
            ‚îú‚îÄ‚îÄ icd10cm-CodesDescriptions-2024/
            ‚îÇ   ‚îî‚îÄ‚îÄ icd10cm-codes-2024.txt
            ‚îú‚îÄ‚îÄ mimic-iv-3.1/
            ‚îÇ   ‚îî‚îÄ‚îÄ (MIMIC-IV files)
            ‚îî‚îÄ‚îÄ mimic-iv-note-2.2/note/
                ‚îî‚îÄ‚îÄ discharge.csv.gz
```

**If your paths are different**, you'll set them in Step 3.

---

## Step 1: Mount Google Drive

This connects your Google Drive to Colab so we can access your data files.

In [2]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Verify mount
if os.path.exists('/content/drive/MyDrive'):
    print("‚úÖ Google Drive mounted successfully!")
else:
    print("‚ùå Drive mount failed. Please try again.")

Mounted at /content/drive
‚úÖ Google Drive mounted successfully!


## Step 2: Clone ShifaMind Repository

This downloads the ShifaMind code from GitHub.

In [3]:
import os

# Remove existing clone if present
if os.path.exists('/content/ShifaMind_Capstone'):
    print("üóëÔ∏è  Removing existing ShifaMind directory...")
    !rm -rf /content/ShifaMind_Capstone

# Clone repository
print("üì• Cloning ShifaMind repository...")
!git clone https://github.com/SyedMohammedSameer/ShifaMind_Capstone.git /content/ShifaMind_Capstone

# Change to repo directory
%cd /content/ShifaMind_Capstone

print("\n‚úÖ Repository cloned successfully!")
print("üìÇ Current directory:", os.getcwd())

üì• Cloning ShifaMind repository...
Cloning into '/content/ShifaMind_Capstone'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 19 (delta 3), reused 8 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (19/19), 67.87 KiB | 13.57 MiB/s, done.
Resolving deltas: 100% (3/3), done.
/content/ShifaMind_Capstone

‚úÖ Repository cloned successfully!
üìÇ Current directory: /content/ShifaMind_Capstone


## Step 3: Configure Data Paths

**DEFAULT PATHS (Most users can skip this)**

If your data is at `/content/drive/MyDrive/ShifaMind/01_Raw_Datasets/`, just run this cell as-is.

**CUSTOM PATHS**

If your data is elsewhere, modify the `BASE_PATH` variable below:

In [4]:
import os
from pathlib import Path

# ============================================================================
# CONFIGURE YOUR DATA PATH HERE
# ============================================================================

# Option 1: Use default path (recommended)
BASE_PATH = "/content/drive/MyDrive/ShifaMind"

# Option 2: Use custom path (uncomment and modify if needed)
# BASE_PATH = "/content/drive/MyDrive/YourCustomFolder/ShifaMind"

# ============================================================================
# SET ENVIRONMENT VARIABLE
# ============================================================================

os.environ['SHIFAMIND_BASE_PATH'] = BASE_PATH

# ============================================================================
# VALIDATE PATHS
# ============================================================================

print("üîç Validating data paths...\n")
print(f"Base Path: {BASE_PATH}")
print("="*70)

# Check critical paths
paths_to_check = {
    "Base Directory": BASE_PATH,
    "Raw Datasets": f"{BASE_PATH}/01_Raw_Datasets",
    "UMLS MRCONSO": f"{BASE_PATH}/01_Raw_Datasets/Extracted/umls-2025AA-metathesaurus-full/2025AA/META/MRCONSO.RRF",
    "ICD-10 Codes": f"{BASE_PATH}/01_Raw_Datasets/Extracted/icd10cm-CodesDescriptions-2024/icd10cm-codes-2024.txt",
    "MIMIC Notes": f"{BASE_PATH}/01_Raw_Datasets/Extracted/mimic-iv-note-2.2/note/discharge.csv.gz",
}

all_exist = True
for name, path in paths_to_check.items():
    exists = Path(path).exists()
    status = "‚úÖ" if exists else "‚ùå"
    print(f"{status} {name}")
    if not exists:
        all_exist = False
        print(f"   Missing: {path}")

print("="*70)
if all_exist:
    print("\nüéâ All paths validated! Ready to proceed.")
else:
    print("\n‚ö†Ô∏è  Some paths are missing. Please:")
    print("   1. Verify your BASE_PATH is correct")
    print("   2. Ensure data files are uploaded to Google Drive")
    print("   3. Check folder and file names match exactly")
    print("\n   Then re-run this cell.")

üîç Validating data paths...

Base Path: /content/drive/MyDrive/ShifaMind
‚úÖ Base Directory
‚úÖ Raw Datasets
‚úÖ UMLS MRCONSO
‚úÖ ICD-10 Codes
‚úÖ MIMIC Notes

üéâ All paths validated! Ready to proceed.


## Step 4: Install Dependencies

Colab already has most packages (PyTorch, NumPy, pandas, etc.). We only install what's missing.

In [5]:
print("üì¶ Installing missing dependencies...\n")
print("Note: This may take 2-3 minutes.\n")

# Install only packages not in Colab by default
!pip install -q transformers>=4.35.0
!pip install -q gradio>=4.0.0
!pip install -q scispacy>=0.5.3
!pip install -q jsonlines>=3.1.0
!pip install -q plotly>=5.14.0

# Download spacy medical model
print("\nüì• Downloading medical NLP model...")
!python -m spacy download en_core_sci_sm

print("\n‚úÖ All dependencies installed!")

üì¶ Installing missing dependencies...

Note: This may take 2-3 minutes.

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
jax 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
jaxlib 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
pytensor 2.35.1 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.
opencv-contrib-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.[0m[31m
[0m
üì• Download

## Step 5: Verify GPU Access

ShifaMind runs much faster on GPU. Let's check if GPU is available.

In [6]:
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"‚úÖ GPU Available: {gpu_name}")
    print(f"   Memory: {gpu_memory:.1f} GB")
    print(f"\nüöÄ Training will use GPU acceleration!")
else:
    print("‚ö†Ô∏è  No GPU detected. Training will use CPU (much slower).")
    print("\nüí° To enable GPU in Colab:")
    print("   1. Go to Runtime > Change runtime type")
    print("   2. Select 'T4 GPU' or 'A100 GPU'")
    print("   3. Click Save")
    print("   4. Restart from Step 1")

‚úÖ GPU Available: NVIDIA A100-SXM4-40GB
   Memory: 42.5 GB

üöÄ Training will use GPU acceleration!


---

# üöÄ Pipeline Execution

Now that setup is complete, run the ShifaMind pipeline in order.

---

## Pipeline Step 1: Generate Knowledge Base

**Time**: ~5-10 minutes

This parses UMLS and ICD-10 to create a structured medical knowledge base.

In [7]:
%cd /content/ShifaMind_Capstone

print("üèóÔ∏è  Generating Clinical Knowledge Base...\n")
!python final_knowledge_base_generator.py

print("\n‚úÖ Knowledge base generated!")
print("üìÑ Output: 03_Models/clinical_knowledge_base.json")

/content/ShifaMind_Capstone
üèóÔ∏è  Generating Clinical Knowledge Base...

SHIFAMIND: CLINICAL KNOWLEDGE BASE GENERATOR
2025-11-21 00:08:09,153 - INFO - UMLS Path: /content/drive/MyDrive/ShifaMind/01_Raw_Datasets/Extracted/umls-2025AA-metathesaurus-full/2025AA/META
2025-11-21 00:08:09,153 - INFO - ICD-10 Path: /content/drive/MyDrive/ShifaMind/01_Raw_Datasets/Extracted/icd10cm-CodesDescriptions-2024
2025-11-21 00:08:09,153 - INFO - Output: /content/drive/MyDrive/ShifaMind/03_Models/clinical_knowledge_base.json
2025-11-21 00:08:09,154 - INFO - STARTING FILTERED KNOWLEDGE BASE GENERATION
2025-11-21 00:08:09,154 - INFO - 
üîç Validating file paths...
2025-11-21 00:08:09,157 - INFO -   ‚úÖ MRCONSO: 2128.5 MB
2025-11-21 00:08:09,158 - INFO -   ‚úÖ ICD-10: 6.1 MB
2025-11-21 00:08:09,158 - INFO - PARSING ICD-10 CODES
2025-11-21 00:08:09,158 - INFO - File: /content/drive/MyDrive/ShifaMind/01_Raw_Datasets/Extracted/icd10cm-CodesDescriptions-2024/icd10cm-codes-2024.txt
2025-11-21 00:08:10,432 -

## Pipeline Step 2: Train Model

**Time**: ~6 hours on GPU (T4), ~1 hour on A100

This trains ShifaMind through 3 stages:
- Stage 1: Diagnosis head (3 epochs)
- Stage 2: Concept head (2 epochs)
- Stage 3: Joint fine-tuning (3 epochs)

**‚ö†Ô∏è Important**: This is a long-running process. Make sure:
1. You have GPU enabled
2. Your Colab session won't timeout (keep browser tab active or use Colab Pro)
3. You have enough Google Drive space for model checkpoints (~2GB)

In [8]:
%cd /content/ShifaMind_Capstone

print("üéì Training ShifaMind Model...\n")
print("This will take several hours. Progress will be shown below.\n")
print("="*70)

!python final_model_training.py

print("\n‚úÖ Training complete!")
print("üì¶ Model saved to: 03_Models/checkpoints/shifamind_model.pt")

/content/ShifaMind_Capstone
üéì Training ShifaMind Model...

This will take several hours. Progress will be shown below.

2025-11-21 00:08:49,151 - INFO - üñ•Ô∏è  Device: cuda
2025-11-21 00:08:49,727 - INFO - SHIFAMIND: MODEL TRAINING PIPELINE
2025-11-21 00:08:49,727 - INFO - Output Directory: /content/drive/MyDrive/ShifaMind/04_Results/experiments/training_run
2025-11-21 00:08:49,727 - INFO - Checkpoint Directory: /content/drive/MyDrive/ShifaMind/03_Models/checkpoints
2025-11-21 00:08:49,727 - INFO - DATA VALIDATION
2025-11-21 00:08:49,728 - INFO -    ‚úÖ MIMIC Notes: 1086.4 MB
2025-11-21 00:08:49,729 - INFO -    ‚úÖ UMLS MRCONSO: 2128.5 MB
2025-11-21 00:08:49,730 - INFO -    ‚úÖ UMLS MRSTY: 205.2 MB
2025-11-21 00:08:51,219 - INFO -    ‚úÖ MIMIC Diagnoses: 32.0 MB
2025-11-21 00:08:51,219 - INFO -    ‚úÖ Output Directory: exists
2025-11-21 00:08:51,220 - INFO -    ‚úÖ Checkpoint Directory: exists
2025-11-21 00:08:51,220 - INFO - ‚úÖ All data validation checks passed
2025-11-21 00:08:

## Pipeline Step 3: Evaluate Model

**Time**: ~10-15 minutes

This computes comprehensive metrics and generates visualizations.

In [9]:
%cd /content/ShifaMind_Capstone

print("üìä Evaluating Model Performance...\n")
!python final_evaluation.py

print("\n‚úÖ Evaluation complete!")
print("üìÅ Results saved to: 04_Results/experiments/")

/content/ShifaMind_Capstone
üìä Evaluating Model Performance...

üñ•Ô∏è  Device: cuda
SHIFAMIND 042: COMPREHENSIVE EVALUATION PIPELINE

üìÅ Output Directory: /content/drive/MyDrive/ShifaMind/04_Results/experiments/042_filtered_concepts
üìÅ Checkpoint: /content/drive/MyDrive/ShifaMind/03_Models/checkpoints/shifamind_model.pt

STARTING COMPREHENSIVE EVALUATION

üìÇ Loading model and data...
  ‚úÖ Loaded checkpoint: /content/drive/MyDrive/ShifaMind/03_Models/checkpoints/shifamind_model.pt
  ‚úÖ Loaded 60 concept embeddings
  Loading Bio_ClinicalBERT...
2025-11-21 00:39:17.861801: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-21 00:39:17.877088: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to regist

## Pipeline Step 4: Launch Interactive Demo

**Time**: Instant

This launches a Gradio web interface where you can test predictions on clinical notes.

**Note**: The demo will create a public URL you can share (valid for 72 hours).

In [21]:
import shutil

# The file the script expects
expected_path = "/content/drive/MyDrive/ShifaMind/03_Models/clinical_knowledge_base.json"

# Your actual file (adjust the number if different)
actual_path = "/content/drive/MyDrive/ShifaMind/03_Models/clinical_knowledge_base_043.json"

# Copy with the correct name
if os.path.exists(actual_path):
    shutil.copy2(actual_path, expected_path)
    print(f"‚úÖ Copied KB to: {expected_path}")
else:
    print(f"‚ùå Source file not found: {actual_path}")
    print("\nTry listing files to find the correct name:")
    !ls -lh /content/drive/MyDrive/ShifaMind/03_Models/*.json

‚úÖ Copied KB to: /content/drive/MyDrive/ShifaMind/03_Models/clinical_knowledge_base.json


In [22]:
# Set environment variable
os.environ['SHIFAMIND_BASE_PATH'] = '/content/drive/MyDrive/ShifaMind'

print(f"üìÇ Using BASE_PATH: {os.environ['SHIFAMIND_BASE_PATH']}")
print("üåê Launching demo...\n")

!python final_demo.py

üìÇ Using BASE_PATH: /content/drive/MyDrive/ShifaMind
üåê Launching demo...

üìÇ Using BASE_PATH: /content/drive/MyDrive/ShifaMind
üì¶ Looking for model at: /content/drive/MyDrive/ShifaMind/03_Models/checkpoints/shifamind_model.pt
üìö Looking for KB at: /content/drive/MyDrive/ShifaMind/03_Models/clinical_knowledge_base.json
üè• SHIFAMIND: LIVE DEMO (FILTERED)

üìÇ Loading checkpoint...
  ‚úÖ Loaded 60 concepts

üì¶ Loading Bio_ClinicalBERT...
2025-11-21 01:15:17.001287: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-21 01:15:17.017312: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763687717.035611   22272 cud

---

# üéâ Setup Complete!

You've successfully:
- ‚úÖ Set up ShifaMind on Google Colab
- ‚úÖ Generated the clinical knowledge base
- ‚úÖ Trained the model (if you ran the training step)
- ‚úÖ Evaluated performance
- ‚úÖ Tested inference

## Next Steps

1. **Explore Results**: Check the `04_Results/experiments/` folder for metrics and visualizations
2. **Use the Demo**: Run the demo cell above to get an interactive web interface
3. **Integrate into Your App**: Import `ShifaMindPredictor` and use it in your own code

## Troubleshooting

- **Path Errors**: Re-run Step 3 and verify all paths exist
- **Out of Memory**: Use a smaller batch size or upgrade to Colab Pro for more RAM
- **GPU Not Available**: Go to Runtime > Change runtime type > Select GPU
- **Session Timeout**: Use Colab Pro or keep browser tab active during training

## Support

For issues or questions:
- Check the README.md in the repository
- Review the documentation in the `docs/` folder
- Contact: Mohammed Sameer Syed

---

**Built with precision. Designed for transparency. Created for better healthcare.**

---