HIPAA-aware ETL + Patient 360 + NLP/RAG search for continuity of care.
- ETL pipeline: cleans and standardizes raw CSVs, de-identifies patients, and builds a Patient 360 mart.
- Population dashboard: overall trends and utilization (Streamlit).
- Patient 360 UI + Q/A: patient summary + grounded answers from notes (Streamlit + RAG).
- Drops direct identifiers (name, address, SSN, etc.).
- Removes full birth/death dates; keeps year only.
- Buckets ages and masks 90+ birth year per safe-harbor practice.
- Hashes patient/encounter identifiers with a configurable salt.
- Phase 1: ETL + Dashboard (Notebook)
pip install -r requirements.txt
jupyter labOpen notebooks/EHR_ETL_and_Dashboard.ipynb and run all cells.
- Phase 2: NLP + RAG (Notebook)
pip install -r requirements-phase2.txt
jupyter labOpen notebooks/EHR_Phase2_NLP_RAG.ipynb and run all cells.
Note: some clinical NLP packages are not yet compatible with Python 3.13. If you need scispaCy/medspaCy/Presidio, use Python 3.11.
- Build Relational + Vector Datastores Build a SQLite relational DB and a FAISS vector index (TF‑IDF fallback if FAISS unavailable).
python src/build_datastores.pyThis creates:
ehr.db(relational tables)note_chunks_fts(keyword index, if FTS5 is available)notes.faiss(vector index) ortfidf.pkl(fallback)- structured “patient summary” chunks are also embedded for semantic retrieval
- Run Patient 360 + Q/A UI
pip install -r requirements-ui.txt
streamlit run dashboard/patient_chatbot.pyRetrieval uses hybrid fusion: keyword (BM25) + semantic (FAISS) with weighted score merging.
- Ollama (local):
export OLLAMA_MODEL="llama3.1:8b"- OpenAI (hosted):
export OPENAI_API_KEY="your_key"
export OPENAI_MODEL="gpt-4o-mini"If no LLM is configured, the UI returns evidence-only answers.
Optional: set a salt for hashing.
export EHR_HASH_SALT="your_secret_salt"pip install -r requirements-ui.txt
streamlit run dashboard/app.pydata/processed/dim_patient.csvdata/processed/fact_*data/processed/mart_patient_360.csvdata/processed/ehr.db(SQLite)data/processed/notes.faissordata/processed/tfidf.pkl(vector index)