# Unified Clinical Corpus â†’ Chroma Pipeline

This notebook loads the first 1,000 rows from `medical_data.csv`, `patient_notes.csv`, `PMC-Patients.csv`, and `pubmed_dataset.csv`, cleans and de-identifies every note, merges all text into a single master document, and stores both the full document and LangChain-tokenised passages inside a Chroma vector database backed by BioClinicalBERT embeddings.

Workflow outline:
1. **Cleaning & Normalisation** â€“ remove noise/special symbols, expand UMLS-style medical abbreviations, lowercase, and redact PHI markers for HIPAA/GDPR compliance.
2. **LangChain Tokenisation & BioClinicalBERT Embeddings** â€“ build a master document, measure token counts with a Hugging Face tokenizer, and prepare embeddings with a PubMed/BioBERT-family model.
3. **Chunking & ChromaDB Storage** â€“ split the master document into ~100â€“300 word passages using `RecursiveCharacterTextSplitter`, then insert both the full document and the chunks into a Chroma collection (metadata includes source, chunk index, timestamps).
4. **Final Outputs** â€“ persist the cleaned corpus, chunk metadata, Chroma persistence directory, and demonstrate a sample semantic query for RAG readiness.



In [None]:
import os
import re
import shutil
from collections import defaultdict
from pathlib import Path
from datetime import datetime

import numpy as np
import pandas as pd

import nltk
nltk.download('punkt', quiet=True)

import spacy
from scispacy.abbreviation import AbbreviationDetector

# Updated imports: Always use explicit import; dependencies should be pre-installed as per instructions.
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

# Chroma & Embeddings config â€“ for downstream cells:
EMBEDDING_MODEL_NAME = "pritamdeka/pubmedbert-base-embeddings"
CHROMA_PERSIST_DIR = Path("data/processed/chroma_clinical/").resolve()
CHUNK_METADATA_PATH = Path("data/processed/chunk_metadata.parquet").resolve()


## Paths & Dataset Configuration

We pull all source data from `data/raw/` and constrain to the first 1,000 rows per file to keep the demonstration tractable.


In [11]:
DATA_ROOT = Path('/home/root495/Inexture/CDSS-RAG/data/raw')
OUTPUT_ROOT = Path('/home/root495/Inexture/CDSS-RAG/data/processed')
OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)

DATASETS = {
    'medical_data': {'path': DATA_ROOT / 'medical_data.csv', 'text_column': 'TEXT'},
    'patient_notes': {'path': DATA_ROOT / 'patient_notes.csv', 'text_column': 'pn_history'},
    'pmc_patients': {'path': DATA_ROOT / 'PMC-Patients.csv', 'text_column': 'patient'},
    'pubmed': {'path': DATA_ROOT / 'pubmed_dataset.csv', 'text_column': 'contents'}
}

MAX_ROWS = 1000
DATASETS


{'medical_data': {'path': PosixPath('/home/root495/Inexture/CDSS-RAG/data/raw/medical_data.csv'),
  'text_column': 'TEXT'},
 'patient_notes': {'path': PosixPath('/home/root495/Inexture/CDSS-RAG/data/raw/patient_notes.csv'),
  'text_column': 'pn_history'},
 'pmc_patients': {'path': PosixPath('/home/root495/Inexture/CDSS-RAG/data/raw/PMC-Patients.csv'),
  'text_column': 'patient'},
 'pubmed': {'path': PosixPath('/home/root495/Inexture/CDSS-RAG/data/raw/pubmed_dataset.csv'),
  'text_column': 'contents'}}

## Load First 1,000 Rows Per Dataset

Each dataset is truncated with `head(1000)` and tagged with `source` plus `source_row_id` for downstream metadata.


In [12]:
raw_dfs = {}
for name, cfg in DATASETS.items():
    df = pd.read_csv(cfg['path']).head(MAX_ROWS)
    df['source'] = name
    df['source_row_id'] = df.index
    raw_dfs[name] = df

{k: len(v) for k, v in raw_dfs.items()}


{'medical_data': 744,
 'patient_notes': 1000,
 'pmc_patients': 1000,
 'pubmed': 1000}

In [13]:
{name: df[[DATASETS[name]['text_column']]].head(2) for name, df in raw_dfs.items()}


{'medical_data':                                                 TEXT
 0  Admission Date:  [**2162-3-3**]              D...
 1  Admission Date:  [**2150-2-25**]              ...,
 'patient_notes':                                           pn_history
 0  17-year-old male, has come to the student heal...
 1  17 yo male with recurrent palpitations for the...,
 'pmc_patients':                                              patient
 0  This 60-year-old male was hospitalized due to ...
 1  A 39-year-old man was hospitalized due to an i...,
 'pubmed':                                             contents
 0  [Biochemical studies on camomile components/II...
 1  [Demonstration of tumor inhibiting properties ...}

## Cleaning & Normalisation

We remove noisy symbols, standardise spacing/case, expand common medical abbreviations (UMLS-inspired) via ScispaCyâ€™s abbreviation detector, and redact PHI markers such as emails, phone numbers, MRNs, and dates.


In [14]:
NLP_MODEL = 'en_core_sci_sm'
nlp = spacy.load(NLP_MODEL, disable=['ner'])
nlp.add_pipe('abbreviation_detector')

UMLS_ABBREV_MAP = {
    'HTN': 'hypertension',
    'DM': 'diabetes mellitus',
    'SOB': 'shortness of breath',
    'CAD': 'coronary artery disease',
    'COPD': 'chronic obstructive pulmonary disease',
    'CHF': 'congestive heart failure',
    'Pt': 'patient',
    'BP': 'blood pressure',
    'HR': 'heart rate',
    'c/o': 'complains of'
}

PHI_PATTERNS = {
    'emails': re.compile(r'\b[\w.-]+@[\w.-]+\.[A-Za-z]{2,}\b'),
    'phones': re.compile(r'(?:\+?\d{1,2}[ -]?)?(?:\(\d{3}\)|\d{3})[ -]?\d{3}[ -]?\d{4}'),
    'dates': re.compile(r'\b(?:\d{1,2}[/-]){2}\d{2,4}\b'),
    'mrn': re.compile(r'\b(?:mrn|patient id)\s*[:#]?\s*\d+\b', re.IGNORECASE),
    'names': re.compile(r'\b([A-Z][a-z]+\s[A-Z][a-z]+)\b')
}


def normalize_whitespace(text: str) -> str:
    return re.sub(r'\s+', ' ', str(text)).strip()


def remove_special_chars(text: str) -> str:
    return re.sub(r"[^0-9A-Za-z.,;:!?%\-\s'\/]", ' ', text)


def expand_abbreviations(text: str) -> str:
    doc = nlp(text)
    expanded = text
    for abrv in doc._.abbreviations:
        key = abrv.text.strip()
        if key in UMLS_ABBREV_MAP:
            expanded = re.sub(rf'\b{re.escape(key)}\b', UMLS_ABBREV_MAP[key], expanded)
    for short, long in UMLS_ABBREV_MAP.items():
        expanded = re.sub(rf'\b{re.escape(short)}\b', long, expanded, flags=re.IGNORECASE)
    return expanded


def deidentify(text: str) -> str:
    redacted = text
    for pattern in PHI_PATTERNS.values():
        redacted = pattern.sub('[REDACTED]', redacted)
    return redacted


def clean_text(text: str) -> str:
    text = normalize_whitespace(text)
    text = remove_special_chars(text)
    text = expand_abbreviations(text)
    text = text.lower()
    text = deidentify(text)
    return text

cleaned_dfs = {}
for name, df in raw_dfs.items():
    text_col = DATASETS[name]['text_column']
    df = df.copy()x``
    df['clean_text'] = df[text_col].fillna('').apply(clean_text)
    cleaned_dfs[name] = df

{k: df[['clean_text']].head(1)['clean_text'].iloc[0][:120] for k, df in cleaned_dfs.items()}



  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]
  global_matches = self.global_matcher(doc)
  global_matches = self.global_matcher(doc)
  global_matches = self.global_matcher(doc)
  global_matches = self.global_matcher(doc)


{'medical_data': 'admission date:    2162-3-3    discharge date:    2162-3-25    date of birth:    2080-1-4    sex: m service: medicine al',
 'patient_notes': "17-year-old male, has come to the student health clinic complaining of heart pounding. mr. cleveland's mother has given ",
 'pmc_patients': 'this 60-year-old male was hospitalized due to moderate ards from covid-19 with symptoms of fever, dry cough, and dyspnea',
 'pubmed': ' biochemical studies on camomile components/iii. in vitro studies about the antipeptic activity of  -- -alpha-bisabolol '}

## Build Per-Document Inputs for Vector Storage

Instead of collapsing everything into a single mega-document, we keep each cleaned row as its own LangChain `Document`. This lets us track provenance per record and evaluate retrieval metrics such as precision@k / recall@k against specific document ids.


In [17]:
document_frames = []
for name, df in cleaned_dfs.items():
    df = df.copy()
    df['doc_id'] = df.apply(lambda row: f"{row['source']}::{int(row['source_row_id'])}", axis=1)
    cleaned_dfs[name] = df
    document_frames.append(df[['doc_id', 'source', 'source_row_id', 'clean_text']])

documents_df = pd.concat(document_frames, ignore_index=True)
print(f"Prepared {len(documents_df)} individual documents for downstream chunking")



Prepared 3744 individual documents for downstream chunking


In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(
    model_name="emilyalsentzer/Bio_ClinicalBERT"
)


In [21]:
from langchain_community.vectorstores import Chroma

CHROMA_PERSIST_DIR.mkdir(parents=True, exist_ok=True)

vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory=str(CHROMA_PERSIST_DIR)
)

print(f"Chunks stored successfully in ChromaDB at {CHROMA_PERSIST_DIR}!")


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


Chunks stored successfully in ChromaDB at /home/root495/Inexture/CDSS-RAG/notebooks/data/processed/chroma_clinical!


In [None]:
from langchain_community.vectorstores import Chroma

# Load the existing ChromaDB vector store from the shared persistence directory
vectordb = Chroma(
    persist_directory=str(CHROMA_PERSIST_DIR),
    embedding_function=embedding_model
)


  vectordb = Chroma(
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [52]:
query = text1  # your user's question

results = vectordb.similarity_search(query, k=5)

print(f"Retrieved {len(results)} chunks")


Retrieved 5 chunks


In [53]:
for i, chunk in enumerate(results, 1):
    print(f"\nðŸŸ¦ Chunk {i}")
    print(chunk.page_content)
    print("-" * 80)



ðŸŸ¦ Chunk 1
a 61-year-old female presented to hospital with a 2-week history of profound diarrhea and vomiting. the patient also complained of dull abdominal pain that temporarily resolved with bowel movements. she denied fevers, weight loss, exposure to sick contacts, external food sources, and a travel history. there were no extraintestinal manifestations of inflammatory bowel disease  ibd , such as arthralgias, uveitis, episcleritis, oral ulcers, and aphthous ulcers. nshe was admitted with an initial diagnosis of viral gastroenteritis and treated with supportive therapy. stool testing for clostridium difficile, ova and parasites, viral pcr, and bacterial cultures were negative. a ct abdomen revealed diffuse edematous changes in the ascending colon, transverse colon, and descending colon, as well as hyperemia in the mesentery indicating colitis. non her second day of admission, the patient developed bloody diarrhea, prompting a colonoscopy by the gastroenterology service. the colon