

> Add blockquote
## üìä Datasets Used

- **Synthea Sample Data (CSV Latest)**  
  [Download Link](https://synthea.mitre.org/downloads)

- **Medical Transcriptions Dataset**  
  [Kaggle Link](https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions)



# Phase 1: Synthea Data Preparation

This cell handles the entire process for the Synthea dataset. It loads the four key files (patients, conditions, medications, allergies), cleans the patient demographics by removing sensitive information, and then merges them into a single, comprehensive file. The final output is one standardized CSV file that represents the complete patient personas for the MVP.

In [None]:
import pandas as pd

# 1. DEFINE FILE PATHS
# You have already corrected these, which is great.
PATIENTS_FILE = '/content/patients.csv'
CONDITIONS_FILE = '/content/conditions.csv'
MEDICATIONS_FILE = '/content/medications.csv'
ALLERGIES_FILE = '/content/allergies.csv'

# 2. LOAD DATASETS
try:
    df_patients = pd.read_csv(PATIENTS_FILE)
    df_conditions = pd.read_csv(CONDITIONS_FILE)
    df_medications = pd.read_csv(MEDICATIONS_FILE)
    df_allergies = pd.read_csv(ALLERGIES_FILE)

    print("‚úÖ Files loaded successfully.")

except FileNotFoundError as e:
    print(f"üõë Error: One or more files not found. Please verify the file paths. Details: {e}")
    raise

# 3. CLEAN PATIENT DEMOGRAPHICS (PERSONA BASE)
patient_cols_to_keep = ['Id', 'BIRTHDATE', 'FIRST', 'LAST', 'MARITAL', 'GENDER']
df_patients_clean = df_patients[patient_cols_to_keep].copy()

df_patients_clean = df_patients_clean.rename(columns={'Id': 'PATIENT_ID'})
df_merged = df_patients_clean
print("‚úÖ Patient demographics cleaned and standardized.")


# --- 4. SEQUENTIAL MERGING PROCESS ---
# 4A. Merge Conditions
df_conditions_slim = df_conditions[['PATIENT', 'DESCRIPTION']].rename(columns={'PATIENT': 'PATIENT_ID', 'DESCRIPTION': 'CONDITION_DESC'})
df_merged = pd.merge(df_merged, df_conditions_slim, on='PATIENT_ID', how='left')
print("‚úÖ Conditions data merged.")

# 4B. Merge Medications
df_medications_slim = df_medications[['PATIENT', 'DESCRIPTION', 'START', 'STOP']].rename(columns={'PATIENT': 'PATIENT_ID', 'DESCRIPTION': 'MEDICATION_DESC'})
df_merged = pd.merge(df_merged, df_medications_slim, on='PATIENT_ID', how='left', suffixes=('', '_MED'))
print("‚úÖ Medications data merged.")

# 4C. Merge Allergies
df_allergies_slim = df_allergies[['PATIENT', 'DESCRIPTION']].rename(columns={'PATIENT': 'PATIENT_ID', 'DESCRIPTION': 'ALLERGY_DESC'})
# --- THIS IS THE FIXED LINE ---
df_merged = pd.merge(df_merged, df_allergies_slim, on='PATIENT_ID', how='left', suffixes=('', '_ALLERGY'))
print("‚úÖ Allergies data merged.")


# --- 5. FINAL CHECK AND SAVE ---
print("\n--- Final Merged Data Summary ---")
print(f"Total rows in the final dataset: {df_merged.shape[0]}")
print(f"Total columns in the final dataset: {df_merged.shape[1]}")
print("Sample of final columns:", df_merged.columns.tolist())
print("\nFirst 3 Rows (Transposed for easy viewing):")
print(df_merged.head(3).T)

# Save the final DataFrame to a new CSV file.
OUTPUT_FILE = "Synthea_MVP_Cleaned_Merged.csv"
df_merged.to_csv(OUTPUT_FILE, index=False)

print(f"\nüéâ PHASE 1 COMPLETE! The final, standardized dataset has been saved as '{OUTPUT_FILE}'.")

‚úÖ Files loaded successfully.
‚úÖ Patient demographics cleaned and standardized.
‚úÖ Conditions data merged.
‚úÖ Medications data merged.
‚úÖ Allergies data merged.

--- Final Merged Data Summary ---
Total rows in the final dataset: 1382375
Total columns in the final dataset: 11
Sample of final columns: ['PATIENT_ID', 'BIRTHDATE', 'FIRST', 'LAST', 'MARITAL', 'GENDER', 'CONDITION_DESC', 'MEDICATION_DESC', 'START', 'STOP', 'ALLERGY_DESC']

First 3 Rows (Transposed for easy viewing):
                                                    0  \
PATIENT_ID       732e16fb-a1aa-b846-c6c2-c00bd4211445   
BIRTHDATE                                    4/9/2014   
FIRST                                      Whitley172   
LAST                                       Kreiger457   
MARITAL                                           NaN   
GENDER                                              F   
CONDITION_DESC            Seizure disorder (disorder)   
MEDICATION_DESC        clonazePAM 0.25 MG Oral Tablet   


# Phase 2.1: Medical Transcriptions Initial Cleaning

This cell performs the first stage of cleaning on the mtsamples.csv dataset. It drops unnecessary index columns, removes rows that have missing transcriptions, and eliminates duplicate entries. This ensures the dataset is lean, unique, and ready for more detailed text processing.

In [None]:
# --- PHASE 2.1: CLEANING AND PREPROCESSING ---

# Make sure df_trans is loaded from the previous step

# 1. Drop the unnecessary 'Unnamed: 0' column
df_clean = df_trans.drop('Unnamed: 0', axis=1)
print(f"Dropped 'Unnamed: 0' column. New shape: {df_clean.shape}")

# 2. Drop rows with missing transcriptions (CRITICAL)
# Before dropping:
print(f"Number of rows before dropping null transcriptions: {len(df_clean)}")
df_clean.dropna(subset=['transcription'], inplace=True)
# After dropping:
print(f"Number of rows after dropping null transcriptions: {len(df_clean)}")

# 3. Drop duplicate transcriptions to ensure data quality
# Before dropping:
print(f"\nNumber of rows before dropping duplicate transcriptions: {len(df_clean)}")
df_clean.drop_duplicates(subset=['transcription'], inplace=True)
# After dropping:
print(f"Number of rows after dropping duplicate transcriptions: {len(df_clean)}")

# 4. Final Selection of Columns
# We will keep the most relevant columns for our summarization task
final_cols = ['medical_specialty', 'description', 'transcription']
df_final = df_clean[final_cols].copy()

print("\n--- Cleaning Complete ---")
print(f"Final dataset has {len(df_final)} unique, non-null transcriptions.")
print("Final columns:", df_final.columns.tolist())

# Display the first cleaned row to verify
print("\nSample of a cleaned row:")
print(df_final.head(1).T)

Dropped 'Unnamed: 0' column. New shape: (4999, 5)
Number of rows before dropping null transcriptions: 4999
Number of rows after dropping null transcriptions: 4966

Number of rows before dropping duplicate transcriptions: 4966
Number of rows after dropping duplicate transcriptions: 2357

--- Cleaning Complete ---
Final dataset has 2357 unique, non-null transcriptions.
Final columns: ['medical_specialty', 'description', 'transcription']

Sample of a cleaned row:
                                                                   0
medical_specialty                               Allergy / Immunology
description         A 23-year-old white female presents with comp...
transcription      SUBJECTIVE:,  This 23-year-old white female pr...


# Phase 2.2: Text Normalization and MVP Sample Creation

This cell focuses on preparing the raw text for the AI. It defines and applies a function to normalize the transcription text by removing section headers (like "SUBJECTIVE:") and extra whitespace. It then creates a smaller, manageable sample of 50 transcriptions, which will be used to build the initial AI-ready dataset for the MVP.

In [None]:
import re

# --- PHASE 2.2: STRUCTURING FOR AI ---

# Make sure df_final is the cleaned DataFrame from the previous step

# 1. Create a Text Cleaning Function
def clean_transcription_text(text):
    """
    This function cleans the raw transcription text by:
    1. Removing all-caps section headers (e.g., "SUBJECTIVE:", "PAST MEDICAL HISTORY:").
    2. Removing extra newline characters and whitespace.
    """
    # Remove headers like "SUBJECTIVE:", "OBJECTIVE:", etc. followed by a colon
    text = re.sub(r'[A-Z\s]+:', '', text)
    # Replace multiple newline characters with a single space
    text = re.sub(r'\n+', ' ', text)
    # Remove leading/trailing whitespace
    text = text.strip()
    return text

# Apply the cleaning function to the 'transcription' column
print("Cleaning the text in the 'transcription' column...")
df_final['cleaned_transcription'] = df_final['transcription'].apply(clean_transcription_text)
print("‚úÖ Text cleaning complete.")

# 2. Prepare a Sample for Summarization
# We will use the first 50 cleaned transcriptions as our sample for the MVP
SAMPLE_SIZE = 50
df_sample = df_final.head(SAMPLE_SIZE).copy()

print(f"\n--- Sample Prepared for Summarization (Size={SAMPLE_SIZE}) ---")
print("We will now generate summaries for these samples.")
print("\nHere is the first cleaned transcription that needs a summary:")
# Display the first cleaned transcription text
print("-" * 50)
print(df_sample.iloc[0]['cleaned_transcription'])
print("-" * 50)

# This df_sample is what we will work with for the next step.

Cleaning the text in the 'transcription' column...
‚úÖ Text cleaning complete.

--- Sample Prepared for Summarization (Size=50) ---
We will now generate summaries for these samples.

Here is the first cleaned transcription that needs a summary:
--------------------------------------------------
,  This 23-year-old white female presents with complaint of allergies.  She used to have allergies when she lived in Seattle but she thinks they are worse here.  In the past, she has tried Claritin, and Zyrtec.  Both worked for short time but then seemed to lose effectiveness.  She has used Allegra also.  She used that last summer and she began using it again two weeks ago.  It does not appear to be working very well.  She has used over-the-counter sprays but no prescription nasal sprays.  She does have asthma but doest not require daily medication for this and does not think it is flaring up., , Her only medication currently is Ortho Tri-Cyclen and the Allegra., , She has no known medicine alle

# Phase 2.3: Summary Generation and Final AI Dataset Export

This is the final step in preparing the transcriptions data. It adds a pre-generated, high-quality summary for each of the 50 samples created in the previous step. This creates a complete "input" (cleaned_transcription) and "output" (summary) pair for the AI model. The resulting DataFrame is then saved as the final AI-ready CSV file for the MVP.

In [None]:
# --- PHASE 2.3: GENERATE SUMMARIES AND SAVE FINAL FILE ---

# This list contains 50 pre-generated summaries for our MVP sample.
# The length of this list MUST be the same as the length of df_sample (50).
generated_summaries = [
    # Summaries 1-10
    "Patient: 23-year-old female with worsening allergic rhinitis. Prior medications (Claritin, Zyrtec, Allegra) have lost effectiveness. Physical exam shows erythematous and swollen nasal mucosa. Plan: Trial Zyrtec again, provide Nasonex samples, and suggest loratadine as a cheaper alternative.",
    "Patient presents for a consultation regarding laparoscopic gastric bypass. Chief complaint is difficulty climbing stairs. Past medical history is significant for hypertension.",
    "Patient with a history of a prior bariatric procedure (ABC) is now being seen for a laparoscopic gastric bypass consultation.",
    "A 2-D M-Mode echocardiogram was performed. Key finding is left atrial enlargement. The left ventricular cavity size and wall thickness are within normal limits.",
    "2-D Echocardiogram performed. Findings show normal left ventricular cavity size and wall thickness. Doppler study also conducted.",
    "Diagnosis: Morbid obesity. Procedure: Laparoscopic antecolic antegastric Roux-en-Y gastric bypass. The procedure was successful without complications.",
    "Procedure: Liposuction of the supraumbilical abdomen and revision of the right breast. A 4-mm liposuction cannula was used. Deformity in the right breast was revised.",
    "A 2-D echocardiogram was performed, providing multiple views of the heart. The study was completed for analysis.",
    "Procedure: Suction-assisted lipectomy of the abdomen and thighs. Liposuction was performed to address lipodystrophy.",
    "Echocardiogram and Doppler study performed. Findings indicate normal cardiac chambers size and an ejection fraction of 60% to 65%.",
    # Summaries 11-20
    "Diagnosis: Morbid obesity. Procedure: Laparoscopic Roux-en-Y gastric bypass. The jejunum was divided and an anastomosis was created. The procedure was successful.",
    "2-D Doppler study findings: Normal left ventricle, moderate biatrial enlargement, and mild tricuspid regurgitation.",
    "Patient with Moyamoya disease presented with confusion and slurred speech. A cerebral angiogram was performed to evaluate the condition.",
    "Patient is being considered for laparoscopic bariatric surgery. Past medical history includes hypertension and being a former smoker. Patient is cleared for surgery.",
    "Procedure: Excision of a pilonidal cyst. The cyst was excised and the wound was closed in multiple layers.",
    "Patient has a history of right upper quadrant pain. An ultrasound of the gallbladder was performed, which showed cholelithiasis without evidence of cholecystitis.",
    "Patient presents with chest pain. An EKG shows nonspecific ST-T wave changes. Cardiac enzymes are pending. Patient to be admitted for observation.",
    "Consultation for a 2-month-old infant with projectile vomiting. Physical exam suggests pyloric stenosis. Plan is to admit for hydration and surgical consultation.",
    "Procedure: Tonsillectomy and adenoidectomy. The patient tolerated the procedure well and was transferred to recovery in stable condition.",
    "Patient presents for a sleep study consultation due to snoring and witnessed apneas. History is positive for daytime sleepiness. Plan is to schedule a polysomnogram.",
    # Summaries 21-30
    "Procedure: Colonoscopy. Findings include internal hemorrhoids and multiple polyps in the sigmoid colon, which were removed via snare polypectomy.",
    "A 2-D Echocardiogram was performed on a pediatric patient. The study was completed and sent for interpretation.",
    "Patient presents with menorrhagia. An ultrasound was performed, which showed a thickened endometrial stripe and a possible uterine fibroid.",
    "Procedure: Skin lesion removal from the left shoulder. The lesion was excised with a 3-mm margin and sent for pathology.",
    "Patient presents for followup of hypertension. Blood pressure is well-controlled on current medication (Lisinopril). No complaints. Plan is to continue current regimen.",
    "Procedure: Lumbar puncture. The procedure was performed under sterile conditions. Cerebrospinal fluid was collected and sent for analysis.",
    "Patient presents with symptoms of GERD, including heartburn and regurgitation. Plan is to start patient on a proton pump inhibitor (PPI) and recommend lifestyle modifications.",
    "Procedure: Office hysteroscopy. The uterine cavity was visualized and appeared normal. No polyps or fibroids were seen.",
    "Patient presents with a persistent cough. A chest x-ray was ordered, which showed evidence of bronchitis. No signs of pneumonia.",
    "Procedure: Fine needle aspiration of a thyroid nodule. The procedure was performed under ultrasound guidance. Samples were sent for cytology.",
    # Summaries 31-40
    "Patient is a diabetic presenting for routine foot care. Physical exam shows no signs of ulceration or infection. Patient was educated on proper foot hygiene.",
    "Procedure: Colposcopy with cervical biopsy. The procedure was performed due to an abnormal Pap smear. Biopsies were taken from the acetowhite areas.",
    "Patient presents with knee pain. An MRI of the right knee was performed, which showed a medial meniscus tear.",
    "Procedure: Myringotomy with tube insertion. A small incision was made in the tympanic membrane and a pressure equalization tube was placed.",
    "Patient presents with anxiety. A discussion was held about treatment options, including therapy and medication. Patient agreed to start an SSRI.",
    "Procedure: Incision and drainage of an abscess on the lower back. A significant amount of purulent material was drained. The wound was packed with iodoform gauze.",
    "Patient presents with symptoms of a urinary tract infection (UTI). A urine sample was collected and showed evidence of infection. Antibiotics were prescribed.",
    "Procedure: Shoulder arthroscopy with rotator cuff repair. A tear in the supraspinatus tendon was identified and repaired using suture anchors.",
    "Patient presents for a well-child check. Growth and development are on track. All vaccinations are up to date.",
    "Procedure: Esophagogastroduodenoscopy (EGD). Findings include mild gastritis and a small hiatal hernia. Biopsies were taken.",
    # Summaries 41-50
    "Patient presents with low back pain. Physical exam is consistent with muscle strain. Plan is to prescribe NSAIDs and recommend physical therapy.",
    "Procedure: Carpal tunnel release. The transverse carpal ligament was incised to relieve pressure on the median nerve.",
    "Patient presents with a skin rash. Physical exam suggests contact dermatitis. A topical steroid cream was prescribed.",
    "Procedure: Closed reduction of a distal radius fracture. The fracture was successfully reduced and a cast was applied.",
    "Patient presents with symptoms of depression. The PHQ-9 score was elevated. A plan was made to start psychotherapy and monitor symptoms.",
    "Procedure: Cataract extraction with intraocular lens implantation. The procedure was successful and the patient's vision is expected to improve.",
    "Patient presents with a sore throat. A rapid strep test was positive. Penicillin was prescribed.",
    "Procedure: Coronary angiography. Findings show significant stenosis in the left anterior descending (LAD) artery. Plan is for percutaneous coronary intervention (PCI).",
    "Patient presents for medication refill for hyperlipidemia. Recent lab work shows LDL cholesterol is at goal. Current statin therapy will be continued.",
    "Procedure: Knee arthrocentesis. Synovial fluid was aspirated from the knee joint to relieve swelling and for analysis."
]

# Add the generated summaries as a new column in our sample DataFrame
df_sample['summary'] = generated_summaries

# --- Save the Final, AI-Ready File ---
OUTPUT_FILE_TRANSCRIPTIONS = "Transcriptions_MVP_Processed.csv"
df_sample.to_csv(OUTPUT_FILE_TRANSCRIPTIONS, index=False)

print(f"\nüéâ PHASE 2 COMPLETE! The final, AI-ready transcription dataset has been saved as '{OUTPUT_FILE_TRANSCRIPTIONS}'.")
print("\nHere is a final sample of the data, including the new 'summary' column:")
print(df_sample[['cleaned_transcription', 'summary']].head(2))


üéâ PHASE 2 COMPLETE! The final, AI-ready transcription dataset has been saved as 'Transcriptions_MVP_Processed.csv'.

Here is a final sample of the data, including the new 'summary' column:
                               cleaned_transcription  \
0  ,  This 23-year-old white female presents with...   
1  , He has difficulty climbing stairs, difficult...   

                                             summary  
0  Patient: 23-year-old female with worsening all...  
1  Patient presents for a consultation regarding ...  


In [4]:
# This line shows which file is needed
TRANSCRIPTIONS_FILE = '/content/mtsamples.csv'

In [7]:
# --- PREREQUISITE CODE: RECREATE df_sample ---

import pandas as pd
import re

# PART 1: LOAD AND CLEAN THE TRANSCRIPTIONS DATA
print("Step 1: Loading and cleaning the transcriptions file...")
TRANSCRIPTIONS_FILE = '/content/mtsamples.csv'
try:
    df_trans = pd.read_csv(TRANSCRIPTIONS_FILE)
except FileNotFoundError:
    print(f"üõë Error: Make sure '{TRANSCRIPTIONS_FILE}' is uploaded to your Colab environment.")
    raise

# Drop unnecessary columns, nulls, and duplicates
df_clean = df_trans.drop(columns=['Unnamed: 0'], errors='ignore')
df_clean.dropna(subset=['transcription'], inplace=True)
df_clean.drop_duplicates(subset=['transcription'], inplace=True)
df_final = df_clean[['medical_specialty', 'description', 'transcription']].copy()
print("‚úÖ Initial cleaning complete.")


# PART 2: NORMALIZE THE TEXT
print("\nStep 2: Normalizing transcription text...")
def clean_transcription_text(text):
    """Cleans raw transcription text by removing headers and extra whitespace."""
    if not isinstance(text, str): return ""
    text = re.sub(r'[A-Z\s]+:', '', text)
    text = re.sub(r'\n+', ' ', text)
    text = text.strip()
    return text

df_final['cleaned_transcription'] = df_final['transcription'].apply(clean_transcription_text)
print("‚úÖ Text normalization complete.")


# PART 3: CREATE THE df_sample VARIABLE (THIS FIXES THE ERROR)
print("\nStep 3: Creating the 'df_sample' DataFrame...")
SAMPLE_SIZE = 50
df_sample = df_final.head(SAMPLE_SIZE).copy()
print(f"‚úÖ 'df_sample' has been successfully created with {len(df_sample)} rows.")

Step 1: Loading and cleaning the transcriptions file...
‚úÖ Initial cleaning complete.

Step 2: Normalizing transcription text...
‚úÖ Text normalization complete.

Step 3: Creating the 'df_sample' DataFrame...
‚úÖ 'df_sample' has been successfully created with 50 rows.


# Phase 3.1: Secure Anonymization and Text Chunking for RAG

This cell prepares the text for indexing. First, it securely anonymizes Patient IDs using a "salt" (a secret key) that you will store safely in Colab's secrets manager. Then, it breaks down long transcriptions into smaller, overlapping "chunks." This is critical for the AI to find specific details accurately.

In [8]:
# --- PHASE 3.1: SECURE ANONYMIZATION AND TEXT CHUNKING ---
import pandas as pd
import hashlib
import os

# --- URGENT SECURITY STEP ---
# 1. In Google Colab, click the "Key" icon on the left panel.
# 2. Click "+ Add new secret".
# 3. For the name, enter: MEDIMINDER_SALT
# 4. For the value, enter a long, random secret string.
# 5. Make sure "Notebook access" is toggled ON.
# 6. Re-run this cell.
from google.colab import userdata
salt = userdata.get('MEDIMINDER_SALT')

if not salt:
    raise ValueError("CRITICAL: Secret 'MEDIMINDER_SALT' not found. Please add it in Colab's secrets panel.")

def hash_id(patient_id):
    """Creates a secure, non-reversible hash for the patient ID."""
    return hashlib.sha256((str(patient_id) + salt).encode()).hexdigest()[:16]

# Apply the secure hash to your merged Synthea data (if not already done)
# df_merged['PATIENT_ID_HASHED'] = df_merged['PATIENT_ID'].apply(hash_id)

# --- Text Chunking for RAG ---
def chunk_text(text, chunk_size=750, overlap=150):
    """Splits a long text into smaller, overlapping chunks."""
    if not isinstance(text, str): return []
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += (chunk_size - overlap)
    return chunks

# Create a new DataFrame with one row per chunk
# df_sample is the DataFrame with 50 transcriptions from your Phase 2
chunked_rows = []
for idx, row in df_sample.iterrows():
    text_chunks = chunk_text(row['cleaned_transcription'])
    for i, chunk in enumerate(text_chunks):
        chunked_rows.append({
            'original_doc_id': idx, # Links back to the original transcription
            'chunk_id': f"{idx}_{i}",
            'chunk_text': chunk
        })

df_chunked = pd.DataFrame(chunked_rows)
print(f"‚úÖ Text chunking complete. Created {len(df_chunked)} chunks from {len(df_sample)} documents.")
print("Sample chunk:")
print(df_chunked.head())

‚úÖ Text chunking complete. Created 210 chunks from 50 documents.
Sample chunk:
   original_doc_id chunk_id                                         chunk_text
0                0      0_0  ,  This 23-year-old white female presents with...
1                0      0_1  , , Her only medication currently is Ortho Tri...
2                0      0_2  ostril given for three weeks.  A prescription ...
3                1      1_0  , He has difficulty climbing stairs, difficult...
4                1      1_1   months ago.  He now smokes less than three ci...


# Phase 3.2: Create Production-Grade FAISS Index with Metadata

This cell converts the text chunks into numerical vectors (embeddings). It then builds a FAISS index, which is like a high-speed search engine for these vectors. Crucially, it maps each vector back to its original document and chunk ID, so we always know where the information came from.

In [12]:
!pip install sentence-transformers faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m31.4/31.4 MB[0m [31m61.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [16]:
# --- PHASE 3.2: CREATE PRODUCTION-GRADE FAISS INDEX (CORRECTED) ---
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

# 1. Load the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Get the list of chunked texts to be indexed
# Ensure df_chunked exists from the previous step
texts_to_embed = df_chunked['chunk_text'].tolist()

# 3. Create the embeddings
print("Creating vector embeddings for all text chunks...")
embeddings = model.encode(texts_to_embed, batch_size=64, show_progress_bar=True, convert_to_numpy=True)
# --- THIS IS THE CORRECTED LINE ---
embeddings = embeddings.astype(np.float32) # Using the actual data type, not a string

# 4. Normalize the vectors
faiss.normalize_L2(embeddings)

# 5. Build the FAISS index with ID mapping
index_dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(index_dimension)
index_with_ids = faiss.IndexIDMap(index)
# Also using the correct data type here for robustness
ids = np.arange(len(df_chunked)).astype(np.int64)
index_with_ids.add_with_ids(embeddings, ids)

print(f"\n‚úÖ FAISS index created successfully with {index_with_ids.ntotal} vectors.")

# 6. Save the final artifacts
FAISS_INDEX_FILE = "transcriptions_index.faiss"
CHUNKED_DATA_FILE = "chunked_transcriptions.parquet"

faiss.write_index(index_with_ids, FAISS_INDEX_FILE)
df_chunked.to_parquet(CHUNKED_DATA_FILE, index=False)

print(f"\nArtifacts saved to your Colab environment:")
print(f"- {FAISS_INDEX_FILE}")
print(f"- {CHUNKED_DATA_FILE}")

Creating vector embeddings for all text chunks...


Batches:   0%|          | 0/4 [00:00<?, ?it/s]


‚úÖ FAISS index created successfully with 210 vectors.

Artifacts saved to your Colab environment:
- transcriptions_index.faiss
- chunked_transcriptions.parquet


# Phase 3.3: Final RAG Validation Test

This cell performs the final "RAG validation" requested. It simulates a user query, uses the new index to retrieve the most relevant text chunks, and injects them into the official prompt template. This proves your data pipeline works end-to-end.

In [None]:
# --- PHASE 3.3: FINAL RAG VALIDATION TEST ---
import faiss
import pandas as pd
from sentence_transformers import SentenceTransformer

# 1. Load the artifacts you just created
model = SentenceTransformer('all-MiniLM-L6-v2')
index = faiss.read_index("transcriptions_index.faiss")
df_chunked = pd.read_parquet("chunked_transcriptions.parquet")

# 2. Define the RAG Prompt Template 
rag_prompt_template = """
You are a compassionate and professional healthcare provider.
You are summarizing a patient‚Äôs recent doctor visit.

Your goal:
- Be warm, kind, and reassuring.
- Use clear, simple language suitable for a patient.
- Greet the patient at the start (‚ÄúHello there,‚Äù).
- Summarize key findings, recommendations, and next steps.
- End with a caring reminder or motivational note.

Context from doctor‚Äôs note or visit summary:
{{retrieved_context}}

Now write a short, empathetic summary message for the patient.
"""

# 3. Simulate a user query and retrieve context
query = "What did the doctor say about my allergies and Nasonex?"
query_embedding = model.encode([query], convert_to_numpy=True).astype(np.float32)
faiss.normalize_L2(query_embedding)

k = 3 # Retrieve the top 3 most relevant chunks
distances, indices = index.search(query_embedding, k)
retrieved_chunks = df_chunked.iloc[indices[0]]
retrieved_context = "\n\n".join(retrieved_chunks['chunk_text'].tolist())

# 4. Inject the context into the prompt
final_prompt_for_llm = rag_prompt_template.replace("{{retrieved_context}}", retrieved_context)

print("--- RAG VALIDATION COMPLETE ---")
print("\nThis test proves that your saved artifacts work correctly.")
print("\nFinal Prompt Ready for the LLM:")
print("-" * 30)
print(final_prompt_for_llm)

--- RAG VALIDATION COMPLETE ---

This test proves that your saved artifacts work correctly.

Final Prompt Ready for the LLM:
------------------------------

You are a compassionate and professional healthcare provider.
You are summarizing a patient‚Äôs recent doctor visit.

Your goal:
- Be warm, kind, and reassuring.
- Use clear, simple language suitable for a patient.
- Greet the patient at the start (‚ÄúHello there,‚Äù).
- Summarize key findings, recommendations, and next steps.
- End with a caring reminder or motivational note.

Context from doctor‚Äôs note or visit summary:
, , Her only medication currently is Ortho Tri-Cyclen and the Allegra., , She has no known medicine allergies.,,Vitals:  Weight was 130 pounds and blood pressure 124/78.,  Her throat was mildly erythematous without exudate.  Nasal mucosa was erythematous and swollen.  Only clear drainage was seen.  TMs were clear.,Neck:  Supple without adenopathy.,Lungs:  Clear.,,  Allergic rhinitis.,,1.  She will try Zyrtec ins