# Women's Health Data Preprocessing - Part 2

## Overview

This notebook is the second in a series focused on creating a dataset for training an LLM model to predict better questions for women's health consultations. This part focuses on preprocessing the data collected in Part 1.

### Objectives
- Load the data collected in Part 1
- Clean and preprocess the data
- Expand the dismissed questions dataset
- Add demographic context to the data
- Analyze question characteristics
- Implement save points throughout the process

### Why This Matters
Proper preprocessing is essential for creating a high-quality dataset that can effectively train an LLM model. By expanding the dismissed questions dataset and adding demographic context, we can create a more comprehensive and balanced dataset that addresses a wider range of women's health concerns across different demographic groups.

## 1. Environment Setup

First, let's set up our environment by importing necessary libraries and loading the data collected in Part 1.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os
import json
import re
import string
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Set up plotting style
plt.style.use('seaborn-whitegrid')
sns.set(style="whitegrid")

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Display versions for reproducibility
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"NLTK version: {nltk.__version__}")
print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

In [None]:
# Define directory structure
data_dir = 'womens_health_data'
raw_dir = os.path.join(data_dir, 'raw')
processed_dir = os.path.join(data_dir, 'processed')
checkpoint_dir = os.path.join(data_dir, 'checkpoints')
expanded_dir = os.path.join(data_dir, 'expanded')
figures_dir = os.path.join(data_dir, 'figures')

# Create directories if they don't exist
for directory in [data_dir, raw_dir, processed_dir, checkpoint_dir, expanded_dir, figures_dir]:
    os.makedirs(directory, exist_ok=True)
    print(f"Created directory: {directory}")

## 2. Helper Functions

Let's create some helper functions for saving and loading checkpoints, as well as for text preprocessing.

In [None]:
def save_checkpoint(df, name):
    """
    Save a dataframe as a checkpoint CSV file.
    
    Parameters:
    - df: pandas DataFrame to save
    - name: name of the checkpoint (without extension)
    
    Returns:
    - path: path to the saved file
    """
    # Create the full path with timestamp to avoid overwriting
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{name}_{timestamp}.csv"
    path = os.path.join(checkpoint_dir, filename)
    
    # Save the dataframe
    df.to_csv(path, index=False)
    print(f"Checkpoint saved: {path}")
    
    # Also save a version with a fixed name for easy loading
    fixed_path = os.path.join(checkpoint_dir, f"{name}_latest.csv")
    df.to_csv(fixed_path, index=False)
    print(f"Latest version saved: {fixed_path}")
    
    return path

def load_checkpoint(name):
    """
    Load the latest checkpoint for a given name.
    
    Parameters:
    - name: name of the checkpoint (without extension)
    
    Returns:
    - df: loaded DataFrame or None if file doesn't exist
    """
    path = os.path.join(checkpoint_dir, f"{name}_latest.csv")
    
    if os.path.exists(path) and os.path.getsize(path) > 0:
        try:
            df = pd.read_csv(path)
            print(f"Checkpoint loaded: {path}")
            print(f"Shape: {df.shape}")
            return df
        except pd.errors.EmptyDataError:
            print(f"Warning: Checkpoint file exists but is empty: {path}")
            return None
        except Exception as e:
            print(f"Error loading checkpoint: {e}")
            return None
    else:
        print(f"Checkpoint not found or empty: {path}")
        return None

def verify_dataframe(df, name):
    """
    Verify a dataframe by displaying basic information.
    
    Parameters:
    - df: pandas DataFrame to verify
    - name: name of the dataframe for display purposes
    """
    print(f"\n--- {name} Verification ---")
    print(f"Shape: {df.shape}")
    print("\nFirst 5 rows:")
    display(df.head())
    print("\nData types:")
    display(df.dtypes)
    print("\nMissing values:")
    missing = df.isnull().sum()
    display(missing[missing > 0] if any(missing > 0) else "No missing values")
    print("\nBasic statistics:")
    display(df.describe(include='all').T)
    print("----------------------------\n")

def preprocess_text(text):
    """
    Preprocess text by converting to lowercase, removing punctuation and stopwords.
    
    Parameters:
    - text: text to preprocess
    
    Returns:
    - processed_text: preprocessed text
    """
    if not isinstance(text, str):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Join tokens back into a string
    processed_text = ' '.join(tokens)
    
    return processed_text

def calculate_text_similarity(text1, text2):
    """
    Calculate cosine similarity between two texts.
    
    Parameters:
    - text1: first text
    - text2: second text
    
    Returns:
    - similarity: cosine similarity score
    """
    if not isinstance(text1, str) or not isinstance(text2, str):
        return 0.0
    
    # Create a CountVectorizer
    vectorizer = CountVectorizer().fit_transform([text1, text2])
    vectors = vectorizer.toarray()
    
    # Calculate cosine similarity
    similarity = cosine_similarity(vectors)[0, 1]
    
    return similarity

## 3. Load Data from Part 1

Let's load the data we collected in Part 1 from the processed directory.

In [None]:
# Load clinical trials data
clinical_trials_path = os.path.join(processed_dir, 'clinical_trials.csv')
if os.path.exists(clinical_trials_path):
    clinical_trials_df = pd.read_csv(clinical_trials_path)
    print(f"Loaded clinical trials data: {clinical_trials_df.shape}")
else:
    print("Clinical trials data not found. Please run Part 1 first.")
    clinical_trials_df = None

# Load PubMed data
pubmed_path = os.path.join(processed_dir, 'pubmed.csv')
if os.path.exists(pubmed_path):
    pubmed_df = pd.read_csv(pubmed_path)
    print(f"Loaded PubMed data: {pubmed_df.shape}")
else:
    print("PubMed data not found. Please run Part 1 first.")
    pubmed_df = None

# Load dismissed questions data
dismissed_questions_path = os.path.join(processed_dir, 'dismissed_questions.csv')
if os.path.exists(dismissed_questions_path):
    dismissed_questions_df = pd.read_csv(dismissed_questions_path)
    print(f"Loaded dismissed questions data: {dismissed_questions_df.shape}")
else:
    print("Dismissed questions data not found. Please run Part 1 first.")
    dismissed_questions_df = None

# Load medical terminology data
medical_terminology_path = os.path.join(processed_dir, 'medical_terminology.csv')
if os.path.exists(medical_terminology_path):
    medical_terminology_df = pd.read_csv(medical_terminology_path)
    print(f"Loaded medical terminology data: {medical_terminology_df.shape}")
else:
    print("Medical terminology data not found. Please run Part 1 first.")
    medical_terminology_df = None

In [None]:
# Verify the loaded data
if clinical_trials_df is not None:
    verify_dataframe(clinical_trials_df, "Clinical Trials")
    
if pubmed_df is not None:
    verify_dataframe(pubmed_df, "PubMed Publications")
    
if dismissed_questions_df is not None:
    verify_dataframe(dismissed_questions_df, "Dismissed Questions")
    
if medical_terminology_df is not None:
    verify_dataframe(medical_terminology_df, "Medical Terminology")

## 4. Analyze Question Characteristics

Let's analyze the characteristics of the dismissed questions and better questions to understand the differences between them.

In [None]:
# Check if we already have the analyzed questions data
analyzed_questions_df = load_checkpoint("analyzed_questions")

# If not, analyze the questions
if analyzed_questions_df is None and dismissed_questions_df is not None:
    # Create a copy of the dismissed questions dataframe
    analyzed_questions_df = dismissed_questions_df.copy()
    
    # Calculate the length of each question
    analyzed_questions_df['DismissedQuestion_Length'] = analyzed_questions_df['DismissedQuestion'].str.len()
    analyzed_questions_df['BetterQuestion_Length'] = analyzed_questions_df['BetterQuestion'].str.len()
    
    # Calculate the ratio of better question length to dismissed question length
    analyzed_questions_df['Question_Length_Ratio'] = analyzed_questions_df['BetterQuestion_Length'] / analyzed_questions_df['DismissedQuestion_Length']
    
    # Preprocess the questions for text analysis
    analyzed_questions_df['DismissedQuestion_Processed'] = analyzed_questions_df['DismissedQuestion'].apply(preprocess_text)
    analyzed_questions_df['BetterQuestion_Processed'] = analyzed_questions_df['BetterQuestion'].apply(preprocess_text)
    
    # Calculate the similarity between dismissed and better questions
    analyzed_questions_df['Question_Similarity'] = analyzed_questions_df.apply(
        lambda row: calculate_text_similarity(row['DismissedQuestion_Processed'], row['BetterQuestion_Processed']), 
        axis=1
    )
    
    # Extract medical terms from the questions
    # We'll use the medical terminology dataframe to identify medical terms
    if medical_terminology_df is not None:
        medical_terms = medical_terminology_df['Term'].str.lower().tolist()
        
        # Function to count medical terms in a text
        def count_medical_terms(text, terms):
            if not isinstance(text, str):
                return 0
            text_lower = text.lower()
            count = sum(1 for term in terms if term.lower() in text_lower)
            return count
        
        # Count medical terms in each question
        analyzed_questions_df['DismissedQuestion_MedicalTerms'] = analyzed_questions_df['DismissedQuestion'].apply(
            lambda x: count_medical_terms(x, medical_terms)
        )
        analyzed_questions_df['BetterQuestion_MedicalTerms'] = analyzed_questions_df['BetterQuestion'].apply(
            lambda x: count_medical_terms(x, medical_terms)
        )
        
        # Calculate the difference in medical terms
        analyzed_questions_df['MedicalTerms_Difference'] = analyzed_questions_df['BetterQuestion_MedicalTerms'] - analyzed_questions_df['DismissedQuestion_MedicalTerms']
    
    # Save checkpoint
    save_checkpoint(analyzed_questions_df, "analyzed_questions")
else:
    print("Using existing analyzed questions data")

In [None]:
# Verify the analyzed questions data
if analyzed_questions_df is not None:
    verify_dataframe(analyzed_questions_df, "Analyzed Questions")

In [None]:
# Visualize the question length comparison
if analyzed_questions_df is not None:
    plt.figure(figsize=(12, 6))
    
    # Create a bar chart
    avg_dismissed_length = analyzed_questions_df['DismissedQuestion_Length'].mean()
    avg_better_length = analyzed_questions_df['BetterQuestion_Length'].mean()
    avg_ratio = analyzed_questions_df['Question_Length_Ratio'].mean()
    
    bars = plt.bar(['Dismissed Question', 'Better Question'], 
                   [avg_dismissed_length, avg_better_length],
                   color=['#ff7f0e', '#2ca02c'])
    
    plt.title('Average Question Length Comparison', fontsize=16)
    plt.ylabel('Average Length (characters)', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 5,
                 f'{height:.1f}',
                 ha='center', va='bottom', fontsize=12)
    
    # Add ratio annotation
    plt.annotate(f'Better questions are on average {avg_ratio:.1f}x longer than dismissed questions',
                xy=(0.5, 0.9), xycoords='axes fraction',
                ha='center', va='center',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='white', alpha=0.8),
                fontsize=12)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'question_length_comparison.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the medical terms comparison
if analyzed_questions_df is not None and 'MedicalTerms_Difference' in analyzed_questions_df.columns:
    plt.figure(figsize=(12, 6))
    
    # Create a bar chart
    avg_dismissed_terms = analyzed_questions_df['DismissedQuestion_MedicalTerms'].mean()
    avg_better_terms = analyzed_questions_df['BetterQuestion_MedicalTerms'].mean()
    avg_diff = analyzed_questions_df['MedicalTerms_Difference'].mean()
    
    bars = plt.bar(['Dismissed Question', 'Better Question'], 
                   [avg_dismissed_terms, avg_better_terms],
                   color=['#ff7f0e', '#2ca02c'])
    
    plt.title('Average Medical Terms Comparison', fontsize=16)
    plt.ylabel('Average Number of Medical Terms', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{height:.1f}',
                 ha='center', va='bottom', fontsize=12)
    
    # Add difference annotation
    plt.annotate(f'Better questions contain on average {avg_diff:.1f} more medical terms',
                xy=(0.5, 0.9), xycoords='axes fraction',
                ha='center', va='center',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='white', alpha=0.8),
                fontsize=12)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'medical_terms_comparison.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the question similarity distribution
if analyzed_questions_df is not None and 'Question_Similarity' in analyzed_questions_df.columns:
    plt.figure(figsize=(10, 6))
    
    # Create a histogram
    plt.hist(analyzed_questions_df['Question_Similarity'], bins=10, color='#1f77b4', alpha=0.7)
    
    plt.title('Distribution of Similarity Between Dismissed and Better Questions', fontsize=16)
    plt.xlabel('Cosine Similarity', fontsize=12)
    plt.ylabel('Frequency', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add mean line
    mean_similarity = analyzed_questions_df['Question_Similarity'].mean()
    plt.axvline(mean_similarity, color='red', linestyle='--', linewidth=2)
    plt.text(mean_similarity + 0.02, plt.ylim()[1] * 0.9, 
             f'Mean: {mean_similarity:.2f}', 
             color='red', fontsize=12)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'question_similarity_histogram.png'), dpi=300)
    plt.show()

## 5. Expand Dismissed Questions Dataset

Now let's expand the dismissed questions dataset to include more examples across different categories.

In [None]:
# Check if we already have the expanded questions data
expanded_questions_df = load_checkpoint("expanded_questions")

# If not, expand the dataset
if expanded_questions_df is None and dismissed_questions_df is not None:
    # Start with the original dismissed questions
    expanded_questions_df = dismissed_questions_df.copy()
    
    # Add more examples to cover additional conditions and categories
    additional_questions = [
        {
            "DismissedQuestion": "I'm having trouble getting pregnant.",
            "BetterQuestion": "My partner and I have been trying to conceive for 14 months with regular unprotected intercourse. I'm 32 years old and have regular periods. I've tracked my ovulation using basal body temperature and ovulation predictor kits. Should we be evaluated for fertility issues, and what specific tests would you recommend for both of us?",
            "Condition": "Infertility",
            "Category": "Reproductive Health",
            "DismissalFrequency": "Medium",
            "DiagnosisDelay": 2.8,
            "AgeGroup": "25-34",
            "RacialEthnicConsiderations": "White/Caucasian"
        },
        {
            "DismissedQuestion": "I'm bleeding between periods.",
            "BetterQuestion": "I've been experiencing spotting between periods for the past 3 months. The bleeding is light but lasts for 2-3 days and occurs midway between my regular periods. I'm 47 years old and have no other symptoms. Could this be perimenopause, fibroids, or something that requires further investigation?",
            "Condition": "Abnormal Uterine Bleeding",
            "Category": "Reproductive Health",
            "DismissalFrequency": "Medium",
            "DiagnosisDelay": 1.9,
            "AgeGroup": "45-54",
            "RacialEthnicConsiderations": "Black/African American"
        },
        {
            "DismissedQuestion": "I'm having memory problems.",
            "BetterQuestion": "I'm 52 and have been experiencing increasing difficulty with short-term memory and word-finding over the past 6 months. It's affecting my work performance. I'm also having night sweats and irregular periods. Could these cognitive changes be related to perimenopause, or should I be concerned about early-onset dementia given my family history?",
            "Condition": "Perimenopausal Cognitive Changes",
            "Category": "Menopause/Aging",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 3.2,
            "AgeGroup": "45-54",
            "RacialEthnicConsiderations": "Asian"
        },
        {
            "DismissedQuestion": "I'm losing my hair.",
            "BetterQuestion": "I've noticed significant hair thinning at my crown and temples over the past 8 months. I'm 38 and have a regular menstrual cycle, but I've also been experiencing fatigue and cold intolerance. My mother had thyroid issues. Could my hair loss be related to a thyroid condition, female pattern hair loss, or another hormonal imbalance?",
            "Condition": "Female Pattern Hair Loss",
            "Category": "Autoimmune Conditions",
            "DismissalFrequency": "Medium",
            "DiagnosisDelay": 2.5,
            "AgeGroup": "35-44",
            "RacialEthnicConsiderations": "Hispanic/Latina"
        },
        {
            "DismissedQuestion": "I feel anxious all the time.",
            "BetterQuestion": "I've been experiencing persistent anxiety with physical symptoms including racing heart, shortness of breath, and insomnia for the past 4 months. These symptoms worsen before my period and are interfering with my daily activities. I have a family history of anxiety disorders. Could this be generalized anxiety disorder, PMDD, or a hormonal imbalance? What diagnostic approach would you recommend?",
            "Condition": "Generalized Anxiety Disorder",
            "Category": "Mental Health",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 3.9,
            "AgeGroup": "25-34",
            "RacialEthnicConsiderations": "White/Caucasian"
        },
        {
            "DismissedQuestion": "I have a lump in my breast.",
            "BetterQuestion": "I discovered a firm, pea-sized lump in my right breast near the armpit that doesn't move when touched. It wasn't there during my self-exam last month. I'm 42 with no family history of breast cancer, but my maternal aunt had ovarian cancer. The lump isn't painful but feels different from my normal breast tissue. How urgently should this be evaluated, and what specific imaging would you recommend?",
            "Condition": "Breast Mass",
            "Category": "Gynecological Cancers",
            "DismissalFrequency": "Low",
            "DiagnosisDelay": 0.8,
            "AgeGroup": "35-44",
            "RacialEthnicConsiderations": "Black/African American"
        },
        {
            "DismissedQuestion": "I'm having trouble sleeping.",
            "BetterQuestion": "I've been experiencing difficulty falling asleep and staying asleep for the past 3 months. I wake up 3-4 times per night, often with night sweats, and feel unrested in the morning. I'm 51 and my periods have become irregular. Could these sleep disturbances be related to perimenopause? What treatment options might help without increasing my risk of breast cancer, which runs in my family?",
            "Condition": "Perimenopausal Insomnia",
            "Category": "Menopause/Aging",
            "DismissalFrequency": "Medium",
            "DiagnosisDelay": 2.3,
            "AgeGroup": "45-54",
            "RacialEthnicConsiderations": "White/Caucasian"
        },
        {
            "DismissedQuestion": "I'm having trouble with sex after having a baby.",
            "BetterQuestion": "I gave birth vaginally 6 months ago and am still experiencing pain during intercourse, along with vaginal dryness despite using lubricant. I'm breastfeeding and haven't had a period yet. The pain is sharp and occurs at the entrance of my vagina. I also notice pain when inserting tampons. Could this be related to hormonal changes from breastfeeding, pelvic floor issues, or inadequate healing from a small tear during delivery?",
            "Condition": "Postpartum Dyspareunia",
            "Category": "Pregnancy/Postpartum",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 4.1,
            "AgeGroup": "25-34",
            "RacialEthnicConsiderations": "Asian"
        },
        {
            "DismissedQuestion": "I'm having dizzy spells.",
            "BetterQuestion": "I've been experiencing recurrent episodes of dizziness and lightheadedness for the past 2 months, particularly when standing up quickly or after prolonged standing. I sometimes feel my heart racing and see spots before my eyes. I'm 29 with low blood pressure historically, and these episodes are worse during my period. Could this be POTS, anemia, or another condition that affects women more frequently?",
            "Condition": "Postural Orthostatic Tachycardia Syndrome",
            "Category": "Cardiovascular Health",
            "DismissalFrequency": "Very High",
            "DiagnosisDelay": 5.8,
            "AgeGroup": "25-34",
            "RacialEthnicConsiderations": "White/Caucasian"
        },
        {
            "DismissedQuestion": "I have no interest in sex anymore.",
            "BetterQuestion": "I've experienced a complete loss of sexual desire for the past year that's causing distress in my relationship. I'm 37, in a stable relationship, and not taking any medications known to affect libido. I don't feel depressed but do feel fatigued. My periods are regular. Could this be female sexual interest/arousal disorder, a hormonal imbalance, or related to another underlying condition? What testing would you recommend?",
            "Condition": "Female Sexual Interest/Arousal Disorder",
            "Category": "Sexual Health",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 4.7,
            "AgeGroup": "35-44",
            "RacialEthnicConsiderations": "Hispanic/Latina"
        }
    ]
    
    # Add the additional questions to the dataframe
    additional_questions_df = pd.DataFrame(additional_questions)
    expanded_questions_df = pd.concat([expanded_questions_df, additional_questions_df], ignore_index=True)
    
    # Save checkpoint
    save_checkpoint(expanded_questions_df, "expanded_questions")
else:
    print("Using existing expanded questions data")

In [None]:
# Verify the expanded questions data
if expanded_questions_df is not None:
    verify_dataframe(expanded_questions_df, "Expanded Questions")

In [None]:
# Visualize the category distribution in the expanded dataset
if expanded_questions_df is not None:
    plt.figure(figsize=(12, 6))
    
    # Count the categories
    category_counts = expanded_questions_df['Category'].value_counts()
    
    # Create a bar chart
    bars = plt.bar(category_counts.index, category_counts.values, 
                   color=sns.color_palette("viridis", len(category_counts)))
    
    plt.title('Distribution of Categories in Expanded Dataset', fontsize=16)
    plt.xlabel('Category', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{int(height)}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'expanded_categories_distribution.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the dismissal frequency distribution in the expanded dataset
if expanded_questions_df is not None:
    plt.figure(figsize=(10, 6))
    
    # Define the order for dismissal frequency
    order = ['Very High', 'High', 'Medium', 'Low']
    
    # Count the frequencies
    dismissal_counts = expanded_questions_df['DismissalFrequency'].value_counts().reindex(order)
    
    # Create a color map based on severity
    colors = ['#d62728', '#ff7f0e', '#ffbb78', '#2ca02c']
    
    # Create a bar chart
    bars = plt.bar(dismissal_counts.index, dismissal_counts.values, color=colors)
    
    plt.title('Distribution of Dismissal Frequencies in Expanded Dataset', fontsize=16)
    plt.xlabel('Dismissal Frequency', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{int(height)}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'expanded_dismissal_frequencies.png'), dpi=300)
    plt.show()

## 6. Add Demographic Context

Now let's add more demographic context to the expanded questions dataset.

In [None]:
# Check if we already have the demographic context data
demographic_context_df = load_checkpoint("demographic_context")

# If not, add demographic context
if demographic_context_df is None and expanded_questions_df is not None:
    # Start with the expanded questions
    demographic_context_df = expanded_questions_df.copy()
    
    # Add demographic context columns
    demographic_context_df['Comorbidities'] = None
    demographic_context_df['ConditionDemographicRiskNotes'] = None
    
    # Update demographic context for each condition
    # This is a simplified example - in a real scenario, this would be based on medical literature
    demographic_updates = {
        'Endometriosis': {
            'Comorbidities': 'Irritable Bowel Syndrome; Migraine; Autoimmune Disorders',
            'ConditionDemographicRiskNotes': 'More commonly diagnosed in white women; often delayed diagnosis in women of color.'
        },
        'Chronic Fatigue Syndrome': {
            'Comorbidities': 'Fibromyalgia; Depression; Anxiety',
            'ConditionDemographicRiskNotes': 'More common in women aged 30-50; often misdiagnosed as depression.'
        },
        'Migraine': {
            'Comorbidities': 'Depression; Anxiety; Irritable Bowel Syndrome',
            'ConditionDemographicRiskNotes': '3x more common in women than men; often hormonal component.'
        },
        'Hashimoto\'s Thyroiditis': {
            'Comorbidities': 'Celiac Disease; Rheumatoid Arthritis; Vitiligo',
            'ConditionDemographicRiskNotes': '8x more common in women; peak onset age 30-50.'
        },
        'Rheumatoid Arthritis': {
            'Comorbidities': 'Cardiovascular Disease; Osteoporosis; Depression',
            'ConditionDemographicRiskNotes': '2-3x more common in women; Native American populations have higher prevalence.'
        },
        'Perimenopause': {
            'Comorbidities': 'Insomnia; Depression; Osteopenia',
            'ConditionDemographicRiskNotes': 'Average onset age 47; earlier onset in smokers and those with family history of early menopause.'
        },
        'Postpartum Depression': {
            'Comorbidities': 'Anxiety; PTSD; Thyroid Dysfunction',
            'ConditionDemographicRiskNotes': 'Higher risk in women with history of depression; socioeconomic factors influence diagnosis rates.'
        },
        'Vulvodynia': {
            'Comorbidities': 'Irritable Bowel Syndrome; Fibromyalgia; Interstitial Cystitis',
            'ConditionDemographicRiskNotes': 'Affects up to 16% of women; often undiagnosed in women of color.'
        },
        'Coronary Artery Disease': {
            'Comorbidities': 'Hypertension; Diabetes; Hyperlipidemia',
            'ConditionDemographicRiskNotes': 'Leading cause of death in women; different symptoms than men; Black women at higher risk.'
        },
        'Irritable Bowel Syndrome': {
            'Comorbidities': 'Anxiety; Depression; Fibromyalgia',
            'ConditionDemographicRiskNotes': '2x more common in women; often exacerbated by hormonal fluctuations.'
        },
        'Infertility': {
            'Comorbidities': 'PCOS; Endometriosis; Thyroid Disorders',
            'ConditionDemographicRiskNotes': 'Age is primary factor; Black and Hispanic women less likely to receive treatment.'
        },
        'Abnormal Uterine Bleeding': {
            'Comorbidities': 'Fibroids; Polyps; Endometriosis',
            'ConditionDemographicRiskNotes': 'Fibroids more common in Black women; endometrial cancer risk increases with age.'
        },
        'Perimenopausal Cognitive Changes': {
            'Comorbidities': 'Insomnia; Depression; Anxiety',
            'ConditionDemographicRiskNotes': 'Affects up to 60% of perimenopausal women; often dismissed as "normal aging".'
        },
        'Female Pattern Hair Loss': {
            'Comorbidities': 'PCOS; Thyroid Disorders; Iron Deficiency',
            'ConditionDemographicRiskNotes': 'Affects up to 50% of women by age 50; different pattern than male baldness.'
        },
        'Generalized Anxiety Disorder': {
            'Comorbidities': 'Depression; Insomnia; Irritable Bowel Syndrome',
            'ConditionDemographicRiskNotes': '2x more common in women; often comorbid with hormonal conditions.'
        },
        'Breast Mass': {
            'Comorbidities': 'Fibrocystic Breast Changes; Mastalgia; Nipple Discharge',
            'ConditionDemographicRiskNotes': 'Black women more likely to develop aggressive breast cancer at younger age; Ashkenazi Jewish women higher BRCA risk.'
        },
        'Perimenopausal Insomnia': {
            'Comorbidities': 'Hot Flashes; Anxiety; Depression',
            'ConditionDemographicRiskNotes': 'Affects up to 60% of perimenopausal women; often undertreated.'
        },
        'Postpartum Dyspareunia': {
            'Comorbidities': 'Pelvic Floor Dysfunction; Vaginal Dryness; Perineal Trauma',
            'ConditionDemographicRiskNotes': 'Affects up to 45% of women after childbirth; risk increases with instrumental delivery.'
        },
        'Postural Orthostatic Tachycardia Syndrome': {
            'Comorbidities': 'Ehlers-Danlos Syndrome; Chronic Fatigue Syndrome; Mast Cell Activation Syndrome',
            'ConditionDemographicRiskNotes': '5:1 female to male ratio; often triggered after pregnancy or viral illness.'
        },
        'Female Sexual Interest/Arousal Disorder': {
            'Comorbidities': 'Depression; Relationship Issues; Hormonal Imbalances',
            'ConditionDemographicRiskNotes': 'Affects 10% of women; increases with age but often not addressed by healthcare providers.'
        }
    }
    
    # Update the dataframe with demographic context
    for index, row in demographic_context_df.iterrows():
        condition = row['Condition']
        if condition in demographic_updates:
            demographic_context_df.at[index, 'Comorbidities'] = demographic_updates[condition]['Comorbidities']
            demographic_context_df.at[index, 'ConditionDemographicRiskNotes'] = demographic_updates[condition]['ConditionDemographicRiskNotes']
    
    # Save checkpoint
    save_checkpoint(demographic_context_df, "demographic_context")
else:
    print("Using existing demographic context data")

In [None]:
# Verify the demographic context data
if demographic_context_df is not None:
    verify_dataframe(demographic_context_df, "Demographic Context")

In [None]:
# Visualize the age group distribution
if demographic_context_df is not None:
    plt.figure(figsize=(10, 6))
    
    # Define the order for age groups
    order = ['18-24', '25-34', '35-44', '45-54', '55-64', '65+']
    
    # Count the age groups
    age_counts = demographic_context_df['AgeGroup'].value_counts().reindex(order)
    
    # Create a bar chart
    bars = plt.bar(age_counts.index, age_counts.values, 
                   color=sns.color_palette("Blues_d", len(age_counts)))
    
    plt.title('Distribution of Age Groups', fontsize=16)
    plt.xlabel('Age Group', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{int(height)}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'age_group_distribution.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the racial/ethnic considerations distribution
if demographic_context_df is not None:
    plt.figure(figsize=(12, 6))
    
    # Count the racial/ethnic considerations
    racial_counts = demographic_context_df['RacialEthnicConsiderations'].value_counts()
    
    # Create a bar chart
    bars = plt.bar(racial_counts.index, racial_counts.values, 
                   color=sns.color_palette("Spectral", len(racial_counts)))
    
    plt.title('Distribution of Racial/Ethnic Considerations', fontsize=16)
    plt.xlabel('Racial/Ethnic Group', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{int(height)}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'racial_ethnic_distribution.png'), dpi=300)
    plt.show()

## 7. Analyze Expanded Dataset

Now let's analyze the expanded dataset with demographic context.

In [None]:
# Check if we already have the analyzed expanded data
analyzed_expanded_df = load_checkpoint("analyzed_expanded")

# If not, analyze the expanded dataset
if analyzed_expanded_df is None and demographic_context_df is not None:
    # Create a copy of the demographic context dataframe
    analyzed_expanded_df = demographic_context_df.copy()
    
    # Calculate the length of each question
    analyzed_expanded_df['DismissedQuestion_Length'] = analyzed_expanded_df['DismissedQuestion'].str.len()
    analyzed_expanded_df['BetterQuestion_Length'] = analyzed_expanded_df['BetterQuestion'].str.len()
    
    # Calculate the ratio of better question length to dismissed question length
    analyzed_expanded_df['Question_Length_Ratio'] = analyzed_expanded_df['BetterQuestion_Length'] / analyzed_expanded_df['DismissedQuestion_Length']
    
    # Preprocess the questions for text analysis
    analyzed_expanded_df['DismissedQuestion_Processed'] = analyzed_expanded_df['DismissedQuestion'].apply(preprocess_text)
    analyzed_expanded_df['BetterQuestion_Processed'] = analyzed_expanded_df['BetterQuestion'].apply(preprocess_text)
    
    # Calculate the similarity between dismissed and better questions
    analyzed_expanded_df['Question_Similarity'] = analyzed_expanded_df.apply(
        lambda row: calculate_text_similarity(row['DismissedQuestion_Processed'], row['BetterQuestion_Processed']), 
        axis=1
    )
    
    # Extract medical terms from the questions
    # We'll use the medical terminology dataframe to identify medical terms
    if medical_terminology_df is not None:
        medical_terms = medical_terminology_df['Term'].str.lower().tolist()
        
        # Function to count medical terms in a text
        def count_medical_terms(text, terms):
            if not isinstance(text, str):
                return 0
            text_lower = text.lower()
            count = sum(1 for term in terms if term.lower() in text_lower)
            return count
        
        # Count medical terms in each question
        analyzed_expanded_df['DismissedQuestion_MedicalTerms'] = analyzed_expanded_df['DismissedQuestion'].apply(
            lambda x: count_medical_terms(x, medical_terms)
        )
        analyzed_expanded_df['BetterQuestion_MedicalTerms'] = analyzed_expanded_df['BetterQuestion'].apply(
            lambda x: count_medical_terms(x, medical_terms)
        )
        
        # Calculate the difference in medical terms
        analyzed_expanded_df['MedicalTerms_Difference'] = analyzed_expanded_df['BetterQuestion_MedicalTerms'] - analyzed_expanded_df['DismissedQuestion_MedicalTerms']
    
    # Save checkpoint
    save_checkpoint(analyzed_expanded_df, "analyzed_expanded")
else:
    print("Using existing analyzed expanded data")

In [None]:
# Verify the analyzed expanded data
if analyzed_expanded_df is not None:
    verify_dataframe(analyzed_expanded_df, "Analyzed Expanded Dataset")

In [None]:
# Visualize the question length ratio by category
if analyzed_expanded_df is not None:
    plt.figure(figsize=(12, 6))
    
    # Calculate the average question length ratio by category
    category_ratios = analyzed_expanded_df.groupby('Category')['Question_Length_Ratio'].mean().sort_values(ascending=False)
    
    # Create a bar chart
    bars = plt.bar(category_ratios.index, category_ratios.values, 
                   color=sns.color_palette("viridis", len(category_ratios)))
    
    plt.title('Average Question Length Ratio by Category', fontsize=16)
    plt.xlabel('Category', fontsize=12)
    plt.ylabel('Average Length Ratio (Better/Dismissed)', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{height:.1f}x',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'question_length_ratio_by_category.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the dismissal frequency by category
if analyzed_expanded_df is not None:
    plt.figure(figsize=(12, 8))
    
    # Create a cross-tabulation of category and dismissal frequency
    dismissal_by_category = pd.crosstab(analyzed_expanded_df['Category'], analyzed_expanded_df['DismissalFrequency'])
    
    # Reorder the columns
    order = ['Very High', 'High', 'Medium', 'Low']
    dismissal_by_category = dismissal_by_category.reindex(columns=order)
    
    # Create a stacked bar chart
    dismissal_by_category.plot(kind='barh', stacked=True, figsize=(12, 8),
                              color=['#d62728', '#ff7f0e', '#ffbb78', '#2ca02c'])
    
    plt.title('Dismissal Frequency by Category', fontsize=16)
    plt.xlabel('Count', fontsize=12)
    plt.ylabel('Category', fontsize=12)
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.legend(title='Dismissal Frequency')
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'dismissal_frequency_by_category.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the diagnosis delay by category
if analyzed_expanded_df is not None:
    plt.figure(figsize=(12, 6))
    
    # Calculate the average diagnosis delay by category
    category_delays = analyzed_expanded_df.groupby('Category')['DiagnosisDelay'].mean().sort_values(ascending=False)
    
    # Create a bar chart
    bars = plt.bar(category_delays.index, category_delays.values, 
                   color=sns.color_palette("Reds_r", len(category_delays)))
    
    plt.title('Average Diagnosis Delay by Category', fontsize=16)
    plt.xlabel('Category', fontsize=12)
    plt.ylabel('Average Diagnosis Delay (years)', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{height:.1f}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'diagnosis_delay_by_category.png'), dpi=300)
    plt.show()

## 8. Prepare Data for Next Notebook

Let's save the preprocessed data for use in the next notebook in the series.

In [None]:
# Save the analyzed expanded dataset to the expanded directory
if analyzed_expanded_df is not None:
    analyzed_expanded_df.to_csv(os.path.join(expanded_dir, 'analyzed_expanded_dataset.csv'), index=False)
    print(f"Saved analyzed expanded dataset to: {os.path.join(expanded_dir, 'analyzed_expanded_dataset.csv')}")

# Save the medical terminology data to the expanded directory
if medical_terminology_df is not None:
    medical_terminology_df.to_csv(os.path.join(expanded_dir, 'medical_terminology.csv'), index=False)
    print(f"Saved medical terminology data to: {os.path.join(expanded_dir, 'medical_terminology.csv')}")

# Save the clinical trials data to the expanded directory
if clinical_trials_df is not None:
    clinical_trials_df.to_csv(os.path.join(expanded_dir, 'clinical_trials.csv'), index=False)
    print(f"Saved clinical trials data to: {os.path.join(expanded_dir, 'clinical_trials.csv')}")

# Save the PubMed data to the expanded directory
if pubmed_df is not None:
    pubmed_df.to_csv(os.path.join(expanded_dir, 'pubmed.csv'), index=False)
    print(f"Saved PubMed data to: {os.path.join(expanded_dir, 'pubmed.csv')}")

In [None]:
# Create a summary of the preprocessing results
preprocessing_summary = {
    "original_dismissed_questions": len(dismissed_questions_df) if dismissed_questions_df is not None else 0,
    "expanded_dismissed_questions": len(expanded_questions_df) if expanded_questions_df is not None else 0,
    "demographic_context_added": True if demographic_context_df is not None else False,
    "question_analysis": {
        "avg_dismissed_length": analyzed_expanded_df['DismissedQuestion_Length'].mean() if analyzed_expanded_df is not None else None,
        "avg_better_length": analyzed_expanded_df['BetterQuestion_Length'].mean() if analyzed_expanded_df is not None else None,
        "avg_length_ratio": analyzed_expanded_df['Question_Length_Ratio'].mean() if analyzed_expanded_df is not None else None,
        "avg_similarity": analyzed_expanded_df['Question_Similarity'].mean() if analyzed_expanded_df is not None else None,
        "avg_medical_terms_difference": analyzed_expanded_df['MedicalTerms_Difference'].mean() if analyzed_expanded_df is not None and 'MedicalTerms_Difference' in analyzed_expanded_df.columns else None
    },
    "categories": list(analyzed_expanded_df['Category'].unique()) if analyzed_expanded_df is not None else [],
    "age_groups": list(analyzed_expanded_df['AgeGroup'].unique()) if analyzed_expanded_df is not None else [],
    "racial_ethnic_groups": list(analyzed_expanded_df['RacialEthnicConsiderations'].unique()) if analyzed_expanded_df is not None else []
}

# Save the summary as JSON
with open(os.path.join(expanded_dir, 'preprocessing_summary.json'), 'w') as f:
    json.dump(preprocessing_summary, f, indent=2)

print("Preprocessing summary saved to:", os.path.join(expanded_dir, 'preprocessing_summary.json'))

In [None]:
# Display the preprocessing summary
print("\n--- Preprocessing Summary ---")
print(f"Original Dismissed Questions: {preprocessing_summary['original_dismissed_questions']}")
print(f"Expanded Dismissed Questions: {preprocessing_summary['expanded_dismissed_questions']}")
print(f"Demographic Context Added: {preprocessing_summary['demographic_context_added']}")
print("\nQuestion Analysis:")
print(f"  Average Dismissed Question Length: {preprocessing_summary['question_analysis']['avg_dismissed_length']:.1f} characters")
print(f"  Average Better Question Length: {preprocessing_summary['question_analysis']['avg_better_length']:.1f} characters")
print(f"  Average Length Ratio: {preprocessing_summary['question_analysis']['avg_length_ratio']:.1f}x")
print(f"  Average Question Similarity: {preprocessing_summary['question_analysis']['avg_similarity']:.2f}")
print(f"  Average Medical Terms Difference: {preprocessing_summary['question_analysis']['avg_medical_terms_difference']:.1f}")
print("\nCategories:")
for category in preprocessing_summary['categories']:
    print(f"  - {category}")
print("\nAge Groups:")
for age_group in preprocessing_summary['age_groups']:
    print(f"  - {age_group}")
print("\nRacial/Ethnic Groups:")
for racial_group in preprocessing_summary['racial_ethnic_groups']:
    print(f"  - {racial_group}")

## 9. Conclusion

In this notebook, we've successfully preprocessed the data collected in Part 1 for our women's health LLM model:

1. **Analyzed Question Characteristics**: We analyzed the differences between dismissed questions and better questions, finding that better questions are significantly longer and contain more medical terminology.

2. **Expanded the Dataset**: We expanded the dismissed questions dataset to include more examples across different categories, increasing the diversity of conditions covered.

3. **Added Demographic Context**: We added demographic context to the dataset, including comorbidities and condition-specific demographic risk notes.

4. **Analyzed Patterns**: We identified patterns in dismissal frequency and diagnosis delay across different categories, which will help our LLM model generate more effective questions for conditions that are frequently dismissed.

This preprocessed data will serve as the foundation for the next steps in our project: analysis and visualization, and training split preparation.

### Next Steps

In the next notebook (Part 3: Analysis and Visualization), we will:
- Perform in-depth analysis of the preprocessed data
- Create visualizations to better understand patterns in women's health questions
- Identify key factors that contribute to question effectiveness
- Prepare the data for the final step: training split preparation