# Women's Health Data Analysis and Visualization - Part 3

## Overview

This notebook is the third in a series focused on creating a dataset for training an LLM model to predict better questions for women's health consultations. This part focuses on in-depth analysis and visualization of the preprocessed data from Part 2.

### Objectives
- Load the preprocessed data from Part 2
- Perform comprehensive analysis of women's health questions
- Create detailed visualizations to identify patterns
- Analyze dismissal patterns across different demographics
- Identify key factors that contribute to question effectiveness
- Prepare insights for LLM training

### Why This Matters
Understanding the patterns in women's health questions and dismissal experiences is crucial for training an effective LLM model. By visualizing these patterns, we can identify the key characteristics that make questions more effective and less likely to be dismissed by healthcare providers.

## 1. Environment Setup

First, let's set up our environment by importing necessary libraries and loading the preprocessed data from Part 2.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os
import json
import re
import string
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from wordcloud import WordCloud
import matplotlib.cm as cm
from matplotlib.colors import Normalize
import matplotlib.colors as mcolors
from scipy.stats import pearsonr, spearmanr

# Set up plotting style
plt.style.use('seaborn-whitegrid')
sns.set(style="whitegrid")

# Set up color palettes
category_palette = sns.color_palette("viridis", 10)
dismissal_palette = {'Very High': '#d62728', 'High': '#ff7f0e', 'Medium': '#ffbb78', 'Low': '#2ca02c'}

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Display versions for reproducibility
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"NLTK version: {nltk.__version__}")
print(f"Matplotlib version: {plt.__version__}")
print(f"Seaborn version: {sns.__version__}")
print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

In [None]:
# Define directory structure
data_dir = 'womens_health_data'
raw_dir = os.path.join(data_dir, 'raw')
processed_dir = os.path.join(data_dir, 'processed')
expanded_dir = os.path.join(data_dir, 'expanded')
checkpoint_dir = os.path.join(data_dir, 'checkpoints')
figures_dir = os.path.join(data_dir, 'figures')
analysis_dir = os.path.join(data_dir, 'analysis')

# Create directories if they don't exist
for directory in [data_dir, raw_dir, processed_dir, expanded_dir, checkpoint_dir, figures_dir, analysis_dir]:
    os.makedirs(directory, exist_ok=True)
    print(f"Created directory: {directory}")

## 2. Helper Functions

Let's create some helper functions for data analysis, visualization, and checkpoint management.

In [None]:
def save_checkpoint(df, name):
    """
    Save a dataframe as a checkpoint CSV file.
    
    Parameters:
    - df: pandas DataFrame to save
    - name: name of the checkpoint (without extension)
    
    Returns:
    - path: path to the saved file
    """
    # Create the full path with timestamp to avoid overwriting
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{name}_{timestamp}.csv"
    path = os.path.join(checkpoint_dir, filename)
    
    # Save the dataframe
    df.to_csv(path, index=False)
    print(f"Checkpoint saved: {path}")
    
    # Also save a version with a fixed name for easy loading
    fixed_path = os.path.join(checkpoint_dir, f"{name}_latest.csv")
    df.to_csv(fixed_path, index=False)
    print(f"Latest version saved: {fixed_path}")
    
    return path

def load_checkpoint(name):
    """
    Load the latest checkpoint for a given name.
    
    Parameters:
    - name: name of the checkpoint (without extension)
    
    Returns:
    - df: loaded DataFrame or None if file doesn't exist
    """
    path = os.path.join(checkpoint_dir, f"{name}_latest.csv")
    
    if os.path.exists(path) and os.path.getsize(path) > 0:
        try:
            df = pd.read_csv(path)
            print(f"Checkpoint loaded: {path}")
            print(f"Shape: {df.shape}")
            return df
        except pd.errors.EmptyDataError:
            print(f"Warning: Checkpoint file exists but is empty: {path}")
            return None
        except Exception as e:
            print(f"Error loading checkpoint: {e}")
            return None
    else:
        print(f"Checkpoint not found or empty: {path}")
        return None

def verify_dataframe(df, name):
    """
    Verify a dataframe by displaying basic information.
    
    Parameters:
    - df: pandas DataFrame to verify
    - name: name of the dataframe for display purposes
    """
    print(f"\n--- {name} Verification ---")
    print(f"Shape: {df.shape}")
    print("\nFirst 5 rows:")
    display(df.head())
    print("\nData types:")
    display(df.dtypes)
    print("\nMissing values:")
    missing = df.isnull().sum()
    display(missing[missing > 0] if any(missing > 0) else "No missing values")
    print("\nBasic statistics:")
    display(df.describe(include='all').T)
    print("----------------------------\n")

def create_wordcloud(text, title, filename, mask=None, background_color='white', colormap='viridis', max_words=200):
    """
    Create and save a word cloud visualization.
    
    Parameters:
    - text: text to visualize
    - title: title for the plot
    - filename: filename to save the plot (without extension)
    - mask: optional mask image for the word cloud
    - background_color: background color for the word cloud
    - colormap: colormap for the word cloud
    - max_words: maximum number of words to include
    """
    # Create the word cloud
    wordcloud = WordCloud(width=800, height=400, 
                          background_color=background_color,
                          colormap=colormap,
                          max_words=max_words,
                          mask=mask,
                          contour_width=1,
                          contour_color='steelblue')
    
    # Generate the word cloud
    wordcloud.generate(text)
    
    # Plot the word cloud
    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(title, fontsize=16)
    plt.tight_layout()
    
    # Save the plot
    plt.savefig(os.path.join(figures_dir, f"{filename}.png"), dpi=300, bbox_inches='tight')
    plt.show()

def extract_key_phrases(text, n=10):
    """
    Extract key phrases from text using TF-IDF.
    
    Parameters:
    - text: text to analyze
    - n: number of key phrases to extract
    
    Returns:
    - key_phrases: list of key phrases
    """
    if not isinstance(text, str) or len(text) < 10:
        return []
    
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    
    # Create a TF-IDF vectorizer
    vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 3), max_features=100)
    
    # Fit the vectorizer to the sentences
    try:
        tfidf_matrix = vectorizer.fit_transform(sentences)
        
        # Get feature names
        feature_names = vectorizer.get_feature_names_out()
        
        # Calculate the average TF-IDF score for each feature
        avg_tfidf = np.array(tfidf_matrix.mean(axis=0)).flatten()
        
        # Get the indices of the top n features
        top_indices = avg_tfidf.argsort()[-n:][::-1]
        
        # Get the top n features
        key_phrases = [feature_names[i] for i in top_indices]
        
        return key_phrases
    except Exception as e:
        print(f"Error extracting key phrases: {e}")
        return []

def calculate_correlation(df, x_col, y_col, method='pearson'):
    """
    Calculate correlation between two columns.
    
    Parameters:
    - df: pandas DataFrame
    - x_col: name of the first column
    - y_col: name of the second column
    - method: correlation method ('pearson' or 'spearman')
    
    Returns:
    - corr: correlation coefficient
    - p_value: p-value
    """
    if method == 'pearson':
        corr, p_value = pearsonr(df[x_col], df[y_col])
    elif method == 'spearman':
        corr, p_value = spearmanr(df[x_col], df[y_col])
    else:
        raise ValueError(f"Unknown correlation method: {method}")
    
    return corr, p_value

## 3. Load Preprocessed Data

Let's load the preprocessed data from Part 2.

In [None]:
# Load the analyzed expanded dataset
analyzed_expanded_path = os.path.join(expanded_dir, 'analyzed_expanded_dataset.csv')
if os.path.exists(analyzed_expanded_path):
    analyzed_df = pd.read_csv(analyzed_expanded_path)
    print(f"Loaded analyzed expanded dataset: {analyzed_df.shape}")
else:
    print("Analyzed expanded dataset not found. Please run Part 2 first.")
    analyzed_df = None

# Load the medical terminology data
medical_terminology_path = os.path.join(expanded_dir, 'medical_terminology.csv')
if os.path.exists(medical_terminology_path):
    medical_terminology_df = pd.read_csv(medical_terminology_path)
    print(f"Loaded medical terminology data: {medical_terminology_df.shape}")
else:
    print("Medical terminology data not found. Please run Part 2 first.")
    medical_terminology_df = None

# Load the clinical trials data
clinical_trials_path = os.path.join(expanded_dir, 'clinical_trials.csv')
if os.path.exists(clinical_trials_path):
    clinical_trials_df = pd.read_csv(clinical_trials_path)
    print(f"Loaded clinical trials data: {clinical_trials_df.shape}")
else:
    print("Clinical trials data not found. Please run Part 2 first.")
    clinical_trials_df = None

# Load the PubMed data
pubmed_path = os.path.join(expanded_dir, 'pubmed.csv')
if os.path.exists(pubmed_path):
    pubmed_df = pd.read_csv(pubmed_path)
    print(f"Loaded PubMed data: {pubmed_df.shape}")
else:
    print("PubMed data not found. Please run Part 2 first.")
    pubmed_df = None

# Load the preprocessing summary
preprocessing_summary_path = os.path.join(expanded_dir, 'preprocessing_summary.json')
if os.path.exists(preprocessing_summary_path):
    with open(preprocessing_summary_path, 'r') as f:
        preprocessing_summary = json.load(f)
    print("Loaded preprocessing summary")
else:
    print("Preprocessing summary not found. Please run Part 2 first.")
    preprocessing_summary = None

In [None]:
# Verify the loaded data
if analyzed_df is not None:
    verify_dataframe(analyzed_df, "Analyzed Expanded Dataset")
    
if medical_terminology_df is not None:
    verify_dataframe(medical_terminology_df, "Medical Terminology")

## 4. Text Analysis of Questions

Let's perform a detailed text analysis of the dismissed questions and better questions to understand the linguistic differences between them.

In [None]:
# Check if we already have the text analysis data
text_analysis_df = load_checkpoint("text_analysis")

# If not, perform text analysis
if text_analysis_df is None and analyzed_df is not None:
    # Create a copy of the analyzed dataframe
    text_analysis_df = analyzed_df.copy()
    
    # Calculate the number of sentences in each question
    text_analysis_df['DismissedQuestion_Sentences'] = text_analysis_df['DismissedQuestion'].apply(
        lambda x: len(sent_tokenize(x)) if isinstance(x, str) else 0
    )
    text_analysis_df['BetterQuestion_Sentences'] = text_analysis_df['BetterQuestion'].apply(
        lambda x: len(sent_tokenize(x)) if isinstance(x, str) else 0
    )
    
    # Calculate the number of words in each question
    text_analysis_df['DismissedQuestion_Words'] = text_analysis_df['DismissedQuestion'].apply(
        lambda x: len(word_tokenize(x)) if isinstance(x, str) else 0
    )
    text_analysis_df['BetterQuestion_Words'] = text_analysis_df['BetterQuestion'].apply(
        lambda x: len(word_tokenize(x)) if isinstance(x, str) else 0
    )
    
    # Calculate the average word length in each question
    def avg_word_length(text):
        if not isinstance(text, str) or len(text) == 0:
            return 0
        words = word_tokenize(text)
        if len(words) == 0:
            return 0
        return sum(len(word) for word in words) / len(words)
    
    text_analysis_df['DismissedQuestion_AvgWordLength'] = text_analysis_df['DismissedQuestion'].apply(avg_word_length)
    text_analysis_df['BetterQuestion_AvgWordLength'] = text_analysis_df['BetterQuestion'].apply(avg_word_length)
    
    # Extract key phrases from each question
    text_analysis_df['DismissedQuestion_KeyPhrases'] = text_analysis_df['DismissedQuestion'].apply(
        lambda x: '; '.join(extract_key_phrases(x, n=5))
    )
    text_analysis_df['BetterQuestion_KeyPhrases'] = text_analysis_df['BetterQuestion'].apply(
        lambda x: '; '.join(extract_key_phrases(x, n=5))
    )
    
    # Calculate the specificity score (ratio of unique words to total words)
    def calculate_specificity(text):
        if not isinstance(text, str) or len(text) == 0:
            return 0
        words = word_tokenize(text.lower())
        if len(words) == 0:
            return 0
        unique_words = set(words)
        return len(unique_words) / len(words)
    
    text_analysis_df['DismissedQuestion_Specificity'] = text_analysis_df['DismissedQuestion'].apply(calculate_specificity)
    text_analysis_df['BetterQuestion_Specificity'] = text_analysis_df['BetterQuestion'].apply(calculate_specificity)
    
    # Calculate the question complexity (product of number of sentences and average word length)
    text_analysis_df['DismissedQuestion_Complexity'] = text_analysis_df['DismissedQuestion_Sentences'] * text_analysis_df['DismissedQuestion_AvgWordLength']
    text_analysis_df['BetterQuestion_Complexity'] = text_analysis_df['BetterQuestion_Sentences'] * text_analysis_df['BetterQuestion_AvgWordLength']
    
    # Calculate the complexity ratio
    text_analysis_df['Complexity_Ratio'] = text_analysis_df['BetterQuestion_Complexity'] / text_analysis_df['DismissedQuestion_Complexity'].replace(0, 0.1)
    
    # Save checkpoint
    save_checkpoint(text_analysis_df, "text_analysis")
else:
    print("Using existing text analysis data")

In [None]:
# Verify the text analysis data
if text_analysis_df is not None:
    verify_dataframe(text_analysis_df, "Text Analysis")

In [None]:
# Visualize the sentence count comparison
if text_analysis_df is not None:
    plt.figure(figsize=(12, 6))
    
    # Create a bar chart
    avg_dismissed_sentences = text_analysis_df['DismissedQuestion_Sentences'].mean()
    avg_better_sentences = text_analysis_df['BetterQuestion_Sentences'].mean()
    avg_ratio = avg_better_sentences / avg_dismissed_sentences
    
    bars = plt.bar(['Dismissed Question', 'Better Question'], 
                   [avg_dismissed_sentences, avg_better_sentences],
                   color=['#ff7f0e', '#2ca02c'])
    
    plt.title('Average Number of Sentences Comparison', fontsize=16)
    plt.ylabel('Average Number of Sentences', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.05,
                 f'{height:.1f}',
                 ha='center', va='bottom', fontsize=12)
    
    # Add ratio annotation
    plt.annotate(f'Better questions have on average {avg_ratio:.1f}x more sentences than dismissed questions',
                xy=(0.5, 0.9), xycoords='axes fraction',
                ha='center', va='center',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='white', alpha=0.8),
                fontsize=12)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'sentence_count_comparison.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the word count comparison
if text_analysis_df is not None:
    plt.figure(figsize=(12, 6))
    
    # Create a bar chart
    avg_dismissed_words = text_analysis_df['DismissedQuestion_Words'].mean()
    avg_better_words = text_analysis_df['BetterQuestion_Words'].mean()
    avg_ratio = avg_better_words / avg_dismissed_words
    
    bars = plt.bar(['Dismissed Question', 'Better Question'], 
                   [avg_dismissed_words, avg_better_words],
                   color=['#ff7f0e', '#2ca02c'])
    
    plt.title('Average Number of Words Comparison', fontsize=16)
    plt.ylabel('Average Number of Words', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
                 f'{height:.1f}',
                 ha='center', va='bottom', fontsize=12)
    
    # Add ratio annotation
    plt.annotate(f'Better questions have on average {avg_ratio:.1f}x more words than dismissed questions',
                xy=(0.5, 0.9), xycoords='axes fraction',
                ha='center', va='center',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='white', alpha=0.8),
                fontsize=12)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'word_count_comparison.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the specificity comparison
if text_analysis_df is not None:
    plt.figure(figsize=(12, 6))
    
    # Create a bar chart
    avg_dismissed_specificity = text_analysis_df['DismissedQuestion_Specificity'].mean()
    avg_better_specificity = text_analysis_df['BetterQuestion_Specificity'].mean()
    avg_ratio = avg_better_specificity / avg_dismissed_specificity
    
    bars = plt.bar(['Dismissed Question', 'Better Question'], 
                   [avg_dismissed_specificity, avg_better_specificity],
                   color=['#ff7f0e', '#2ca02c'])
    
    plt.title('Average Specificity Comparison', fontsize=16)
    plt.ylabel('Average Specificity Score', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                 f'{height:.2f}',
                 ha='center', va='bottom', fontsize=12)
    
    # Add ratio annotation
    plt.annotate(f'Better questions are {avg_ratio:.2f}x more specific than dismissed questions',
                xy=(0.5, 0.9), xycoords='axes fraction',
                ha='center', va='center',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='white', alpha=0.8),
                fontsize=12)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'specificity_comparison.png'), dpi=300)
    plt.show()

In [None]:
# Create word clouds for dismissed and better questions
if text_analysis_df is not None:
    # Combine all dismissed questions
    all_dismissed = ' '.join(text_analysis_df['DismissedQuestion'].dropna())
    
    # Combine all better questions
    all_better = ' '.join(text_analysis_df['BetterQuestion'].dropna())
    
    # Create word clouds
    create_wordcloud(all_dismissed, 'Word Cloud of Dismissed Questions', 'dismissed_questions_wordcloud', 
                     colormap='Oranges')
    create_wordcloud(all_better, 'Word Cloud of Better Questions', 'better_questions_wordcloud', 
                     colormap='Greens')

In [None]:
# Analyze the key phrases
if text_analysis_df is not None:
    # Extract all key phrases
    dismissed_key_phrases = []
    for phrases in text_analysis_df['DismissedQuestion_KeyPhrases'].dropna():
        dismissed_key_phrases.extend([phrase.strip() for phrase in phrases.split(';') if phrase.strip()])
    
    better_key_phrases = []
    for phrases in text_analysis_df['BetterQuestion_KeyPhrases'].dropna():
        better_key_phrases.extend([phrase.strip() for phrase in phrases.split(';') if phrase.strip()])
    
    # Count the frequency of each key phrase
    dismissed_phrase_counts = pd.Series(dismissed_key_phrases).value_counts().head(15)
    better_phrase_counts = pd.Series(better_key_phrases).value_counts().head(15)
    
    # Plot the top key phrases for dismissed questions
    plt.figure(figsize=(12, 6))
    dismissed_phrase_counts.plot(kind='barh', color='#ff7f0e')
    plt.title('Top Key Phrases in Dismissed Questions', fontsize=16)
    plt.xlabel('Frequency', fontsize=12)
    plt.ylabel('Key Phrase', fontsize=12)
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'dismissed_key_phrases.png'), dpi=300)
    plt.show()
    
    # Plot the top key phrases for better questions
    plt.figure(figsize=(12, 6))
    better_phrase_counts.plot(kind='barh', color='#2ca02c')
    plt.title('Top Key Phrases in Better Questions', fontsize=16)
    plt.xlabel('Frequency', fontsize=12)
    plt.ylabel('Key Phrase', fontsize=12)
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'better_key_phrases.png'), dpi=300)
    plt.show()

## 5. Correlation Analysis

Let's analyze the correlations between different question characteristics and dismissal frequency.

In [None]:
# Check if we already have the correlation analysis data
correlation_df = load_checkpoint("correlation_analysis")

# If not, perform correlation analysis
if correlation_df is None and text_analysis_df is not None:
    # Create a dataframe to store correlation results
    correlation_df = pd.DataFrame(columns=['Variable1', 'Variable2', 'Correlation', 'P_Value', 'Method'])
    
    # Convert dismissal frequency to numeric
    dismissal_map = {'Low': 1, 'Medium': 2, 'High': 3, 'Very High': 4}
    text_analysis_df['DismissalFrequency_Numeric'] = text_analysis_df['DismissalFrequency'].map(dismissal_map)
    
    # Define the variables to analyze
    question_vars = [
        'DismissedQuestion_Length', 'BetterQuestion_Length', 'Question_Length_Ratio',
        'DismissedQuestion_Sentences', 'BetterQuestion_Sentences',
        'DismissedQuestion_Words', 'BetterQuestion_Words',
        'DismissedQuestion_AvgWordLength', 'BetterQuestion_AvgWordLength',
        'DismissedQuestion_Specificity', 'BetterQuestion_Specificity',
        'DismissedQuestion_Complexity', 'BetterQuestion_Complexity', 'Complexity_Ratio'
    ]
    
    # Calculate correlations with dismissal frequency
    for var in question_vars:
        if var in text_analysis_df.columns:
            # Pearson correlation
            pearson_corr, pearson_p = calculate_correlation(text_analysis_df, var, 'DismissalFrequency_Numeric', method='pearson')
            correlation_df = pd.concat([correlation_df, pd.DataFrame({
                'Variable1': [var],
                'Variable2': ['DismissalFrequency'],
                'Correlation': [pearson_corr],
                'P_Value': [pearson_p],
                'Method': ['Pearson']
            })], ignore_index=True)
            
            # Spearman correlation
            spearman_corr, spearman_p = calculate_correlation(text_analysis_df, var, 'DismissalFrequency_Numeric', method='spearman')
            correlation_df = pd.concat([correlation_df, pd.DataFrame({
                'Variable1': [var],
                'Variable2': ['DismissalFrequency'],
                'Correlation': [spearman_corr],
                'P_Value': [spearman_p],
                'Method': ['Spearman']
            })], ignore_index=True)
    
    # Calculate correlations with diagnosis delay
    for var in question_vars:
        if var in text_analysis_df.columns:
            # Pearson correlation
            pearson_corr, pearson_p = calculate_correlation(text_analysis_df, var, 'DiagnosisDelay', method='pearson')
            correlation_df = pd.concat([correlation_df, pd.DataFrame({
                'Variable1': [var],
                'Variable2': ['DiagnosisDelay'],
                'Correlation': [pearson_corr],
                'P_Value': [pearson_p],
                'Method': ['Pearson']
            })], ignore_index=True)
            
            # Spearman correlation
            spearman_corr, spearman_p = calculate_correlation(text_analysis_df, var, 'DiagnosisDelay', method='spearman')
            correlation_df = pd.concat([correlation_df, pd.DataFrame({
                'Variable1': [var],
                'Variable2': ['DiagnosisDelay'],
                'Correlation': [spearman_corr],
                'P_Value': [spearman_p],
                'Method': ['Spearman']
            })], ignore_index=True)
    
    # Save checkpoint
    save_checkpoint(correlation_df, "correlation_analysis")
else:
    print("Using existing correlation analysis data")

In [None]:
# Verify the correlation analysis data
if correlation_df is not None:
    verify_dataframe(correlation_df, "Correlation Analysis")

In [None]:
# Visualize the top correlations with dismissal frequency
if correlation_df is not None:
    # Filter for Spearman correlations with dismissal frequency
    dismissal_corrs = correlation_df[
        (correlation_df['Variable2'] == 'DismissalFrequency') & 
        (correlation_df['Method'] == 'Spearman')
    ].copy()
    
    # Sort by absolute correlation
    dismissal_corrs['Abs_Correlation'] = dismissal_corrs['Correlation'].abs()
    dismissal_corrs = dismissal_corrs.sort_values('Abs_Correlation', ascending=False).head(10)
    
    # Plot the correlations
    plt.figure(figsize=(12, 6))
    bars = plt.barh(dismissal_corrs['Variable1'], dismissal_corrs['Correlation'], 
                   color=dismissal_corrs['Correlation'].apply(lambda x: '#2ca02c' if x > 0 else '#d62728'))
    
    plt.title('Top Correlations with Dismissal Frequency', fontsize=16)
    plt.xlabel('Spearman Correlation Coefficient', fontsize=12)
    plt.ylabel('Question Characteristic', fontsize=12)
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.axvline(x=0, color='black', linestyle='-', alpha=0.3)
    
    # Add data labels
    for bar in bars:
        width = bar.get_width()
        x_pos = width + 0.03 if width > 0 else width - 0.15
        plt.text(x_pos, bar.get_y() + bar.get_height()/2,
                 f'{width:.2f}',
                 va='center', fontsize=10,
                 color='black')
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'dismissal_frequency_correlations.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the top correlations with diagnosis delay
if correlation_df is not None:
    # Filter for Spearman correlations with diagnosis delay
    delay_corrs = correlation_df[
        (correlation_df['Variable2'] == 'DiagnosisDelay') & 
        (correlation_df['Method'] == 'Spearman')
    ].copy()
    
    # Sort by absolute correlation
    delay_corrs['Abs_Correlation'] = delay_corrs['Correlation'].abs()
    delay_corrs = delay_corrs.sort_values('Abs_Correlation', ascending=False).head(10)
    
    # Plot the correlations
    plt.figure(figsize=(12, 6))
    bars = plt.barh(delay_corrs['Variable1'], delay_corrs['Correlation'], 
                   color=delay_corrs['Correlation'].apply(lambda x: '#2ca02c' if x > 0 else '#d62728'))
    
    plt.title('Top Correlations with Diagnosis Delay', fontsize=16)
    plt.xlabel('Spearman Correlation Coefficient', fontsize=12)
    plt.ylabel('Question Characteristic', fontsize=12)
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.axvline(x=0, color='black', linestyle='-', alpha=0.3)
    
    # Add data labels
    for bar in bars:
        width = bar.get_width()
        x_pos = width + 0.03 if width > 0 else width - 0.15
        plt.text(x_pos, bar.get_y() + bar.get_height()/2,
                 f'{width:.2f}',
                 va='center', fontsize=10,
                 color='black')
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'diagnosis_delay_correlations.png'), dpi=300)
    plt.show()

## 6. Demographic Analysis

Let's analyze how dismissal patterns vary across different demographic groups.

In [None]:
# Check if we already have the demographic analysis data
demographic_analysis_df = load_checkpoint("demographic_analysis")

# If not, perform demographic analysis
if demographic_analysis_df is None and text_analysis_df is not None:
    # Create a copy of the text analysis dataframe
    demographic_analysis_df = text_analysis_df.copy()
    
    # Convert dismissal frequency to numeric
    if 'DismissalFrequency_Numeric' not in demographic_analysis_df.columns:
        dismissal_map = {'Low': 1, 'Medium': 2, 'High': 3, 'Very High': 4}
        demographic_analysis_df['DismissalFrequency_Numeric'] = demographic_analysis_df['DismissalFrequency'].map(dismissal_map)
    
    # Save checkpoint
    save_checkpoint(demographic_analysis_df, "demographic_analysis")
else:
    print("Using existing demographic analysis data")

In [None]:
# Verify the demographic analysis data
if demographic_analysis_df is not None:
    verify_dataframe(demographic_analysis_df, "Demographic Analysis")

In [None]:
# Visualize dismissal frequency by age group
if demographic_analysis_df is not None:
    # Calculate the average dismissal frequency by age group
    age_dismissal = demographic_analysis_df.groupby('AgeGroup')['DismissalFrequency_Numeric'].mean().reset_index()
    
    # Define the order for age groups
    order = ['18-24', '25-34', '35-44', '45-54', '55-64', '65+']
    age_dismissal['AgeGroup'] = pd.Categorical(age_dismissal['AgeGroup'], categories=order, ordered=True)
    age_dismissal = age_dismissal.sort_values('AgeGroup')
    
    # Create a color map based on dismissal frequency
    norm = Normalize(vmin=age_dismissal['DismissalFrequency_Numeric'].min(), 
                     vmax=age_dismissal['DismissalFrequency_Numeric'].max())
    colors = plt.cm.Reds(norm(age_dismissal['DismissalFrequency_Numeric']))
    
    # Plot the dismissal frequency by age group
    plt.figure(figsize=(12, 6))
    bars = plt.bar(age_dismissal['AgeGroup'], age_dismissal['DismissalFrequency_Numeric'], color=colors)
    
    plt.title('Average Dismissal Frequency by Age Group', fontsize=16)
    plt.xlabel('Age Group', fontsize=12)
    plt.ylabel('Average Dismissal Frequency (1=Low, 4=Very High)', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.05,
                 f'{height:.2f}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'dismissal_by_age_group.png'), dpi=300)
    plt.show()

In [None]:
# Visualize dismissal frequency by racial/ethnic group
if demographic_analysis_df is not None:
    # Calculate the average dismissal frequency by racial/ethnic group
    racial_dismissal = demographic_analysis_df.groupby('RacialEthnicConsiderations')['DismissalFrequency_Numeric'].mean().reset_index()
    
    # Sort by dismissal frequency
    racial_dismissal = racial_dismissal.sort_values('DismissalFrequency_Numeric', ascending=False)
    
    # Create a color map based on dismissal frequency
    norm = Normalize(vmin=racial_dismissal['DismissalFrequency_Numeric'].min(), 
                     vmax=racial_dismissal['DismissalFrequency_Numeric'].max())
    colors = plt.cm.Reds(norm(racial_dismissal['DismissalFrequency_Numeric']))
    
    # Plot the dismissal frequency by racial/ethnic group
    plt.figure(figsize=(12, 6))
    bars = plt.bar(racial_dismissal['RacialEthnicConsiderations'], racial_dismissal['DismissalFrequency_Numeric'], color=colors)
    
    plt.title('Average Dismissal Frequency by Racial/Ethnic Group', fontsize=16)
    plt.xlabel('Racial/Ethnic Group', fontsize=12)
    plt.ylabel('Average Dismissal Frequency (1=Low, 4=Very High)', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.05,
                 f'{height:.2f}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'dismissal_by_racial_ethnic_group.png'), dpi=300)
    plt.show()

In [None]:
# Visualize diagnosis delay by age group
if demographic_analysis_df is not None:
    # Calculate the average diagnosis delay by age group
    age_delay = demographic_analysis_df.groupby('AgeGroup')['DiagnosisDelay'].mean().reset_index()
    
    # Define the order for age groups
    order = ['18-24', '25-34', '35-44', '45-54', '55-64', '65+']
    age_delay['AgeGroup'] = pd.Categorical(age_delay['AgeGroup'], categories=order, ordered=True)
    age_delay = age_delay.sort_values('AgeGroup')
    
    # Create a color map based on diagnosis delay
    norm = Normalize(vmin=age_delay['DiagnosisDelay'].min(), 
                     vmax=age_delay['DiagnosisDelay'].max())
    colors = plt.cm.Reds(norm(age_delay['DiagnosisDelay']))
    
    # Plot the diagnosis delay by age group
    plt.figure(figsize=(12, 6))
    bars = plt.bar(age_delay['AgeGroup'], age_delay['DiagnosisDelay'], color=colors)
    
    plt.title('Average Diagnosis Delay by Age Group', fontsize=16)
    plt.xlabel('Age Group', fontsize=12)
    plt.ylabel('Average Diagnosis Delay (years)', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{height:.1f}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'delay_by_age_group.png'), dpi=300)
    plt.show()

In [None]:
# Visualize diagnosis delay by racial/ethnic group
if demographic_analysis_df is not None:
    # Calculate the average diagnosis delay by racial/ethnic group
    racial_delay = demographic_analysis_df.groupby('RacialEthnicConsiderations')['DiagnosisDelay'].mean().reset_index()
    
    # Sort by diagnosis delay
    racial_delay = racial_delay.sort_values('DiagnosisDelay', ascending=False)
    
    # Create a color map based on diagnosis delay
    norm = Normalize(vmin=racial_delay['DiagnosisDelay'].min(), 
                     vmax=racial_delay['DiagnosisDelay'].max())
    colors = plt.cm.Reds(norm(racial_delay['DiagnosisDelay']))
    
    # Plot the diagnosis delay by racial/ethnic group
    plt.figure(figsize=(12, 6))
    bars = plt.bar(racial_delay['RacialEthnicConsiderations'], racial_delay['DiagnosisDelay'], color=colors)
    
    plt.title('Average Diagnosis Delay by Racial/Ethnic Group', fontsize=16)
    plt.xlabel('Racial/Ethnic Group', fontsize=12)
    plt.ylabel('Average Diagnosis Delay (years)', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{height:.1f}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'delay_by_racial_ethnic_group.png'), dpi=300)
    plt.show()

## 7. Question Transformation Analysis

Let's analyze how dismissed questions transform into better questions.

In [None]:
# Check if we already have the transformation analysis data
transformation_df = load_checkpoint("transformation_analysis")

# If not, perform transformation analysis
if transformation_df is None and demographic_analysis_df is not None:
    # Create a copy of the demographic analysis dataframe
    transformation_df = demographic_analysis_df.copy()
    
    # Calculate the transformation metrics
    transformation_df['Length_Increase'] = transformation_df['BetterQuestion_Length'] - transformation_df['DismissedQuestion_Length']
    transformation_df['Word_Increase'] = transformation_df['BetterQuestion_Words'] - transformation_df['DismissedQuestion_Words']
    transformation_df['Sentence_Increase'] = transformation_df['BetterQuestion_Sentences'] - transformation_df['DismissedQuestion_Sentences']
    transformation_df['Specificity_Increase'] = transformation_df['BetterQuestion_Specificity'] - transformation_df['DismissedQuestion_Specificity']
    transformation_df['Complexity_Increase'] = transformation_df['BetterQuestion_Complexity'] - transformation_df['DismissedQuestion_Complexity']
    
    # Calculate the percentage increases
    transformation_df['Length_Increase_Pct'] = (transformation_df['Length_Increase'] / transformation_df['DismissedQuestion_Length']) * 100
    transformation_df['Word_Increase_Pct'] = (transformation_df['Word_Increase'] / transformation_df['DismissedQuestion_Words']) * 100
    transformation_df['Sentence_Increase_Pct'] = (transformation_df['Sentence_Increase'] / transformation_df['DismissedQuestion_Sentences'].replace(0, 1)) * 100
    transformation_df['Specificity_Increase_Pct'] = (transformation_df['Specificity_Increase'] / transformation_df['DismissedQuestion_Specificity']) * 100
    transformation_df['Complexity_Increase_Pct'] = (transformation_df['Complexity_Increase'] / transformation_df['DismissedQuestion_Complexity'].replace(0, 0.1)) * 100
    
    # Save checkpoint
    save_checkpoint(transformation_df, "transformation_analysis")
else:
    print("Using existing transformation analysis data")

In [None]:
# Verify the transformation analysis data
if transformation_df is not None:
    verify_dataframe(transformation_df, "Transformation Analysis")

In [None]:
# Visualize the average transformation metrics
if transformation_df is not None:
    # Calculate the average transformation metrics
    avg_metrics = {
        'Length Increase': transformation_df['Length_Increase'].mean(),
        'Word Increase': transformation_df['Word_Increase'].mean(),
        'Sentence Increase': transformation_df['Sentence_Increase'].mean(),
        'Specificity Increase': transformation_df['Specificity_Increase'].mean(),
        'Complexity Increase': transformation_df['Complexity_Increase'].mean()
    }
    
    # Plot the average transformation metrics
    plt.figure(figsize=(12, 6))
    bars = plt.bar(avg_metrics.keys(), avg_metrics.values(), color=sns.color_palette("viridis", len(avg_metrics)))
    
    plt.title('Average Question Transformation Metrics', fontsize=16)
    plt.ylabel('Average Increase', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{height:.1f}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'transformation_metrics.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the average percentage increases
if transformation_df is not None:
    # Calculate the average percentage increases
    avg_pct_increases = {
        'Length': transformation_df['Length_Increase_Pct'].mean(),
        'Words': transformation_df['Word_Increase_Pct'].mean(),
        'Sentences': transformation_df['Sentence_Increase_Pct'].mean(),
        'Specificity': transformation_df['Specificity_Increase_Pct'].mean(),
        'Complexity': transformation_df['Complexity_Increase_Pct'].mean()
    }
    
    # Plot the average percentage increases
    plt.figure(figsize=(12, 6))
    bars = plt.bar(avg_pct_increases.keys(), avg_pct_increases.values(), color=sns.color_palette("viridis", len(avg_pct_increases)))
    
    plt.title('Average Percentage Increases in Question Transformation', fontsize=16)
    plt.ylabel('Average Percentage Increase (%)', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 10,
                 f'{height:.0f}%',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'transformation_percentage_increases.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the transformation metrics by category
if transformation_df is not None:
    # Calculate the average word increase by category
    category_word_increase = transformation_df.groupby('Category')['Word_Increase'].mean().sort_values(ascending=False)
    
    # Plot the average word increase by category
    plt.figure(figsize=(12, 6))
    bars = plt.bar(category_word_increase.index, category_word_increase.values, 
                   color=sns.color_palette("viridis", len(category_word_increase)))
    
    plt.title('Average Word Increase by Category', fontsize=16)
    plt.xlabel('Category', fontsize=12)
    plt.ylabel('Average Word Increase', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
                 f'{height:.1f}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'word_increase_by_category.png'), dpi=300)
    plt.show()

## 8. Identify Key Factors for Question Effectiveness

Based on our analysis, let's identify the key factors that contribute to question effectiveness.

In [None]:
# Create a summary of key factors
key_factors = {
    "question_length": {
        "finding": "Better questions are significantly longer than dismissed questions",
        "evidence": f"Average length ratio: {preprocessing_summary['question_analysis']['avg_length_ratio']:.1f}x",
        "recommendation": "Encourage patients to provide more detailed questions with context"
    },
    "medical_terminology": {
        "finding": "Better questions contain more medical terminology",
        "evidence": f"Average medical terms difference: {preprocessing_summary['question_analysis']['avg_medical_terms_difference']:.1f}",
        "recommendation": "Include relevant medical terms in questions to improve specificity"
    },
    "question_specificity": {
        "finding": "Better questions are more specific",
        "evidence": f"Average specificity increase: {transformation_df['Specificity_Increase'].mean():.2f}",
        "recommendation": "Include specific details about symptoms, duration, and context"
    },
    "sentence_structure": {
        "finding": "Better questions contain more sentences",
        "evidence": f"Average sentence increase: {transformation_df['Sentence_Increase'].mean():.1f}",
        "recommendation": "Structure questions with multiple sentences covering different aspects"
    },
    "question_complexity": {
        "finding": "Better questions are more complex",
        "evidence": f"Average complexity increase: {transformation_df['Complexity_Increase'].mean():.1f}",
        "recommendation": "Include both simple and complex sentences to convey information effectively"
    },
    "demographic_context": {
        "finding": "Dismissal patterns vary by demographic group",
        "evidence": "Highest dismissal rates observed in specific age and racial/ethnic groups",
        "recommendation": "Tailor questions based on demographic context and known dismissal patterns"
    },
    "symptom_details": {
        "finding": "Better questions include detailed symptom descriptions",
        "evidence": "Key phrases in better questions focus on symptom characteristics",
        "recommendation": "Describe symptoms in detail, including onset, duration, severity, and triggers"
    },
    "medical_history": {
        "finding": "Better questions reference relevant medical history",
        "evidence": "Better questions often mention family history and previous conditions",
        "recommendation": "Include relevant personal and family medical history in questions"
    },
    "question_structure": {
        "finding": "Better questions have a clear structure",
        "evidence": "Better questions typically start with context, then details, then specific query",
        "recommendation": "Structure questions with context first, then details, then specific query"
    },
    "impact_description": {
        "finding": "Better questions describe impact on daily life",
        "evidence": "Better questions often mention how symptoms affect daily activities",
        "recommendation": "Describe how symptoms or conditions impact daily life and functioning"
    }
}

# Save the key factors as JSON
with open(os.path.join(analysis_dir, 'key_factors.json'), 'w') as f:
    json.dump(key_factors, f, indent=2)

print("Key factors saved to:", os.path.join(analysis_dir, 'key_factors.json'))

In [None]:
# Display the key factors
print("\n--- Key Factors for Question Effectiveness ---")
for factor, details in key_factors.items():
    print(f"\n{factor.replace('_', ' ').title()}:")
    print(f"  Finding: {details['finding']}")
    print(f"  Evidence: {details['evidence']}")
    print(f"  Recommendation: {details['recommendation']}")

## 9. Prepare Data for Next Notebook

Let's save the analyzed data for use in the next notebook in the series.

In [None]:
# Save the transformation analysis data to the analysis directory
if transformation_df is not None:
    transformation_df.to_csv(os.path.join(analysis_dir, 'transformation_analysis.csv'), index=False)
    print(f"Saved transformation analysis data to: {os.path.join(analysis_dir, 'transformation_analysis.csv')}")

# Save the text analysis data to the analysis directory
if text_analysis_df is not None:
    text_analysis_df.to_csv(os.path.join(analysis_dir, 'text_analysis.csv'), index=False)
    print(f"Saved text analysis data to: {os.path.join(analysis_dir, 'text_analysis.csv')}")

# Save the correlation analysis data to the analysis directory
if correlation_df is not None:
    correlation_df.to_csv(os.path.join(analysis_dir, 'correlation_analysis.csv'), index=False)
    print(f"Saved correlation analysis data to: {os.path.join(analysis_dir, 'correlation_analysis.csv')}")

# Save the demographic analysis data to the analysis directory
if demographic_analysis_df is not None:
    demographic_analysis_df.to_csv(os.path.join(analysis_dir, 'demographic_analysis.csv'), index=False)
    print(f"Saved demographic analysis data to: {os.path.join(analysis_dir, 'demographic_analysis.csv')}")

In [None]:
# Create a summary of the analysis results
analysis_summary = {
    "text_analysis": {
        "avg_dismissed_sentences": float(text_analysis_df['DismissedQuestion_Sentences'].mean()) if text_analysis_df is not None else None,
        "avg_better_sentences": float(text_analysis_df['BetterQuestion_Sentences'].mean()) if text_analysis_df is not None else None,
        "avg_dismissed_words": float(text_analysis_df['DismissedQuestion_Words'].mean()) if text_analysis_df is not None else None,
        "avg_better_words": float(text_analysis_df['BetterQuestion_Words'].mean()) if text_analysis_df is not None else None,
        "avg_dismissed_specificity": float(text_analysis_df['DismissedQuestion_Specificity'].mean()) if text_analysis_df is not None else None,
        "avg_better_specificity": float(text_analysis_df['BetterQuestion_Specificity'].mean()) if text_analysis_df is not None else None,
        "avg_dismissed_complexity": float(text_analysis_df['DismissedQuestion_Complexity'].mean()) if text_analysis_df is not None else None,
        "avg_better_complexity": float(text_analysis_df['BetterQuestion_Complexity'].mean()) if text_analysis_df is not None else None
    },
    "transformation_analysis": {
        "avg_length_increase": float(transformation_df['Length_Increase'].mean()) if transformation_df is not None else None,
        "avg_word_increase": float(transformation_df['Word_Increase'].mean()) if transformation_df is not None else None,
        "avg_sentence_increase": float(transformation_df['Sentence_Increase'].mean()) if transformation_df is not None else None,
        "avg_specificity_increase": float(transformation_df['Specificity_Increase'].mean()) if transformation_df is not None else None,
        "avg_complexity_increase": float(transformation_df['Complexity_Increase'].mean()) if transformation_df is not None else None,
        "avg_length_increase_pct": float(transformation_df['Length_Increase_Pct'].mean()) if transformation_df is not None else None,
        "avg_word_increase_pct": float(transformation_df['Word_Increase_Pct'].mean()) if transformation_df is not None else None,
        "avg_sentence_increase_pct": float(transformation_df['Sentence_Increase_Pct'].mean()) if transformation_df is not None else None,
        "avg_specificity_increase_pct": float(transformation_df['Specificity_Increase_Pct'].mean()) if transformation_df is not None else None,
        "avg_complexity_increase_pct": float(transformation_df['Complexity_Increase_Pct'].mean()) if transformation_df is not None else None
    },
    "demographic_analysis": {
        "age_groups": list(demographic_analysis_df['AgeGroup'].unique()) if demographic_analysis_df is not None else [],
        "racial_ethnic_groups": list(demographic_analysis_df['RacialEthnicConsiderations'].unique()) if demographic_analysis_df is not None else [],
        "highest_dismissal_age_group": demographic_analysis_df.groupby('AgeGroup')['DismissalFrequency_Numeric'].mean().idxmax() if demographic_analysis_df is not None else None,
        "highest_dismissal_racial_ethnic_group": demographic_analysis_df.groupby('RacialEthnicConsiderations')['DismissalFrequency_Numeric'].mean().idxmax() if demographic_analysis_df is not None else None,
        "highest_delay_age_group": demographic_analysis_df.groupby('AgeGroup')['DiagnosisDelay'].mean().idxmax() if demographic_analysis_df is not None else None,
        "highest_delay_racial_ethnic_group": demographic_analysis_df.groupby('RacialEthnicConsiderations')['DiagnosisDelay'].mean().idxmax() if demographic_analysis_df is not None else None
    },
    "key_factors": list(key_factors.keys())
}

# Save the analysis summary as JSON
with open(os.path.join(analysis_dir, 'analysis_summary.json'), 'w') as f:
    json.dump(analysis_summary, f, indent=2)

print("Analysis summary saved to:", os.path.join(analysis_dir, 'analysis_summary.json'))

## 10. Conclusion

In this notebook, we've performed a comprehensive analysis and visualization of women's health questions to identify patterns that can help our LLM model generate better questions:

1. **Text Analysis**: We analyzed the linguistic characteristics of dismissed and better questions, finding that better questions are longer, more specific, and more complex.

2. **Correlation Analysis**: We identified correlations between question characteristics and dismissal frequency/diagnosis delay, which can help prioritize features for our LLM model.

3. **Demographic Analysis**: We analyzed how dismissal patterns vary across different demographic groups, finding significant variations that our LLM model should account for.

4. **Question Transformation Analysis**: We analyzed how dismissed questions transform into better questions, identifying key patterns that our LLM model can learn from.

5. **Key Factors**: We identified 10 key factors that contribute to question effectiveness, which will guide our LLM model in generating better questions.

These insights will be used in the next notebook (Part 4: Training Split Preparation) to prepare the data for LLM training, ensuring that our model learns to generate questions that are less likely to be dismissed by healthcare providers.

### Next Steps

In the next notebook, we will:
- Prepare the data for LLM training
- Create train/validation/test splits
- Format the data for different LLM frameworks
- Create evaluation metrics for assessing question quality