# Women's Health Training Split Preparation - Part 4

## Overview

This notebook is the fourth and final part in a series focused on creating a dataset for training an LLM model to predict better questions for women's health consultations. This part focuses on preparing the analyzed data from Part 3 for LLM training by creating appropriate train/validation/test splits and formatting the data for different LLM frameworks.

### Objectives
- Load the analyzed data from Part 3
- Create balanced train/validation/test splits
- Format the data for LLM training
- Create evaluation metrics for assessing question quality
- Prepare the final dataset package

### Why This Matters
Proper preparation of training data is crucial for developing an effective LLM model. By creating balanced splits and appropriate formatting, we ensure that our model learns to generate better questions for women's health consultations across a wide range of conditions and demographics.

## 1. Environment Setup

First, let's set up our environment by importing necessary libraries and loading the analyzed data from Part 3.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os
import json
import re
import string
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import random
from sklearn.model_selection import train_test_split, StratifiedKFold
import shutil
import zipfile

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)

# Set up plotting style
plt.style.use('seaborn-whitegrid')
sns.set(style="whitegrid")

# Display versions for reproducibility
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Matplotlib version: {plt.__version__}")
print(f"Seaborn version: {sns.__version__}")
print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

In [None]:
# Define directory structure
data_dir = 'womens_health_data'
raw_dir = os.path.join(data_dir, 'raw')
processed_dir = os.path.join(data_dir, 'processed')
expanded_dir = os.path.join(data_dir, 'expanded')
analysis_dir = os.path.join(data_dir, 'analysis')
checkpoint_dir = os.path.join(data_dir, 'checkpoints')
figures_dir = os.path.join(data_dir, 'figures')
training_dir = os.path.join(data_dir, 'training')
output_dir = os.path.join(data_dir, 'output')

# Create directories if they don't exist
for directory in [data_dir, raw_dir, processed_dir, expanded_dir, analysis_dir, checkpoint_dir, figures_dir, training_dir, output_dir]:
    os.makedirs(directory, exist_ok=True)
    print(f"Created directory: {directory}")

## 2. Helper Functions

Let's create some helper functions for data splitting, formatting, and checkpoint management.

In [None]:
def save_checkpoint(df, name):
    """
    Save a dataframe as a checkpoint CSV file.
    
    Parameters:
    - df: pandas DataFrame to save
    - name: name of the checkpoint (without extension)
    
    Returns:
    - path: path to the saved file
    """
    # Create the full path with timestamp to avoid overwriting
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{name}_{timestamp}.csv"
    path = os.path.join(checkpoint_dir, filename)
    
    # Save the dataframe
    df.to_csv(path, index=False)
    print(f"Checkpoint saved: {path}")
    
    # Also save a version with a fixed name for easy loading
    fixed_path = os.path.join(checkpoint_dir, f"{name}_latest.csv")
    df.to_csv(fixed_path, index=False)
    print(f"Latest version saved: {fixed_path}")
    
    return path

def load_checkpoint(name):
    """
    Load the latest checkpoint for a given name.
    
    Parameters:
    - name: name of the checkpoint (without extension)
    
    Returns:
    - df: loaded DataFrame or None if file doesn't exist
    """
    path = os.path.join(checkpoint_dir, f"{name}_latest.csv")
    
    if os.path.exists(path) and os.path.getsize(path) > 0:
        try:
            df = pd.read_csv(path)
            print(f"Checkpoint loaded: {path}")
            print(f"Shape: {df.shape}")
            return df
        except pd.errors.EmptyDataError:
            print(f"Warning: Checkpoint file exists but is empty: {path}")
            return None
        except Exception as e:
            print(f"Error loading checkpoint: {e}")
            return None
    else:
        print(f"Checkpoint not found or empty: {path}")
        return None

def verify_dataframe(df, name):
    """
    Verify a dataframe by displaying basic information.
    
    Parameters:
    - df: pandas DataFrame to verify
    - name: name of the dataframe for display purposes
    """
    print(f"\n--- {name} Verification ---")
    print(f"Shape: {df.shape}")
    print("\nFirst 5 rows:")
    display(df.head())
    print("\nData types:")
    display(df.dtypes)
    print("\nMissing values:")
    missing = df.isnull().sum()
    display(missing[missing > 0] if any(missing > 0) else "No missing values")
    print("\nBasic statistics:")
    display(df.describe(include='all').T)
    print("----------------------------\n")

def format_for_llm_training(row, format_type='instruction'):
    """
    Format a row of data for LLM training.
    
    Parameters:
    - row: pandas Series containing a row of data
    - format_type: type of formatting to use ('instruction', 'chat', 'completion')
    
    Returns:
    - formatted_data: formatted data for LLM training
    """
    if format_type == 'instruction':
        # Format for instruction tuning (e.g., Alpaca format)
        instruction = "Transform this dismissed women's health question into a better, more specific question that is less likely to be dismissed by healthcare providers."
        input_text = f"Dismissed Question: {row['DismissedQuestion']}\nCondition: {row['Condition']}\nCategory: {row['Category']}"
        output_text = f"Better Question: {row['BetterQuestion']}"
        
        formatted_data = {
            "instruction": instruction,
            "input": input_text,
            "output": output_text
        }
    
    elif format_type == 'chat':
        # Format for chat tuning (e.g., ChatML format)
        system_message = "You are a helpful assistant that transforms vague or easily dismissed women's health questions into better, more specific questions that are less likely to be dismissed by healthcare providers."
        user_message = f"I need help improving this question for my doctor about my {row['Condition']} (Category: {row['Category']}): '{row['DismissedQuestion']}'"
        assistant_message = f"Here's a better way to ask your question: '{row['BetterQuestion']}'"
        
        formatted_data = {
            "messages": [
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message},
                {"role": "assistant", "content": assistant_message}
            ]
        }
    
    elif format_type == 'completion':
        # Format for completion tuning (e.g., GPT format)
        prompt = f"Transform this dismissed women's health question into a better, more specific question that is less likely to be dismissed by healthcare providers.\n\nDismissed Question: {row['DismissedQuestion']}\nCondition: {row['Condition']}\nCategory: {row['Category']}\n\nBetter Question:"
        completion = f" {row['BetterQuestion']}"
        
        formatted_data = {
            "prompt": prompt,
            "completion": completion
        }
    
    else:
        raise ValueError(f"Unknown format type: {format_type}")
    
    return formatted_data

def save_jsonl(data, filename):
    """
    Save data as a JSONL file.
    
    Parameters:
    - data: list of JSON-serializable objects
    - filename: name of the file to save
    """
    with open(filename, 'w') as f:
        for item in data:
            f.write(json.dumps(item) + '\n')
    print(f"Saved {len(data)} items to {filename}")

def calculate_category_distribution(df, category_col='Category'):
    """
    Calculate the distribution of categories in a dataframe.
    
    Parameters:
    - df: pandas DataFrame
    - category_col: name of the category column
    
    Returns:
    - distribution: dictionary mapping categories to percentages
    """
    counts = df[category_col].value_counts()
    total = counts.sum()
    distribution = {category: (count / total) * 100 for category, count in counts.items()}
    return distribution

def calculate_distribution_difference(dist1, dist2):
    """
    Calculate the difference between two distributions.
    
    Parameters:
    - dist1: first distribution (dictionary)
    - dist2: second distribution (dictionary)
    
    Returns:
    - difference: average absolute difference in percentages
    """
    all_categories = set(dist1.keys()) | set(dist2.keys())
    differences = []
    
    for category in all_categories:
        pct1 = dist1.get(category, 0)
        pct2 = dist2.get(category, 0)
        differences.append(abs(pct1 - pct2))
    
    return sum(differences) / len(differences) if differences else 0

## 3. Load Analyzed Data

Let's load the analyzed data from Part 3.

In [None]:
# Load the transformation analysis data
transformation_path = os.path.join(analysis_dir, 'transformation_analysis.csv')
if os.path.exists(transformation_path):
    transformation_df = pd.read_csv(transformation_path)
    print(f"Loaded transformation analysis data: {transformation_df.shape}")
else:
    print("Transformation analysis data not found. Please run Part 3 first.")
    transformation_df = None

# Load the text analysis data
text_analysis_path = os.path.join(analysis_dir, 'text_analysis.csv')
if os.path.exists(text_analysis_path):
    text_analysis_df = pd.read_csv(text_analysis_path)
    print(f"Loaded text analysis data: {text_analysis_df.shape}")
else:
    print("Text analysis data not found. Please run Part 3 first.")
    text_analysis_df = None

# Load the demographic analysis data
demographic_path = os.path.join(analysis_dir, 'demographic_analysis.csv')
if os.path.exists(demographic_path):
    demographic_df = pd.read_csv(demographic_path)
    print(f"Loaded demographic analysis data: {demographic_df.shape}")
else:
    print("Demographic analysis data not found. Please run Part 3 first.")
    demographic_df = None

# Load the analysis summary
analysis_summary_path = os.path.join(analysis_dir, 'analysis_summary.json')
if os.path.exists(analysis_summary_path):
    with open(analysis_summary_path, 'r') as f:
        analysis_summary = json.load(f)
    print("Loaded analysis summary")
else:
    print("Analysis summary not found. Please run Part 3 first.")
    analysis_summary = None

# Load the key factors
key_factors_path = os.path.join(analysis_dir, 'key_factors.json')
if os.path.exists(key_factors_path):
    with open(key_factors_path, 'r') as f:
        key_factors = json.load(f)
    print("Loaded key factors")
else:
    print("Key factors not found. Please run Part 3 first.")
    key_factors = None

In [None]:
# Verify the loaded data
if transformation_df is not None:
    verify_dataframe(transformation_df, "Transformation Analysis")

## 4. Prepare Data for Training

Let's prepare the data for LLM training by selecting the relevant columns and ensuring data quality.

In [None]:
# Check if we already have the training data
training_data_df = load_checkpoint("training_data")

# If not, prepare the training data
if training_data_df is None and transformation_df is not None:
    # Select the relevant columns for training
    training_columns = [
        'DismissedQuestion', 'BetterQuestion', 'Condition', 'Category',
        'DismissalFrequency', 'DiagnosisDelay', 'AgeGroup', 'RacialEthnicConsiderations',
        'Comorbidities', 'ConditionDemographicRiskNotes'
    ]
    
    # Create a copy of the transformation dataframe with only the relevant columns
    training_data_df = transformation_df[training_columns].copy()
    
    # Check for missing values
    missing_values = training_data_df.isnull().sum()
    print("Missing values:")
    print(missing_values[missing_values > 0] if any(missing_values > 0) else "No missing values")
    
    # Fill missing values
    if any(missing_values > 0):
        # Fill missing Comorbidities and ConditionDemographicRiskNotes with empty strings
        if 'Comorbidities' in missing_values and missing_values['Comorbidities'] > 0:
            training_data_df['Comorbidities'] = training_data_df['Comorbidities'].fillna('')
        
        if 'ConditionDemographicRiskNotes' in missing_values and missing_values['ConditionDemographicRiskNotes'] > 0:
            training_data_df['ConditionDemographicRiskNotes'] = training_data_df['ConditionDemographicRiskNotes'].fillna('')
    
    # Add a unique ID for each example
    training_data_df['ExampleID'] = [f"WH{i:04d}" for i in range(1, len(training_data_df) + 1)]
    
    # Save checkpoint
    save_checkpoint(training_data_df, "training_data")
else:
    print("Using existing training data")

In [None]:
# Verify the training data
if training_data_df is not None:
    verify_dataframe(training_data_df, "Training Data")

## 5. Create Train/Validation/Test Splits

Let's create balanced train/validation/test splits for our LLM training.

In [None]:
# Check if we already have the split data
train_df = load_checkpoint("train_data")
val_df = load_checkpoint("val_data")
test_df = load_checkpoint("test_data")

# If not, create the splits
if (train_df is None or val_df is None or test_df is None) and training_data_df is not None:
    # Define the split ratios
    train_ratio = 0.7
    val_ratio = 0.15
    test_ratio = 0.15
    
    # Create a stratified split based on Category
    # First, split into train and temp (val + test)
    train_df, temp_df = train_test_split(
        training_data_df,
        test_size=(val_ratio + test_ratio),
        random_state=42,
        stratify=training_data_df['Category']
    )
    
    # Then split temp into val and test
    val_df, test_df = train_test_split(
        temp_df,
        test_size=test_ratio / (val_ratio + test_ratio),
        random_state=42,
        stratify=temp_df['Category']
    )
    
    # Print the split sizes
    print(f"Training set size: {len(train_df)} ({len(train_df) / len(training_data_df) * 100:.1f}%)")
    print(f"Validation set size: {len(val_df)} ({len(val_df) / len(training_data_df) * 100:.1f}%)")
    print(f"Test set size: {len(test_df)} ({len(test_df) / len(training_data_df) * 100:.1f}%)")
    
    # Save checkpoints
    save_checkpoint(train_df, "train_data")
    save_checkpoint(val_df, "val_data")
    save_checkpoint(test_df, "test_data")
else:
    print("Using existing split data")

In [None]:
# Verify the split data
if train_df is not None:
    verify_dataframe(train_df, "Training Set")
    
if val_df is not None:
    verify_dataframe(val_df, "Validation Set")
    
if test_df is not None:
    verify_dataframe(test_df, "Test Set")

In [None]:
# Visualize the category distribution in each split
if train_df is not None and val_df is not None and test_df is not None:
    # Calculate the category distributions
    train_dist = calculate_category_distribution(train_df)
    val_dist = calculate_category_distribution(val_df)
    test_dist = calculate_category_distribution(test_df)
    full_dist = calculate_category_distribution(training_data_df)
    
    # Calculate the distribution differences
    train_diff = calculate_distribution_difference(train_dist, full_dist)
    val_diff = calculate_distribution_difference(val_dist, full_dist)
    test_diff = calculate_distribution_difference(test_dist, full_dist)
    
    print(f"Average distribution difference - Train: {train_diff:.2f}%, Val: {val_diff:.2f}%, Test: {test_diff:.2f}%")
    
    # Create a dataframe for plotting
    categories = list(full_dist.keys())
    plot_data = []
    
    for category in categories:
        plot_data.append({
            'Category': category,
            'Full Dataset': full_dist.get(category, 0),
            'Training Set': train_dist.get(category, 0),
            'Validation Set': val_dist.get(category, 0),
            'Test Set': test_dist.get(category, 0)
        })
    
    plot_df = pd.DataFrame(plot_data)
    
    # Melt the dataframe for easier plotting
    melted_df = pd.melt(plot_df, id_vars=['Category'], var_name='Dataset', value_name='Percentage')
    
    # Plot the category distribution
    plt.figure(figsize=(14, 8))
    sns.barplot(x='Category', y='Percentage', hue='Dataset', data=melted_df)
    plt.title('Category Distribution Across Datasets', fontsize=16)
    plt.xlabel('Category', fontsize=12)
    plt.ylabel('Percentage (%)', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Dataset')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'category_distribution_across_datasets.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the dismissal frequency distribution in each split
if train_df is not None and val_df is not None and test_df is not None:
    # Calculate the dismissal frequency distributions
    train_dismissal = train_df['DismissalFrequency'].value_counts(normalize=True) * 100
    val_dismissal = val_df['DismissalFrequency'].value_counts(normalize=True) * 100
    test_dismissal = test_df['DismissalFrequency'].value_counts(normalize=True) * 100
    full_dismissal = training_data_df['DismissalFrequency'].value_counts(normalize=True) * 100
    
    # Create a dataframe for plotting
    dismissal_levels = ['Low', 'Medium', 'High', 'Very High']
    dismissal_data = []
    
    for level in dismissal_levels:
        dismissal_data.append({
            'DismissalFrequency': level,
            'Full Dataset': full_dismissal.get(level, 0),
            'Training Set': train_dismissal.get(level, 0),
            'Validation Set': val_dismissal.get(level, 0),
            'Test Set': test_dismissal.get(level, 0)
        })
    
    dismissal_df = pd.DataFrame(dismissal_data)
    
    # Melt the dataframe for easier plotting
    melted_dismissal = pd.melt(dismissal_df, id_vars=['DismissalFrequency'], var_name='Dataset', value_name='Percentage')
    
    # Define the order for dismissal frequency
    order = ['Low', 'Medium', 'High', 'Very High']
    melted_dismissal['DismissalFrequency'] = pd.Categorical(melted_dismissal['DismissalFrequency'], categories=order, ordered=True)
    
    # Plot the dismissal frequency distribution
    plt.figure(figsize=(12, 6))
    sns.barplot(x='DismissalFrequency', y='Percentage', hue='Dataset', data=melted_dismissal)
    plt.title('Dismissal Frequency Distribution Across Datasets', fontsize=16)
    plt.xlabel('Dismissal Frequency', fontsize=12)
    plt.ylabel('Percentage (%)', fontsize=12)
    plt.legend(title='Dataset')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig(os.path.join(figures_dir, 'dismissal_frequency_distribution_across_datasets.png'), dpi=300)
    plt.show()

## 6. Format Data for LLM Training

Let's format the data for different LLM training frameworks.

In [None]:
# Create the training directory structure
instruction_dir = os.path.join(training_dir, 'instruction_format')
chat_dir = os.path.join(training_dir, 'chat_format')
completion_dir = os.path.join(training_dir, 'completion_format')

for directory in [instruction_dir, chat_dir, completion_dir]:
    os.makedirs(directory, exist_ok=True)
    print(f"Created directory: {directory}")

In [None]:
# Format the data for instruction tuning
if train_df is not None and val_df is not None and test_df is not None:
    # Format the training data
    train_instruction = [format_for_llm_training(row, format_type='instruction') for _, row in train_df.iterrows()]
    val_instruction = [format_for_llm_training(row, format_type='instruction') for _, row in val_df.iterrows()]
    test_instruction = [format_for_llm_training(row, format_type='instruction') for _, row in test_df.iterrows()]
    
    # Save the formatted data
    save_jsonl(train_instruction, os.path.join(instruction_dir, 'train.jsonl'))
    save_jsonl(val_instruction, os.path.join(instruction_dir, 'val.jsonl'))
    save_jsonl(test_instruction, os.path.join(instruction_dir, 'test.jsonl'))
    
    # Display an example
    print("\nExample of instruction format:")
    print(json.dumps(train_instruction[0], indent=2))

In [None]:
# Format the data for chat tuning
if train_df is not None and val_df is not None and test_df is not None:
    # Format the training data
    train_chat = [format_for_llm_training(row, format_type='chat') for _, row in train_df.iterrows()]
    val_chat = [format_for_llm_training(row, format_type='chat') for _, row in val_df.iterrows()]
    test_chat = [format_for_llm_training(row, format_type='chat') for _, row in test_df.iterrows()]
    
    # Save the formatted data
    save_jsonl(train_chat, os.path.join(chat_dir, 'train.jsonl'))
    save_jsonl(val_chat, os.path.join(chat_dir, 'val.jsonl'))
    save_jsonl(test_chat, os.path.join(chat_dir, 'test.jsonl'))
    
    # Display an example
    print("\nExample of chat format:")
    print(json.dumps(train_chat[0], indent=2))

In [None]:
# Format the data for completion tuning
if train_df is not None and val_df is not None and test_df is not None:
    # Format the training data
    train_completion = [format_for_llm_training(row, format_type='completion') for _, row in train_df.iterrows()]
    val_completion = [format_for_llm_training(row, format_type='completion') for _, row in val_df.iterrows()]
    test_completion = [format_for_llm_training(row, format_type='completion') for _, row in test_df.iterrows()]
    
    # Save the formatted data
    save_jsonl(train_completion, os.path.join(completion_dir, 'train.jsonl'))
    save_jsonl(val_completion, os.path.join(completion_dir, 'val.jsonl'))
    save_jsonl(test_completion, os.path.join(completion_dir, 'test.jsonl'))
    
    # Display an example
    print("\nExample of completion format:")
    print(json.dumps(train_completion[0], indent=2))

## 7. Create Evaluation Metrics

Let's create evaluation metrics for assessing question quality.

In [None]:
# Create evaluation metrics based on our analysis
evaluation_metrics = {
    "length_ratio": {
        "description": "Ratio of generated question length to original question length",
        "target": float(analysis_summary["text_analysis"]["avg_better_words"] / analysis_summary["text_analysis"]["avg_dismissed_words"]),
        "min_acceptable": 2.0,
        "calculation": "len(generated_question.split()) / len(original_question.split())"
    },
    "specificity_score": {
        "description": "Measure of question specificity based on unique word ratio",
        "target": float(analysis_summary["text_analysis"]["avg_better_specificity"]),
        "min_acceptable": 0.7,
        "calculation": "len(set(generated_question.split())) / len(generated_question.split())"
    },
    "medical_term_count": {
        "description": "Number of medical terms included in the question",
        "target": float(analysis_summary["transformation_analysis"]["avg_word_increase"]),
        "min_acceptable": 3,
        "calculation": "sum(1 for term in medical_terms if term.lower() in generated_question.lower())"
    },
    "sentence_count": {
        "description": "Number of sentences in the question",
        "target": float(analysis_summary["text_analysis"]["avg_better_sentences"]),
        "min_acceptable": 2,
        "calculation": "len(sent_tokenize(generated_question))"
    },
    "complexity_score": {
        "description": "Measure of question complexity based on sentence count and word length",
        "target": float(analysis_summary["text_analysis"]["avg_better_complexity"]),
        "min_acceptable": 10,
        "calculation": "len(sent_tokenize(generated_question)) * (sum(len(word) for word in generated_question.split()) / len(generated_question.split()))"
    },
    "context_inclusion": {
        "description": "Whether the question includes context about symptoms, duration, and impact",
        "target": 1.0,
        "min_acceptable": 0.8,
        "calculation": "binary score based on presence of symptom details, duration, and impact description"
    },
    "similarity_to_original": {
        "description": "Semantic similarity to the original question to ensure the core intent is preserved",
        "target": float(analysis_summary["text_analysis"]["avg_dismissed_specificity"]),
        "min_acceptable": 0.5,
        "calculation": "cosine_similarity(vectorize(original_question), vectorize(generated_question))"
    }
}

# Save the evaluation metrics as JSON
with open(os.path.join(training_dir, 'evaluation_metrics.json'), 'w') as f:
    json.dump(evaluation_metrics, f, indent=2)

print("Evaluation metrics saved to:", os.path.join(training_dir, 'evaluation_metrics.json'))

In [None]:
# Display the evaluation metrics
print("\n--- Evaluation Metrics for Question Quality ---")
for metric, details in evaluation_metrics.items():
    print(f"\n{metric.replace('_', ' ').title()}:")
    print(f"  Description: {details['description']}")
    print(f"  Target: {details['target']:.2f}")
    print(f"  Minimum Acceptable: {details['min_acceptable']}")
    print(f"  Calculation: {details['calculation']}")

## 8. Create Dataset Documentation

Let's create documentation for the dataset to help users understand its structure and how to use it.

In [None]:
# Create a README file
readme_content = f"""
# Women's Health LLM Dataset

## Overview

This dataset is designed for training LLM models to generate better questions for women's health consultations. It contains pairs of dismissed questions and better alternatives, along with relevant metadata about conditions, categories, and demographic context.

## Dataset Statistics

- Total examples: {len(training_data_df) if training_data_df is not None else 'N/A'}
- Training set: {len(train_df) if train_df is not None else 'N/A'} examples ({len(train_df) / len(training_data_df) * 100:.1f}% if training_data_df is not None and train_df is not None else 'N/A')
- Validation set: {len(val_df) if val_df is not None else 'N/A'} examples ({len(val_df) / len(training_data_df) * 100:.1f}% if training_data_df is not None and val_df is not None else 'N/A')
- Test set: {len(test_df) if test_df is not None else 'N/A'} examples ({len(test_df) / len(training_data_df) * 100:.1f}% if training_data_df is not None and test_df is not None else 'N/A')
- Categories: {', '.join(training_data_df['Category'].unique()) if training_data_df is not None else 'N/A'}
- Conditions: {len(training_data_df['Condition'].unique()) if training_data_df is not None else 'N/A'} unique conditions

## Key Insights

Based on our analysis, we found that better questions for women's health consultations have the following characteristics:

1. **Length**: Better questions are {analysis_summary['transformation_analysis']['avg_length_increase_pct']:.0f}% longer than dismissed questions
2. **Specificity**: Better questions are {analysis_summary['text_analysis']['avg_better_specificity'] / analysis_summary['text_analysis']['avg_dismissed_specificity']:.1f}x more specific
3. **Structure**: Better questions contain {analysis_summary['text_analysis']['avg_better_sentences']:.1f} sentences on average
4. **Medical Terminology**: Better questions include {analysis_summary['transformation_analysis']['avg_word_increase']:.1f} more words, many of which are medical terms
5. **Context**: Better questions provide context about symptoms, duration, and impact on daily life

## Data Format

The dataset is provided in three formats for different LLM training approaches:

1. **Instruction Format**: For instruction tuning (e.g., Alpaca format)
2. **Chat Format**: For chat tuning (e.g., ChatML format)
3. **Completion Format**: For completion tuning (e.g., GPT format)

Each format includes train, validation, and test splits in JSONL format.

## Directory Structure

```
womens_health_data/
├── training/
│   ├── instruction_format/
│   │   ├── train.jsonl
│   │   ├── val.jsonl
│   │   └── test.jsonl
│   ├── chat_format/
│   │   ├── train.jsonl
│   │   ├── val.jsonl
│   │   └── test.jsonl
│   ├── completion_format/
│   │   ├── train.jsonl
│   │   ├── val.jsonl
│   │   └── test.jsonl
│   └── evaluation_metrics.json
├── raw/
│   └── ... (raw data files)
├── processed/
│   └── ... (processed data files)
├── expanded/
│   └── ... (expanded data files)
├── analysis/
│   ├── transformation_analysis.csv
│   ├── text_analysis.csv
│   ├── demographic_analysis.csv
│   ├── analysis_summary.json
│   └── key_factors.json
└── figures/
    └── ... (visualization figures)
```

## Usage

### Loading the Data

```python
import json

# Load the training data
with open('womens_health_data/training/instruction_format/train.jsonl', 'r') as f:
    train_data = [json.loads(line) for line in f]

# Load the validation data
with open('womens_health_data/training/instruction_format/val.jsonl', 'r') as f:
    val_data = [json.loads(line) for line in f]

# Load the test data
with open('womens_health_data/training/instruction_format/test.jsonl', 'r') as f:
    test_data = [json.loads(line) for line in f]
```

### Evaluation

The `evaluation_metrics.json` file contains metrics for evaluating the quality of generated questions. These metrics are based on our analysis of what makes a question less likely to be dismissed by healthcare providers.

## License

This dataset is provided for research and educational purposes only. It does not contain any personally identifiable information.

## Citation

If you use this dataset in your research, please cite it as follows:

```
@dataset{womens_health_llm_dataset,
  title={Women's Health LLM Dataset},
  author={Your Name},
  year={2025},
  month={April}
}
```

## Contact

For questions or feedback about this dataset, please contact [your email].
"""

# Save the README file
with open(os.path.join(training_dir, 'README.md'), 'w') as f:
    f.write(readme_content)

print("README saved to:", os.path.join(training_dir, 'README.md'))

In [None]:
# Create a data dictionary
data_dictionary_content = """
# Women's Health LLM Dataset - Data Dictionary

## Original Data Fields

| Field | Description | Type | Example |
| ----- | ----------- | ---- | ------- |
| ExampleID | Unique identifier for each example | string | WH0001 |
| DismissedQuestion | Original question that is likely to be dismissed | string | "I'm tired all the time." |
| BetterQuestion | Improved version of the question that is less likely to be dismissed | string | "I've been experiencing persistent fatigue for the past 3 months that isn't relieved by rest. It's affecting my ability to work and exercise. Could this be related to my thyroid condition or another underlying issue?" |
| Condition | Medical condition related to the question | string | "Chronic Fatigue Syndrome" |
| Category | Broader category of the condition | string | "Chronic Pain/Fatigue" |
| DismissalFrequency | How often questions about this condition are dismissed | string | "High" |
| DiagnosisDelay | Average time (in years) to diagnosis for this condition | float | 4.5 |
| AgeGroup | Age group most affected by this condition | string | "25-34" |
| RacialEthnicConsiderations | Racial/ethnic considerations for this condition | string | "Black/African American" |
| Comorbidities | Common comorbidities associated with this condition | string | "Depression; Anxiety; Irritable Bowel Syndrome" |
| ConditionDemographicRiskNotes | Notes on demographic risk factors for this condition | string | "More common in women aged 30-50; often misdiagnosed as depression." |

## Instruction Format Fields

| Field | Description | Type | Example |
| ----- | ----------- | ---- | ------- |
| instruction | Task instruction for the model | string | "Transform this dismissed women's health question into a better, more specific question that is less likely to be dismissed by healthcare providers." |
| input | Input text containing the dismissed question and context | string | "Dismissed Question: I'm tired all the time.\nCondition: Chronic Fatigue Syndrome\nCategory: Chronic Pain/Fatigue" |
| output | Expected output text containing the better question | string | "Better Question: I've been experiencing persistent fatigue for the past 3 months that isn't relieved by rest. It's affecting my ability to work and exercise. Could this be related to my thyroid condition or another underlying issue?" |

## Chat Format Fields

| Field | Description | Type | Example |
| ----- | ----------- | ---- | ------- |
| messages | Array of message objects | array | [system_message, user_message, assistant_message] |
| messages[0].role | Role of the first message | string | "system" |
| messages[0].content | Content of the system message | string | "You are a helpful assistant that transforms vague or easily dismissed women's health questions into better, more specific questions that are less likely to be dismissed by healthcare providers." |
| messages[1].role | Role of the second message | string | "user" |
| messages[1].content | Content of the user message | string | "I need help improving this question for my doctor about my Chronic Fatigue Syndrome (Category: Chronic Pain/Fatigue): 'I'm tired all the time.'" |
| messages[2].role | Role of the third message | string | "assistant" |
| messages[2].content | Content of the assistant message | string | "Here's a better way to ask your question: 'I've been experiencing persistent fatigue for the past 3 months that isn't relieved by rest. It's affecting my ability to work and exercise. Could this be related to my thyroid condition or another underlying issue?'" |

## Completion Format Fields

| Field | Description | Type | Example |
| ----- | ----------- | ---- | ------- |
| prompt | Prompt text for the model | string | "Transform this dismissed women's health question into a better, more specific question that is less likely to be dismissed by healthcare providers.\n\nDismissed Question: I'm tired all the time.\nCondition: Chronic Fatigue Syndrome\nCategory: Chronic Pain/Fatigue\n\nBetter Question:" |
| completion | Expected completion text | string | " I've been experiencing persistent fatigue for the past 3 months that isn't relieved by rest. It's affecting my ability to work and exercise. Could this be related to my thyroid condition or another underlying issue?" |
"""

# Save the data dictionary
with open(os.path.join(training_dir, 'DATA_DICTIONARY.md'), 'w') as f:
    f.write(data_dictionary_content)

print("Data dictionary saved to:", os.path.join(training_dir, 'DATA_DICTIONARY.md'))

## 9. Prepare Final Dataset Package

Let's prepare the final dataset package for distribution.

In [None]:
# Create a directory for the final dataset package
package_dir = os.path.join(output_dir, 'Womens_Health_LLM_Dataset')
os.makedirs(package_dir, exist_ok=True)
print(f"Created directory: {package_dir}")

# Create subdirectories
package_training_dir = os.path.join(package_dir, 'training')
package_analysis_dir = os.path.join(package_dir, 'analysis')
package_figures_dir = os.path.join(package_dir, 'figures')
package_documentation_dir = os.path.join(package_dir, 'documentation')

for directory in [package_training_dir, package_analysis_dir, package_figures_dir, package_documentation_dir]:
    os.makedirs(directory, exist_ok=True)
    print(f"Created directory: {directory}")

In [None]:
# Copy the training data
shutil.copytree(os.path.join(training_dir, 'instruction_format'), os.path.join(package_training_dir, 'instruction_format'))
shutil.copytree(os.path.join(training_dir, 'chat_format'), os.path.join(package_training_dir, 'chat_format'))
shutil.copytree(os.path.join(training_dir, 'completion_format'), os.path.join(package_training_dir, 'completion_format'))
shutil.copy(os.path.join(training_dir, 'evaluation_metrics.json'), os.path.join(package_training_dir, 'evaluation_metrics.json'))
print("Copied training data")

# Copy the analysis data
for file in ['transformation_analysis.csv', 'text_analysis.csv', 'demographic_analysis.csv', 'analysis_summary.json', 'key_factors.json']:
    shutil.copy(os.path.join(analysis_dir, file), os.path.join(package_analysis_dir, file))
print("Copied analysis data")

# Copy the figures
for file in os.listdir(figures_dir):
    if file.endswith('.png'):
        shutil.copy(os.path.join(figures_dir, file), os.path.join(package_figures_dir, file))
print("Copied figures")

# Copy the documentation
shutil.copy(os.path.join(training_dir, 'README.md'), os.path.join(package_dir, 'README.md'))
shutil.copy(os.path.join(training_dir, 'DATA_DICTIONARY.md'), os.path.join(package_documentation_dir, 'DATA_DICTIONARY.md'))
print("Copied documentation")

In [None]:
# Create a summary of the dataset package
package_summary = {
    "dataset_name": "Women's Health LLM Dataset",
    "version": "1.0.0",
    "created_date": datetime.now().strftime("%Y-%m-%d"),
    "total_examples": len(training_data_df) if training_data_df is not None else 0,
    "train_examples": len(train_df) if train_df is not None else 0,
    "val_examples": len(val_df) if val_df is not None else 0,
    "test_examples": len(test_df) if test_df is not None else 0,
    "categories": list(training_data_df['Category'].unique()) if training_data_df is not None else [],
    "conditions": list(training_data_df['Condition'].unique()) if training_data_df is not None else [],
    "formats": ["instruction", "chat", "completion"],
    "key_metrics": {
        "avg_length_ratio": float(analysis_summary["text_analysis"]["avg_better_words"] / analysis_summary["text_analysis"]["avg_dismissed_words"]),
        "avg_specificity_ratio": float(analysis_summary["text_analysis"]["avg_better_specificity"] / analysis_summary["text_analysis"]["avg_dismissed_specificity"]),
        "avg_sentence_count": float(analysis_summary["text_analysis"]["avg_better_sentences"]),
        "avg_word_increase": float(analysis_summary["transformation_analysis"]["avg_word_increase"]),
        "avg_complexity_increase": float(analysis_summary["transformation_analysis"]["avg_complexity_increase"])
    }
}

# Save the package summary
with open(os.path.join(package_dir, 'dataset_summary.json'), 'w') as f:
    json.dump(package_summary, f, indent=2)

print("Package summary saved to:", os.path.join(package_dir, 'dataset_summary.json'))

In [None]:
# Create a zip file of the dataset package
zip_path = os.path.join(output_dir, 'Womens_Health_LLM_Dataset.zip')
with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(package_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, os.path.dirname(package_dir))
            zipf.write(file_path, arcname)

print(f"Dataset package created: {zip_path}")

## 10. Conclusion

In this notebook, we've prepared the analyzed data from Part 3 for LLM training:

1. **Data Preparation**: We selected the relevant columns and ensured data quality for LLM training.

2. **Train/Validation/Test Splits**: We created balanced splits for training, validation, and testing, ensuring that each split has a similar distribution of categories and dismissal frequencies.

3. **Data Formatting**: We formatted the data for different LLM training frameworks, including instruction tuning, chat tuning, and completion tuning.

4. **Evaluation Metrics**: We created evaluation metrics for assessing question quality based on our analysis of what makes a question less likely to be dismissed.

5. **Dataset Documentation**: We created comprehensive documentation for the dataset, including a README file and a data dictionary.

6. **Final Dataset Package**: We prepared a final dataset package for distribution, including all the necessary files and documentation.

The dataset is now ready for use in training an LLM model to generate better questions for women's health consultations. The model can be trained using any of the provided formats, and the evaluation metrics can be used to assess the quality of the generated questions.

### Next Steps

To use this dataset for LLM training:

1. Choose an appropriate LLM framework (e.g., Alpaca, ChatML, GPT)
2. Load the corresponding formatted data
3. Fine-tune the model using the training set
4. Evaluate the model using the validation set and the provided evaluation metrics
5. Test the final model on the test set

The resulting model should be able to generate better questions for women's health consultations that are less likely to be dismissed by healthcare providers.