# Women's Health Data Collection - Part 1

## Overview

This notebook is the first in a series focused on creating a dataset for training an LLM model to predict better questions for women's health consultations. This part focuses specifically on data collection from multiple sources.

### Objectives
- Set up the necessary environment and dependencies
- Collect data from ClinicalTrials.gov API for women's health studies
- Collect data from PubMed for women's health research
- Create a structured dataset of commonly dismissed women's health questions
- Implement save points to preserve collected data

### Why This Matters
Women's health questions are often dismissed in healthcare settings, leading to delayed diagnoses and treatment. By collecting data on commonly dismissed questions and their better alternatives, we can train an LLM to help women formulate more effective questions that are less likely to be dismissed by healthcare providers.

## 1. Environment Setup

First, let's set up our environment by importing necessary libraries and creating directories for our data.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os
import json
import requests
import time
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
from IPython.display import display

# Set up plotting style
plt.style.use('seaborn-v0_8-whitegrid')  # Updated to avoid deprecation warning
sns.set(style="whitegrid")

# Display versions for reproducibility
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Pandas version: 1.5.3
NumPy version: 1.24.3
Current time: 2025-03-31 21:39:52


In [2]:
# Create directory structure for data storage
# This ensures we have a consistent place to save our collected data

# Set the base directory to your specific path
base_dir = '/Users/cazandraaporbo/Desktop/FOXX_Health/coding_work/jup_files/delivery'

# Main data directory
data_dir = os.path.join(base_dir, 'womens_health_data')
# Raw data from different sources
raw_dir = os.path.join(data_dir, 'raw')
# Checkpoints to save progress
checkpoint_dir = os.path.join(data_dir, 'checkpoints')
# Processed data after cleaning
processed_dir = os.path.join(data_dir, 'processed')

# Create directories if they don't exist
for directory in [data_dir, raw_dir, checkpoint_dir, processed_dir]:
    os.makedirs(directory, exist_ok=True)
    print(f"Created directory: {directory}")

Created directory: /Users/cazandraaporbo/Desktop/FOXX_Health/coding_work/jup_files/delivery/womens_health_data
Created directory: /Users/cazandraaporbo/Desktop/FOXX_Health/coding_work/jup_files/delivery/womens_health_data/raw
Created directory: /Users/cazandraaporbo/Desktop/FOXX_Health/coding_work/jup_files/delivery/womens_health_data/checkpoints
Created directory: /Users/cazandraaporbo/Desktop/FOXX_Health/coding_work/jup_files/delivery/womens_health_data/processed


## 2. Helper Functions

Here we create helper functions for saving and loading checkpoints. This will allow us to resume our work without having to re-run time-consuming data collection steps.

In [3]:
def save_checkpoint(df, name):
    """
    Save a dataframe as a checkpoint CSV file.
    
    Parameters:
    - df: pandas DataFrame to save
    - name: name of the checkpoint (without extension)
    
    Returns:
    - path: path to the saved file
    """
    # Create the full path with timestamp to avoid overwriting
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{name}_{timestamp}.csv"
    path = os.path.join(checkpoint_dir, filename)
    
    # Save the dataframe
    df.to_csv(path, index=False)
    print(f"Checkpoint saved: {path}")
    
    # Also save a version with a fixed name for easy loading
    fixed_path = os.path.join(checkpoint_dir, f"{name}_latest.csv")
    df.to_csv(fixed_path, index=False)
    print(f"Latest version saved: {fixed_path}")
    
    return path

print("save_checkpoint function defined.")

save_checkpoint function defined.


In [4]:
def load_checkpoint(name):
    """
    Load the latest checkpoint for a given name.
    
    Parameters:
    - name: name of the checkpoint (without extension)
    
    Returns:
    - df: loaded DataFrame or None if file doesn't exist
    """
    path = os.path.join(checkpoint_dir, f"{name}_latest.csv")
    
    if os.path.exists(path) and os.path.getsize(path) > 0:
        try:
            df = pd.read_csv(path)
            print(f"Checkpoint loaded: {path}")
            print(f"Shape: {df.shape}")
            display(df.head())  # Display the first 5 rows for verification
            return df
        except pd.errors.EmptyDataError:
            print(f"Warning: Checkpoint file exists but is empty: {path}")
            return None
        except Exception as e:
            print(f"Error loading checkpoint: {e}")
            return None
    else:
        print(f"Checkpoint not found or empty: {path}")
        return None

print("load_checkpoint function defined.")

load_checkpoint function defined.


In [5]:
def verify_dataframe(df, name):
    """
    Verify a dataframe by displaying basic information.
    
    Parameters:
    - df: pandas DataFrame to verify
    - name: name of the dataframe for display purposes
    """
    print(f"\n--- {name} Verification ---")
    print(f"Shape: {df.shape}")
    print("\nFirst 5 rows:")
    display(df.head())
    print("\nData types:")
    display(df.dtypes)
    print("\nMissing values:")
    missing = df.isnull().sum()
    display(missing[missing > 0] if any(missing > 0) else "No missing values")
    print("\nBasic statistics:")
    display(df.describe(include='all').T)
    print("----------------------------\n")

print("verify_dataframe function defined.")

verify_dataframe function defined.


## 3. Data Collection from ClinicalTrials.gov

ClinicalTrials.gov is a database of clinical studies conducted around the world. We'll use their API to collect data on women's health studies.

In [6]:
def fetch_clinical_trials(term, max_results=100):
    """
    Fetch clinical trials data from ClinicalTrials.gov API.
    
    Parameters:
    - term: search term for clinical trials
    - max_results: maximum number of results to return
    
    Returns:
    - studies: list of study data dictionaries
    """
    base_url = "https://clinicaltrials.gov/api/query/study_fields"
    
    # Define the fields we want to retrieve
    fields = [
        "NCTId", "BriefTitle", "Condition", "OverallStatus", "Phase", 
        "EnrollmentCount", "StartDate", "CompletionDate", "StudyType",
        "InterventionName", "SponsorCollaboratorsModule", "Gender",
        "MinimumAge", "MaximumAge", "EligibilityCriteria"
    ]
    
    # Construct the query parameters
    params = {
        "expr": term + " AND AREA[Gender]Female",  # Fixed string concatenation
        "fields": ",".join(fields),
        "min_rnk": 1,
        "max_rnk": max_results,
        "fmt": "json"
    }
    
    # Make the API request
    print(f"Fetching clinical trials for term: {term}")
    response = requests.get(base_url, params=params)
    
    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()
        studies = data.get("StudyFieldsResponse", {}).get("StudyFields", [])
        print(f"Retrieved {len(studies)} studies")
        return studies
    else:
        print(f"Error: {response.status_code}")
        return []

print("fetch_clinical_trials function defined.")

fetch_clinical_trials function defined.


In [7]:
# Check if we already have the clinical trials data
clinical_trials_df = load_checkpoint("clinical_trials")

# If not, fetch the data
if clinical_trials_df is None:
    # Define search terms for women's health conditions
    search_terms = [
        "endometriosis", 
        "polycystic ovary syndrome", 
        "fibromyalgia", 
        "chronic fatigue syndrome", 
        "autoimmune disorders women", 
        "menopause", 
        "pregnancy complications", 
        "gynecological cancer", 
        "breast cancer"
    ]
    
    # Fetch data for each term
    all_studies = []
    for term in search_terms:
        studies = fetch_clinical_trials(term, max_results=50)
        all_studies.extend(studies)
        time.sleep(1)  # Be nice to the API
    
    # Convert to DataFrame
    clinical_trials_df = pd.json_normalize(all_studies)
    
    # Save checkpoint
    save_checkpoint(clinical_trials_df, "clinical_trials")
else:
    print("Using existing clinical trials data")

# Verify the data
verify_dataframe(clinical_trials_df, "Clinical Trials")

Checkpoint not found or empty: /Users/cazandraaporbo/Desktop/FOXX_Health/coding_work/jup_files/delivery/womens_health_data/checkpoints/clinical_trials_latest.csv
Fetching clinical trials for term: endometriosis
Error: 404
Fetching clinical trials for term: polycystic ovary syndrome
Error: 404
Fetching clinical trials for term: fibromyalgia
Error: 404
Fetching clinical trials for term: chronic fatigue syndrome
Error: 404
Fetching clinical trials for term: autoimmune disorders women
Error: 404
Fetching clinical trials for term: menopause
Error: 404
Fetching clinical trials for term: pregnancy complications
Error: 404
Fetching clinical trials for term: gynecological cancer
Error: 404
Fetching clinical trials for term: breast cancer
Error: 404
Checkpoint saved: /Users/cazandraaporbo/Desktop/FOXX_Health/coding_work/jup_files/delivery/womens_health_data/checkpoints/clinical_trials_20250331_214024.csv
Latest version saved: /Users/cazandraaporbo/Desktop/FOXX_Health/coding_work/jup_files/delive


Data types:


Series([], dtype: object)


Missing values:


'No missing values'


Basic statistics:


ValueError: Cannot describe a DataFrame without columns

## 4. Data Collection from PubMed

PubMed is a database of biomedical literature. We'll use their API to collect research papers on women's health topics.

In [None]:
def fetch_pubmed_ids(term, max_results=100):
    """
    Fetch PubMed IDs for a given search term.
    
    Parameters:
    - term: search term for PubMed
    - max_results: maximum number of results to return
    
    Returns:
    - ids: list of PubMed IDs
    """
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    
    # Construct the query parameters
    params = {
        "db": "pubmed",
        "term": term,
        "retmax": max_results,
        "retmode": "json"
    }
    
    # Make the API request
    print(f"Fetching PubMed IDs for term: {term}")
    response = requests.get(base_url, params=params)
    
    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()
        ids = data.get("esearchresult", {}).get("idlist", [])
        print(f"Retrieved {len(ids)} PubMed IDs")
        return ids
    else:
        print(f"Error: {response.status_code}")
        return []

def fetch_pubmed_details(ids):
    """
    Fetch details for a list of PubMed IDs.
    
    Parameters:
    - ids: list of PubMed IDs
    
    Returns:
    - articles: list of article data dictionaries
    """
    if not ids:
        return []
    
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    
    # Construct the query parameters
    params = {
        "db": "pubmed",
        "id": ",".join(ids),
        "retmode": "json"
    }
    
    # Make the API request
    print(f"Fetching details for {len(ids)} PubMed articles")
    response = requests.get(base_url, params=params)
    
    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()
        articles = []
        
        for pmid in ids:
            article_data = data.get("result", {}).get(pmid, {})
            if article_data:
                # Extract relevant fields
                article = {
                    "PMID": pmid,
                    "Title": article_data.get("title", ""),
                    "PubDate": article_data.get("pubdate", ""),
                    "Source": article_data.get("source", ""),
                    "Authors": ", ".join([author.get("name", "") for author in article_data.get("authors", [])]),
                    "DOI": article_data.get("elocationid", ""),
                }
                articles.append(article)
        
        print(f"Retrieved details for {len(articles)} articles")
        return articles
    else:
        print(f"Error: {response.status_code}")
        return []

print("PubMed API functions defined.")

In [None]:
# Check if we already have the PubMed data
pubmed_df = load_checkpoint("pubmed")

# If not, fetch the data
if pubmed_df is None:
    # Define search terms for women's health research
    search_terms = [
        "women's health AND dismissal",
        "women's health AND bias",
        "endometriosis AND diagnosis delay",
        "fibromyalgia AND women AND misdiagnosis",
        "chronic fatigue syndrome AND women",
        "autoimmune disorders AND women AND diagnosis",
        "women's pain AND dismissal",
        "gender bias AND healthcare",
        "women's health AND communication"
    ]
    
    # Fetch data for each term
    all_articles = []
    for term in search_terms:
        ids = fetch_pubmed_ids(term, max_results=30)
        articles = fetch_pubmed_details(ids)
        all_articles.extend(articles)
        time.sleep(1)  # Be nice to the API
    
    # Convert to DataFrame
    pubmed_df = pd.DataFrame(all_articles)
    
    # Save checkpoint
    save_checkpoint(pubmed_df, "pubmed")
else:
    print("Using existing PubMed data")

# Verify the data
verify_dataframe(pubmed_df, "PubMed Articles")

## 5. Create Dismissed Questions Dataset

Now we'll create a structured dataset of commonly dismissed women's health questions and their better alternatives.

In [None]:
# Check if we already have the dismissed questions data
dismissed_questions_df = load_checkpoint("dismissed_questions")

# If not, create the data
if dismissed_questions_df is None:
    # Create a structured dataset of dismissed questions and better alternatives
    dismissed_questions_data = [
        {
            "DismissedQuestion": "I'm tired all the time.",
            "BetterQuestion": "I've been experiencing persistent fatigue for the past 3 months that isn't relieved by rest. It's affecting my ability to work and exercise. Could this be related to my thyroid condition or another underlying issue?",
            "Condition": "Chronic Fatigue Syndrome",
            "Category": "Chronic Pain/Fatigue",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 4.5,
            "AgeGroup": "25-34"
        },
        {
            "DismissedQuestion": "My periods are really painful.",
            "BetterQuestion": "My menstrual pain is severe enough that I miss work for 2-3 days each month. Over-the-counter pain medications don't help, and I also experience pain during intercourse. Could these symptoms be consistent with endometriosis?",
            "Condition": "Endometriosis",
            "Category": "Reproductive Health",
            "DismissalFrequency": "Very High",
            "DiagnosisDelay": 7.5,
            "AgeGroup": "18-24"
        },
        {
            "DismissedQuestion": "I have pain all over my body.",
            "BetterQuestion": "I'm experiencing widespread muscle pain that moves between different parts of my body, along with specific tender points near my joints. I also have sleep disturbances and morning stiffness. Could these symptoms align with fibromyalgia?",
            "Condition": "Fibromyalgia",
            "Category": "Chronic Pain/Fatigue",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 5.0,
            "AgeGroup": "35-44"
        },
        {
            "DismissedQuestion": "I'm gaining weight even though I'm not eating more.",
            "BetterQuestion": "I've gained 15 pounds in the last 3 months despite maintaining my diet and exercise routine. I'm also experiencing irregular periods, acne, and increased hair growth on my face. Could these symptoms be related to PCOS or a hormonal imbalance?",
            "Condition": "Polycystic Ovary Syndrome",
            "Category": "Reproductive Health",
            "DismissalFrequency": "Medium",
            "DiagnosisDelay": 2.5,
            "AgeGroup": "18-24"
        },
        {
            "DismissedQuestion": "I'm having hot flashes.",
            "BetterQuestion": "I'm experiencing sudden intense heat sensations several times daily, followed by sweating and chills. These episodes disrupt my sleep and work. Given that I'm 48 and my periods have become irregular, could this be perimenopause, and what treatment options might help manage these symptoms?",
            "Condition": "Perimenopause",
            "Category": "Menopause/Aging",
            "DismissalFrequency": "Medium",
            "DiagnosisDelay": 1.5,
            "AgeGroup": "45-54"
        },
        {
            "DismissedQuestion": "I feel anxious a lot.",
            "BetterQuestion": "I'm experiencing persistent anxiety with physical symptoms including rapid heartbeat, shortness of breath, and trembling. These episodes occur multiple times weekly and interfere with my daily activities. Given my family history of anxiety disorders, what evaluation would you recommend?",
            "Condition": "Generalized Anxiety Disorder",
            "Category": "Mental Health",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 3.0,
            "AgeGroup": "25-34"
        },
        {
            "DismissedQuestion": "My joints hurt.",
            "BetterQuestion": "I'm experiencing symmetric joint pain and swelling in my hands, wrists, and knees, particularly in the morning when stiffness lasts for over an hour. The pain improves with movement but worsens with activity. Given that autoimmune conditions run in my family, should I be tested for rheumatoid arthritis?",
            "Condition": "Rheumatoid Arthritis",
            "Category": "Autoimmune Disorders",
            "DismissalFrequency": "Medium",
            "DiagnosisDelay": 2.0,
            "AgeGroup": "35-44"
        },
        {
            "DismissedQuestion": "I have headaches frequently.",
            "BetterQuestion": "I experience severe, throbbing headaches on one side of my head 3-4 times monthly, accompanied by nausea, vomiting, and sensitivity to light and sound. These episodes last 4-72 hours and significantly impair my ability to function. Do these symptoms match migraine criteria, and what preventive treatments might be appropriate?",
            "Condition": "Migraine",
            "Category": "Neurological Conditions",
            "DismissalFrequency": "Medium",
            "DiagnosisDelay": 1.8,
            "AgeGroup": "25-34"
        },
        {
            "DismissedQuestion": "I'm having trouble sleeping.",
            "BetterQuestion": "For the past 6 months, I've had difficulty falling asleep and staying asleep despite maintaining good sleep hygiene. I feel exhausted during the day, affecting my work performance. I've also noticed increased irritability and difficulty concentrating. Could this be insomnia or another sleep disorder?",
            "Condition": "Insomnia",
            "Category": "Sleep Disorders",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 2.5,
            "AgeGroup": "35-44"
        },
        {
            "DismissedQuestion": "My stomach hurts often.",
            "BetterQuestion": "I experience recurrent abdominal pain and bloating that's relieved by bowel movements. My symptoms worsen with stress and certain foods. I alternate between constipation and diarrhea, sometimes with mucus in my stool. These symptoms have persisted for over 6 months. Could this be IBS or should we investigate other digestive conditions?",
            "Condition": "Irritable Bowel Syndrome",
            "Category": "Digestive Disorders",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 3.0,
            "AgeGroup": "25-34"
        }
    ]
    
    # Convert to DataFrame
    dismissed_questions_df = pd.DataFrame(dismissed_questions_data)
    
    # Save checkpoint
    save_checkpoint(dismissed_questions_df, "dismissed_questions")
else:
    print("Using existing dismissed questions data")

# Verify the data
verify_dataframe(dismissed_questions_df, "Dismissed Questions")

## 6. Create Medical Terminology Dataset

Let's create a dataset of medical terminology with plain language explanations to help bridge the gap between medical and patient language.

In [None]:
# Check if we already have the medical terminology data
medical_terminology_df = load_checkpoint("medical_terminology")

# If not, create the data
if medical_terminology_df is None:
    # Create a structured dataset of medical terminology with plain language explanations
    medical_terminology_data = [
        {
            "MedicalTerm": "Endometriosis",
            "PlainLanguage": "A condition where tissue similar to the lining of the uterus grows outside the uterus, causing pain and potential fertility issues.",
            "Category": "Reproductive Health",
            "Complexity": "Medium"
        },
        {
            "MedicalTerm": "Fibromyalgia",
            "PlainLanguage": "A condition characterized by widespread muscle pain, fatigue, and tender points throughout the body.",
            "Category": "Chronic Pain/Fatigue",
            "Complexity": "Medium"
        },
        {
            "MedicalTerm": "Polycystic Ovary Syndrome",
            "PlainLanguage": "A hormonal disorder causing enlarged ovaries with small cysts, irregular periods, and excess male hormone levels.",
            "Category": "Reproductive Health",
            "Complexity": "High"
        },
        {
            "MedicalTerm": "Chronic Fatigue Syndrome",
            "PlainLanguage": "A complex disorder characterized by extreme fatigue that can't be explained by any underlying medical condition.",
            "Category": "Chronic Pain/Fatigue",
            "Complexity": "Medium"
        },
        {
            "MedicalTerm": "Perimenopause",
            "PlainLanguage": "The transition period before menopause when the ovaries gradually begin to make less estrogen.",
            "Category": "Menopause/Aging",
            "Complexity": "Medium"
        },
        {
            "MedicalTerm": "Dysmenorrhea",
            "PlainLanguage": "Painful menstrual cramps caused by uterine contractions.",
            "Category": "Reproductive Health",
            "Complexity": "High"
        },
        {
            "MedicalTerm": "Hypothyroidism",
            "PlainLanguage": "A condition where the thyroid gland doesn't produce enough thyroid hormone, causing fatigue, weight gain, and cold intolerance.",
            "Category": "Endocrine Disorders",
            "Complexity": "Medium"
        },
        {
            "MedicalTerm": "Rheumatoid Arthritis",
            "PlainLanguage": "An autoimmune disorder that causes inflammation in the joints, leading to pain, swelling, and potential joint damage.",
            "Category": "Autoimmune Disorders",
            "Complexity": "Medium"
        },
        {
            "MedicalTerm": "Irritable Bowel Syndrome",
            "PlainLanguage": "A disorder affecting the large intestine, causing abdominal pain, bloating, and altered bowel habits.",
            "Category": "Digestive Disorders",
            "Complexity": "Medium"
        },
        {
            "MedicalTerm": "Migraine",
            "PlainLanguage": "A neurological condition characterized by severe, recurring headaches, often accompanied by nausea, vomiting, and sensitivity to light and sound.",
            "Category": "Neurological Conditions",
            "Complexity": "Low"
        }
    ]
    
    # Convert to DataFrame
    medical_terminology_df = pd.DataFrame(medical_terminology_data)
    
    # Save checkpoint
    save_checkpoint(medical_terminology_df, "medical_terminology")
else:
    print("Using existing medical terminology data")

# Verify the data
verify_dataframe(medical_terminology_df, "Medical Terminology")

## 7. Summary of Collected Data

Let's summarize the data we've collected and prepare it for the next notebook in the series.

In [None]:
# Display summary statistics for each dataset
print("\n=== Data Collection Summary ===\n")

print(f"Clinical Trials: {len(clinical_trials_df)} records")
print(f"PubMed Articles: {len(pubmed_df)} records")
print(f"Dismissed Questions: {len(dismissed_questions_df)} records")
print(f"Medical Terminology: {len(medical_terminology_df)} records")

# Count unique conditions in the dismissed questions dataset
unique_conditions = dismissed_questions_df['Condition'].nunique()
print(f"\nUnique conditions in dismissed questions dataset: {unique_conditions}")

# Count categories in the dismissed questions dataset
category_counts = dismissed_questions_df['Category'].value_counts()
print("\nCategory distribution in dismissed questions dataset:")
display(category_counts)

# Calculate average diagnosis delay
avg_delay = dismissed_questions_df['DiagnosisDelay'].mean()
print(f"\nAverage diagnosis delay: {avg_delay:.1f} years")

# Identify conditions with highest dismissal frequency
high_dismissal = dismissed_questions_df[dismissed_questions_df['DismissalFrequency'] == 'Very High']
print("\nConditions with very high dismissal frequency:")
display(high_dismissal[['Condition', 'Category', 'DiagnosisDelay']])

print("\n=== Data Collection Complete ===\n")
print("The collected data is now ready for preprocessing in the next notebook.")

## 8. Next Steps

In this notebook, we've collected data from multiple sources for our women's health LLM model:

1. Clinical trials data from ClinicalTrials.gov
2. Research papers from PubMed
3. A structured dataset of dismissed questions and better alternatives
4. Medical terminology with plain language explanations

In the next notebook (Part 2: Data Preprocessing), we'll clean and preprocess this data to prepare it for analysis and LLM training. This will include:

- Cleaning and standardizing text data
- Extracting features from the collected data
- Merging datasets where appropriate
- Creating a unified dataset for LLM training

All the data collected in this notebook has been saved as checkpoints and can be loaded in the next notebook.