# Women's Health Data Collection - Part 1

## Overview

This notebook is the first in a series focused on creating a dataset for training an LLM model to predict better questions for women's health consultations. This part focuses specifically on data collection from multiple sources.

### Objectives
- Set up the necessary environment and dependencies
- Collect data from ClinicalTrials.gov API for women's health studies
- Collect data from PubMed for women's health research
- Create a structured dataset of commonly dismissed women's health questions
- Implement save points to preserve collected data

### Why This Matters
Women's health questions are often dismissed in healthcare settings, leading to delayed diagnoses and treatment. By collecting data on commonly dismissed questions and their better alternatives, we can train an LLM to help women formulate more effective questions that are less likely to be dismissed by healthcare providers.

## 1. Environment Setup

First, let's set up our environment by importing necessary libraries and creating directories for our data.

In [29]:
# Import necessary libraries  
import pandas as pd  
import numpy as np  
import os  
import json  
import requests  
import time  
from datetime import datetime  
import matplotlib.pyplot as plt  
import seaborn as sns  
from tqdm.notebook import tqdm  
  
# Set up plotting style - using updated style name  
plt.style.use('seaborn-v0_8-whitegrid')  # Updated from 'seaborn-whitegrid'  
sns.set(style="whitegrid")  
  
# Display versions for reproducibility  
print(f"Pandas version: {pd.__version__}")  
print(f"NumPy version: {np.__version__}")  
print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")  

Pandas version: 1.5.3
NumPy version: 1.24.3
Current time: 2025-03-31 21:49:27


In [30]:
# Create directory structure for data storage  
# This ensures we have a consistent place to save our collected data  
  
# Main data directory  
data_dir = 'womens_health_data'  
# Raw data from different sources  
raw_dir = os.path.join(data_dir, 'raw')  
# Checkpoints to save progress  
checkpoint_dir = os.path.join(data_dir, 'checkpoints')  
# Processed data after cleaning  
processed_dir = os.path.join(data_dir, 'processed')  
  
# Create directories if they don't exist  
for directory in [data_dir, raw_dir, checkpoint_dir, processed_dir]:  
    os.makedirs(directory, exist_ok=True)  
  
print(f"Directory structure created at: {os.path.abspath(data_dir)}")  

Directory structure created at: /Users/cazandraaporbo/Desktop/FOXX_Health/coding_work/jup_files/delivery/documentation/womens_health_data


In [31]:

# You can customize this to your own path if needed  
data_dir = 'womens_health_data'  
  
# Print the absolute path to confirm  
print(f"Data directory: {os.path.abspath(data_dir)}")  

Data directory: /Users/cazandraaporbo/Desktop/FOXX_Health/coding_work/jup_files/delivery/documentation/womens_health_data


## 2. Helper Functions

Create helper functions for saving and loading checkpoints. This will allow us to resume our work without having to re-run time-consuming data collection steps.

In [32]:
def save_checkpoint(df, name):  
    """Save a dataframe as a checkpoint CSV file."""  
    # Create the full path with timestamp to avoid overwriting  
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")  
    filename = f"{name}_{timestamp}.csv"  
    path = os.path.join(checkpoint_dir, filename)  
      
    # Save the dataframe  
    df.to_csv(path, index=False)  
    print(f"Checkpoint saved: {path}")  
      
    # Also save a version with a fixed name for easy loading  
    fixed_path = os.path.join(checkpoint_dir, f"{name}_latest.csv")  
    df.to_csv(fixed_path, index=False)  
    print(f"Latest version saved: {fixed_path}")  
      
    return path  
  
print("save_checkpoint function defined.")  

save_checkpoint function defined.


In [33]:
def load_checkpoint(name):  
    """Load the latest checkpoint for a given name."""  
    path = os.path.join(checkpoint_dir, f"{name}_latest.csv")  
      
    if os.path.exists(path) and os.path.getsize(path) > 0:  
        try:  
            df = pd.read_csv(path)  
            print(f"Checkpoint loaded: {path}")  
            print(f"Shape: {df.shape}")  
            return df  
        except pd.errors.EmptyDataError:  
            print(f"Warning: Checkpoint file exists but is empty: {path}")  
            return None  
        except Exception as e:  
            print(f"Error loading checkpoint: {e}")  
            return None  
    else:  
        print(f"Checkpoint not found or empty: {path}")  
        return None  
  
print("load_checkpoint function defined.")  

load_checkpoint function defined.


In [27]:
# Define verify_dataframe function  
def verify_dataframe(df, name):  
    "Verify a dataframe by displaying basic information."""  
    print("\n--- " + name + " Verification ---")  
    print("Shape:", df.shape)  
    print("\nFirst 5 rows:")  
    display(df.head())  
    print("\nData types:")  
    display(df.dtypes)  
    print("\nMissing values:")  
    missing = df.isnull().sum()  
    if any(missing > 0):  
        display(missing[missing > 0])  
    else:  
        print("No missing values")  
    print("\nBasic statistics:")  
    display(df.describe(include='all').T)  
    print("----------------------------\n")  
  
print("verify_dataframe function defined.")  

verify_dataframe function defined.


In [34]:
def verify_dataframe(df, name):  
    """Verify a dataframe by displaying basic information."""  
    print(f"\n--- {name} Verification ---")  
    print(f"Shape: {df.shape}")  
    print("\nFirst 5 rows:")  
    display(df.head())  
    print("\nData types:")  
    display(df.dtypes)  
    print("\nMissing values:")  
    missing = df.isnull().sum()  
    if any(missing > 0):  
        display(missing[missing > 0])  
    else:  
        print("No missing values")  
    print("\nBasic statistics:")  
    display(df.describe(include='all').T)  
    print("----------------------------\n")  
  
print("verify_dataframe function defined.")  

verify_dataframe function defined.


In [36]:
with open('Womens_Health_Data_Collection_Part1.ipynb', 'r', encoding='utf-8') as f:
    notebook = json.load(f)

code_cells = [cell for cell in notebook.get('cells', []) if cell.get('cell_type') == 'code']

# Show the next few code cells (4-6)
for i, cell in enumerate(code_cells[3:6], 4):
    source = ''.join(cell.get('source', []))
    print(f"\
--- Code Cell {i} ---")
    print(source[:500] + "..." if len(source) > 500 else source)  # Limit output length

--- Code Cell 4 ---
def save_checkpoint(df, name):  
    """Save a dataframe as a checkpoint CSV file."""  
    # Create the full path with timestamp to avoid overwriting  
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")  
    filename = f"{name}_{timestamp}.csv"  
    path = os.path.join(checkpoint_dir, filename)  
      
    # Save the dataframe  
    df.to_csv(path, index=False)  
    print(f"Checkpoint saved: {path}")  
      
    # Also save a version with a fixed name for easy loading  
    fixed_pat...
--- Code Cell 5 ---
def load_checkpoint(name):  
    """Load the latest checkpoint for a given name."""  
    path = os.path.join(checkpoint_dir, f"{name}_latest.csv")  
      
    if os.path.exists(path) and os.path.getsize(path) > 0:  
        try:  
            df = pd.read_csv(path)  
            print(f"Checkpoint loaded: {path}")  
            print(f"Shape: {df.shape}")  
            return df  
        except pd.errors.EmptyDataError:  
--- Code Cell 6 ---
# Defin

## 3. Data Collection from ClinicalTrials.gov

ClinicalTrials.gov is a database of clinical studies conducted around the world. We'll use their API to collect data on women's health studies.

In [37]:
def fetch_clinical_trials(term, max_results=100):
    """
    Fetch clinical trials data from ClinicalTrials.gov API.
    
    Parameters:
    - term: search term
    - max_results: maximum number of results to return
    
    Returns:
    - data: JSON response from the API
    """
    base_url = "https://clinicaltrials.gov/api/query/study_fields"
    
    # Define the fields we want to retrieve
    fields = [
        "NCTId", "BriefTitle", "OfficialTitle", "Condition", "ConditionMeshTerm",
        "Gender", "MinimumAge", "MaximumAge", "HealthyVolunteers",
        "StudyType", "Phase", "EnrollmentCount", "StartDate", "PrimaryCompletionDate",
        "Sponsor", "LeadSponsorName", "LocationCountry", "LocationFacility"
    ]
    
    # Construct the query parameters
    params = {
        "expr": term + " AND AREA[Gender]Female",  # Focus on women's health
        "fields": ",".join(fields),
        "min_rnk": 1,
        "max_rnk": max_results,
        "fmt": "json"
    }
    
    # Make the request
    print(f"Fetching clinical trials for term: {term}")
    response = requests.get(base_url, params=params)
    
    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()
        print(f"Retrieved {len(data['StudyFieldsResponse']['StudyFields'])} studies")
        return data
    else:
        print(f"Error: {response.status_code}")
        print(response.text)
        return None

In [10]:
# Check if we already have clinical trials data
clinical_trials_df = load_checkpoint("clinical_trials")

# If not, collect the data
if clinical_trials_df is None:
    # Define search terms for women's health conditions
    search_terms = [
        "Endometriosis", "Polycystic Ovary Syndrome", "Uterine Fibroids",
        "Pregnancy Complications", "Postpartum Depression", "Menopause",
        "Osteoporosis", "Breast Cancer", "Ovarian Cancer", "Cervical Cancer",
        "Autoimmune Disease Women", "Heart Disease Women", "Fibromyalgia",
        "Chronic Fatigue Syndrome", "Thyroid Disease Women"
    ]
    
    # Initialize an empty list to store all studies
    all_studies = []
    
    # Fetch data for each search term
    for term in tqdm(search_terms, desc="Fetching clinical trials"):
        data = fetch_clinical_trials(term, max_results=50)  # Limit to 50 per term to avoid overloading
        
        if data and 'StudyFieldsResponse' in data and 'StudyFields' in data['StudyFieldsResponse']:
            studies = data['StudyFieldsResponse']['StudyFields']
            all_studies.extend(studies)
            
            # Add a small delay to avoid overwhelming the API
            time.sleep(1)
    
    # Convert to DataFrame
    clinical_trials_df = pd.json_normalize(all_studies)
    
    # Save checkpoint
    save_checkpoint(clinical_trials_df, "clinical_trials")
else:
    print("Using existing clinical trials data")

Checkpoint not found or empty: womens_health_data/checkpoints/clinical_trials_latest.csv


Fetching clinical trials:   0%|          | 0/15 [00:00<?, ?it/s]

Fetching clinical trials for term: Endometriosis
Error: 404
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.26.2</center>
</body>
</html>

Fetching clinical trials for term: Polycystic Ovary Syndrome
Error: 404
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.26.2</center>
</body>
</html>

Fetching clinical trials for term: Uterine Fibroids
Error: 404
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.26.2</center>
</body>
</html>

Fetching clinical trials for term: Pregnancy Complications
Error: 404
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.26.2</center>
</body>
</html>

Fetching clinical trials for term: Postpartum Depression
Error: 404
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Fou

In [11]:
# Verify the clinical trials data
verify_dataframe(clinical_trials_df, "Clinical Trials")


--- Clinical Trials Verification ---
Shape: (0, 0)

First 5 rows:



Data types:


Series([], dtype: object)


Missing values:


'No missing values'


Basic statistics:


ValueError: Cannot describe a DataFrame without columns

In [12]:
# Clean and preprocess the clinical trials data
# This step ensures we have consistent data for further analysis

# First, let's check which columns have list values
list_columns = [col for col in clinical_trials_df.columns if clinical_trials_df[col].apply(lambda x: isinstance(x, list)).any()]
print(f"Columns with list values: {list_columns}")

# For list columns, we'll join the values with a semicolon
for col in list_columns:
    clinical_trials_df[col] = clinical_trials_df[col].apply(lambda x: "; ".join(x) if isinstance(x, list) else x)

# Rename columns for clarity
clinical_trials_df = clinical_trials_df.rename(columns={
    'NCTId': 'StudyID',
    'BriefTitle': 'Title',
    'LeadSponsorName': 'Sponsor',
    'EnrollmentCount': 'Enrollment',
    'ConditionMeshTerm': 'ConditionTerms'
})

# Select only the columns we need
selected_columns = [
    'StudyID', 'Title', 'Condition', 'ConditionTerms', 'Gender',
    'MinimumAge', 'MaximumAge', 'StudyType', 'Phase', 'Enrollment',
    'StartDate', 'Sponsor', 'LocationCountry'
]

# Filter columns that exist in our dataframe
existing_columns = [col for col in selected_columns if col in clinical_trials_df.columns]
clinical_trials_df = clinical_trials_df[existing_columns]

# Save the cleaned data
save_checkpoint(clinical_trials_df, "clinical_trials_cleaned")

Columns with list values: []
Checkpoint saved: womens_health_data/checkpoints/clinical_trials_cleaned_20250331_210506.csv
Latest version saved: womens_health_data/checkpoints/clinical_trials_cleaned_latest.csv


'womens_health_data/checkpoints/clinical_trials_cleaned_20250331_210506.csv'

In [13]:
# Verify the cleaned clinical trials data
verify_dataframe(clinical_trials_df, "Cleaned Clinical Trials")


--- Cleaned Clinical Trials Verification ---
Shape: (0, 0)

First 5 rows:



Data types:


Series([], dtype: object)


Missing values:


'No missing values'


Basic statistics:


ValueError: Cannot describe a DataFrame without columns

## 4. Data Collection from PubMed

PubMed is a database of biomedical literature. We'll use their API to collect research papers on women's health topics, particularly focusing on medical dismissal.

In [14]:
def fetch_pubmed_data(term, max_results=100):
    """
    Fetch publication data from PubMed API.
    
    Parameters:
    - term: search term
    - max_results: maximum number of results to return
    
    Returns:
    - data: list of publication data
    """
    # Base URLs for PubMed API (E-utilities)
    esearch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    esummary_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    
    # Step 1: Search for IDs
    search_params = {
        "db": "pubmed",
        "term": term,
        "retmax": max_results,
        "retmode": "json",
        "sort": "relevance"
    }
    
    print(f"Searching PubMed for term: {term}")
    search_response = requests.get(esearch_url, params=search_params)
    
    if search_response.status_code != 200:
        print(f"Error in search: {search_response.status_code}")
        return []
    
    search_data = search_response.json()
    id_list = search_data['esearchresult']['idlist']
    
    if not id_list:
        print("No results found")
        return []
    
    print(f"Found {len(id_list)} results")
    
    # Step 2: Get summaries for the IDs
    summary_params = {
        "db": "pubmed",
        "id": ",".join(id_list),
        "retmode": "json"
    }
    
    summary_response = requests.get(esummary_url, params=summary_params)
    
    if summary_response.status_code != 200:
        print(f"Error in summary: {summary_response.status_code}")
        return []
    
    summary_data = summary_response.json()
    
    # Extract relevant information
    publications = []
    for article_id in id_list:
        if article_id in summary_data['result']:
            article = summary_data['result'][article_id]
            
            # Extract authors
            authors = []
            if 'authors' in article and article['authors']:
                authors = [author['name'] for author in article['authors'] if 'name' in author]
            
            # Create publication record
            publication = {
                'PMID': article_id,
                'Title': article.get('title', ''),
                'Authors': '; '.join(authors),
                'Journal': article.get('fulljournalname', ''),
                'PubDate': article.get('pubdate', ''),
                'SearchTerm': term
            }
            
            publications.append(publication)
    
    return publications

In [15]:
# Check if we already have PubMed data
pubmed_df = load_checkpoint("pubmed")

# If not, collect the data
if pubmed_df is None:
    # Define search terms focusing on women's health and medical dismissal
    search_terms = [
        "women health medical dismissal",
        "gender bias healthcare",
        "women pain dismissal",
        "endometriosis diagnosis delay",
        "women autoimmune disease diagnosis",
        "women heart disease symptoms",
        "women chronic fatigue diagnosis",
        "women fibromyalgia diagnosis",
        "women medical gaslighting",
        "women healthcare communication"
    ]
    
    # Initialize an empty list to store all publications
    all_publications = []
    
    # Fetch data for each search term
    for term in tqdm(search_terms, desc="Fetching PubMed data"):
        publications = fetch_pubmed_data(term, max_results=30)  # Limit to 30 per term
        all_publications.extend(publications)
        
        # Add a small delay to avoid overwhelming the API
        time.sleep(1)
    
    # Convert to DataFrame
    pubmed_df = pd.DataFrame(all_publications)
    
    # Remove duplicates based on PMID
    pubmed_df = pubmed_df.drop_duplicates(subset=['PMID'])
    
    # Save checkpoint
    save_checkpoint(pubmed_df, "pubmed")
else:
    print("Using existing PubMed data")

Checkpoint not found or empty: womens_health_data/checkpoints/pubmed_latest.csv


Fetching PubMed data:   0%|          | 0/10 [00:00<?, ?it/s]

Searching PubMed for term: women health medical dismissal
Found 30 results
Searching PubMed for term: gender bias healthcare
Found 30 results
Searching PubMed for term: women pain dismissal
Found 30 results
Searching PubMed for term: endometriosis diagnosis delay
Found 30 results
Searching PubMed for term: women autoimmune disease diagnosis
Found 30 results
Searching PubMed for term: women heart disease symptoms
Found 30 results
Searching PubMed for term: women chronic fatigue diagnosis
Found 30 results
Searching PubMed for term: women fibromyalgia diagnosis
Found 30 results
Searching PubMed for term: women medical gaslighting
Found 16 results
Searching PubMed for term: women healthcare communication
Found 30 results
Checkpoint saved: womens_health_data/checkpoints/pubmed_20250331_210533.csv
Latest version saved: womens_health_data/checkpoints/pubmed_latest.csv


In [None]:
# Verify the PubMed data
verify_dataframe(pubmed_df, "PubMed Publications")

In [None]:
# Clean and preprocess the PubMed data

# Extract year from PubDate
pubmed_df['Year'] = pubmed_df['PubDate'].str.extract(r'(\d{4})')

# Convert Year to numeric
pubmed_df['Year'] = pd.to_numeric(pubmed_df['Year'], errors='coerce')

# Fill missing years with the median year
median_year = pubmed_df['Year'].median()
pubmed_df['Year'] = pubmed_df['Year'].fillna(median_year)

# Create a URL column for PubMed links
pubmed_df['URL'] = 'https://pubmed.ncbi.nlm.nih.gov/' + pubmed_df['PMID']

# Save the cleaned data
save_checkpoint(pubmed_df, "pubmed_cleaned")

In [None]:
# Verify the cleaned PubMed data
verify_dataframe(pubmed_df, "Cleaned PubMed Publications")

## 5. Creating a Dataset of Dismissed Questions

Now we'll create a structured dataset of commonly dismissed women's health questions and their better alternatives. This is the core data that will be used to train our LLM model.

In [None]:
# Check if we already have dismissed questions data
dismissed_questions_df = load_checkpoint("dismissed_questions")

# If not, create the dataset
if dismissed_questions_df is None:
    # Create a structured dataset of dismissed questions and better alternatives
    dismissed_questions_data = [
        {
            "DismissedQuestion": "I'm tired all the time.",
            "BetterQuestion": "I've been experiencing extreme fatigue for the past 3 months, even after 8 hours of sleep. It's affecting my ability to work. Could this be related to my thyroid condition?",
            "Condition": "Chronic Fatigue Syndrome",
            "Category": "Chronic Pain/Fatigue",
            "DismissalFrequency": "Very High",
            "DiagnosisDelay": 5.2,
            "AgeGroup": "35-44",
            "RacialEthnicConsiderations": "White/Caucasian"
        },
        {
            "DismissedQuestion": "My periods are really painful.",
            "BetterQuestion": "My menstrual pain is so severe that I miss work for 2-3 days each month. Over-the-counter pain medications don't help, and I often vomit from the pain. Could this be endometriosis?",
            "Condition": "Endometriosis",
            "Category": "Reproductive Health",
            "DismissalFrequency": "Very High",
            "DiagnosisDelay": 7.5,
            "AgeGroup": "25-34",
            "RacialEthnicConsiderations": "Black/African American"
        },
        {
            "DismissedQuestion": "I have headaches a lot.",
            "BetterQuestion": "I experience throbbing headaches with visual aura 3-4 times per month, typically around my menstrual cycle. They last 24-48 hours and are accompanied by nausea and sensitivity to light. How can we determine if these are hormonal migraines?",
            "Condition": "Migraine",
            "Category": "Chronic Pain/Fatigue",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 3.6,
            "AgeGroup": "25-34",
            "RacialEthnicConsiderations": "Hispanic/Latina"
        },
        {
            "DismissedQuestion": "I'm gaining weight for no reason.",
            "BetterQuestion": "I've gained 20 pounds in the last 6 months despite no changes to my diet or exercise routine. I'm also experiencing hair loss, fatigue, and feeling cold all the time. Could this be a thyroid issue?",
            "Condition": "Hashimoto's Thyroiditis",
            "Category": "Autoimmune Conditions",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 4.8,
            "AgeGroup": "45-54",
            "RacialEthnicConsiderations": "Asian"
        },
        {
            "DismissedQuestion": "My joints hurt.",
            "BetterQuestion": "I have symmetric joint pain and stiffness in my hands, wrists, and knees that's worst in the morning and lasts for more than an hour. I also feel fatigued and occasionally feverish. My mother had rheumatoid arthritis. Should I be tested for autoimmune conditions?",
            "Condition": "Rheumatoid Arthritis",
            "Category": "Autoimmune Conditions",
            "DismissalFrequency": "Medium",
            "DiagnosisDelay": 4.3,
            "AgeGroup": "55-64",
            "RacialEthnicConsiderations": "Native American/Alaska Native"
        },
        {
            "DismissedQuestion": "I'm having hot flashes.",
            "BetterQuestion": "I'm 48 and experiencing intense hot flashes 8-10 times daily, along with night sweats that disrupt my sleep. My periods have become irregular over the past year. What treatment options are available for perimenopause symptoms that are this severe?",
            "Condition": "Perimenopause",
            "Category": "Menopause/Aging",
            "DismissalFrequency": "Medium",
            "DiagnosisDelay": 2.1,
            "AgeGroup": "45-54",
            "RacialEthnicConsiderations": "White/Caucasian"
        },
        {
            "DismissedQuestion": "I feel sad after having my baby.",
            "BetterQuestion": "I gave birth 6 weeks ago and have been experiencing persistent feelings of hopelessness, difficulty bonding with my baby, and thoughts of harming myself. I'm sleeping only 2-3 hours a night even when the baby is sleeping. Is this postpartum depression requiring immediate treatment?",
            "Condition": "Postpartum Depression",
            "Category": "Pregnancy/Postpartum",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 3.7,
            "AgeGroup": "25-34",
            "RacialEthnicConsiderations": "Black/African American"
        },
        {
            "DismissedQuestion": "Sex is painful.",
            "BetterQuestion": "I experience a burning, tearing pain at the entrance of my vagina during penetration that continues throughout intercourse. This has been happening for 6 months and is affecting my relationship. Could this be vulvodynia or another condition requiring specialized treatment?",
            "Condition": "Vulvodynia",
            "Category": "Sexual Health",
            "DismissalFrequency": "Very High",
            "DiagnosisDelay": 6.4,
            "AgeGroup": "35-44",
            "RacialEthnicConsiderations": "Hispanic/Latina"
        },
        {
            "DismissedQuestion": "I have chest pain sometimes.",
            "BetterQuestion": "I'm experiencing chest tightness and shortness of breath during physical activity, along with unusual fatigue. The discomfort radiates to my jaw and left arm. I'm 58 with a family history of heart disease. Given that heart attack symptoms can be different in women, should I be evaluated for cardiovascular disease?",
            "Condition": "Coronary Artery Disease",
            "Category": "Cardiovascular Health",
            "DismissalFrequency": "High",
            "DiagnosisDelay": 5.0,
            "AgeGroup": "55-64",
            "RacialEthnicConsiderations": "Black/African American"
        },
        {
            "DismissedQuestion": "I have stomach problems.",
            "BetterQuestion": "I've been experiencing severe abdominal pain, bloating, and alternating constipation and diarrhea for over 6 months. The pain is often worse after eating and during my period. I've lost 10 pounds unintentionally. Could this be irritable bowel syndrome, endometriosis affecting my bowels, or something else that requires investigation?",
            "Condition": "Irritable Bowel Syndrome",
            "Category": "Chronic Pain/Fatigue",
            "DismissalFrequency": "Medium",
            "DiagnosisDelay": 3.8,
            "AgeGroup": "35-44",
            "RacialEthnicConsiderations": "Asian"
        }
    ]
    
    # Convert to DataFrame
    dismissed_questions_df = pd.DataFrame(dismissed_questions_data)
    
    # Save checkpoint
    save_checkpoint(dismissed_questions_df, "dismissed_questions")
else:
    print("Using existing dismissed questions data")

In [None]:
# Verify the dismissed questions data
verify_dataframe(dismissed_questions_df, "Dismissed Questions")

## 6. Creating a Medical Terminology Dataset

To help our LLM model understand medical terminology, we'll create a dataset of medical terms with plain language explanations.

In [None]:
# Check if we already have medical terminology data
medical_terminology_df = load_checkpoint("medical_terminology")

# If not, create the dataset
if medical_terminology_df is None:
    # Create a dataset of medical terms with plain language explanations
    medical_terminology_data = [
        {
            "Term": "Endometriosis",
            "Definition": "A condition where tissue similar to the lining of the uterus grows outside the uterus.",
            "PlainLanguage": "Tissue that normally lines the inside of your uterus grows outside of it, causing pain and other symptoms.",
            "Category": "Reproductive Health"
        },
        {
            "Term": "Fibromyalgia",
            "Definition": "A disorder characterized by widespread musculoskeletal pain accompanied by fatigue, sleep, memory and mood issues.",
            "PlainLanguage": "A condition that causes pain all over the body, along with tiredness, sleep problems, and sometimes emotional and mental distress.",
            "Category": "Chronic Pain/Fatigue"
        },
        {
            "Term": "Hypothyroidism",
            "Definition": "A condition in which the thyroid gland doesn't produce enough thyroid hormone.",
            "PlainLanguage": "When your thyroid gland doesn't make enough of certain important hormones, causing symptoms like fatigue and weight gain.",
            "Category": "Autoimmune Conditions"
        },
        {
            "Term": "Polycystic Ovary Syndrome",
            "Definition": "A hormonal disorder causing enlarged ovaries with small cysts on the outer edges.",
            "PlainLanguage": "A condition where a woman's hormones are out of balance, often causing irregular periods, excess hair growth, and small cysts on the ovaries.",
            "Category": "Reproductive Health"
        },
        {
            "Term": "Vulvodynia",
            "Definition": "Chronic pain or discomfort around the opening of the vagina (vulva) for which there is no identifiable cause.",
            "PlainLanguage": "Ongoing pain in the area around the opening of the vagina without a clear cause.",
            "Category": "Sexual Health"
        },
        {
            "Term": "Perimenopause",
            "Definition": "The transition period before menopause when the ovaries gradually begin to make less estrogen.",
            "PlainLanguage": "The time leading up to menopause when your body starts producing less estrogen, causing changes in your periods and other symptoms.",
            "Category": "Menopause/Aging"
        },
        {
            "Term": "Postpartum Depression",
            "Definition": "A mood disorder that can affect women after childbirth, characterized by feelings of extreme sadness, anxiety, and exhaustion.",
            "PlainLanguage": "Depression that happens after having a baby, causing feelings of extreme sadness, worry, and tiredness.",
            "Category": "Pregnancy/Postpartum"
        },
        {
            "Term": "Microvascular Coronary Dysfunction",
            "Definition": "A disease affecting the small coronary artery blood vessels, more common in women than men.",
            "PlainLanguage": "A heart condition where the small blood vessels in the heart don't work properly, often causing chest pain. It's more common in women.",
            "Category": "Cardiovascular Health"
        },
        {
            "Term": "Hashimoto's Thyroiditis",
            "Definition": "An autoimmune disorder in which the immune system attacks the thyroid, often leading to hypothyroidism.",
            "PlainLanguage": "A condition where your immune system attacks your thyroid gland, usually causing it to become underactive.",
            "Category": "Autoimmune Conditions"
        },
        {
            "Term": "Adenomyosis",
            "Definition": "A condition in which the tissue that normally lines the uterus grows into the muscular wall of the uterus.",
            "PlainLanguage": "A condition where the tissue that normally lines your uterus grows into the muscle wall, causing pain and heavy periods.",
            "Category": "Reproductive Health"
        }
    ]
    
    # Convert to DataFrame
    medical_terminology_df = pd.DataFrame(medical_terminology_data)
    
    # Save checkpoint
    save_checkpoint(medical_terminology_df, "medical_terminology")
else:
    print("Using existing medical terminology data")

In [None]:
# Verify the medical terminology data
verify_dataframe(medical_terminology_df, "Medical Terminology")

## 7. Data Collection Summary

Let's summarize the data we've collected and prepare it for the next notebook in the series.

In [None]:
# Create a summary of the collected data
data_summary = {
    "clinical_trials": {
        "count": len(clinical_trials_df),
        "columns": list(clinical_trials_df.columns),
        "conditions": clinical_trials_df['Condition'].nunique() if 'Condition' in clinical_trials_df.columns else 0
    },
    "pubmed": {
        "count": len(pubmed_df),
        "columns": list(pubmed_df.columns),
        "search_terms": pubmed_df['SearchTerm'].nunique() if 'SearchTerm' in pubmed_df.columns else 0
    },
    "dismissed_questions": {
        "count": len(dismissed_questions_df),
        "columns": list(dismissed_questions_df.columns),
        "conditions": dismissed_questions_df['Condition'].nunique() if 'Condition' in dismissed_questions_df.columns else 0,
        "categories": dismissed_questions_df['Category'].nunique() if 'Category' in dismissed_questions_df.columns else 0
    },
    "medical_terminology": {
        "count": len(medical_terminology_df),
        "columns": list(medical_terminology_df.columns),
        "categories": medical_terminology_df['Category'].nunique() if 'Category' in medical_terminology_df.columns else 0
    }
}

# Save the summary as JSON
with open(os.path.join(data_dir, 'data_collection_summary.json'), 'w') as f:
    json.dump(data_summary, f, indent=2)

print("Data collection summary saved to:", os.path.join(data_dir, 'data_collection_summary.json'))

In [None]:
# Display the summary
print("\n--- Data Collection Summary ---")
print(f"Clinical Trials: {data_summary['clinical_trials']['count']} records")
print(f"PubMed Publications: {data_summary['pubmed']['count']} records")
print(f"Dismissed Questions: {data_summary['dismissed_questions']['count']} records")
print(f"Medical Terminology: {data_summary['medical_terminology']['count']} records")
print("\nTotal Records Collected:", 
      data_summary['clinical_trials']['count'] + 
      data_summary['pubmed']['count'] + 
      data_summary['dismissed_questions']['count'] + 
      data_summary['medical_terminology']['count'])

## 8. Visualize Data Collection Results

Let's create some visualizations to better understand the data we've collected.

In [None]:
# Create a bar chart of the number of records by data source
sources = ['Clinical Trials', 'PubMed Publications', 'Dismissed Questions', 'Medical Terminology']
counts = [
    data_summary['clinical_trials']['count'],
    data_summary['pubmed']['count'],
    data_summary['dismissed_questions']['count'],
    data_summary['medical_terminology']['count']
]

plt.figure(figsize=(10, 6))
bars = plt.bar(sources, counts, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'])
plt.title('Number of Records by Data Source', fontsize=16)
plt.xlabel('Data Source', fontsize=12)
plt.ylabel('Number of Records', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Add count labels on top of bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
             f'{int(height)}',
             ha='center', va='bottom', fontsize=11)

plt.tight_layout()
plt.savefig(os.path.join(data_dir, 'data_collection_summary.png'), dpi=300)
plt.show()

In [None]:
# Visualize the categories in the dismissed questions dataset
if 'Category' in dismissed_questions_df.columns:
    category_counts = dismissed_questions_df['Category'].value_counts()
    
    plt.figure(figsize=(12, 6))
    bars = plt.bar(category_counts.index, category_counts.values, color=sns.color_palette("viridis", len(category_counts)))
    plt.title('Distribution of Categories in Dismissed Questions Dataset', fontsize=16)
    plt.xlabel('Category', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add count labels on top of bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{int(height)}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(data_dir, 'dismissed_questions_categories.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the dismissal frequency in the dismissed questions dataset
if 'DismissalFrequency' in dismissed_questions_df.columns:
    # Define the order for dismissal frequency
    order = ['Very High', 'High', 'Medium', 'Low']
    
    # Count the frequencies
    dismissal_counts = dismissed_questions_df['DismissalFrequency'].value_counts().reindex(order)
    
    # Create a color map based on severity
    colors = ['#d62728', '#ff7f0e', '#ffbb78', '#2ca02c']
    
    plt.figure(figsize=(10, 6))
    bars = plt.bar(dismissal_counts.index, dismissal_counts.values, color=colors)
    plt.title('Distribution of Dismissal Frequencies', fontsize=16)
    plt.xlabel('Dismissal Frequency', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add count labels on top of bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{int(height)}',
                 ha='center', va='bottom', fontsize=11)
    
    plt.tight_layout()
    plt.savefig(os.path.join(data_dir, 'dismissal_frequencies.png'), dpi=300)
    plt.show()

In [None]:
# Visualize the publication years in the PubMed dataset
if 'Year' in pubmed_df.columns:
    year_counts = pubmed_df['Year'].value_counts().sort_index()
    
    plt.figure(figsize=(12, 6))
    plt.plot(year_counts.index, year_counts.values, marker='o', linestyle='-', linewidth=2, markersize=8)
    plt.title('Distribution of PubMed Publications by Year', fontsize=16)
    plt.xlabel('Year', fontsize=12)
    plt.ylabel('Number of Publications', fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.7)
    
    # Add data labels
    for x, y in zip(year_counts.index, year_counts.values):
        plt.text(x, y + 0.3, str(int(y)), ha='center', va='bottom', fontsize=10)
    
    plt.tight_layout()
    plt.savefig(os.path.join(data_dir, 'pubmed_years.png'), dpi=300)
    plt.show()

## 9. Prepare for Next Notebook

Let's save all our data in a format that can be easily loaded in the next notebook in the series.

In [None]:
# Save all datasets to the processed directory for use in the next notebook
clinical_trials_df.to_csv(os.path.join(processed_dir, 'clinical_trials.csv'), index=False)
pubmed_df.to_csv(os.path.join(processed_dir, 'pubmed.csv'), index=False)
dismissed_questions_df.to_csv(os.path.join(processed_dir, 'dismissed_questions.csv'), index=False)
medical_terminology_df.to_csv(os.path.join(processed_dir, 'medical_terminology.csv'), index=False)

print("All datasets saved to the processed directory for use in the next notebook.")

## 10. Conclusion

In this notebook, we've successfully collected data from multiple sources for our women's health LLM model:

1. **Clinical Trials Data**: We collected data on women's health studies from ClinicalTrials.gov.
2. **PubMed Publications**: We gathered research papers on women's health and medical dismissal.
3. **Dismissed Questions Dataset**: We created a structured dataset of commonly dismissed women's health questions and their better alternatives.
4. **Medical Terminology**: We compiled a dataset of medical terms with plain language explanations.

This data will serve as the foundation for our LLM model that aims to help women formulate better questions for healthcare consultations.

### Next Steps

In the next notebook (Part 2: Data Preprocessing), we will:
- Clean and preprocess the collected data
- Expand the dismissed questions dataset
- Add demographic context to the data
- Prepare the data for analysis and visualization

This will help us create a more comprehensive and balanced dataset for our LLM model.