# 📊 Data Usage Examples - Cardiac Digital Twins Enhanced\n\nThis notebook demonstrates how to load, explore, and use the synthetic datasets provided with the Cardiac Digital Twins Enhanced framework.\n\n## 🎯 Learning Objectives\n\nBy the end of this notebook, you will be able to:\n- Load and explore the synthetic cardiac datasets\n- Understand the relationship between parameters and clinical metrics\n- Prepare data for machine learning models\n- Visualize cardiac function across different conditions\n- Use the data for physics-informed learning

In [None]:
# Import required libraries\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom pathlib import Path\nimport warnings\nwarnings.filterwarnings('ignore')\n\n# Set style for better plots\nplt.style.use('seaborn-v0_8')\nsns.set_palette('husl')\n\n# Configure display\npd.set_option('display.max_columns', None)\npd.set_option('display.precision', 3)\n\nprint('📊 Data Usage Examples - Cardiac Digital Twins Enhanced')\nprint('=' * 60)

## 📁 1. Loading the Datasets\n\nLet's start by loading all the available datasets and exploring their structure.

In [None]:
# Define data directory\ndata_dir = Path('../data')\n\n# Load all datasets\ndatasets = {}\n\n# Windkessel parameters\ndatasets['parameters'] = pd.read_csv(data_dir / 'windkessel_parameters.csv')\nprint(f'✅ Loaded Windkessel parameters: {datasets[\"parameters\"].shape}')\n\n# Clinical metrics\ndatasets['metrics'] = pd.read_csv(data_dir / 'clinical_metrics.csv')\nprint(f'✅ Loaded clinical metrics: {datasets[\"metrics\"].shape}')\n\n# Combined dataset\ndatasets['combined'] = pd.read_csv(data_dir / 'combined_dataset.csv')\nprint(f'✅ Loaded combined dataset: {datasets[\"combined\"].shape}')\n\n# Clinical conditions\ndatasets['conditions'] = pd.read_csv(data_dir / 'clinical_conditions.csv')\nprint(f'✅ Loaded clinical conditions: {datasets[\"conditions\"].shape}')\n\n# Echo metadata\ndatasets['echo_meta'] = pd.read_csv(data_dir / 'echo_metadata.csv')\nprint(f'✅ Loaded echo metadata: {datasets[\"echo_meta\"].shape}')\n\nprint(f'\\n📊 Total datasets loaded: {len(datasets)}')

## 🔍 2. Dataset Exploration\n\nLet's explore the structure and content of each dataset.

In [None]:
# Explore Windkessel parameters\nprint('🔬 WINDKESSEL PARAMETERS')\nprint('=' * 40)\nprint(f'Shape: {datasets[\"parameters\"].shape}')\nprint(f'Columns: {list(datasets[\"parameters\"].columns)}')\nprint('\\nFirst 5 rows:')\ndisplay(datasets['parameters'].head())\n\nprint('\\nStatistical summary:')\ndisplay(datasets['parameters'].describe())

In [None]:
# Explore clinical metrics\nprint('🏥 CLINICAL METRICS')\nprint('=' * 40)\nprint(f'Shape: {datasets[\"metrics\"].shape}')\nprint(f'Columns: {list(datasets[\"metrics\"].columns)}')\nprint('\\nFirst 5 rows:')\ndisplay(datasets['metrics'].head())\n\nprint('\\nStatistical summary:')\ndisplay(datasets['metrics'].describe())

In [None]:
# Explore clinical conditions\nprint('🩺 CLINICAL CONDITIONS')\nprint('=' * 40)\nprint(f'Shape: {datasets[\"conditions\"].shape}')\nprint(f'Conditions: {datasets[\"conditions\"][\"condition\"].unique()}')\nprint('\\nSamples per condition:')\nprint(datasets['conditions']['condition'].value_counts())\n\nprint('\\nFirst 5 rows:')\ndisplay(datasets['conditions'].head())

## 📊 3. Data Visualization\n\nLet's create visualizations to understand the data distributions and relationships.

In [None]:
# Plot parameter distributions\nfig, axes = plt.subplots(3, 4, figsize=(16, 12))\nfig.suptitle('Windkessel Parameter Distributions', fontsize=16, fontweight='bold')\n\nparams = ['Emax', 'Emin', 'Tc', 'Rm', 'Ra', 'Rs', 'Ca', 'Cs', 'Cr', 'Ls', 'Rc', 'Vd']\n\nfor i, param in enumerate(params):\n    ax = axes[i//4, i%4]\n    datasets['parameters'][param].hist(bins=30, ax=ax, alpha=0.7, color='skyblue', edgecolor='black')\n    ax.set_title(f'{param}', fontweight='bold')\n    ax.set_xlabel(param)\n    ax.set_ylabel('Frequency')\n    ax.grid(True, alpha=0.3)\n\nplt.tight_layout()\nplt.show()

In [None]:
# Plot clinical metrics distributions\nfig, axes = plt.subplots(2, 4, figsize=(16, 8))\nfig.suptitle('Clinical Metrics Distributions', fontsize=16, fontweight='bold')\n\nmetrics = ['VED', 'VES', 'EF', 'stroke_volume', 'max_pressure', 'min_pressure', 'cardiac_output', 'heart_rate']\n\nfor i, metric in enumerate(metrics):\n    ax = axes[i//4, i%4]\n    datasets['metrics'][metric].hist(bins=30, ax=ax, alpha=0.7, color='lightcoral', edgecolor='black')\n    ax.set_title(f'{metric}', fontweight='bold')\n    ax.set_xlabel(metric)\n    ax.set_ylabel('Frequency')\n    ax.grid(True, alpha=0.3)\n\nplt.tight_layout()\nplt.show()

In [None]:
# Plot parameter correlations\nplt.figure(figsize=(12, 10))\ncorrelation_matrix = datasets['parameters'].corr()\nmask = np.triu(np.ones_like(correlation_matrix, dtype=bool))\n\nsns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm', center=0,\n            square=True, linewidths=0.5, cbar_kws={\"shrink\": .8})\nplt.title('Windkessel Parameter Correlations', fontsize=14, fontweight='bold')\nplt.tight_layout()\nplt.show()

## 🩺 4. Clinical Condition Analysis\n\nLet's analyze how parameters differ across clinical conditions.

In [None]:
# Plot key parameters by condition\nfig, axes = plt.subplots(2, 3, figsize=(18, 12))\nfig.suptitle('Key Parameters by Clinical Condition', fontsize=16, fontweight='bold')\n\nkey_params = ['Emax', 'Emin', 'Rs', 'Ra', 'Ca', 'Vd']\n\nfor i, param in enumerate(key_params):\n    ax = axes[i//3, i%3]\n    sns.boxplot(data=datasets['conditions'], x='condition', y=param, ax=ax)\n    ax.set_title(f'{param} by Condition', fontweight='bold')\n    ax.set_xlabel('Clinical Condition')\n    ax.set_ylabel(param)\n    ax.tick_params(axis='x', rotation=45)\n    ax.grid(True, alpha=0.3)\n\nplt.tight_layout()\nplt.show()

In [None]:
# Calculate and display condition statistics\nprint('📊 PARAMETER STATISTICS BY CONDITION')\nprint('=' * 50)\n\nfor condition in datasets['conditions']['condition'].unique():\n    condition_data = datasets['conditions'][datasets['conditions']['condition'] == condition]\n    print(f'\\n🏥 {condition.upper()}')\n    print('-' * 30)\n    \n    # Key parameters for this condition\n    key_stats = condition_data[['Emax', 'Emin', 'Rs', 'Ra']].describe().loc[['mean', 'std']]\n    print(key_stats.round(4))

## 🔗 5. Parameter-Metric Relationships\n\nLet's explore the relationships between Windkessel parameters and clinical metrics.

In [None]:
# Plot key parameter-metric relationships\nfig, axes = plt.subplots(2, 2, figsize=(14, 10))\nfig.suptitle('Parameter-Metric Relationships', fontsize=16, fontweight='bold')\n\n# Emax vs EF\naxes[0,0].scatter(datasets['parameters']['Emax'], datasets['metrics']['EF'], alpha=0.6, color='blue')\naxes[0,0].set_xlabel('Emax (Maximum Elastance)')\naxes[0,0].set_ylabel('Ejection Fraction (%)')\naxes[0,0].set_title('Contractility vs Ejection Fraction')\naxes[0,0].grid(True, alpha=0.3)\n\n# Rs vs max_pressure\naxes[0,1].scatter(datasets['parameters']['Rs'], datasets['metrics']['max_pressure'], alpha=0.6, color='red')\naxes[0,1].set_xlabel('Rs (Systemic Resistance)')\naxes[0,1].set_ylabel('Max Pressure (mmHg)')\naxes[0,1].set_title('Afterload vs Peak Pressure')\naxes[0,1].grid(True, alpha=0.3)\n\n# Tc vs heart_rate\naxes[1,0].scatter(datasets['parameters']['Tc'], datasets['metrics']['heart_rate'], alpha=0.6, color='green')\naxes[1,0].set_xlabel('Tc (Cardiac Cycle Time)')\naxes[1,0].set_ylabel('Heart Rate (bpm)')\naxes[1,0].set_title('Cycle Time vs Heart Rate')\naxes[1,0].grid(True, alpha=0.3)\n\n# Vd vs stroke_volume\naxes[1,1].scatter(datasets['parameters']['Vd'], datasets['metrics']['stroke_volume'], alpha=0.6, color='purple')\naxes[1,1].set_xlabel('Vd (Dead Volume)')\naxes[1,1].set_ylabel('Stroke Volume (mL)')\naxes[1,1].set_title('Dead Volume vs Stroke Volume')\naxes[1,1].grid(True, alpha=0.3)\n\nplt.tight_layout()\nplt.show()

In [None]:
# Calculate correlation between parameters and metrics\ncombined_corr = datasets['combined'].corr()\n\n# Extract parameter-metric correlations\nparam_cols = datasets['parameters'].columns\nmetric_cols = datasets['metrics'].columns\n\nparam_metric_corr = combined_corr.loc[param_cols, metric_cols]\n\nplt.figure(figsize=(12, 8))\nsns.heatmap(param_metric_corr, annot=True, cmap='RdBu_r', center=0,\n            square=False, linewidths=0.5, cbar_kws={\"shrink\": .8})\nplt.title('Parameter-Metric Correlations', fontsize=14, fontweight='bold')\nplt.xlabel('Clinical Metrics')\nplt.ylabel('Windkessel Parameters')\nplt.tight_layout()\nplt.show()

## 🤖 6. Machine Learning Data Preparation\n\nLet's prepare the data for machine learning models.

In [None]:
from sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.metrics import mean_squared_error, r2_score\n\n# Prepare features and targets\nX = datasets['combined'].iloc[:, :12].values  # Parameters\ny = datasets['combined'].iloc[:, 12:].values  # Clinical metrics\n\nfeature_names = list(datasets['parameters'].columns)\ntarget_names = list(datasets['metrics'].columns)\n\nprint(f'📊 Features (X): {X.shape} - {feature_names}')\nprint(f'📊 Targets (y): {y.shape} - {target_names}')\n\n# Split the data\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, test_size=0.2, random_state=42\n)\n\nprint(f'\\n✅ Training set: {X_train.shape}')\nprint(f'✅ Test set: {X_test.shape}')

In [None]:
# Normalize the data\nscaler_X = StandardScaler()\nscaler_y = StandardScaler()\n\nX_train_scaled = scaler_X.fit_transform(X_train)\nX_test_scaled = scaler_X.transform(X_test)\ny_train_scaled = scaler_y.fit_transform(y_train)\ny_test_scaled = scaler_y.transform(y_test)\n\nprint('📊 DATA NORMALIZATION SUMMARY')\nprint('=' * 40)\nprint(f'X_train - Mean: {X_train_scaled.mean():.3f}, Std: {X_train_scaled.std():.3f}')\nprint(f'y_train - Mean: {y_train_scaled.mean():.3f}, Std: {y_train_scaled.std():.3f}')\nprint(f'X_test - Mean: {X_test_scaled.mean():.3f}, Std: {X_test_scaled.std():.3f}')\nprint(f'y_test - Mean: {y_test_scaled.mean():.3f}, Std: {y_test_scaled.std():.3f}')

In [None]:
# Simple baseline model demonstration\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.ensemble import RandomForestRegressor\n\n# Train simple models\nmodels = {\n    'Linear Regression': LinearRegression(),\n    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)\n}\n\nresults = {}\n\nfor name, model in models.items():\n    # Train model\n    model.fit(X_train_scaled, y_train_scaled)\n    \n    # Predict\n    y_pred_scaled = model.predict(X_test_scaled)\n    y_pred = scaler_y.inverse_transform(y_pred_scaled)\n    \n    # Calculate metrics\n    mse = mean_squared_error(y_test, y_pred)\n    r2 = r2_score(y_test, y_pred)\n    \n    results[name] = {'MSE': mse, 'R2': r2}\n    \n    print(f'🤖 {name}:')\n    print(f'   MSE: {mse:.3f}')\n    print(f'   R²:  {r2:.3f}')\n    print()

## 📈 7. Echo Metadata Analysis\n\nLet's explore the echocardiogram metadata.

In [None]:
# Analyze echo metadata\necho_df = datasets['echo_meta']\n\nfig, axes = plt.subplots(2, 3, figsize=(18, 10))\nfig.suptitle('Echocardiogram Metadata Analysis', fontsize=16, fontweight='bold')\n\n# Age distribution\naxes[0,0].hist(echo_df['age'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')\naxes[0,0].set_title('Age Distribution')\naxes[0,0].set_xlabel('Age (years)')\naxes[0,0].set_ylabel('Frequency')\naxes[0,0].grid(True, alpha=0.3)\n\n# Gender distribution\ngender_counts = echo_df['gender'].value_counts()\naxes[0,1].pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%')\naxes[0,1].set_title('Gender Distribution')\n\n# Frame count distribution\nframe_counts = echo_df['frame_count'].value_counts().sort_index()\naxes[0,2].bar(frame_counts.index, frame_counts.values, alpha=0.7, color='lightcoral')\naxes[0,2].set_title('Frame Count Distribution')\naxes[0,2].set_xlabel('Number of Frames')\naxes[0,2].set_ylabel('Frequency')\naxes[0,2].grid(True, alpha=0.3)\n\n# SNR distribution\naxes[1,0].hist(echo_df['snr_db'], bins=20, alpha=0.7, color='lightgreen', edgecolor='black')\naxes[1,0].set_title('Signal-to-Noise Ratio')\naxes[1,0].set_xlabel('SNR (dB)')\naxes[1,0].set_ylabel('Frequency')\naxes[1,0].grid(True, alpha=0.3)\n\n# Contrast distribution\naxes[1,1].hist(echo_df['contrast'], bins=20, alpha=0.7, color='orange', edgecolor='black')\naxes[1,1].set_title('Image Contrast')\naxes[1,1].set_xlabel('Contrast')\naxes[1,1].set_ylabel('Frequency')\naxes[1,1].grid(True, alpha=0.3)\n\n# Condition distribution\ncondition_counts = echo_df['condition'].value_counts()\naxes[1,2].pie(condition_counts.values, labels=condition_counts.index, autopct='%1.1f%%')\naxes[1,2].set_title('Condition Distribution')\n\nplt.tight_layout()\nplt.show()

## 💾 8. Data Export and Saving\n\nLet's demonstrate how to save processed data for later use.

In [None]:
# Save processed data\nimport pickle\n\n# Create processed data dictionary\nprocessed_data = {\n    'X_train': X_train_scaled,\n    'X_test': X_test_scaled,\n    'y_train': y_train_scaled,\n    'y_test': y_test_scaled,\n    'scaler_X': scaler_X,\n    'scaler_y': scaler_y,\n    'feature_names': feature_names,\n    'target_names': target_names\n}\n\n# Save to pickle file\nwith open('../data/processed_data.pkl', 'wb') as f:\n    pickle.dump(processed_data, f)\n\nprint('✅ Processed data saved to ../data/processed_data.pkl')\n\n# Save training data as CSV\ntrain_df = pd.DataFrame(X_train_scaled, columns=feature_names)\ntrain_targets = pd.DataFrame(y_train_scaled, columns=target_names)\ntrain_combined = pd.concat([train_df, train_targets], axis=1)\ntrain_combined.to_csv('../data/training_data_normalized.csv', index=False)\n\nprint('✅ Training data saved to ../data/training_data_normalized.csv')

## 📊 9. Data Summary and Statistics\n\nLet's create a comprehensive summary of our datasets.

In [None]:
# Create comprehensive data summary\nsummary_stats = {}\n\nfor name, df in datasets.items():\n    if name != 'echo_meta':  # Skip echo metadata for numeric summary\n        numeric_cols = df.select_dtypes(include=[np.number]).columns\n        summary_stats[name] = {\n            'shape': df.shape,\n            'numeric_columns': len(numeric_cols),\n            'missing_values': df.isnull().sum().sum(),\n            'memory_usage_mb': df.memory_usage(deep=True).sum() / 1024 / 1024\n        }\n\n# Display summary\nprint('📊 DATASET SUMMARY STATISTICS')\nprint('=' * 50)\n\nfor name, stats in summary_stats.items():\n    print(f'\\n📁 {name.upper()}:')\n    print(f'   Shape: {stats[\"shape\"]}')\n    print(f'   Numeric columns: {stats[\"numeric_columns\"]}')\n    print(f'   Missing values: {stats[\"missing_values\"]}')\n    print(f'   Memory usage: {stats[\"memory_usage_mb\"]:.2f} MB')\n\n# Overall statistics\ntotal_samples = sum(stats['shape'][0] for stats in summary_stats.values())\ntotal_memory = sum(stats['memory_usage_mb'] for stats in summary_stats.values())\n\nprint(f'\\n🎯 OVERALL STATISTICS:')\nprint(f'   Total samples: {total_samples:,}')\nprint(f'   Total memory: {total_memory:.2f} MB')\nprint(f'   Datasets: {len(datasets)}')\nprint(f'   Data quality: ✅ No missing values')

## 🎯 10. Next Steps and Recommendations\n\nBased on this data exploration, here are the recommended next steps:

In [None]:
print('🎯 NEXT STEPS AND RECOMMENDATIONS')\nprint('=' * 50)\nprint()\nprint('1. 🤖 MACHINE LEARNING MODELS:')\nprint('   - Use the normalized training data for neural networks')\nprint('   - Try physics-informed neural networks (PINNs)')\nprint('   - Implement multi-task learning for all clinical metrics')\nprint()\nprint('2. 🔬 PHYSICS-INFORMED LEARNING:')\nprint('   - Incorporate Windkessel equations as loss constraints')\nprint('   - Use parameter-metric correlations for regularization')\nprint('   - Validate predictions against known physiological ranges')\nprint()\nprint('3. 🏥 CLINICAL VALIDATION:')\nprint('   - Test models on clinical condition datasets')\nprint('   - Evaluate performance across different disease states')\nprint('   - Implement uncertainty quantification')\nprint()\nprint('4. 📊 DATA AUGMENTATION:')\nprint('   - Generate more samples for rare conditions')\nprint('   - Add noise for robustness testing')\nprint('   - Create time-series data for dynamic modeling')\nprint()\nprint('5. 🎥 VIDEO PROCESSING:')\nprint('   - Generate actual synthetic echocardiogram videos')\nprint('   - Implement 3D CNN architectures')\nprint('   - Add temporal attention mechanisms')\nprint()\nprint('✅ Data is ready for advanced cardiac digital twin development!')

## 📚 Conclusion\n\nIn this notebook, we have:\n\n✅ **Loaded and explored** all synthetic cardiac datasets\n✅ **Visualized** parameter distributions and relationships\n✅ **Analyzed** clinical conditions and their characteristics\n✅ **Prepared** data for machine learning models\n✅ **Demonstrated** baseline model training\n✅ **Explored** echocardiogram metadata\n✅ **Saved** processed data for future use\n\nThe datasets are now ready for advanced cardiac digital twin development using physics-informed neural networks and deep learning approaches.\n\n**🫀 Ready to build the future of cardiac care through AI-powered digital twins!**