# CGMacros Dataset Exploration - Updated for Actual Data Structure

This notebook explores the actual structure of the CGMacros dataset after examining the real CSV files.

## Dataset Structure Overview

Based on our analysis, the dataset contains:

1. **CGMacros Files (CGMacros_CSVs/)**: Time-series data for 44 participants
   - Columns: Timestamp, Libre GL, Dexcom GL, HR, Calories, METs, Meal Type, Carbs, Protein, Fat, Fiber, Amount Consumed, Image path
   
2. **bio.csv**: Demographics and lab data
   - Columns: Age, Gender, BMI, Body weight, Height, A1c, Fasting GLU, Insulin, Triglycerides, Cholesterol, etc.
   
3. **microbes.csv**: Microbiome composition
   - Thousands of bacterial species columns with binary/abundance values
   
4. **gut_health_test.csv**: Gut health scores
   - Various gut health metrics (Gut Lining Health, LPS Biosynthesis, etc.)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

## 1. Load and Examine Individual Data Sources

In [None]:
# Set up data directory
data_dir = Path('../data/raw')

# Load and examine one CGMacros file
cgmacros_files = list((data_dir / 'CGMacros_CSVs').glob('CGMacros-*.csv'))
print(f"Found {len(cgmacros_files)} CGMacros participant files")

# Load first participant file
if cgmacros_files:
    sample_file = cgmacros_files[0]
    sample_df = pd.read_csv(sample_file)
    print(f"\nSample CGMacros file: {sample_file.name}")
    print(f"Shape: {sample_df.shape}")
    print(f"Columns: {list(sample_df.columns)}")
    print("\nFirst few rows:")
    sample_df.head()

In [None]:
# Examine data types and missing values
print("Data types and missing values:")
info_df = pd.DataFrame({
    'Column': sample_df.columns,
    'Type': sample_df.dtypes,
    'Non-Null Count': sample_df.count(),
    'Missing Count': sample_df.isnull().sum(),
    'Missing %': (sample_df.isnull().sum() / len(sample_df)) * 100
})
info_df

In [None]:
# Load demographics data
bio_file = data_dir / 'bio.csv'
if bio_file.exists():
    bio_df = pd.read_csv(bio_file)
    print(f"Demographics data shape: {bio_df.shape}")
    print(f"Columns: {list(bio_df.columns)}")
    print("\nFirst few rows:")
    display(bio_df.head())
else:
    print("Demographics file not found")

In [None]:
# Load microbiome data (just examine structure due to size)
microbes_file = data_dir / 'microbes.csv'
if microbes_file.exists():
    microbes_df = pd.read_csv(microbes_file)
    print(f"Microbiome data shape: {microbes_df.shape}")
    print(f"First 10 columns: {list(microbes_df.columns[:10])}")
    print(f"\nSample of first participant microbiome data:")
    display(microbes_df.iloc[0:2, 0:10])  # First 2 rows, first 10 columns
else:
    print("Microbiome file not found")

In [None]:
# Load gut health data
gut_health_file = data_dir / 'gut_health_test.csv'
if gut_health_file.exists():
    gut_health_df = pd.read_csv(gut_health_file)
    print(f"Gut health data shape: {gut_health_df.shape}")
    print(f"Columns: {list(gut_health_df.columns)}")
    print("\nFirst few rows:")
    display(gut_health_df.head())
else:
    print("Gut health file not found")

## 2. Load Data Using Updated Data Loader

In [None]:
# Import our updated data loader
import sys
sys.path.append('../src')

from data_loader_updated import DataLoader

# Initialize data loader
loader = DataLoader(data_dir='../data/raw')

# Load all data
print("Loading all data...")
merged_df = loader.load_all_data()

print(f"\nMerged dataset shape: {merged_df.shape}")
print(f"Participants: {merged_df['participant_id'].nunique()}")
print(f"Total records: {len(merged_df)}")

In [None]:
# Examine the merged dataset structure
print("Column categories in merged dataset:")

# Categorize columns
time_cols = [col for col in merged_df.columns if 'time' in col.lower() or col == 'Timestamp']
glucose_cols = [col for col in merged_df.columns if 'gl' in col.lower() or 'glucose' in col.lower()]
activity_cols = [col for col in merged_df.columns if col in ['HR', 'METs', 'Calories']]
meal_cols = [col for col in merged_df.columns if 'meal' in col.lower() or col in ['Carbs', 'Protein', 'Fat', 'Fiber', 'Amount Consumed']]
demo_cols = [col for col in merged_df.columns if col in ['Age', 'Gender', 'BMI', 'A1c', 'Fasting GLU', 'Insulin']]
gut_health_cols = [col for col in merged_df.columns if any(term in col for term in ['Gut', 'LPS', 'Biofilm', 'Production', 'Metabolism'])]

print(f"Time columns ({len(time_cols)}): {time_cols}")
print(f"Glucose columns ({len(glucose_cols)}): {glucose_cols}")
print(f"Activity columns ({len(activity_cols)}): {activity_cols}")
print(f"Meal columns ({len(meal_cols)}): {meal_cols}")
print(f"Demographics columns ({len(demo_cols)}): {demo_cols}")
print(f"Gut health columns ({len(gut_health_cols)}): {gut_health_cols[:5]}...")  # Show first 5

## 3. Compute and Analyze Target Variable (CCR)

In [None]:
# Import target computation module
from target_updated import compute_ccr, validate_ccr_computation

# Compute CCR
df_with_ccr = compute_ccr(merged_df)

# Validate CCR computation
is_valid, message = validate_ccr_computation(df_with_ccr)
print(f"CCR validation: {message}")

In [None]:
# Analyze CCR distribution
ccr_values = df_with_ccr['CCR']
non_zero_ccr = ccr_values[ccr_values > 0]

print(f"CCR Statistics:")
print(f"Total records: {len(ccr_values)}")
print(f"Non-zero CCR records: {len(non_zero_ccr)} ({len(non_zero_ccr)/len(ccr_values):.1%})")
print(f"Mean CCR: {non_zero_ccr.mean():.3f}")
print(f"Std CCR: {non_zero_ccr.std():.3f}")
print(f"Min CCR: {non_zero_ccr.min():.3f}")
print(f"Max CCR: {non_zero_ccr.max():.3f}")

# Plot CCR distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogram of non-zero CCR values
axes[0].hist(non_zero_ccr, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].set_title('Distribution of Non-Zero CCR Values')
axes[0].set_xlabel('CCR (Carbohydrate Caloric Ratio)')
axes[0].set_ylabel('Frequency')
axes[0].axvline(non_zero_ccr.mean(), color='red', linestyle='--', label=f'Mean: {non_zero_ccr.mean():.3f}')
axes[0].legend()

# Box plot by participant
participant_ccr = df_with_ccr[df_with_ccr['CCR'] > 0].groupby('participant_id')['CCR'].apply(list)
axes[1].boxplot([ccr_list for ccr_list in participant_ccr if len(ccr_list) > 0], 
                patch_artist=True, 
                boxprops=dict(facecolor='lightblue', alpha=0.7))
axes[1].set_title('CCR Distribution by Participant')
axes[1].set_xlabel('Participant')
axes[1].set_ylabel('CCR')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 4. Explore Data Quality and Patterns

In [None]:
# Analyze data availability per participant
participant_summary = df_with_ccr.groupby('participant_id').agg({
    'CCR': ['count', lambda x: (x > 0).sum()],
    'Libre GL': lambda x: x.notna().sum(),
    'Dexcom GL': lambda x: x.notna().sum(),
    'HR': lambda x: x.notna().sum(),
    'Meal Type': lambda x: (x != 'No Meal').sum() if 'Meal Type' in df_with_ccr.columns else 0
}).round(2)

participant_summary.columns = ['Total_Records', 'Meals_with_CCR', 'Libre_Records', 'Dexcom_Records', 'HR_Records', 'Meal_Records']

print("Data availability per participant:")
display(participant_summary.head(10))

# Plot data availability
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Total records per participant
axes[0,0].bar(range(len(participant_summary)), participant_summary['Total_Records'])
axes[0,0].set_title('Total Records per Participant')
axes[0,0].set_xlabel('Participant ID')
axes[0,0].set_ylabel('Number of Records')

# Meals with CCR per participant
axes[0,1].bar(range(len(participant_summary)), participant_summary['Meals_with_CCR'], color='orange')
axes[0,1].set_title('Meals with CCR per Participant')
axes[0,1].set_xlabel('Participant ID')
axes[0,1].set_ylabel('Number of Meals')

# Glucose data availability
libre_counts = participant_summary['Libre_Records']
dexcom_counts = participant_summary['Dexcom_Records']
x = range(len(participant_summary))
width = 0.35
axes[1,0].bar([i - width/2 for i in x], libre_counts, width, label='Libre GL', alpha=0.7)
axes[1,0].bar([i + width/2 for i in x], dexcom_counts, width, label='Dexcom GL', alpha=0.7)
axes[1,0].set_title('Glucose Data Availability')
axes[1,0].set_xlabel('Participant ID')
axes[1,0].set_ylabel('Number of Records')
axes[1,0].legend()

# Heart rate data availability
axes[1,1].bar(range(len(participant_summary)), participant_summary['HR_Records'], color='red', alpha=0.7)
axes[1,1].set_title('Heart Rate Data Availability')
axes[1,1].set_xlabel('Participant ID')
axes[1,1].set_ylabel('Number of Records')

plt.tight_layout()
plt.show()

In [None]:
# Analyze glucose patterns
glucose_data = df_with_ccr[['participant_id', 'Libre GL', 'Dexcom GL', 'CCR']].dropna(subset=['Libre GL', 'Dexcom GL'])

if len(glucose_data) > 0:
    print(f"Records with both Libre and Dexcom glucose: {len(glucose_data)}")
    
    # Correlation between Libre and Dexcom
    correlation = glucose_data['Libre GL'].corr(glucose_data['Dexcom GL'])
    print(f"Correlation between Libre and Dexcom: {correlation:.3f}")
    
    # Plot glucose comparison
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Scatter plot
    axes[0].scatter(glucose_data['Libre GL'], glucose_data['Dexcom GL'], alpha=0.6)
    axes[0].plot([glucose_data['Libre GL'].min(), glucose_data['Libre GL'].max()], 
                 [glucose_data['Libre GL'].min(), glucose_data['Libre GL'].max()], 
                 'r--', alpha=0.8)
    axes[0].set_xlabel('Libre GL')
    axes[0].set_ylabel('Dexcom GL')
    axes[0].set_title(f'Libre vs Dexcom Glucose (r={correlation:.3f})')
    
    # Distribution comparison
    axes[1].hist(glucose_data['Libre GL'], bins=30, alpha=0.5, label='Libre GL', density=True)
    axes[1].hist(glucose_data['Dexcom GL'], bins=30, alpha=0.5, label='Dexcom GL', density=True)
    axes[1].set_xlabel('Glucose Level')
    axes[1].set_ylabel('Density')
    axes[1].set_title('Glucose Distribution Comparison')
    axes[1].legend()
    
    plt.tight_layout()
    plt.show()
else:
    print("No records with both Libre and Dexcom glucose data")

## 5. Analyze Meal Patterns and CCR Relationships

In [None]:
# Analyze meal types and their CCR patterns
meal_data = df_with_ccr[(df_with_ccr['CCR'] > 0) & (df_with_ccr['Meal Type'] != 'No Meal')].copy()

if len(meal_data) > 0 and 'Meal Type' in meal_data.columns:
    print(f"Meal records with CCR: {len(meal_data)}")
    
    # Meal type distribution
    meal_counts = meal_data['Meal Type'].value_counts()
    print(f"\nMeal type distribution:")
    print(meal_counts)
    
    # CCR by meal type
    ccr_by_meal = meal_data.groupby('Meal Type')['CCR'].agg(['mean', 'std', 'count']).round(3)
    print(f"\nCCR by meal type:")
    display(ccr_by_meal)
    
    # Plot meal analysis
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Meal type distribution
    meal_counts.plot(kind='bar', ax=axes[0,0])
    axes[0,0].set_title('Meal Type Distribution')
    axes[0,0].set_ylabel('Count')
    axes[0,0].tick_params(axis='x', rotation=45)
    
    # CCR by meal type (box plot)
    meal_data.boxplot(column='CCR', by='Meal Type', ax=axes[0,1])
    axes[0,1].set_title('CCR Distribution by Meal Type')
    axes[0,1].set_xlabel('Meal Type')
    
    # Macronutrient analysis
    if all(col in meal_data.columns for col in ['Carbs', 'Protein', 'Fat']):
        # Average macronutrients by meal type
        macro_by_meal = meal_data.groupby('Meal Type')[['Carbs', 'Protein', 'Fat']].mean()
        macro_by_meal.plot(kind='bar', ax=axes[1,0], stacked=True)
        axes[1,0].set_title('Average Macronutrients by Meal Type')
        axes[1,0].set_ylabel('Grams')
        axes[1,0].tick_params(axis='x', rotation=45)
        
        # CCR vs Carbs relationship
        axes[1,1].scatter(meal_data['Carbs'], meal_data['CCR'], alpha=0.6)
        axes[1,1].set_xlabel('Carbohydrates (g)')
        axes[1,1].set_ylabel('CCR')
        axes[1,1].set_title('CCR vs Carbohydrates')
    
    plt.tight_layout()
    plt.show()
else:
    print("No meal data with CCR found")

## 6. Explore Participant Demographics and Health Metrics

In [None]:
# Analyze demographics (participant-level data)
if not bio_df.empty:
    print("Demographics Analysis:")
    
    # Basic demographics
    if 'Age' in bio_df.columns:
        print(f"Age - Mean: {bio_df['Age'].mean():.1f}, Range: {bio_df['Age'].min()}-{bio_df['Age'].max()}")
    
    if 'Gender' in bio_df.columns:
        print(f"Gender distribution:\n{bio_df['Gender'].value_counts()}")
    
    if 'BMI' in bio_df.columns:
        print(f"BMI - Mean: {bio_df['BMI'].mean():.1f}, Range: {bio_df['BMI'].min():.1f}-{bio_df['BMI'].max():.1f}")
    
    # Plot demographics
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Age distribution
    if 'Age' in bio_df.columns:
        axes[0,0].hist(bio_df['Age'].dropna(), bins=15, alpha=0.7, color='skyblue', edgecolor='black')
        axes[0,0].set_title('Age Distribution')
        axes[0,0].set_xlabel('Age')
        axes[0,0].set_ylabel('Count')
    
    # BMI distribution
    if 'BMI' in bio_df.columns:
        axes[0,1].hist(bio_df['BMI'].dropna(), bins=15, alpha=0.7, color='lightgreen', edgecolor='black')
        axes[0,1].set_title('BMI Distribution')
        axes[0,1].set_xlabel('BMI')
        axes[0,1].set_ylabel('Count')
        # Add BMI category lines
        axes[0,1].axvline(18.5, color='red', linestyle='--', alpha=0.7, label='Underweight')
        axes[0,1].axvline(25, color='orange', linestyle='--', alpha=0.7, label='Overweight')
        axes[0,1].axvline(30, color='red', linestyle='--', alpha=0.7, label='Obese')
        axes[0,1].legend()
    
    # A1c distribution (diabetes marker)
    if 'A1c' in bio_df.columns:
        axes[1,0].hist(bio_df['A1c'].dropna(), bins=15, alpha=0.7, color='coral', edgecolor='black')
        axes[1,0].set_title('A1c Distribution')
        axes[1,0].set_xlabel('A1c (%)')
        axes[1,0].set_ylabel('Count')
        # Add diabetes cutoffs
        axes[1,0].axvline(5.7, color='orange', linestyle='--', alpha=0.7, label='Prediabetic')
        axes[1,0].axvline(6.5, color='red', linestyle='--', alpha=0.7, label='Diabetic')
        axes[1,0].legend()
    
    # Correlation heatmap of numeric variables
    numeric_cols = bio_df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 1:
        corr_matrix = bio_df[numeric_cols].corr()
        sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[1,1], fmt='.2f')
        axes[1,1].set_title('Demographics Correlation Matrix')
    
    plt.tight_layout()
    plt.show()
else:
    print("No demographics data available")

## 7. Microbiome Data Overview

In [None]:
# Analyze microbiome data
if not microbes_df.empty:
    print(f"Microbiome Analysis:")
    print(f"Number of participants: {len(microbes_df)}")
    print(f"Number of microbial species: {microbes_df.shape[1] - 1}")  # Minus subject column
    
    # Get microbiome columns (exclude subject)
    microbe_cols = [col for col in microbes_df.columns if col != 'subject']
    microbe_data = microbes_df[microbe_cols]
    
    # Calculate diversity metrics
    species_richness = (microbe_data > 0).sum(axis=1)  # Number of species present
    total_abundance = microbe_data.sum(axis=1)  # Total microbial abundance
    
    print(f"\nMicrobiome Diversity:")
    print(f"Average species richness: {species_richness.mean():.1f} ± {species_richness.std():.1f}")
    print(f"Range: {species_richness.min()} - {species_richness.max()} species")
    
    # Most prevalent species
    species_prevalence = (microbe_data > 0).sum().sort_values(ascending=False)
    print(f"\nTop 10 most prevalent species:")
    for i, (species, count) in enumerate(species_prevalence.head(10).items()):
        print(f"{i+1}. {species}: {count}/{len(microbes_df)} participants ({count/len(microbes_df):.1%})")
    
    # Plot microbiome analysis
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Species richness distribution
    axes[0,0].hist(species_richness, bins=20, alpha=0.7, color='forestgreen', edgecolor='black')
    axes[0,0].set_title('Species Richness Distribution')
    axes[0,0].set_xlabel('Number of Species Present')
    axes[0,0].set_ylabel('Number of Participants')
    
    # Total abundance distribution
    axes[0,1].hist(total_abundance, bins=20, alpha=0.7, color='darkblue', edgecolor='black')
    axes[0,1].set_title('Total Microbial Abundance Distribution')
    axes[0,1].set_xlabel('Total Abundance')
    axes[0,1].set_ylabel('Number of Participants')
    
    # Species prevalence (top 20)
    top_20_species = species_prevalence.head(20)
    axes[1,0].barh(range(len(top_20_species)), top_20_species.values)
    axes[1,0].set_yticks(range(len(top_20_species)))
    axes[1,0].set_yticklabels([name[:30] + '...' if len(name) > 30 else name for name in top_20_species.index])
    axes[1,0].set_title('Top 20 Most Prevalent Species')
    axes[1,0].set_xlabel('Number of Participants')
    
    # Richness vs abundance scatter
    axes[1,1].scatter(species_richness, total_abundance, alpha=0.6)
    axes[1,1].set_xlabel('Species Richness')
    axes[1,1].set_ylabel('Total Abundance')
    axes[1,1].set_title('Species Richness vs Total Abundance')
    
    plt.tight_layout()
    plt.show()
else:
    print("No microbiome data available")

## 8. Gut Health Scores Analysis

In [None]:
# Analyze gut health scores
if not gut_health_df.empty:
    print("Gut Health Analysis:")
    
    # Get gut health metrics (exclude subject)
    gut_metrics = [col for col in gut_health_df.columns if col != 'subject']
    
    print(f"Number of gut health metrics: {len(gut_metrics)}")
    print(f"Metrics: {gut_metrics}")
    
    # Summary statistics
    gut_health_stats = gut_health_df[gut_metrics].describe().round(2)
    print(f"\nGut Health Metrics Summary:")
    display(gut_health_stats)
    
    # Calculate composite gut health score
    gut_health_df['composite_score'] = gut_health_df[gut_metrics].mean(axis=1)
    
    print(f"\nComposite Gut Health Score:")
    print(f"Mean: {gut_health_df['composite_score'].mean():.2f}")
    print(f"Std: {gut_health_df['composite_score'].std():.2f}")
    print(f"Range: {gut_health_df['composite_score'].min():.2f} - {gut_health_df['composite_score'].max():.2f}")
    
    # Plot gut health analysis
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Composite score distribution
    axes[0,0].hist(gut_health_df['composite_score'], bins=15, alpha=0.7, color='purple', edgecolor='black')
    axes[0,0].set_title('Composite Gut Health Score Distribution')
    axes[0,0].set_xlabel('Composite Score')
    axes[0,0].set_ylabel('Number of Participants')
    
    # Heatmap of gut health scores
    gut_health_matrix = gut_health_df[gut_metrics].T  # Transpose for better visualization
    sns.heatmap(gut_health_matrix, cmap='RdYlBu_r', ax=axes[0,1], cbar_kws={'label': 'Score'})
    axes[0,1].set_title('Gut Health Scores Heatmap')
    axes[0,1].set_xlabel('Participant')
    axes[0,1].set_ylabel('Gut Health Metric')
    
    # Box plot of all metrics
    gut_health_df[gut_metrics].boxplot(ax=axes[1,0], rot=45)
    axes[1,0].set_title('Distribution of All Gut Health Metrics')
    axes[1,0].set_ylabel('Score')
    
    # Correlation matrix of gut health metrics
    gut_corr = gut_health_df[gut_metrics].corr()
    sns.heatmap(gut_corr, annot=True, cmap='coolwarm', center=0, ax=axes[1,1], fmt='.2f')
    axes[1,1].set_title('Gut Health Metrics Correlation')
    
    plt.tight_layout()
    plt.show()
else:
    print("No gut health data available")

## 9. Data Quality Summary and Recommendations

In [None]:
# Generate comprehensive data quality report
print("=" * 60)
print("CGMACROS DATASET QUALITY SUMMARY")
print("=" * 60)

print(f"\n📊 OVERALL DATASET:")
print(f"   • Total merged records: {len(merged_df):,}")
print(f"   • Number of participants: {merged_df['participant_id'].nunique()}")
print(f"   • Total features: {merged_df.shape[1]}")
print(f"   • Records with CCR > 0: {len(df_with_ccr[df_with_ccr['CCR'] > 0]):,}")

print(f"\n🎯 TARGET VARIABLE (CCR):")
ccr_stats = df_with_ccr[df_with_ccr['CCR'] > 0]['CCR']
if len(ccr_stats) > 0:
    print(f"   • Valid CCR records: {len(ccr_stats):,}")
    print(f"   • Mean CCR: {ccr_stats.mean():.3f}")
    print(f"   • CCR range: {ccr_stats.min():.3f} - {ccr_stats.max():.3f}")
    print(f"   • CCR variability (CV): {ccr_stats.std()/ccr_stats.mean():.3f}")

print(f"\n📈 DATA MODALITIES:")
glucose_avail = merged_df[['Libre GL', 'Dexcom GL']].notna().any(axis=1).sum()
activity_avail = merged_df[['HR', 'METs', 'Calories']].notna().any(axis=1).sum()
print(f"   • Glucose data: {glucose_avail:,} records ({glucose_avail/len(merged_df):.1%})")
print(f"   • Activity data: {activity_avail:,} records ({activity_avail/len(merged_df):.1%})")
if not bio_df.empty:
    print(f"   • Demographics: {len(bio_df)} participants")
if not microbes_df.empty:
    print(f"   • Microbiome: {len(microbes_df)} participants, {microbes_df.shape[1]-1} species")
if not gut_health_df.empty:
    print(f"   • Gut health: {len(gut_health_df)} participants, {len(gut_metrics)} metrics")

print(f"\n⚠️  DATA QUALITY ISSUES:")
# Check for common issues
issues = []

# Missing data issues
missing_pct = merged_df.isnull().sum() / len(merged_df) * 100
high_missing = missing_pct[missing_pct > 50]
if len(high_missing) > 0:
    issues.append(f"High missing data: {len(high_missing)} columns >50% missing")

# Participant data imbalance
records_per_participant = merged_df['participant_id'].value_counts()
if records_per_participant.std() / records_per_participant.mean() > 0.5:
    issues.append(f"Imbalanced participant data: CV = {records_per_participant.std()/records_per_participant.mean():.2f}")

# CCR data sparsity
ccr_coverage = len(df_with_ccr[df_with_ccr['CCR'] > 0]) / len(merged_df)
if ccr_coverage < 0.1:
    issues.append(f"Low CCR coverage: {ccr_coverage:.1%} of records have valid CCR")

if issues:
    for issue in issues:
        print(f"   • {issue}")
else:
    print(f"   • No major data quality issues detected")

print(f"\n🚀 RECOMMENDATIONS:")
print(f"   • Focus on meal-level prediction rather than individual time points")
print(f"   • Use participant-level features from demographics, microbiome, and gut health")
print(f"   • Engineer temporal and aggregate features from time-series data")
print(f"   • Consider participant-specific models or clustering approaches")
print(f"   • Handle missing glucose data with interpolation or imputation")
print(f"   • Use microbiome diversity features rather than individual species")

print("\n" + "=" * 60)

## 10. Next Steps

Based on this exploration, the next steps should be:

1. **Feature Engineering**: Use the updated feature engineering module to create comprehensive features
2. **Data Aggregation**: Consider meal-level aggregation for more stable predictions
3. **Model Development**: Start with simpler models and progressively add complexity
4. **Validation Strategy**: Use participant-aware cross-validation to avoid data leakage
5. **Baseline Models**: Establish baseline performance before adding multimodal features

The updated modules (`data_loader_updated.py`, `feature_engineering_updated.py`, `target_updated.py`) are now aligned with the actual dataset structure and ready for use in the modeling pipeline.