### Morphometrics

#### Steps In This Notebook:
1. Get the relevant data file set from 2003 - 2004 and 2005-2006 individually and combine. Verify the SEQN (id number)
2. Wrangle the combined file
3. Combine with the mortality data
4. Perform  EDA

In [1]:
import pandas as pd
import json

import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

from nhanes.utils import get_nhanes_year_code_dict, get_source_code_from_filepath
from nhanes.utils import EmptySectionError, make_long_variable_name
from nhanes.utils import get_vars_to_keep, get_datasets

#### The BMX Files (body measurements)

##### 2003 - 2004. Examine to ensure same columns as 2005 - 2006

2003 - 2004 
Codebook
- SEQN - Respondent sequence number
- BMDSTATS - Body Measures Component Status Code
BMDRECUF - Height-Length difference flagged
BMDSUBF - Unusual value noted during data review
BMDTHICF - Unusual value noted during data review
BMDLEGF - Unusual value noted during data review
BMDARMLF - Unusual value noted during data review
BMDCALFF - Unusual value noted during data review
BMXWT - Weight (kg)
BMIWT - Weight Comment
BMXRECUM - Recumbent Length (cm)
BMIRECUM - Recumbent Length Comment
BMXHEAD - Head Circumference (cm)
BMIHEAD - Head Circumference Comment
BMXHT - Standing Height (cm)
BMIHT - Standing Height Comment
BMXBMI - Body Mass Index (kg/m**2)
BMXLEG - Upper Leg Length (cm)
BMILEG - Upper Leg Length Comment
BMXCALF - Maximal Calf Circumference (cm)
BMICALF - Maximal Calf Comment
BMXARML - Upper Arm Length (cm)
BMIARML - Upper Arm Length Comment
BMXARMC - Arm Circumference (cm)
BMIARMC - Arm Circumference Comment
BMXWAIST - Waist Circumference (cm)
BMIWAIST - Waist Circumference Comment
BMXTHICR - Thigh Circumference (cm)
BMITHICR - Thigh Circumference Comment
BMXTRI - Triceps Skinfold (mm)
BMITRI - Triceps Skinfold Comment
BMXSUB - Subscapular Skinfold (mm)
BMISUB - Subscapular Skinfold Comment

2005 - 2006

Codebook
SEQN - Respondent sequence number
BMDSTATS - Body Measures Component Status Code
BMXWT - Weight (kg)
BMIWT - Weight Comment
BMXRECUM - Recumbent Length (cm)
BMIRECUM - Recumbent Length Comment
BMXHEAD - Head Circumference (cm)
BMIHEAD - Head Circumference Comment
BMXHT - Standing Height (cm)
BMIHT - Standing Height Comment
BMXBMI - Body Mass Index (kg/m**2)
BMXLEG - Upper Leg Length (cm)
BMILEG - Upper Leg Length Comment
BMXCALF - Maximal Calf Circumference (cm)
BMICALF - Maximal Calf Comment
BMXARML - Upper Arm Length (cm)
BMIARML - Upper Arm Length Comment
BMXARMC - Arm Circumference (cm)
BMIARMC - Arm Circumference Comment
BMXWAIST - Waist Circumference (cm)
BMIWAIST - Waist Circumference Comment
BMXTHICR - Thigh Circumference (cm)
BMITHICR - Thigh Circumference Comment
BMXTRI - Triceps Skinfold (mm)
BMITRI - Triceps Skinfold Comment
BMXSUB - Subscapular Skinfold (mm)
BMISUB - Subscapular Skinfold Comment

Missing in 2005-2006:

BMDRECUF (Height-Length difference flagged)
BMDSUBF (Unusual value noted during data review)
BMDTHICF (Unusual value noted during data review)
BMDLEGF (Unusual value noted during data review)
BMDARMLF (Unusual value noted during data review)
BMDCALFF (Unusual value noted during data review)

In [2]:
# Read the files
bmx_c = pd.read_csv('../data/raw_data/2003-2004a/BMX_C.csv')
bmx_d = pd.read_csv('../data/raw_data/2005-2006/BMX_D.csv')

# 1. Compare number of columns
print(f"BMX_C.csv has {len(bmx_c.columns)} columns")
print(f"BMX_D.csv has {len(bmx_d.columns)} columns")

# 2. Find columns in C that don't exist in D
missing_in_d = set(bmx_c.columns) - set(bmx_d.columns)
print(f"Columns in BMX_C but not in BMX_D: {missing_in_d}")

# 3. Check data types
for col in set(bmx_c.columns) & set(bmx_d.columns):
    if bmx_c[col].dtype != bmx_d[col].dtype:
        print(f"Data type difference for {col}: {bmx_c[col].dtype} in C, {bmx_d[col].dtype} in D")

BMX_C.csv has 33 columns
BMX_D.csv has 27 columns
Columns in BMX_C but not in BMX_D: {'BMDLEGF', 'BMDARMLF', 'BMDCALFF', 'BMDRECUF', 'BMDTHICF', 'BMDSUBF'}


In [6]:
import pandas as pd

# Read the files
bmx_c = pd.read_csv('../data/raw_data/2003-2004a/BMX_C.csv')
bmx_d = pd.read_csv('../data/raw_data/2005-2006/BMX_D.csv')

# Create a mapping dictionary for more descriptive column names
column_mapping = {
    'SEQN': 'ParticipantID',
    'BMDSTATS': 'BodyMeasuresStatus',
    'BMDRECUF': 'HeightLengthDifferenceFlag',
    'BMDSUBF': 'SubscapularUnusualValueFlag',
    'BMDTHICF': 'ThighUnusualValueFlag',
    'BMDLEGF': 'LegUnusualValueFlag',
    'BMDARMLF': 'ArmLengthUnusualValueFlag',
    'BMDCALFF': 'CalfUnusualValueFlag',
    'BMXWT': 'Weight_kg',
    'BMIWT': 'WeightComment',
    'BMXRECUM': 'RecumbentLength_cm',
    'BMIRECUM': 'RecumbentLengthComment',
    'BMXHEAD': 'HeadCircumference_cm',
    'BMIHEAD': 'HeadCircumferenceComment',
    'BMXHT': 'Height_cm',
    'BMIHT': 'HeightComment',
    'BMXBMI': 'BMI_kgm2',
    'BMXLEG': 'UpperLegLength_cm',
    'BMILEG': 'UpperLegLengthComment',
    'BMXCALF': 'CalfCircumference_cm',
    'BMICALF': 'CalfCircumferenceComment',
    'BMXARML': 'UpperArmLength_cm',
    'BMIARML': 'UpperArmLengthComment',
    'BMXARMC': 'ArmCircumference_cm',
    'BMIARMC': 'ArmCircumferenceComment',
    'BMXWAIST': 'WaistCircumference_cm',
    'BMIWAIST': 'WaistCircumferenceComment',
    'BMXTHICR': 'ThighCircumference_cm',
    'BMITHICR': 'ThighCircumferenceComment',
    'BMXTRI': 'TricepsSkinfold_mm',
    'BMITRI': 'TricepsSkinfoldComment',
    'BMXSUB': 'SubscapularSkinfold_mm',
    'BMISUB': 'SubscapularSkinfoldComment'
}

# Convert BMIHEAD to float in BMX_D
bmx_d['BMIHEAD'] = pd.to_numeric(bmx_d['BMIHEAD'], errors='coerce')

# Add a column to identify the NHANES cycle
bmx_c['Cycle'] = '2003-2004'
bmx_d['Cycle'] = '2005-2006'

# Determine which approach to take for the flag variables
# Option 1: Keep the flag variables for 2003-2004 and add NAs for 2005-2006
flag_columns = ['BMDRECUF', 'BMDSUBF', 'BMDTHICF', 'BMDLEGF', 'BMDARMLF', 'BMDCALFF']
for col in flag_columns:
    if col not in bmx_d.columns:
        bmx_d[col] = None

# Combine the datasets
combined_df = pd.concat([bmx_c, bmx_d], ignore_index=True)

# Apply the new column names
combined_df = combined_df.rename(columns=column_mapping)

# Save the combined dataset
combined_df.to_csv('../data/dataCombined/BMX_combined_labeled.csv', print(f"2005-2006 participants: {sum(combined_df['NHANESCycle'] == '2005-2006')}")
)

# Print summary information
print(f"Combined dataset has {combined_df.shape[0]} rows and {combined_df.shape[1]} columns")
print(f"2003-2004 participants: {sum(combined_df['NHANESCycle'] == '2003-2004')}")
print(f"2005-2006 participants: {sum(combined_df['NHANESCycle'] == '2005-2006')}")

  combined_df = pd.concat([bmx_c, bmx_d], ignore_index=True)


KeyError: 'NHANESCycle'

In [8]:

# Read the combined BMX file
bmx = pd.read_csv('../data/dataCombined/BMX_combined_labeled.csv')

# Basic information
print(f"Shape: {bmx.shape}")
print("\nData types:")
print(bmx.dtypes)

Shape: (19593, 34)

Data types:
ParticipantID                  float64
BodyMeasuresStatus             float64
HeightLengthDifferenceFlag     float64
SubscapularUnusualValueFlag    float64
ThighUnusualValueFlag          float64
LegUnusualValueFlag            float64
ArmLengthUnusualValueFlag      float64
CalfUnusualValueFlag           float64
Weight_kg                      float64
WeightComment                  float64
RecumbentLength_cm             float64
RecumbentLengthComment         float64
HeadCircumference_cm           float64
HeadCircumferenceComment       float64
Height_cm                      float64
HeightComment                  float64
BMI_kgm2                       float64
UpperLegLength_cm              float64
UpperLegLengthComment          float64
CalfCircumference_cm           float64
CalfCircumferenceComment       float64
UpperArmLength_cm              float64
UpperArmLengthComment          float64
ArmCircumference_cm            float64
ArmCircumferenceComment        f

In [10]:
# Check for null values
print("\nMissing values by column:")
missing_values = bmx.isnull().sum()
missing_values


Missing values by column:


ParticipantID                      0
BodyMeasuresStatus                 0
HeightLengthDifferenceFlag     19566
SubscapularUnusualValueFlag    19592
ThighUnusualValueFlag          19590
LegUnusualValueFlag            19579
ArmLengthUnusualValueFlag      19588
CalfUnusualValueFlag           19590
Weight_kg                        217
WeightComment                  18888
RecumbentLength_cm             17007
RecumbentLengthComment         19516
HeadCircumference_cm           19032
HeadCircumferenceComment       19592
Height_cm                       1941
HeightComment                  19010
BMI_kgm2                        1957
UpperLegLength_cm               5060
UpperLegLengthComment          18946
CalfCircumference_cm            4937
CalfCircumferenceComment       19062
UpperArmLength_cm                905
UpperArmLengthComment          19053
ArmCircumference_cm              912
ArmCircumferenceComment        19045
WaistCircumference_cm           2524
WaistCircumferenceComment      18928
T

In [11]:
bmx.columns

Index(['ParticipantID', 'BodyMeasuresStatus', 'HeightLengthDifferenceFlag',
       'SubscapularUnusualValueFlag', 'ThighUnusualValueFlag',
       'LegUnusualValueFlag', 'ArmLengthUnusualValueFlag',
       'CalfUnusualValueFlag', 'Weight_kg', 'WeightComment',
       'RecumbentLength_cm', 'RecumbentLengthComment', 'HeadCircumference_cm',
       'HeadCircumferenceComment', 'Height_cm', 'HeightComment', 'BMI_kgm2',
       'UpperLegLength_cm', 'UpperLegLengthComment', 'CalfCircumference_cm',
       'CalfCircumferenceComment', 'UpperArmLength_cm',
       'UpperArmLengthComment', 'ArmCircumference_cm',
       'ArmCircumferenceComment', 'WaistCircumference_cm',
       'WaistCircumferenceComment', 'ThighCircumference_cm',
       'ThighCircumferenceComment', 'TricepsSkinfold_mm',
       'TricepsSkinfoldComment', 'SubscapularSkinfold_mm',
       'SubscapularSkinfoldComment', 'Cycle'],
      dtype='object')

In [16]:
# List of columns to drop - all flag and comment columns
columns_to_drop = [
    'BodyMeasuresStatus',
    'HeightLengthDifferenceFlag',
    'SubscapularUnusualValueFlag',
    'ThighUnusualValueFlag',
    'LegUnusualValueFlag',
    'ArmLengthUnusualValueFlag',
    'CalfUnusualValueFlag',
    'WeightComment',
    'RecumbentLengthComment',
    'HeadCircumferenceComment',
    'HeightComment',
    'UpperLegLengthComment',
    'CalfCircumferenceComment',
    'UpperArmLengthComment',
    'ArmCircumferenceComment',
    'WaistCircumferenceComment',
    'ThighCircumferenceComment',
    'TricepsSkinfoldComment',
    'SubscapularSkinfoldComment'
]

# Drop the columns
bmx_clean = bmx.drop(columns=columns_to_drop)

# Display the new column list
print("Columns remaining in the clean dataset:")
print(bmx_clean.columns.tolist())
print(f"Reduced from {len(bmx.columns)} to {len(bmx_clean.columns)} columns")

# Save the cleaned dataframe to a new CSV file
bmx_clean.to_csv('../data/dataCombined/BMX_clean.csv', index=False)

print(f"Clean BMX dataset saved to BMX_clean.csv with {bmx_clean.shape[0]} rows and {bmx_clean.shape[1]} columns")

Columns remaining in the clean dataset:
['ParticipantID', 'Weight_kg', 'RecumbentLength_cm', 'HeadCircumference_cm', 'Height_cm', 'BMI_kgm2', 'UpperLegLength_cm', 'CalfCircumference_cm', 'UpperArmLength_cm', 'ArmCircumference_cm', 'WaistCircumference_cm', 'ThighCircumference_cm', 'TricepsSkinfold_mm', 'SubscapularSkinfold_mm', 'Cycle']
Reduced from 34 to 15 columns
Clean BMX dataset saved to BMX_clean.csv with 19593 rows and 15 columns


In [17]:
bmx.describe()


Unnamed: 0,ParticipantID,BodyMeasuresStatus,HeightLengthDifferenceFlag,SubscapularUnusualValueFlag,ThighUnusualValueFlag,LegUnusualValueFlag,ArmLengthUnusualValueFlag,CalfUnusualValueFlag,Weight_kg,WeightComment,...,ArmCircumference_cm,ArmCircumferenceComment,WaistCircumference_cm,WaistCircumferenceComment,ThighCircumference_cm,ThighCircumferenceComment,TricepsSkinfold_mm,TricepsSkinfoldComment,SubscapularSkinfold_mm,SubscapularSkinfoldComment
count,19593.0,19593.0,27.0,1.0,3.0,14.0,5.0,3.0,19376.0,705.0,...,18681.0,548.0,17069.0,665.0,14489.0,688.0,17483.0,1745.0,16175.0,3056.0
mean,31261.57495,1.3882,1.0,1.0,1.0,1.0,1.0,1.0,60.309708,2.895035,...,27.852288,1.0,85.520183,1.0,51.456367,1.0,16.219385,1.277937,15.030751,1.064791
std,5912.179825,0.794959,0.0,,0.0,0.0,0.0,0.0,31.503332,0.579273,...,7.807737,0.0,21.415495,0.0,8.095624,0.0,8.219635,0.448111,8.440113,0.246196
min,21005.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.4,1.0,...,10.8,1.0,32.0,1.0,28.0,1.0,2.8,1.0,2.8,1.0
25%,26142.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35.7,3.0,...,22.0,1.0,70.5,1.0,46.2,1.0,9.5,1.0,8.0,1.0
50%,31285.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,63.4,3.0,...,28.7,1.0,86.0,1.0,50.9,1.0,14.0,1.0,13.0,1.0
75%,36382.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,81.1,3.0,...,33.3,1.0,100.6,1.0,56.1,1.0,21.5,2.0,21.0,1.0
max,41474.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,371.0,4.0,...,62.4,1.0,175.0,1.0,94.8,1.0,45.0,2.0,44.0,2.0


In [18]:

# Read the combined BMX file
#bmx = pd.read_csv('../data/dataCombined/BMX_combined_labeled.csv')

# Basic information
print(f"Shape: {bmx.shape}")
print("\nData types:")
print(bmx.dtypes)

# Check for null values
print("\nMissing values by column:")
missing_values = bmx.isnull().sum()
missing_percent = (bmx.isnull().sum() / len(bmx)) * 100
missing_df = pd.DataFrame({'Missing Count': missing_values, 'Missing Percent': missing_percent})
print(missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Percent', ascending=False))

# Descriptive statistics for key measurements
measurement_cols = ['Weight_kg', 'Height_cm', 'BMI_kgm2', 'WaistCircumference_cm', 'ArmCircumference_cm', 'ThighCircumference_cm', 'TricepsSkinfold_mm', 'SubscapularSkinfold_mm']

print("\nDescriptive statistics for key measurements:")
print(bmx[measurement_cols].describe())

# Check distribution by cycle
print("\nDistribution by cycle:")
print(bmx['Cycle'].value_counts())

# Check key measurements by cycle to ensure consistency
print("\nMean values by cycle:")
cycle_means = bmx.groupby('Cycle')[measurement_cols].mean()
print(cycle_means)

# Check for implausible values
print("\nChecking for implausible values:")
implausible = {
    'Weight_kg': (bmx['Weight_kg'] < 20) | (bmx['Weight_kg'] > 300),
    'Height_cm': (bmx['Height_cm'] < 100) | (bmx['Height_cm'] > 220),
    'BMI_kgm2': (bmx['BMI_kgm2'] < 10) | (bmx['BMI_kgm2'] > 70),
    'WaistCircumference_cm': (bmx['WaistCircumference_cm'] < 40) | (bmx['WaistCircumference_cm'] > 200)
}

for col, condition in implausible.items():
    count = condition.sum()
    if count > 0:
        print(f"Found {count} potentially implausible values in {col}")

Shape: (19593, 34)

Data types:
ParticipantID                  float64
BodyMeasuresStatus             float64
HeightLengthDifferenceFlag     float64
SubscapularUnusualValueFlag    float64
ThighUnusualValueFlag          float64
LegUnusualValueFlag            float64
ArmLengthUnusualValueFlag      float64
CalfUnusualValueFlag           float64
Weight_kg                      float64
WeightComment                  float64
RecumbentLength_cm             float64
RecumbentLengthComment         float64
HeadCircumference_cm           float64
HeadCircumferenceComment       float64
Height_cm                      float64
HeightComment                  float64
BMI_kgm2                       float64
UpperLegLength_cm              float64
UpperLegLengthComment          float64
CalfCircumference_cm           float64
CalfCircumferenceComment       float64
UpperArmLength_cm              float64
UpperArmLengthComment          float64
ArmCircumference_cm            float64
ArmCircumferenceComment        f

In [20]:
import pandas as pd

# Load the BMX combined data
bmx = pd.read_csv('../data/dataCombined/BMX_combined_labeled.csv')

# List of columns to drop based on high percentage of missing values
columns_to_drop = [
    # Flag columns
    'BodyMeasuresStatus',
    'HeightLengthDifferenceFlag',
    'SubscapularUnusualValueFlag',
    'ThighUnusualValueFlag',
    'LegUnusualValueFlag',
    'ArmLengthUnusualValueFlag',
    'CalfUnusualValueFlag',
    
    # Comment columns
    'WeightComment',
    'RecumbentLengthComment',
    'HeadCircumferenceComment',
    'HeightComment',
    'UpperLegLengthComment',
    'CalfCircumferenceComment',
    'UpperArmLengthComment',
    'ArmCircumferenceComment',
    'WaistCircumferenceComment',
    'ThighCircumferenceComment',
    'TricepsSkinfoldComment',
    'SubscapularSkinfoldComment',
    
    # High missing percentage columns
    'HeadCircumference_cm',
    'RecumbentLength_cm'
]

# Drop the columns
bmx_clean = bmx.drop(columns=columns_to_drop)

# Display the new column list
print("Columns remaining in the clean dataset:")
print(bmx_clean.columns.tolist())
print(f"Reduced from {len(bmx.columns)} to {len(bmx_clean.columns)} columns")

# Check for any remaining columns with high missing values
missing_values = bmx_clean.isnull().sum()
missing_percent = (bmx_clean.isnull().sum() / len(bmx_clean)) * 100
missing_df = pd.DataFrame({'Missing Count': missing_values, 'Missing Percent': missing_percent})
print("\nRemaining columns with missing values:")
print(missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Percent', ascending=False))

# Save the cleaned dataframe to a new CSV file
bmx_clean.to_csv('../data/dataCombined/BMX_clean.csv', index=False)

print(f"Clean BMX dataset saved to BMX_clean.csv with {bmx_clean.shape[0]} rows and {bmx_clean.shape[1]} columns")

Columns remaining in the clean dataset:
['ParticipantID', 'Weight_kg', 'Height_cm', 'BMI_kgm2', 'UpperLegLength_cm', 'CalfCircumference_cm', 'UpperArmLength_cm', 'ArmCircumference_cm', 'WaistCircumference_cm', 'ThighCircumference_cm', 'TricepsSkinfold_mm', 'SubscapularSkinfold_mm', 'Cycle']
Reduced from 34 to 13 columns

Remaining columns with missing values:
                        Missing Count  Missing Percent
ThighCircumference_cm            5104        26.050120
UpperLegLength_cm                5060        25.825550
CalfCircumference_cm             4937        25.197775
SubscapularSkinfold_mm           3418        17.445006
WaistCircumference_cm            2524        12.882152
TricepsSkinfold_mm               2110        10.769152
BMI_kgm2                         1957         9.988261
Height_cm                        1941         9.906599
ArmCircumference_cm               912         4.654724
UpperArmLength_cm                 905         4.618997
Weight_kg                        

In [21]:


# Load all datasets
bmx_clean = pd.read_csv('../data/dataCombined/BMX_clean.csv')
analysis_data = pd.read_csv('../data/dataCombined/NHANES_analysis_data.csv')
seqn_analysis = pd.read_csv('../data/dataCombined/SEQN_analysis.csv')

# Convert ParticipantID to integer in bmx_clean to match SEQN in other files
bmx_clean['ParticipantID'] = bmx_clean['ParticipantID'].astype(int)

# Filter BMX data to only include participants in the analysis dataset
bmx_filtered = bmx_clean[bmx_clean['ParticipantID'].isin(seqn_analysis['SEQN'])]

print(f"Original BMX data: {bmx_clean.shape[0]} participants")
print(f"Filtered BMX data: {bmx_filtered.shape[0]} participants")
print(f"Analysis dataset: {seqn_analysis.shape[0]} participants")

# Merge the filtered BMX data with the mortality analysis dataset
# Rename ParticipantID to SEQN to match the column name in the analysis dataset
bmx_filtered = bmx_filtered.rename(columns={'ParticipantID': 'SEQN'})

# Perform the merge
combined_data = pd.merge(analysis_data, bmx_filtered, on='SEQN', how='left')

# Check if all participants have BMX data
missing_bmx = combined_data[combined_data['Weight_kg'].isnull()].shape[0]
print(f"Number of participants without BMX data: {missing_bmx} ({missing_bmx/combined_data.shape[0]*100:.2f}%)")

# Save the combined dataset
combined_data.to_csv('../data/dataCombined/NHANES_combinedBMX_analysis.csv', index=False)

print(f"Combined dataset saved with {combined_data.shape[0]} rows and {combined_data.shape[1]} columns")

# Summary of key variables
key_vars = ['SEQN', 'Age', 'BMI', 'Weight_kg', 'Height_cm', 'BMI_kgm2', 'WaistCircumference_cm', 'yr5_mort', "permth_exm","permth_int","ucod_leading"]
print("\nSummary of key variables in combined dataset:")
print(combined_data[key_vars].describe())

# Check correlation between BMI from analysis data and BMI from BMX data
if 'BMI' in combined_data.columns and 'BMI_kgm2' in combined_data.columns:
    correlation = combined_data['BMI'].corr(combined_data['BMI_kgm2'])
    print(f"\nCorrelation between BMI and BMI_kgm2: {correlation:.4f}")

Original BMX data: 19593 participants
Filtered BMX data: 3198 participants
Analysis dataset: 3198 participants
Number of participants without BMX data: 0 (0.00%)
Combined dataset saved with 3198 rows and 93 columns

Summary of key variables in combined dataset:
               SEQN          Age          BMI    Weight_kg    Height_cm  \
count   3198.000000  3198.000000  3198.000000  3198.000000  3198.000000   
mean   30919.298624    65.969512    28.862373    80.600219   166.926276   
std     5939.084536     9.682211     5.966912    18.646620    10.083906   
min    21009.000000    50.000000    13.360000    35.900000   133.700000   
25%    25782.000000    58.083333    24.760000    67.600000   159.500000   
50%    30529.500000    65.500000    28.020000    78.000000   166.750000   
75%    36118.750000    73.562500    31.905000    91.775000   174.200000   
max    41468.000000    84.916667    59.120000   173.500000   203.200000   

          BMI_kgm2  WaistCircumference_cm     yr5_mort   permt