<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Visualization of Inspiration Astronaut Biomarkers

## Background
We are acting as the first data scientist on the Inspiration-Health project. Our first major task is to build a **reproducible pipeline** to clean, transform, and visualize astronaut biomarker data from the SpaceX Inspiration4 mission.
Instead of building predictive models, our focus is on **identifying and visualizing significant deviations from baseline values** to help communicate health changes during and after spaceflight.


## Data
We are working with astronaut biospecimen data from the **SpaceX Inspiration4 Mission** (NASA OSD-575).
The cleaned files in `cleaned_data/` include the Comprehensive Metabolic Panel (CMP), Cardiovascular Multiplex Panel, and Immune Multiplex Panel.

Each dataset is organized in **long format**, where each row corresponds to a single measurement for a given astronaut at a given timepoint.
Key columns include:

- **analyte** – the biomarker name (e.g., *GLUCOSE, CREATININE, SODIUM, ALBUMIN*).
- **value** – measured biomarker value.
- **range_min**, **range_max** – clinical reference ranges for each analyte.
- **units** – measurement units (mg/dL, U/L, etc.).
- **test_type** – indicates panel (CMP, cardiovascular multiplex, immune multiplex).
- **subject_id** – astronaut ID (C001–C004).
- **sex** – M/F.
- **timepoint** – collection time relative to flight (*L-92, L-44, L-3 pre-flight*; *R+1, R+45, R+82 post-flight*).

These features allow us to:
- Track biomarker trajectories across mission phases.
- Compare astronaut values to **baseline (pre-flight average)**.
- Identify deviations outside clinical reference ranges.
- Visualize individual astronaut responses and group trends.


## Approach

We will build a pipeline that:

1. **Load Data**
   Merge metadata and biomarker tables.

2. **Clean Data**
   Handle missing values, outliers, and consistency issues.

3. **Split Data by Phase**
   Group values into pre-flight vs. post-flight to allow comparison.

4. **Feature Engineering for Visualization**
   - Compute **baseline mean** for each analyte (average of L-92, L-44, L-3).
   - Calculate **delta = value – baseline** at each post-flight timepoint.
   - Flag analytes with statistically or clinically notable deviations.

5. **Visualization**
   - Line plots of biomarker trajectories across timepoints.
   - Heatmaps of deviations (Δ) per astronaut and analyte.
   - Highlight analytes with the largest relative changes.

6. **Communication**
   Produce visuals and narratives that explain *what changed, when it changed, and why it matters* in accessible language.


In [79]:
# Run this before any other code cell
# This downloads the csv data files into the same directory where you have saved this notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# ==== CONFIG ====
DATA_DIR = Path("cleaned_data")

# Files (pick the one you want to analyze)
CMP_FILE      = DATA_DIR / "Metabolic_Panel.csv"
CARDIO_FILE   = DATA_DIR / "LSDS-8_Multiplex_serum_cardiovascular_EvePanel_TRANSFORMED_all_astronauts.csv"
IMMUNE_FILE   = DATA_DIR / "LSDS-8_Multiplex_serum_immune_EvePanel_TRANSFORMED_all_astronauts.csv"

# ==== LOAD DATA ====
def load_data(file):
    df = pd.read_csv(file)
    return df

cmp = load_data(CMP_FILE)
cmp.head()

In [80]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn import metrics
import pandas as pd

def run_model(X_train,y_train,X_test,y_test):

    lin_model = LinearRegression()
    lin_model.fit(X_train,y_train)
    y_pred = lin_model.predict(X_test)
    test_mse = metrics.mean_squared_error(y_test, y_pred)
    return test_mse

import warnings
warnings.filterwarnings("ignore")

pd.options.display.float_format = '{:,.2f}'.format

### Load data
Create and run a function `load_data()` to do your data loading and any merging needed.  You can specify the arguments and returns as needed.

In [81]:
def load_data(file1, file2):
    df1 = pd.read_csv(file1)
    df2 = pd.read_csv(file2)
    merged = pd.merge(df1, df2, how="outer", on=['dteday','hr'])
    return merged

### Clean data
Use the cell below to create and run a function `clean_data()` which cleans up the data as needed.

We started with the Metabolic Pannel which has these rows which we transformed to columns:
* Sample Name
* albumin_value_gram_per_deciliter
* albumin_range_min_gram_per_deciliter
* albumin_range_max_gram_per_deciliter
* albumin_to_globulin_ratio_value
* albumin_to_globulin_ratio_range_min
* albumin_to_globulin_ratio_range_max
* alkaline_phosphatase_value_units_per_liter
* alkaline_phosphatase_range_min_units_per_liter
* alkaline_phosphatase_range_max_units_per_liter
* alt_value_units_per_liter
* alt_range_min_units_per_liter
* alt_max_units_per_liter
* ast_value_units_per_liter
* ast_range_min_units_per_liter
* ast_max_units_per_liter
* total_bilirubin_value_milligram_per_deciliter
* total_bilirubin_range_min_milligram_per_deciliter
* total_bilirubin_range_max_milligram_per_deciliter
* bun_to_creatinine_ratio_value
* bun_to_creatinine_ratio_range_min
* bun_to_creatinine_ratio_range_max
* calcium_value_milligram_per_deciliter
* calcium_range_min_milligram_per_deciliter
* calcium_range_max_milligram_per_deciliter
* carbon_dioxide_value_millimol_per_liter
* carbon_dioxide_range_min_millimol_per_liter
* carbon_dioxide_range_max_millimol_per_liter
* chloride_value_millimol_per_liter
* chloride_range_min_millimol_per_liter
* chloride_range_max_millimol_per_liter
* creatinine_value_milligram_per_deciliter
* creatinine_range_min_milligram_per_deciliter
* creatinine_range_max_milligram_per_deciliter
* egfr_african_american_value_milliliter_per_minute_per_1.73_meter_squared
* egfr_african_american_range_min_milliliter_per_minute_per_1.73_meter_squared
* egfr_african_american_range_max_milliliter_per_minute_per_1.73_meter_squared
* egfr_non_african_american_value_milliliter_per_minute_per_1.73_meter_squared
* egfr_non_african_american_range_min_milliliter_per_minute_per_1.73_meter_squared
* egfr_non_african_american_range_max_milliliter_per_minute_per_1.73_meter_squared
* globulin_value_gram_per_deciliter
* globulin_range_min_gram_per_deciliter
* globulin_range_max_gram_per_deciliter
* glucose_value_milligram_per_deciliter
* glucose_range_min_milligram_per_deciliter
* glucose_range_max_milligram_per_deciliter
* potassium_value_millimol_per_liter
* potassium_range_min_millimol_per_liter
* potassium_range_max_millimol_per_liter
* total_protein_value_gram_per_deciliter
* total_protein_range_min_gram_per_deciliter
* total_protein_range_max_gram_per_deciliter
* sodium_value_millimol_per_liter
* sodium_range_min_millimol_per_liter
* sodium_range_max_millimol_per_liter
* urea_nitrogen_bun_value_milligram_per_deciliter
* urea_nitrogen_bun_range_min_milligram_per_deciliter
* urea_nitrogen_bun_range_max_milligram_per_deciliter
* astronautID
* timepoint


In [1]:
def clean_data(s1):
    # Standardize dates
    s1['dteday'] = pd.to_datetime(s1['dteday'], errors='coerce')

    # Convert to epoch seconds
    s1['dteday'] = (s1['dteday'] + pd.to_timedelta(s1['hr'], unit='h')).astype('int64') // 10**9

    # Forcing Data into numeric if not already
    for col in s1.columns[1:]:
        s1[col] = pd.to_numeric(s1[col], errors='coerce')

    # Dropping non-numerics
    s2 = s1.dropna()

    print("Dropped " + str(len(s1)-len(s2)) + " rows from our dataset due to missing values.")

    # Remove extrema
    numeric_cols = s2.iloc[:, 2:]
    s3 = s2.copy()

    for col in numeric_cols.columns:
        # Get the Series
        series = s2[col]

        # Calculate mean and std
        col_mn = series.mean()
        col_st = series.std()

        # Mask rows within 2 std
        keep_col = (series > col_mn - 2 * col_st) & (series < col_mn + 2 * col_st)

        # Drop rows outside this range
        s3 = s3[keep_col]
    print("Dropped " +str(len(s2)-len(s3))+ " columns due to extrema")

    s4 = s3.drop(columns=['hr'])

    return s4

#### Handling missing values

In the metabolic panel, there were multiple features with missing values. There were two columns which were completely empty, so we ended up dropping them. For the rest, we went ahead with imputing with mean, median or mode based on the feature.

### Split data for training and testing
Create and run the function `split_data()` in the cell below to split the data into training and test sets.  You should use all data up to and including July 31 2012 as the training set, and the data for the period August 1 2012 - December 31 2012 as the test set.

In [83]:
def split_data(s1):
    # Define cutoff
    cutoff = int(pd.to_datetime("2012-08-01 00:00:00").timestamp())

    before = s1[s1['dteday'] <= cutoff]
    after  = s1[s1['dteday'] > cutoff]

    target = 'cnt'

    # Split into X and y
    X_train = before.drop(columns=[target])
    y_train = before[target]

    X_test = after.drop(columns=[target])
    y_test = after[target]

    return X_train, y_train, X_test, y_test


### Feature Engineering
Create and run the function `build_features()` below to create any additional derivative features (e.g. time series features) that you wish to use in modeling.  You will need to apply this function to both your training and test sets.
#### Stats
This module provides statistical processing functions
for astronaut serum and biochemical datasets.

##### Overview
--------
1. Defines a consistent analyte metadata map (`ANALYTE_INFO`)
   that assigns each analyte a human-friendly label and unit.

2. Transforms the wide-format astronaut dataset
   (value, min, max triplets for each analyte)
   into a tidy long-format DataFrame for analysis and plotting.

3. Provides statistical comparison functions to test
   whether recovery day 1 (R+1) is significantly shifted
   relative to the pre-launch baseline (L-series).

##### Data Model
----------
The tidy DataFrame returned by `tidy_from_wide` has the following columns:

    astronautID   : string, subject identifier (e.g. "C001")
    timepoint     : string, raw timepoint (e.g. "L-3", "R+1")
    flight_day    : integer, absolute scale where:
                      L-0 = 0 (launch day)
                      R+0 = 3 (last day in space)
                      R+1 = 4 (first recovery day)
                      L-n = -n (pre-launch)
    analyte       : string, machine-readable analyte name
    value         : numeric, observed measurement
    min           : numeric, reference range minimum
    max           : numeric, reference range maximum
    label         : string, human-readable analyte label
    unit          : string, measurement unit

##### Statistical Tests
-----------------
`analyze_r1_vs_L(tidy_df)` applies two complementary analyses:

1. **Within-astronaut analysis**
   - For each astronaut and analyte, compares the R+1 measurement
     to that astronaut’s baseline distribution of L-values.
   - Implemented as a one-sample t-test:
         H0: R+1 = mean(L)
   - Reports mean baseline, R+1, t-statistic, p-value,
     and effect size (Cohen’s d).

2. **Across-astronaut analysis**
   - Aggregates astronauts by computing each subject’s baseline mean (L-mean).
   - Performs a paired t-test across astronauts:
         H0: mean(L-means) = mean(R+1)
   - Reports group means, t-statistic, p-value,
     and effect size (mean difference / SD of differences).

##### Returned Output
---------------
The results are returned as a DataFrame with columns:

    analyte      : analyte name
    astronautID  : subject ID ("ALL" for group-level test)
    test_type    : "within" (per astronaut) or "group" (across astronauts)
    n_L          : number of L timepoints used
    mean_L       : baseline mean
    R1           : R+1 value (or mean across astronauts)
    t_stat       : test statistic
    p_value      : two-sided p-value
    effect_size  : Cohen’s d

##### Use Cases
---------
- Identify analytes that change significantly at R+1 relative to baseline.
- Assess subject-specific vs. group-level recovery patterns.
- Provide both inferential results (p-values) and effect sizes
  for scientific reporting and visualization

In [3]:
# Stats

### Feature Selection
Use the cell below to create and run the function `feature_select()` which performs feature selection using univariate (filter) methods.  After you analyze the correlations, determine whether you would like to remove any features and do so.

### Prepare Features for Modeling
Our final step in the pipeline is to prepare our feature set for modeling.  In particular, in this step we need to ensure that any categorical variables we may be using are encoded as numeric values in order for the model to function properly.  You might also consider scaling some of your data.

In the below cell create and run a function `prepare_train_feats()` which prepares the training features.

We also need to prepare the features in our test set in the same way to feed into the model.  Use the cell below for the function `prepare_test_feats()` which prepares your test set features.

### Run pipeline
Finally, let's bring everything together in a function to run the entire pipeline for our training data.  Complete the function `run_pipeline()` in the cell below.  The function should call any/all of the functions you have defined above which are needed to load the data, transform it and prepare the features for both the training set and the test set.

In [84]:
def run_pipeline(bike_filename, weather_filename):
    '''
    Runs your pipeline (calling the above functions as needed) to transform the raw data into the training and test data sets for modeling

    Inputs:
        bike_filename(str): name of the file containing the bike data
        weather_filename(str): name of the file containing the weather data

    Returns:
        X_train(pd.DataFrame): dataframe containing the training set inputs
        y_train(pd.DataFrame): dataframe containing the training set labels
        X_test(pd.DataFrame): dataframe containing the test set inputs
        y_test(pd.DataFrame): dataframe containing the test set labels
    '''
    merged = load_data(bike_filename,weather_filename)
    clean = clean_data(merged)
    X_train, y_train, X_test, y_test = split_data(clean)

    return X_train, y_train, X_test, y_test

    

Now that we've prepared our features we are ready to run our model.  Run the cell below, which trains the model on the training set and calculates and reports the mean squared error (MSE) on the test set.  If everything went well you should have a MSE below 18500

In [85]:
bike_datafile = "2011-2012_bikes.csv"
weather_datafile = "2011-2012_weather_messy.csv"
X_train, y_train, X_test, y_test = run_pipeline(bike_datafile, weather_datafile)
mse_score = run_model(X_train, y_train, X_test, y_test)
print('Mean Squared Error on the test set: {:.2f}'.format(mse_score))

assert mse_score < 18500

Dropped 58 rows from our dataset due to missing values.
Dropped 3416 columns due to extrema
Mean Squared Error on the test set: 17888.35
