
# Data Mining for Space Habitats
## *Analyzing ISS Environmental Telemetry for Safer Off-World Living*
**Kenny Jenkins**  
**Department of Computer Science**  
**University of Colorado Boulder**  

### Section 1: Setup and Imports
#### Purpose
Initialize libraries, plotting, and configuration for a multi-mission, multi-modal pipeline that supports exploratory analysis, forecasting, anomaly detection, and biological linkage.
#### What is implemented
1. Core scientific stack: NumPy, Pandas, SciPy, statsmodels, scikit-learn, TensorFlow Keras.
2. Visualization: Matplotlib, Seaborn, Plotly.
3. Time series utilities: STL decomposition, FFT, Kalman filter, SARIMAX, VAR.
4. Streamlit configuration for the dashboard UI.
5. Global settings for figure styles and display options.
6. Warning suppression for cleaner logs.
#### New since AWG
- Seasonal ARIMA and VAR have been included for benchmarking against LSTM forecasts.
- Configuration variables anticipate multi-mission scaling and future ingestion of behavioral and event context such as docking windows.
#### To-Do
- Expose a single settings object for mission lists, paths, sampling intervals, and model hyperparameters.
- Add a random seed control block for reproducibility across deep learning and statsmodels.
- Parameterize figure sizes and export paths for batch reruns.

In [2]:
# ==== Section 1: Setup & Imports (Jupyter-first, Streamlit-safe) ====

import os, sys, random, warnings, io, json, time
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from pathlib import Path
from typing import Dict, Tuple, List

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Plotly (optional visual QA)
import plotly.express as px
import plotly.graph_objects as go

# Timeseries & stats
from scipy.fft import fft, fftfreq
from statsmodels.tsa.seasonal import STL
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.api import VAR
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import MinMaxScaler
from pykalman import KalmanFilter

# Deep learning
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, LSTM, RepeatVector, TimeDistributed, Dense
from tensorflow.keras.callbacks import EarlyStopping

# tqdm (Jupyter-friendly)
from tqdm.notebook import tqdm

# ---- Streamlit guard (don’t run page config when in Jupyter) ----
# try:
#    import streamlit as st
#    RUNNING_STREAMLIT = os.environ.get("STREAMLIT_SERVER_RUNNING") == "1"
#    if RUNNING_STREAMLIT:
#        st.set_page_config(page_title="ISS Environmental Telemetry & Omics Explorer", layout="wide")
#        st.title("ISS Environmental Telemetry & Omics Explorer")
#        st.caption("Rodent Research: RR-1, RR-3, RR-6, RR-9, RR-12, RR-19 · Variables: CO₂, Temperature, RH, Pressure, Radiation")
# except Exception:
#    RUNNING_STREAMLIT = False

# ---- Global Settings object (missions, paths, sampling, fig sizes, seeds) ----
@dataclass
class Settings:
    missions: List[str] = field(default_factory=lambda: ['RR-1','RR-3','RR-4','RR-5','RR-6','RR-8','RR-9','RR-12','RR-17','RR-19'])
    sampling: str = "1min"
    seed: int = 42
    fig_size: Tuple[int,int] = (12, 6)

    # Data roots (environment or sensible defaults). Put data OUTSIDE macOS-protected folders.
    data_roots: List[Path] = field(default_factory=lambda: [
        Path(os.environ.get("SPACEHAB_DATA", "~/data/osdr_eda")).expanduser(),         # recommended
        Path.cwd() / "data"                                                             # repo-local fallback
    ])
    # Optional separate omics root
    omics_roots: List[Path] = field(default_factory=lambda: [
        Path(os.environ.get("SPACEHAB_OMICS", "~/data/omics")).expanduser(),
        Path.cwd() / "data" / "omics"
    ])

    # Output dirs (all inside repo to avoid permissions/tcc issues)
    outputs_root: Path = field(default_factory=lambda: Path.cwd() / "outputs")
    preprocessed_dir: Path = field(default_factory=lambda: Path.cwd() / "outputs" / "preprocessed")
    pattern_dir: Path = field(default_factory=lambda: Path.cwd() / "outputs" / "pattern_analysis")
    relationships_dir: Path = field(default_factory=lambda: Path.cwd() / "outputs" / "relationships")
    anomalies_dir: Path = field(default_factory=lambda: Path.cwd() / "outputs" / "anomaly_forecast")

S = Settings()

# Create output dirs
for d in [S.outputs_root, S.preprocessed_dir, S.pattern_dir, S.pattern_dir / "radiation",
          S.relationships_dir, S.anomalies_dir]:
    d.mkdir(parents=True, exist_ok=True)

# ---- Reproducibility (NumPy, Python, TF; statsmodels uses NumPy RNG) ----
os.environ["PYTHONHASHSEED"] = str(S.seed)
os.environ["TF_DETERMINISTIC_OPS"] = "1"   # best-effort determinism
random.seed(S.seed)
np.random.seed(S.seed)
tf.random.set_seed(S.seed)

# ---- Plotting / Pandas display ----
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = S.fig_size
plt.rcParams["axes.titlesize"] = 14
plt.rcParams["axes.labelsize"] = 12

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 200)
pd.set_option("display.float_format", "{:.2f}".format)

# ---- Clean warnings ----
warnings.filterwarnings("ignore")

print("Environment setup complete.")
print("TensorFlow:", tf.__version__)
print("Data roots:", [str(p) for p in S.data_roots])
print("Omics roots:", [str(p) for p in S.omics_roots])
print("Outputs root:", str(S.outputs_root))


Environment setup complete.
TensorFlow: 2.16.2
Data roots: ['/Users/kennethjenkins/data/osdr_eda', '/Users/kennethjenkins/Citizen Science/Data Mining for Space Habitats/data']
Omics roots: ['/Users/kennethjenkins/data/omics', '/Users/kennethjenkins/Citizen Science/Data Mining for Space Habitats/data/omics']
Outputs root: /Users/kennethjenkins/Citizen Science/Data Mining for Space Habitats/outputs


### Section 2: Load and Preview ISS Environmental Telemetry
#### Purpose
Load EDA Rodent Research telemetry and radiation datasets across missions and load GeneLab omics tables for later linkage. Standardize and validate inputs to support cross-mission analysis.
#### What is implemented
1. Ingestion for RR missions: summary tables, minute-level telemetry, and radiation logs.
2. Omics ingestion for GLDS-98, GLDS-99, and GLDS-104 including differential expression, normalized counts, and metadata.
3. Time parsing and schema checks.
4. Quick previews, missing value summaries, and first-pass visualizations saved to disk.
5. Consolidation into an rr_data dictionary by mission.
#### New since AWG
- Plan to expand forecasting beyond RR-1 to additional RR missions prioritized by data completeness.
- Prepare hooks for additional omics datasets in GeneLab.
- Add placeholders for behavioral video and phenotypic data suggested by the group.
#### To-Dos
- Add loader for event context such as docking or EVA intervals when available.
- Create a data availability matrix by mission and variable that the dashboard can display.
- Implement light validation tests that fail fast when column names or date formats drift.

In [3]:
# ==== Section 2: Load and Preview ISS Environmental and Omics Data ====
from pathlib import Path

# Helper: try to read a file across multiple roots
def find_file(rel_path: str, roots: List[Path]) -> Path | None:
    for root in roots:
        p = (root / rel_path).expanduser()
        if p.exists():
            return p
    return None

def safe_read_csv(path: Path, **kwargs) -> pd.DataFrame | None:
    try:
        return pd.read_csv(path, **kwargs)
    except Exception as e:
        print(f"[READ FAIL] {path}: {e}")
        return None

# ---- Where do the EDA files live relative to the root? ----
# We keep your original filenames; only the base folder changes.
def env_relpaths(rr: str) -> dict:
    return {
        "summary": f"EDA Rodent Research/{rr}_EDA_Summary_table.csv",
        "telemetry": f"EDA Rodent Research/{rr}_EDA_Telemetry_data.csv",
        "radiation": f"EDA Rodent Research/{rr}_EDA_Radiation_data.csv",
    }

# GeneLab omics: allow multiple roots; keys: diff,norm,meta
OMICS_MANIFEST = {
    "GLDS-98": {
        "diff": "RR-1/GLDS-98/GLDS-98_rna_seq_differential_expression.csv",
        "norm": "RR-1/GLDS-98/GLDS-98_rna_seq_Normalized_Counts.csv",
        "meta": "RR-1/GLDS-98/GLDS-98_rna_seq_SampleTable.csv",
    },
    "GLDS-99": {
        "diff": "RR-1/GLDS-99/GLDS-99_rna_seq_differential_expression.csv",
        "norm": "RR-1/GLDS-99/GLDS-99_rna_seq_Normalized_Counts.csv",
        "meta": "RR-1/GLDS-99/GLDS-99_rna_seq_SampleTable.csv",
    },
    "GLDS-104": {
        "diff": "RR-1/GLDS-104/GLDS-104_rna_seq_differential_expression.csv",
        "norm": "RR-1/GLDS-104/GLDS-104_rna_seq_Normalized_Counts.csv",
        "meta": "RR-1/GLDS-104/GLDS-104_rna_seq_visualization_PCA_table.csv",
    },
}

summary_data, telemetry_data, radiation_data = {}, {}, {}
ok_summary, ok_telemetry, ok_radiation = set(), set(), set()

# Basic schema validators (fail-fast on drift)
def require_cols(df: pd.DataFrame, cols: List[str], name: str) -> None:
    missing = [c for c in cols if c not in df.columns]
    if missing:
        raise ValueError(f"[SCHEMA] {name}: missing columns {missing}")

for rr in S.missions:
    rels = env_relpaths(rr)

    # Summary
    p = find_file(rels["summary"], S.data_roots)
    if p:
        df = safe_read_csv(p)
        if df is not None and not df.empty:
            summary_data[rr] = df
            ok_summary.add(rr)
    else:
        print(f"[MISS] {rels['summary']} (check SPACEHAB_DATA)")

    # Telemetry
    p = find_file(rels["telemetry"], S.data_roots)
    if p:
        df = safe_read_csv(p)
        if df is not None and not df.empty:
            # Robust time parse
            if "Controller_Time_GMT" in df.columns:
                df["Controller_Time_GMT"] = pd.to_datetime(df["Controller_Time_GMT"], errors="coerce", utc=True)
            telemetry_data[rr] = df
            ok_telemetry.add(rr)
    else:
        print(f"[MISS] {rels['telemetry']}")

    # Radiation
    p = find_file(rels["radiation"], S.data_roots)
    if p:
        df = safe_read_csv(p)
        if df is not None and not df.empty:
            if "Date" in df.columns:
                df["Date"] = pd.to_datetime(df["Date"], errors="coerce", utc=True)
            radiation_data[rr] = df
            ok_radiation.add(rr)
    else:
        print(f"[MISS] {rels['radiation']}")

    print(f"Checked {rr}")

# ---- Omics Loader (optional for RR-1 linkage later) ----
omics_data: Dict[str, Dict[str, pd.DataFrame]] = {}
for glds, parts in OMICS_MANIFEST.items():
    omics_data[glds] = {}
    for key, rel in parts.items():
        p = find_file(rel, S.omics_roots)
        if p:
            omics_data[glds][key] = safe_read_csv(p)
        else:
            print(f"[MISS] OMICS {glds}:{key} → {rel}")

# Quick previews & availability
availability_rows = []
for rr in S.missions:
    availability_rows.append({
        "RR_Mission": rr,
        "Summary Available": rr in ok_summary,
        "Telemetry Available": rr in ok_telemetry,
        "Radiation Available": rr in ok_radiation
    })
availability_df = pd.DataFrame(availability_rows)
print("\nDataset Availability by Mission:")
display(availability_df)

# Consolidated handle (unchanged interface for later sections)
rr_data = {
    rr: {
        "summary": summary_data.get(rr),
        "telemetry": telemetry_data.get(rr),
        "radiation": radiation_data.get(rr),
    }
    for rr in S.missions
}


[MISS] EDA Rodent Research/RR-1_EDA_Summary_table.csv (check SPACEHAB_DATA)
[MISS] EDA Rodent Research/RR-1_EDA_Telemetry_data.csv
[MISS] EDA Rodent Research/RR-1_EDA_Radiation_data.csv
Checked RR-1
[MISS] EDA Rodent Research/RR-3_EDA_Summary_table.csv (check SPACEHAB_DATA)
[MISS] EDA Rodent Research/RR-3_EDA_Telemetry_data.csv
[MISS] EDA Rodent Research/RR-3_EDA_Radiation_data.csv
Checked RR-3
[MISS] EDA Rodent Research/RR-4_EDA_Summary_table.csv (check SPACEHAB_DATA)
[MISS] EDA Rodent Research/RR-4_EDA_Telemetry_data.csv
[MISS] EDA Rodent Research/RR-4_EDA_Radiation_data.csv
Checked RR-4
[MISS] EDA Rodent Research/RR-5_EDA_Summary_table.csv (check SPACEHAB_DATA)
[MISS] EDA Rodent Research/RR-5_EDA_Telemetry_data.csv
[MISS] EDA Rodent Research/RR-5_EDA_Radiation_data.csv
Checked RR-5
[MISS] EDA Rodent Research/RR-6_EDA_Summary_table.csv (check SPACEHAB_DATA)
[MISS] EDA Rodent Research/RR-6_EDA_Telemetry_data.csv
[MISS] EDA Rodent Research/RR-6_EDA_Radiation_data.csv
Checked RR-6
[MISS

Unnamed: 0,RR_Mission,Summary Available,Telemetry Available,Radiation Available
0,RR-1,False,False,False
1,RR-3,False,False,False
2,RR-4,False,False,False
3,RR-5,False,False,False
4,RR-6,False,False,False
5,RR-8,False,False,False
6,RR-9,False,False,False
7,RR-12,False,False,False
8,RR-17,False,False,False
9,RR-19,False,False,False


### Section 3: Data Preprocessing
#### Purpose
Create clean, aligned, and feature-enriched time series at one-minute resolution for telemetry and radiation streams across missions.
#### What is implemented
1. One-minute resampling with conservative interpolation of short gaps.
2. Rolling mean and standard deviation features at 5, 30, and 180 minutes.
3. Orbital day or night proxy using CO₂ distributions where light data is absent.
4. Approximate crew awake flags using UTC windowing.
5. Z-score normalization fields for statistical detectors.
6. Export of preprocessed tables for downstream modules.
#### New since AWG
- Designed to support cross-mission parity to enable fair forecasting and anomaly comparisons.
- Roadmap to add event markers such as docking and payload operations.
#### To-Dos
- Replace orbital proxy with explicit lighting if available from EDA.
- Persist a unified time index per mission to simplify joins with omics and events.
- Add simple quality flags per minute such as original, interpolated, or missing.

In [None]:
# ==== Section 3: Data Preprocessing ====

def minute_index(df: pd.DataFrame, time_col: str) -> pd.DatetimeIndex:
    idx = pd.to_datetime(df[time_col], errors="coerce", utc=True).dropna()
    return pd.date_range(idx.min().floor("T"), idx.max().ceil("T"), freq=S.sampling)

def build_quality_flags(original: pd.Series, resampled: pd.Series, interp: pd.Series) -> pd.Series:
    # original points that landed exactly on the minute bins
    flag = pd.Series(index=resampled.index, data="missing", dtype="object")
    # mark where we have a value after resample before interpolate
    flag.loc[resampled.index[resampled.notna()]] = "orig"
    # after interpolation, any previously missing-but-now-present becomes 'interp'
    new_filled = interp.notna() & resampled.isna()
    flag.loc[new_filled.index[new_filled]] = "interp"
    return flag

cleaned_telemetry, cleaned_radiation = {}, {}

for rr in availability_df.query("`Telemetry Available` and `Radiation Available`")["RR_Mission"]:
    tel = rr_data[rr]["telemetry"].copy()
    rad = rr_data[rr]["radiation"].copy()

    # --- TELEMETRY ---
    require_cols(tel, ["Controller_Time_GMT"], f"{rr} telemetry")
    tel = tel.set_index(pd.to_datetime(tel["Controller_Time_GMT"], utc=True)).sort_index()
    # keep numeric cols only
    tel_num = tel.select_dtypes(include=[np.number])

    # resample
    tel_res = tel_num.resample(S.sampling).mean()
    tel_pre = tel_res.copy()
    # interpolate short gaps only (<=5 minutes)
    tel_int = tel_res.interpolate(limit=5)

    # rolling features
    wins = [5, 30, 180]
    for col in tel_num.columns:
        for w in wins:
            tel_int[f"{col}_mean_{w}min"] = tel_int[col].rolling(w, min_periods=1).mean()
            tel_int[f"{col}_std_{w}min"]  = tel_int[col].rolling(w, min_periods=1).std()

    # day/night proxy using CO2 (keep as you had)
    if "CO2_ppm_ISS" in tel_int.columns:
        thr = tel_int["CO2_ppm_ISS"].median()
        tel_int["Orbital_Day"] = (tel_int["CO2_ppm_ISS"] > thr).astype(int)

    # crew awake heuristic (UTC 06–22)
    tel_int["Hour"] = tel_int.index.hour
    tel_int["Crew_Awake"] = ((tel_int["Hour"] >= 6) & (tel_int["Hour"] < 22)).astype(int)
    tel_int.drop(columns="Hour", inplace=True)

    # z-scores for native telemetry columns
    for col in tel_num.columns:
        std = tel_int[col].std()
        if pd.notna(std) and std > 0:
            tel_int[f"{col}_zscore"] = (tel_int[col] - tel_int[col].mean()) / std

    # quality flags per column (orig/interp/missing)
    for col in tel_num.columns:
        flags = build_quality_flags(
            original=tel_num[col],
            resampled=tel_pre[col],
            interp=tel_int[col]
        )
        tel_int[f"{col}_qflag"] = flags.values

    cleaned_telemetry[rr] = tel_int
    tel_out = S.preprocessed_dir / f"{rr}_cleaned_telemetry.csv"
    tel_int.to_csv(tel_out)
    print(f"[✓] Telemetry preprocessed → {tel_out}")

    # --- RADIATION ---
    require_cols(rad, ["Date"], f"{rr} radiation")
    rad = rad.set_index(pd.to_datetime(rad["Date"], utc=True)).sort_index()
    rad_num = rad.select_dtypes(include=[np.number])

    rad_res = rad_num.resample(S.sampling).ffill()   # radiation usually daily/periodic, forward fill ok
    rad_pre = rad_res.copy()
    rad_int = rad_res.interpolate(limit=5)

    for col in rad_num.columns:
        for w in wins:
            rad_int[f"{col}_mean_{w}min"] = rad_int[col].rolling(w, min_periods=1).mean()
            rad_int[f"{col}_std_{w}min"]  = rad_int[col].rolling(w, min_periods=1).std()

    # z-scores
    for col in rad_num.columns:
        std = rad_int[col].std()
        if pd.notna(std) and std > 0:
            rad_int[f"{col}_zscore"] = (rad_int[col] - rad_int[col].mean()) / std

    # quality flags
    for col in rad_num.columns:
        flags = build_quality_flags(
            original=rad_num[col],
            resampled=rad_pre[col],
            interp=rad_int[col]
        )
        rad_int[f"{col}_qflag"] = flags.values

    cleaned_radiation[rr] = rad_int
    rad_out = S.preprocessed_dir / f"{rr}_cleaned_radiation.csv"
    rad_int.to_csv(rad_out)
    print(f"[✓] Radiation preprocessed → {rad_out}")

# Handy alias for later sections (keeps your original names)
preprocessed_telemetry = cleaned_telemetry


In [None]:
print("Telemetry files:", len(list(S.preprocessed_dir.glob("*_cleaned_telemetry.csv"))))
print("Radiation files:", len(list(S.preprocessed_dir.glob("*_cleaned_radiation.csv"))))


### Section 4: Pattern Extraction
#### Purpose
Characterize seasonal and frequency structure to establish baseline rhythms and reveal mission-specific signatures.

#### What is implemented
- STL decomposition on telemetry and radiation with figure exports.
- FFT spectra with safeguards for flat or short signals.
- Diagnostics that log ranges, variance, and sampling integrity.

#### New since AWG
- Plan to compare STL seasonality with SARIMAX seasonal terms to validate decomposition consistency.
- Prepare cross-mission summary tables of dominant frequencies and seasonal amplitudes.

#### To-Dos
- Add an automated report that ranks variables by seasonal strength per mission.
- Store STL seasonal indices for reuse by forecasting baselines.
- Add Welch spectra as a robustness check for FFT.

In [None]:
# Section 4: Pattern Extraction

# Columns to analyze
radiation_cols = ['GCR_Dose_mGy_d', 'SAA_Dose_mGy_d', 'Total_Dose_mGy_d', 'Accumulated_Dose_mGy_d']
signal_cols = ['Temp_degC_ISS', 'RH_percent_ISS', 'CO2_ppm_ISS']

# Output folders
os.makedirs('pattern_analysis_plots', exist_ok=True)
os.makedirs('pattern_analysis_plots/radiation', exist_ok=True)

# STL settings
stl_period = 1440  # daily pattern
min_required_points = stl_period * 2

# Diagnostic logger
def log_diagnostics(col, rr, ts):
    tqdm.write(f"\n--- Processing {col} for {rr} ---")
    tqdm.write("[Diagnostics]")
    tqdm.write(f"Total data points: {len(ts)}")
    tqdm.write(f"Date range: {ts.index.min()} to {ts.index.max()}")
    tqdm.write(f"Missing values after interpolate: {ts[col].isnull().sum()}")
    tqdm.write("Descriptive statistics:")
    tqdm.write(str(ts[col].describe()))
    tqdm.write(f"Standard deviation: {ts[col].std()}")
    tqdm.write("First few values:")
    tqdm.write(str(ts[col].head()))

# Safe FFT plotting function
def safe_fft(ts_values, col, rr, out_path):
    y = ts_values - np.mean(ts_values)
    if np.std(y) < 1e-6:
        tqdm.write(f"[SKIP FFT] {col} in {rr} is too flat for FFT.")
        return
    N = len(y)
    T = 60.0  # 1-min interval
    yf = fft(y)
    xf = fftfreq(N, T)[:N // 2]
    plt.figure(figsize=(12, 5))
    plt.plot(xf, 2.0 / N * np.abs(yf[0:N // 2]), color='black')
    plt.title(f'{col} Frequency Spectrum ({rr})')
    plt.xlabel('Frequency (Hz)')
    plt.ylabel('Amplitude')
    plt.tight_layout()
    plt.savefig(out_path)
    plt.close()

# Begin loop
for rr in tqdm(complete_missions, desc='Missions'):
    tqdm.write(f"\n=== Pattern Analysis for {rr} ===")

    # --- TELEMETRY ---
    df_tel = preprocessed_telemetry.get(rr)
    if df_tel is not None:
        for col in tqdm(signal_cols, desc=f'{rr} Telemetry', leave=False):
            if col not in df_tel.columns:
                continue
            try:
                ts = df_tel[[col]].dropna()
                ts = ts.resample('1min').mean().interpolate(limit=5)

                # Drop non-finite
                ts = ts[np.isfinite(ts[col])]
                ts = ts.dropna()
                std_dev = ts[col].std()

                if len(ts) < min_required_points or std_dev < 1e-6:
                    tqdm.write(f"[SKIP] {col} in {rr}: not enough valid data or too flat (std={std_dev:.6f})")
                    continue

                log_diagnostics(col, rr, ts)

                # Raw Plot
                plt.figure(figsize=(12, 4))
                plt.plot(ts.index, ts[col], color='black')
                plt.title(f'{col} Raw Signal Plot ({rr})')
                plt.xlabel('Time')
                plt.ylabel(col)
                plt.tight_layout()
                plt.savefig(f'pattern_analysis_plots/{col}_{rr}_Raw.png')
                plt.close()

                # STL
                try:
                    stl = STL(ts[col], period=stl_period, robust=True)
                    result = stl.fit()
                    fig, axes = plt.subplots(4, 1, figsize=(14, 10), sharex=True)
                    axes[0].plot(ts.index, ts[col], color='black'); axes[0].set_title(f'{col} - Original ({rr})')
                    axes[1].plot(result.trend, color='blue'); axes[1].set_title('Trend')
                    axes[2].plot(result.seasonal, color='green'); axes[2].set_title('Seasonal')
                    axes[3].plot(result.resid, color='red'); axes[3].set_title('Residual')
                    plt.tight_layout()
                    plt.savefig(f'pattern_analysis_plots/{col}_{rr}_STL_decomposition.png')
                    plt.close()
                except Exception as e:
                    tqdm.write(f"[ERROR] STL failed for {col} in {rr}: {e}")
                    continue

                # FFT
                safe_fft(ts[col].values, col, rr, f'pattern_analysis_plots/{col}_{rr}_FFT.png')
                tqdm.write(f"[DONE] {col} for {rr}")
            except Exception as e:
                tqdm.write(f"[ERROR] {col} in {rr}: {e}")

    # --- RADIATION ---
    df_rad = cleaned_radiation.get(rr)
    if df_rad is not None:
        for col in tqdm(radiation_cols, desc=f'{rr} Radiation', leave=False):
            if col not in df_rad.columns:
                continue
            try:
                ts = df_rad[[col]].dropna()
                ts = ts.resample('1min').interpolate(method='linear', limit=5)

                ts = ts[np.isfinite(ts[col])]
                ts = ts.dropna()
                std_dev = ts[col].std()

                if len(ts) < min_required_points or std_dev < 1e-6:
                    tqdm.write(f"[SKIP] {col} in {rr}: not enough valid data or too flat (std={std_dev:.6f})")
                    continue

                log_diagnostics(col, rr, ts)

                # Raw
                plt.figure(figsize=(12, 4))
                plt.plot(ts.index, ts[col], color='black')
                plt.title(f'{col} Raw Signal Plot ({rr})')
                plt.xlabel('Time')
                plt.ylabel(col)
                plt.tight_layout()
                plt.savefig(f'pattern_analysis_plots/radiation/{col}_{rr}_Raw.png')
                plt.close()

                # STL
                try:
                    stl = STL(ts[col], period=stl_period, robust=True)
                    result = stl.fit()
                    fig, axes = plt.subplots(4, 1, figsize=(14, 10), sharex=True)
                    axes[0].plot(ts.index, ts[col], color='black'); axes[0].set_title(f'{col} - Original ({rr})')
                    axes[1].plot(result.trend, color='blue'); axes[1].set_title('Trend')
                    axes[2].plot(result.seasonal, color='green'); axes[2].set_title('Seasonal')
                    axes[3].plot(result.resid, color='red'); axes[3].set_title('Residual')
                    plt.tight_layout()
                    plt.savefig(f'pattern_analysis_plots/radiation/{col}_{rr}_STL_decomposition.png')
                    plt.close()
                except Exception as e:
                    tqdm.write(f"[ERROR] STL failed for {col} in {rr}: {e}")
                    continue

                # FFT
                safe_fft(ts[col].values, col, rr, f'pattern_analysis_plots/radiation/{col}_{rr}_FFT.png')
                tqdm.write(f"[DONE] {col} for {rr}")
            except Exception as e:
                tqdm.write(f"[ERROR] {col} in {rr}: {e}")

### Section 5: Relationship Mapping
#### Purpose
Quantify associations and temporal dependencies across telemetry and radiation variables, and prepare hooks for biological and behavioral linkage.
#### What is implemented
1. Pearson and Spearman correlation matrices with heatmap exports.
2. Granger causality tests at short lags to detect lead-lag structure.
3. Network graphs of strong correlations to visualize system coupling.
#### New since AWG
- Plan to extend analyses to additional tissues such as anterior tibialis and to behavioral or phenotypic signals when available.
- Roadmap for formal interpretability such as SHAP once multivariate models are introduced.
#### To-Dos
- Add partial correlations controlling for time of day and orbital phase.
- Evaluate transfer entropy or directed information as a non-linear alternative to Granger tests.
- Prepare tidy outputs that the dashboard can filter by mission, variable, and method.

In [None]:
# Section 5: Relationship Mapping

from scipy.stats import spearmanr
from statsmodels.tsa.stattools import grangercausalitytests
import seaborn as sns
import networkx as nx

# Directory to save plots
os.makedirs('relationship_mapping_plots', exist_ok=True)

# Columns to include for correlation and causality tests
telemetry_metrics = ['Temp_degC_ISS', 'RH_percent_ISS', 'CO2_ppm_ISS']
radiation_metrics = ['GCR_Dose_mGy_d', 'SAA_Dose_mGy_d', 'Total_Dose_mGy_d', 'Accumulated_Dose_mGy_d']

for rr in complete_missions:
    print(f"\n--- Analyzing {rr} ---")

    df_tel = preprocessed_telemetry.get(rr)
    df_rad = cleaned_radiation.get(rr)

    if df_tel is None or df_rad is None:
        print(f"[SKIP] Missing data for {rr}")
        continue

    # Merge telemetry and radiation signals (1-minute index)
    merged = df_tel.join(df_rad, how='inner')
    merged = merged[telemetry_metrics + radiation_metrics].dropna()

    if len(merged) < 2000:
        print(f"[SKIP] Insufficient data points for correlation/causality analysis in {rr}")
        continue

    # Compute Pearson and Spearman correlation matrices
    pearson_corr = merged.corr(method='pearson')
    spearman_corr, _ = spearmanr(merged)

    spearman_corr_df = pd.DataFrame(spearman_corr, index=merged.columns, columns=merged.columns)

    # Save correlation heatmaps
    plt.figure(figsize=(10, 8))
    sns.heatmap(pearson_corr, annot=True, cmap='coolwarm', fmt=".2f")
    plt.title(f'Pearson Correlation Matrix - {rr}')
    plt.tight_layout()
    plt.savefig(f'relationship_mapping_plots/{rr}_pearson_corr.png')
    plt.close()

    plt.figure(figsize=(10, 8))
    sns.heatmap(spearman_corr_df, annot=True, cmap='coolwarm', fmt=".2f")
    plt.title(f'Spearman Correlation Matrix - {rr}')
    plt.tight_layout()
    plt.savefig(f'relationship_mapping_plots/{rr}_spearman_corr.png')
    plt.close()

    # Granger Causality: test telemetry → radiation (lag = 5 min)
    print(f"\n[Granger Causality Tests for {rr}]")
    for rad_col in radiation_metrics:
        for tel_col in telemetry_metrics:
            try:
                data = merged[[rad_col, tel_col]].dropna()
                if data.shape[0] > 200:
                    result = grangercausalitytests(data[[rad_col, tel_col]], maxlag=5, verbose=False)
                    p_values = [round(result[i + 1][0]['ssr_ftest'][1], 4) for i in range(5)]
                    print(f"{tel_col} → {rad_col}: min p-value = {min(p_values)}")
            except Exception as e:
                print(f"Failed Granger test {tel_col} → {rad_col}: {e}")

    # Build and save network graph of strong Pearson correlations
    try:
        G = nx.Graph()
        threshold = 0.6
        for i in range(len(pearson_corr.columns)):
            for j in range(i + 1, len(pearson_corr.columns)):
                var1 = pearson_corr.columns[i]
                var2 = pearson_corr.columns[j]
                weight = pearson_corr.iloc[i, j]
                if abs(weight) > threshold:
                    G.add_edge(var1, var2, weight=round(weight, 2))

        if len(G.edges) > 0:
            plt.figure(figsize=(8, 6))
            pos = nx.spring_layout(G, seed=42)
            nx.draw(G, pos, with_labels=True, node_color='lightblue', edge_color='gray', node_size=2000, font_size=10)
            edge_labels = nx.get_edge_attributes(G, 'weight')
            nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels)
            plt.title(f'Variable Correlation Network - {rr}')
            plt.tight_layout()
            plt.savefig(f'relationship_mapping_plots/{rr}_network_graph.png')
            plt.close()
    except Exception as e:
        print(f"Failed to create network graph for {rr}: {e}")


### Section 6: Anomaly Detection and Forecasting
#### Purpose
Detect environmental deviations and produce short-term forecasts that are operationally useful for life-support planning.
#### What is implemented
1. Statistical detectors using Z-score and EWMA with exportable plots.
2. LSTM autoencoder reconstruction error for subtle anomalies.
3. Univariate LSTM forecasting with rolling origin validation.
4. Metrics including RMSE and MAE and summary CSV exports.

#### New since AWG
- Benchmark Seasonal ARIMA and VAR against LSTM forecasts starting with RR-1 and expanding to other missions.
- Expand the LSTM and anomaly suite to RR-3, RR-6, RR-9, RR-12, and RR-19 using the same preprocessing and evaluation.
- Prepare a path to multivariate forecasting such as multivariate LSTM or attention models where feasible.

#### To-Dos
- Implement SARIMAX grid search for seasonal orders with AIC selection and save diagnostics.
- Add residual diagnostics and Ljung-Box tests for each baseline model.
- Introduce conformal prediction intervals or quantile regression to provide uncertainty bands.
- Create a single scoreboard table per mission that ranks models by RMSE and MAE with ties broken by simplicity.

In [None]:
# Section 6: Anomaly Detection and Forecasting (RR-1 only, telemetry + radiation)

import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, RepeatVector, TimeDistributed, Dense
from tensorflow.keras.callbacks import EarlyStopping

# Optimization
os.environ["OMP_NUM_THREADS"] = "2"
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
tf.config.threading.set_intra_op_parallelism_threads(2)
tf.config.threading.set_inter_op_parallelism_threads(2)

# Settings
sequence_length = 45
epochs = 2
batch_size = 32
threshold_z = 3
ewma_span = 60
validation_fraction = 0.2

telemetry_cols = ['Temp_degC_ISS', 'RH_percent_ISS', 'CO2_ppm_ISS']
radiation_cols = ['GCR_Dose_mGy_d', 'SAA_Dose_mGy_d', 'Total_Dose_mGy_d', 'Accumulated_Dose_mGy_d']
early_stop = EarlyStopping(monitor='val_loss', patience=1, restore_best_weights=True)

output_root = 'anomaly_forecast_outputs'
os.makedirs(output_root, exist_ok=True)

def create_sequences(data, seq_length):
    xs, ys = [], []
    for i in range(len(data) - seq_length):
        x = data[i:(i + seq_length)]
        y = data[i + seq_length]
        xs.append(x)
        ys.append(y)
    return np.array(xs), np.array(ys)

def run_anomaly_forecast(df, signal_type, rr, col, metrics_summary):
    df_col = df[[col]].dropna().copy()
    df_col.columns = ['value']
    if len(df_col) < 2000 or df_col['value'].std() < 1e-5:
        print(f"[SKIP] {col} in {rr}: insufficient data or low variance")
        return

    df_col = df_col.resample('2min').mean().dropna()
    if len(df_col) < 1000:
        print(f"[SKIP] {col} in {rr}: too short after downsampling")
        return

    print(f"\n--- Processing {col} in {rr} ({len(df_col)} pts) ---")
    col_start = time.time()
    col_dir = os.path.join(output_root, rr, f'{signal_type}_{col}')
    os.makedirs(col_dir, exist_ok=True)

    try:
        ts = df_col.copy()
        ts['zscore'] = (ts['value'] - ts['value'].mean()) / ts['value'].std()
        ts['z_anomaly'] = ts['zscore'].abs() > threshold_z
        ts['ewma'] = ts['value'].ewm(span=ewma_span).mean()
        ts['ewma_resid'] = ts['value'] - ts['ewma']
        ts['ewma_anomaly'] = ts['ewma_resid'].abs() > (2 * ts['ewma_resid'].std())

        # Z-score anomalies
        plt.figure(figsize=(12, 4))
        plt.plot(ts.index, ts['value'], label='Value', color='black')
        plt.scatter(ts[ts['z_anomaly']].index, ts[ts['z_anomaly']]['value'], color='red', label='Z Anomaly')
        plt.title(f'{col} Z-Score Anomalies ({rr})')
        plt.legend(); plt.tight_layout()
        plt.savefig(os.path.join(col_dir, 'zscore_anomalies.png'))
        plt.close()

        # EWMA anomalies
        plt.figure(figsize=(12, 4))
        plt.plot(ts.index, ts['value'], color='black', label='Value')
        plt.plot(ts.index, ts['ewma'], color='blue', label='EWMA')
        plt.scatter(ts[ts['ewma_anomaly']].index, ts[ts['ewma_anomaly']]['value'], color='orange', label='EWMA Anomaly')
        plt.title(f'{col} EWMA Anomalies ({rr})')
        plt.legend(); plt.tight_layout()
        plt.savefig(os.path.join(col_dir, 'ewma_anomalies.png'))
        plt.close()

        # LSTM Autoencoder
        scaler = MinMaxScaler()
        scaled = scaler.fit_transform(ts[['value']].values).astype(np.float32)
        X_seq, _ = create_sequences(scaled, sequence_length)
        if len(X_seq) < 100:
            print(f"[SKIP] {col} in {rr}: insufficient sequences for LSTM")
            return

        split = int(len(X_seq) * (1 - validation_fraction))
        X_train = X_seq[:split]
        X_test = X_seq[split:]

        input_dim = X_train.shape[2]
        timesteps = X_train.shape[1]

        print("  → Training Autoencoder...")
        ae_start = time.time()
        inputs = Input(shape=(timesteps, input_dim))
        encoded = LSTM(16, activation='relu', return_sequences=False)(inputs)
        decoded = RepeatVector(timesteps)(encoded)
        decoded = LSTM(16, activation='relu', return_sequences=True)(decoded)
        outputs = TimeDistributed(Dense(input_dim))(decoded)

        model = Model(inputs, outputs)
        model.compile(optimizer='adam', loss='mse')
        model.fit(X_train, X_train, epochs=epochs, batch_size=batch_size,
                  validation_split=0.1, shuffle=False, callbacks=[early_stop], verbose=1)
        ae_end = time.time()

        X_test_pred = model.predict(X_test, verbose=0)
        mse = np.mean(np.square(X_test - X_test_pred), axis=(1, 2))
        threshold = np.percentile(mse, 95)

        plt.figure(figsize=(12, 4))
        plt.plot(mse, color='black', label='Reconstruction Error')
        plt.axhline(y=threshold, color='red', linestyle='--', label='Threshold')
        plt.title(f'{col} LSTM Reconstruction Error ({rr})')
        plt.legend(); plt.tight_layout()
        plt.savefig(os.path.join(col_dir, 'lstm_reconstruction_error.png'))
        plt.close()

        # Forecasting
        print("  → Training Forecaster...")
        X_forecast, y_forecast = create_sequences(scaled, sequence_length)
        split_f = int(len(X_forecast) * (1 - validation_fraction))
        X_train_f = X_forecast[:split_f]
        y_train_f = y_forecast[:split_f]
        X_test_f = X_forecast[split_f:]
        y_test_f = y_forecast[split_f:]

        model_forecast = tf.keras.Sequential([
            LSTM(16, activation='relu', input_shape=(sequence_length, 1)),
            Dense(1)
        ])
        model_forecast.compile(optimizer='adam', loss='mse')
        model_forecast.fit(X_train_f, y_train_f, epochs=epochs, batch_size=batch_size,
                           shuffle=False, callbacks=[early_stop], verbose=1)

        y_pred = model_forecast.predict(X_test_f)
        y_test_inv = scaler.inverse_transform(y_test_f.reshape(-1, 1))
        y_pred_inv = scaler.inverse_transform(y_pred)

        rmse = np.sqrt(mean_squared_error(y_test_inv, y_pred_inv))
        mae = mean_absolute_error(y_test_inv, y_pred_inv)

        # Forecast plot
        plt.figure(figsize=(12, 4))
        plt.plot(y_test_inv, label='Actual', color='black')
        plt.plot(y_pred_inv, label='Forecast', color='green')
        plt.title(f'{col} LSTM Forecast ({rr})\nRMSE: {rmse:.2f}, MAE: {mae:.2f}')
        plt.legend(); plt.tight_layout()
        plt.savefig(os.path.join(col_dir, 'lstm_forecast.png'))
        plt.close()

        # Append metrics
        metrics_summary.append({
            'RR': rr,
            'Type': signal_type,
            'Variable': col,
            'RMSE': rmse,
            'MAE': mae,
            'AE_Threshold': threshold,
            'AE_MeanMSE': np.mean(mse),
            'AE_Time': ae_end - ae_start
        })

        print(f"[DONE] {col} in {rr} → RMSE: {rmse:.2f}, MAE: {mae:.2f}")

    except Exception as e:
        print(f"[ERROR] Failed for {col} in {rr}: {e}")

    print(f"  → Total time for {col}: {time.time() - col_start:.1f} sec")

# Run for RR-1
rr = 'RR-1'
metrics_summary = []

# Telemetry
df_tel = preprocessed_telemetry.get(rr)
if df_tel is not None:
    for col in telemetry_cols:
        if col in df_tel.columns:
            run_anomaly_forecast(df_tel, 'telemetry', rr, col, metrics_summary)

# Radiation
df_rad = cleaned_radiation.get(rr)
if df_rad is not None:
    for col in radiation_cols:
        if col in df_rad.columns:
            run_anomaly_forecast(df_rad, 'radiation', rr, col, metrics_summary)

# Save summary
summary_df = pd.DataFrame(metrics_summary)
summary_df.to_csv(os.path.join(output_root, f'{rr}_forecast_summary.csv'), index=False)
print(f"\n[✓] Saved forecast summary for {rr} ({len(metrics_summary)} variables)")


### Section 7: Evaluation and Insights
#### Purpose
Aggregate results across missions, identify the most volatile variables, and link environmental anomalies to biological findings where timing allows.

#### What is implemented
1. Consolidation of forecast metrics into per-mission and all-mission summaries.
2. Ranking of top forecastable variables and top anomaly intensities.
3. RR-1 omics linkage including significant differential expression tables and PCA summaries.

#### New since AWG
- Compare missions that exhibit environmental anomalies against missions with relatively stable conditions to test biological contrasts suggested by the group.
- Extend linkage to additional GeneLab datasets beyond GLDS-104.

#### To-Dos
- Add an anomaly calendar per mission with merged telemetry and radiation spike timestamps.
- If event context is available, annotate anomaly calendars with docking or EVA intervals and compute overlap rates.
- For omics, add pathway enrichment summaries and export top enriched terms per anomaly window.
- Produce a concise narrative per mission that highlights the strongest patterns and any outliers.

In [None]:
# Section 7: Evaluation and Insights

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

summary_dir = 'anomaly_forecast_outputs'
pattern_data_dir = 'pattern_extraction_data'

# Omics (GLDS-104 for RR-1)
deg_path = '/Users/kennethjenkins/Documents/Education/CU Boulder/Data Mining Specialization/Data Mining Projects/Project Checkpoint/Data Sources/Genelab Imics Datasets/RR-1/OSD-104/GLDS-104_rna_seq_differential_expression.csv'
norm_counts_path = '/Users/kennethjenkins/Documents/Education/CU Boulder/Data Mining Specialization/Data Mining Projects/Project Checkpoint/Data Sources/Genelab Imics Datasets/RR-1/OSD-104/GLDS-104_rna_seq_Normalized_Counts.csv'
pca_path = pca_path = '/Users/kennethjenkins/Documents/Education/CU Boulder/Data Mining Specialization/Data Mining Projects/Project Checkpoint/Data Sources/Genelab Imics Datasets/RR-1/OSD-104/GLDS-104_rna_seq_visualization_PCA_table.csv'


# 7.1 Combine All Forecast Summaries
all_summaries = []
for rr in complete_missions:
    summary_path = os.path.join(summary_dir, f'{rr}_forecast_summary.csv')
    if os.path.exists(summary_path):
        df = pd.read_csv(summary_path)
        df['Mission'] = rr
        all_summaries.append(df)

combined_summary = pd.concat(all_summaries, ignore_index=True)
combined_summary.to_csv(os.path.join(summary_dir, 'all_mission_summary.csv'), index=False)
print("[✓] Combined anomaly forecast summaries across missions.")

# 7.2 Top Forecastable and Anomalous Variables
top_forecast = combined_summary.sort_values('RMSE').groupby('Mission').head(3)
top_anomalous = combined_summary.sort_values('AE_MeanMSE', ascending=False).groupby('Mission').head(3)

print("\nTop Forecasted Variables per Mission:")
print(top_forecast[['Mission', 'Variable', 'RMSE', 'MAE']])

print("\nTop Anomalous Variables per Mission:")
print(top_anomalous[['Mission', 'Variable', 'AE_MeanMSE', 'AE_Threshold']])

# 7.3 Visualization: Anomaly Intensity
plt.figure(figsize=(10, 6))
sns.boxplot(x='Type', y='AE_MeanMSE', data=combined_summary)
plt.title("Anomaly Intensity by Variable Type")
plt.tight_layout()
plt.savefig(os.path.join(summary_dir, 'anomaly_boxplot_by_type.png'))
plt.close()

plt.figure(figsize=(12, 6))
sns.barplot(data=combined_summary, x='Mission', y='AE_MeanMSE', hue='Type')
plt.title("Mean Anomaly Score (AE) by Mission and Type")
plt.tight_layout()
plt.savefig(os.path.join(summary_dir, 'anomaly_by_mission.png'))
plt.close()

# 7.4 Composite Anomaly Timeline for RR-1
rr1_env_path = '/Users/kennethjenkins/CU Boulder/Past/pattern_extraction_data/RR-1_telemetry_processed.csv'
rr1_rad_path = '/Users/kennethjenkins/CU Boulder/Past/pattern_extraction_data/RR-1_radiation_processed.csv'

env_df = pd.read_csv(rr1_env_path, index_col=0, parse_dates=True)
rad_df = pd.read_csv(rr1_rad_path, index_col=0, parse_dates=True)

# Combine telemetry and radiation
combined_rr1 = env_df.join(rad_df, how='outer')
combined_rr1 = combined_rr1.sort_index().interpolate(limit=5)

# Define key variables to track
key_vars = [
    'Temp_degC_ISS', 'RH_percent_ISS', 'CO2_ppm_ISS',
    'GCR_Dose_mGy_d', 'SAA_Dose_mGy_d', 'Total_Dose_mGy_d', 'Accumulated_Dose_mGy_d'
]

# Detect anomaly times
anomaly_times = set()
for var in key_vars:
    if var in combined_rr1.columns:
        z = (combined_rr1[var] - combined_rr1[var].mean()) / combined_rr1[var].std()
        spikes = z[np.abs(z) > 3].index
        anomaly_times.update(spikes)

print(f"\n[✓] Found {len(anomaly_times)} unique anomaly timestamps across all key signals.")

# Convert to sorted list
anomaly_times = sorted(list(anomaly_times))

# 7.5 Load Omics and Evaluate Top Genes
deg_df = pd.read_csv(deg_path)

# Clean column names
deg_df.columns = deg_df.columns.str.strip().str.replace('\u200b', '').str.replace('\xa0', '')

# Rename for easier access
deg_df = deg_df.rename(columns={
    'Adj.p.value_(Space Flight)v(Ground Control)': 'padj',
    'Log2fc_(Space Flight)v(Ground Control)': 'log2FoldChange',
    'SYMBOL': 'gene'
})

# Filter significant genes
deg_filtered = deg_df[deg_df['padj'] < 0.05].sort_values(by='log2FoldChange', ascending=False)
print(f"[✓] Loaded {len(deg_df)} DEGs. {len(deg_filtered)} significant (padj < 0.05).")

# Top 5 upregulated and downregulated genes
top_up = deg_filtered.head(5)
top_down = deg_filtered.tail(5)

print("\nTop 5 Upregulated Genes:")
print(top_up[['gene', 'log2FoldChange', 'padj']])

print("\nTop 5 Downregulated Genes:")
print(top_down[['gene', 'log2FoldChange', 'padj']])


# Load normalized expression data
norm_df = pd.read_csv(norm_counts_path, index_col=0)
top_var_genes = norm_df.var(axis=1).sort_values(ascending=False).head(5)

print("\nTop 5 Variable Genes Across Omics Samples:")
print(top_var_genes)

# PCA metadata
pca_df = pd.read_csv(pca_path)
print(f"[✓] Loaded PCA table with {pca_df.shape[0]} samples and {pca_df.shape[1]} components.")

# 7.6 Export All Outputs
top_forecast.to_csv(os.path.join(summary_dir, 'top_forecasted_variables.csv'), index=False)
top_anomalous.to_csv(os.path.join(summary_dir, 'top_anomalous_variables.csv'), index=False)
top_up.to_csv(os.path.join(summary_dir, 'rr1_top_upregulated_genes.csv'), index=False)
top_down.to_csv(os.path.join(summary_dir, 'rr1_top_downregulated_genes.csv'), index=False)
top_var_genes.to_csv(os.path.join(summary_dir, 'rr1_top_variable_genes.csv'))
pca_df.to_csv(os.path.join(summary_dir, 'rr1_pca_summary.csv'), index=False)

# Save anomaly timeline
anomaly_df = pd.DataFrame(anomaly_times, columns=['timestamp'])
anomaly_df.to_csv(os.path.join(summary_dir, 'rr1_composite_anomaly_timestamps.csv'), index=False)

print("\nAll multi-signal anomalies and omics insights exported.")


### Section 8: Prescriptive Insights
#### Purpose
Translate analytics into decision aids for habitat operations such as alert thresholds, scheduling guidance, and sampling triggers.
#### What is implemented
1. Preliminary thresholds for CO₂, RH, and temperature grounded in anomaly behavior and forecast accuracy.
2. Radiation-aware scheduling guidance based on SAA variability and cumulative dose monitoring.
3. Triggers for omics or physiological sampling aligned to anomaly criteria.
4. Guidance for embedded forecasting and visualization for real-time use.
#### New since AWG
- Mark these prescriptions as preliminary and subject to revision after cross-mission Seasonal ARIMA benchmarking.
- Add placeholders for LSDA physiological endpoints once access is available.
#### To-Dos
- Calibrate thresholds using ROC curves generated from retrospective anomaly labels once a small labeled set is created.
- Encode recommendations into a JSON policy that the dashboard can read and display.
- Add unit-tested utilities to compute rolling compliance against the policy per mission.

In [None]:
import pandas as pd
import os

# Define base path
base_path = "/Users/kennethjenkins/CU Boulder/preprocessed_data"

# Define telemetry and radiation file paths
telemetry_files = [
    "RR-1_cleaned_telemetry.csv",
    "RR-3_cleaned_telemetry.csv",
    "RR-6_cleaned_telemetry.csv",
    "RR-9_cleaned_telemetry.csv",
    "RR-12_cleaned_telemetry.csv",
    "RR-19_cleaned_telemetry.csv"
]

radiation_files = [
    "RR-1_cleaned_radiation.csv",
    "RR-3_cleaned_radiation.csv",
    "RR-6_cleaned_radiation.csv",
    "RR-9_cleaned_radiation.csv",
    "RR-12_cleaned_radiation.csv",
    "RR-19_cleaned_radiation.csv"
]

# Helper function to generate summary stats for each file
def generate_summary(file_list, prefix):
    summaries = []
    for fname in file_list:
        fpath = os.path.join(base_path, fname)
        df = pd.read_csv(fpath)
        desc = df.describe().T
        desc["mission"] = fname.split("_")[0]
        desc["variable"] = desc.index
        summaries.append(desc.reset_index(drop=True))
    
    full_summary = pd.concat(summaries, ignore_index=True)
    return full_summary[["mission", "variable", "count", "mean", "std", "min", "25%", "50%", "75%", "max"]]

# Generate summaries
telemetry_summary = generate_summary(telemetry_files, "telemetry")
radiation_summary = generate_summary(radiation_files, "radiation")

# Save to disk (optional)
telemetry_summary.to_csv(os.path.join(base_path, "telemetry_descriptive_summary.csv"), index=False)
radiation_summary.to_csv(os.path.join(base_path, "radiation_descriptive_summary.csv"), index=False)

# Display the results in Jupyter
from IPython.display import display

print("Telemetry Descriptive Statistics Summary:")
display(telemetry_summary)

print("\nRadiation Descriptive Statistics Summary:")
display(radiation_summary)


### Section 10 - Streamlit Dashboard

In [None]:
# app.py - Streamlit Dashboard for Data Mining for Space Habitats

import streamlit as st
import os
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image

st.set_page_config(page_title="Space Habitat Dashboard", layout="wide")
st.title("🛰️ Space Habitat Anomaly & Forecast Dashboard")
st.markdown("Explore ISS telemetry anomalies, LSTM forecasts, and omics responses (RR-1).")

data_dir = "anomaly_forecast_outputs"

missions = sorted([d for d in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir, d))])
mission = st.selectbox("Select Mission", missions)

summary_path = os.path.join(data_dir, f"{mission}_forecast_summary.csv")
if os.path.exists(summary_path):
    summary_df = pd.read_csv(summary_path)
    variables = summary_df["Variable"].tolist()
    var = st.selectbox("Select Variable", variables)

    col_path = os.path.join(data_dir, mission, var)

    st.subheader(f" {var} Forecast for {mission}")
    fc_path = os.path.join(col_path, "lstm_forecast.png")
    if os.path.exists(fc_path):
        st.image(fc_path, caption="LSTM Forecast", use_column_width=True)

    st.subheader(" LSTM Reconstruction Error")
    ae_path = os.path.join(col_path, "lstm_reconstruction_error.png")
    if os.path.exists(ae_path):
        st.image(ae_path, caption="Reconstruction Error", use_column_width=True)

    st.subheader(" Anomaly Plots")
    c1, c2 = st.columns(2)
    with c1:
        z_path = os.path.join(col_path, "zscore_anomalies.png")
        if os.path.exists(z_path): st.image(z_path, caption="Z-Score Anomalies", use_column_width=True)
    with c2:
        ewma_path = os.path.join(col_path, "ewma_anomalies.png")
        if os.path.exists(ewma_path): st.image(ewma_path, caption="EWMA Anomalies", use_column_width=True)

    st.subheader(" Metrics Summary")
    metrics_path = os.path.join(col_path, "metrics.txt")
    if os.path.exists(metrics_path):
        with open(metrics_path) as f:
            st.text(f.read())

st.divider()

if mission == "RR-1":
    st.header(" RR-1 Omics Summary")

    up_path = os.path.join(data_dir, "rr1_top_upregulated_genes.csv")
    down_path = os.path.join(data_dir, "rr1_top_downregulated_genes.csv")

    if os.path.exists(up_path):
        st.subheader("Top Upregulated Genes")
        st.dataframe(pd.read_csv(up_path).round(4))

    if os.path.exists(down_path):
        st.subheader("Top Downregulated Genes")
        st.dataframe(pd.read_csv(down_path).round(4))

    pca_path = os.path.join(data_dir, "rr1_pca_summary.csv")
    if os.path.exists(pca_path):
        st.subheader("Omics PCA Table")
        st.dataframe(pd.read_csv(pca_path).round(4))


### Section 11: Evaluation and Insights

#### 11.1 Forecasting and Anomaly Detection Performance

**LSTM Forecasting Performance**

**Key Observations:**

**Anomaly Detection Effectiveness**

#### 11.2 Biological Linkage via Omics Data (RR-1)

**Findings:**

#### 11.3 Pattern and Relationship Insights

**STL and FFT Observations:**

**Correlation and Causality Analysis:**

#### 11.4 Dashboard Usability

#### 11.5 Key Findings and Future Applications

**Future Work Recommendations:**

### 11.6 Descriptive Statistical Summary

#### Telemetry Variables (RR-1)

#### Radiation Variables (RR-1)


### 11.7 Prescriptive Statistical Insights

#### Environmental Monitoring Recommendations

#### Radiation-Aware Activity Scheduling

#### Sampling and Health Monitoring Triggers

#### Onboard Predictive Infrastructure

#### Habitat Design and Sensor Deployment

#### Recommendations for Artemis and Future Missions

---

### Section 12: Conclusion and Lessons Learned

#### Summary of Insights

#### Limitations

#### Project Challenges and Solutions

#### Recommendations and Next Steps


# End of code