# Univariate prediction and persistence models

This notebook consolidates the univariate onset-prediction workflows with the cleaned persistence/remission cohort. Both sections reuse the shared helpers so the preprocessing, cohort definitions, and logistic regression pipelines stay synchronized with the multivariate analyses.

## Imports and shared setup

In [None]:
from analysis_utils import (
    load_base_dataset,
    engineer_baseline_features,
    prepare_univariate_prediction_dataset,
    prepare_persistence_dataset,
    run_univariate_logistic_regressions,
)
from IPython.display import display


In [None]:
raw_df = load_base_dataset()
feature_df, feature_sets = engineer_baseline_features(raw_df)
print(f'Dataset shape: {raw_df.shape}')
print(f'Feature matrix shape: {feature_df[feature_sets["all_features"]].shape}')


## Univariate prediction of future atypical AN onset

Participants with full AN diagnoses or baseline atypical AN onset are removed to mirror the original risk-prediction experiment. The target labels any mBMI-defined atypical AN onset across waves 1–6.

In [None]:
prediction_df = prepare_univariate_prediction_dataset(
    feature_df, feature_sets['all_features']
)
outcome_counts = prediction_df['aan_onset_anywave'].value_counts().to_dict()
print('Univariate prediction cohort size:', len(prediction_df))
print('Outcome counts:', outcome_counts)
prediction_results = run_univariate_logistic_regressions(
    prediction_df, feature_sets['all_features'], target_col='aan_onset_anywave'
)
display(prediction_results)


## Univariate persistence vs. remission analyses

The persistence dataset retains participants with baseline or mBMI-defined onset who have complete wave-1–6 onset data and labels cases that revisit onset after at least one remission wave.

In [None]:
persistence_df = prepare_persistence_dataset(
    feature_df, feature_sets['all_features']
)
print('Persistence cohort size:', len(persistence_df))
print(persistence_df['aan_persistence'].value_counts().rename('count'))
persistence_results = run_univariate_logistic_regressions(
    persistence_df, feature_sets['all_features']
)
display(persistence_results)
