Notebooks contain extraneous code that is not essential to the modeling pipeline, such as visualization and creating figures. These are briefly acknowledged in the summary, but are skipped during the more detailed descriptions of each modeling step in Sections 1-3 below.
- Step 1: data processing and setup
00-merge_transcripts_and_codes.ipynb
: Create transcript segmentationsm convert annotations to data frames (each row is an utterance and columns correspond to the presence of specific outcomes.01-load_data.ipynb
: look at distribution of passage lengths, count positive labels per subject, create test (holdout) and training (development) set, visualize embedding distribution, create 4 folds01A-embeds.py
: embeddings for BERT, RoBERTa (unused), and MentalBERT01B-embeds_long.py
: embeddings for MentalLongformer
- Step 2: modeling
02-lr.ipynb
: perform logistic regression with range of C values and 4-fold CV- Calls
baseline.py
,baseline_parts.py
,perm_trn.py
, andperm_trn_parts.py
- Calls
03-roc_auc.ipynb
: calculate ROC AUC from the results of02-lr.ipynb
04-tuning.ipynb
: for each model, segmentation, and training scheme (baseline or permuted), find best C value using inner 3-fold CV, collect predictions of selected models (helpful for generating graphs later)roc_auc.py
- Step 3: post hocs and visualization
05-results.ipynb
: visualize results, create graphs, perform post hoc tests comparing models and segmentations
Dependencies are specified in requirements.txt and environment.yml.
This step is performed in 00-merge_transcripts_and_codes.ipynb
.
Transcripts were provided as a mixture of DOC and DOCX files; .doc files were converted to .docx with Microsoft Word and .docx files were converted to text with pandoc 3.1.11.1.
Annotations were provided as separate XLSX spreadsheets, one for each specific outcome. Highlighted excerpts were presented alongside line numbers, although line numbers often did not correspond to utterance breaks in the DOCX transcripts.
Annotations were matched to transcript utterances using two passes"
- Fuzzy string matching (
thefuzz
). - Best Levenshtein match
The different segmentations (Original, Monologue, and Turns) were produced:
- Original: removal of header information, other noninformative lines.
- Monologue: starting with Original, remove interviewer utterances and utterances shorter than 13 characters.
- Turns: concatenate blocks of text between interviewer utterances (e.g., ["I: hello", "P: "hi"] becomes ["I: hello P: hi"]).
Performed in 01-load_data.ipynb
. The following applies to all segmentations.
Counts of each specific outcomes per subject were generated. A test set was generated by manually balancing the presence of outcomes within the test and training set (process not shown).
Aggregated labels (high-level domains or "Any" label) were generated.
The training set was split using stratified group 4-fold cross-validation. Outcomes were grouped by subject to prevent leakage from test and training. Stratification kept roughly even numbers of positive examples in each fold, when possible.
This step can be performed prior to 1.2. Embeddings were generated in 01A-embeds.py
and 01B-embeds_long.py
.
Non-Llama embeddings were generated from AutoModel
, provided by Hugging Face. MentalLongformer embeddings were not AutoModel compatible, and was generated separately in 01B
.
This step was performed on the NIH Biowulf computing cluster. Jobs were parallelized using the swarm
command.
The notebook 02-lr.ipynb
generates swarm files that call baseline.py
, baseline_parts.py
, perm_trn.py
, and perm_trn_parts.py
to perform logistic regression.
To sanity check results, permutation tests were performed. Here, permutation tests refer to shuffling the labels associated with predictors before training. Although notebooks allude to perm_trn_all
as a sample permutation regime, only perm_trn.py
was used for all models, segmentations, etc.
Because we use 3-fold inner CV to tune the hyperparameter for 4-fold outer CV, the number of logistic regresion fits is a product of:
- Number of embedding models
- Number of segmentations
- Number of labels
- Number of folds (outer CV)
- Number of folds minus one (inner CV)
- Number of C values to test
- Type of training (baseline/non-permuted or permuted)
Embeddings are easier to save and load as .t
PyTorch tensor files, but need to be converted to numpy
arrays before modeling. Because reading and converting files takes some time, logistic regression fits were not parallelized over every possible factor. In execution, parallelization occurred over embedding model, segmentation, and feature. For embeddings models with larger hidden dimension (Llama), C values were partitioned into four sub-lists and results were concatenated after.
ROC AUC was calculated for every inner CV fit in 03-roc_auc.ipynb
. The C value corresponding to the highest average ROC AUC was selected, and the corresponding outer CV model was pulled in 04-tuning.ipynb
.
These tuned ROC AUC values were saved. Predictions were also saved to generate ROC graphs later.
Post hoc comparisons (Friedman, Nemenyi, Bayesian comparison) were performed across models and across segmentations in 05-results.ipynb
. Comparisons were performed across fold-averaged ROC AUC.