Outcomes that matter to depressed adolescents can be identified with large language models

Notebooks contain extraneous code that is not essential to the modeling pipeline, such as visualization and creating figures. These are briefly acknowledged in the summary, but are skipped during the more detailed descriptions of each modeling step in Sections 1-3 below.

Summary

Step 1: data processing and setup
- 00-merge_transcripts_and_codes.ipynb: Create transcript segmentationsm convert annotations to data frames (each row is an utterance and columns correspond to the presence of specific outcomes.
- 01-load_data.ipynb: look at distribution of passage lengths, count positive labels per subject, create test (holdout) and training (development) set, visualize embedding distribution, create 4 folds
  - 01A-embeds.py: embeddings for BERT, RoBERTa (unused), and MentalBERT
  - 01B-embeds_long.py: embeddings for MentalLongformer
Step 2: modeling
- 02-lr.ipynb: perform logistic regression with range of C values and 4-fold CV
  - Calls baseline.py, baseline_parts.py, perm_trn.py, and perm_trn_parts.py
- 03-roc_auc.ipynb: calculate ROC AUC from the results of 02-lr.ipynb
- 04-tuning.ipynb: for each model, segmentation, and training scheme (baseline or permuted), find best C value using inner 3-fold CV, collect predictions of selected models (helpful for generating graphs later)
  - roc_auc.py
Step 3: post hocs and visualization
- 05-results.ipynb: visualize results, create graphs, perform post hoc tests comparing models and segmentations

Dependencies are specified in requirements.txt and environment.yml.

1 Data processing and setup

1.1 Converting annotations to utterance-label format

This step is performed in 00-merge_transcripts_and_codes.ipynb.

Transcripts were provided as a mixture of DOC and DOCX files; .doc files were converted to .docx with Microsoft Word and .docx files were converted to text with pandoc 3.1.11.1.

Annotations were provided as separate XLSX spreadsheets, one for each specific outcome. Highlighted excerpts were presented alongside line numbers, although line numbers often did not correspond to utterance breaks in the DOCX transcripts.

Annotations were matched to transcript utterances using two passes"

Fuzzy string matching (thefuzz).
Best Levenshtein match

The different segmentations (Original, Monologue, and Turns) were produced:

Original: removal of header information, other noninformative lines.
Monologue: starting with Original, remove interviewer utterances and utterances shorter than 13 characters.
Turns: concatenate blocks of text between interviewer utterances (e.g., ["I: hello", "P: "hi"] becomes ["I: hello P: hi"]).

1.2 Create test and training sets

Performed in 01-load_data.ipynb. The following applies to all segmentations.

Counts of each specific outcomes per subject were generated. A test set was generated by manually balancing the presence of outcomes within the test and training set (process not shown).

Aggregated labels (high-level domains or "Any" label) were generated.

The training set was split using stratified group 4-fold cross-validation. Outcomes were grouped by subject to prevent leakage from test and training. Stratification kept roughly even numbers of positive examples in each fold, when possible.

1.3 Generate embeddings from LLMs

This step can be performed prior to 1.2. Embeddings were generated in 01A-embeds.py and 01B-embeds_long.py.

Non-Llama embeddings were generated from AutoModel, provided by Hugging Face. MentalLongformer embeddings were not AutoModel compatible, and was generated separately in 01B.

2 Modeling

This step was performed on the NIH Biowulf computing cluster. Jobs were parallelized using the swarm command.

2.1 Logistic regression

The notebook 02-lr.ipynb generates swarm files that call baseline.py, baseline_parts.py, perm_trn.py, and perm_trn_parts.py to perform logistic regression.

To sanity check results, permutation tests were performed. Here, permutation tests refer to shuffling the labels associated with predictors before training. Although notebooks allude to perm_trn_all as a sample permutation regime, only perm_trn.py was used for all models, segmentations, etc.

Because we use 3-fold inner CV to tune the hyperparameter for 4-fold outer CV, the number of logistic regresion fits is a product of:

Number of embedding models
Number of segmentations
Number of labels
Number of folds (outer CV)
Number of folds minus one (inner CV)
Number of C values to test
Type of training (baseline/non-permuted or permuted)

Embeddings are easier to save and load as .t PyTorch tensor files, but need to be converted to numpy arrays before modeling. Because reading and converting files takes some time, logistic regression fits were not parallelized over every possible factor. In execution, parallelization occurred over embedding model, segmentation, and feature. For embeddings models with larger hidden dimension (Llama), C values were partitioned into four sub-lists and results were concatenated after.

2.2 Tuning by ROC AUC

ROC AUC was calculated for every inner CV fit in 03-roc_auc.ipynb. The C value corresponding to the highest average ROC AUC was selected, and the corresponding outer CV model was pulled in 04-tuning.ipynb.

These tuned ROC AUC values were saved. Predictions were also saved to generate ROC graphs later.

3 Post hocs

Post hoc comparisons (Friedman, Nemenyi, Bayesian comparison) were performed across models and across segmentations in 05-results.ipynb. Comparisons were performed across fold-averaged ROC AUC.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
notebooks		notebooks
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Outcomes that matter to depressed adolescents can be identified with large language models

Summary

1 Data processing and setup

1.1 Converting annotations to utterance-label format

1.2 Create test and training sets

1.3 Generate embeddings from LLMs

2 Modeling

2.1 Logistic regression

2.2 Tuning by ROC AUC

3 Post hocs

About

Releases

Packages

Contributors 3

Languages

License

NIMH-MLT/impactme

Folders and files

Latest commit

History

Repository files navigation

Outcomes that matter to depressed adolescents can be identified with large language models

Summary

1 Data processing and setup

1.1 Converting annotations to utterance-label format

1.2 Create test and training sets

1.3 Generate embeddings from LLMs

2 Modeling

2.1 Logistic regression

2.2 Tuning by ROC AUC

3 Post hocs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages