No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Comparing phonological feature systems using EEG data

This repository contains the materials for the scientific paper

McCloy, DR & Lee, AKC. Investigating the fit between phonological feature systems and brain responses to speech using EEG. Language, Cognition and Neuroscience.

Raw EEG data related to this project is hosted on the Open Science Framework:

To reproduce the analyses here, clone this repository, download the .tar.gz files that are in the eeg-data-raw folder on the OSF project webpage, put them in the folder of the same name that is part of this repository, and run to unpack the data. Then follow the instructions below.


The original development repository for this project is at As with many projects, it grew organically and consequently that repo is not especially orderly. Refactoring and cleanup is underway, and the permanent home of the code will be here. This message will be updated accordingly when that process is complete.

Pipeline overview

The steps to reproduce the analysis and manuscript are documented below. Scripts that must be run interactively (e.g., for manual marking of consonant-vowel transitions in the stimuli, or for annotating bad sections of EEG data) include _interactive in the filename.

General setup

  • and compress and uncompress the raw EEG data and metadata.
  • downloads a particular historical version of expyfun into a local directory, to ensure that the stimulus generation and experiment runner script are always run with the version of expyfun against which it they were developed and tested.

Stimulus generation

NOTE The results of all of the manual/interactive steps of stimulus generation are included in the repository, so it is possible to skip this section entirely to reproduce the analysis. These scripts are included as a record of the procedure, and as a starting point for future stimulus generation efforts that start from different recordings.

  • The first step of stimulus generation involved annotating the raw recordings with .TextGrid files using praat, to mark which syllables should be extracted for use as stimuli. Subsequent steps assume that recordings are in stimuli/recordings, associated TextGrids are in recordings/textgrids, with WAV/TextGrid correspondences determined by parallel subfolders/filenames within each.
  • applies a high-pass filter to the raw recordings (to ameliorate very low frequency noise in the recording), extracts the annotated syllables into individual WAV files, and root-mean-square normalizes the extracted syllables.
  • 003_mark_syllable_boundaries_interactive.praat opens each syllable in praat for annotation of the consonant-vowel transition point. This should be run from an open instance of the praat GUI (Praat → Open Praat script...; Run → Run). The result is a new set of textgrids in stimuli/stimuli-tg.
  • 004_make_syllable_boundary_table.praat parses the syllable textgrids and writes them to a table (params/syllable-boundary-times.tsv). This can be run through the praat GUI, or via praat --run 004_make_syllable_boundary_table.praat stimuli/stimuli-tg params/cv-boundary-times.tsv. WARNING Before running this step, make sure praat’s text writing setting is set to UTF-8 (menu command Praat → Preferences → Text writing preferences) or else subsequent scripts will not be able to read the CV boundary times table.
  • assembles the syllables into stimulus blocks (with different randomizations for each subject). Each block is written out as a single WAV file whose duration is matched to the video stimulus used in that block. The record of subject/block/video/syllable correspondences is written to params/master-dataframe.tsv. Note that the block-length WAV files are for reference only; when the experiment is actually run, individual syllable WAVs are loaded and played individually.

Data collection

  • runs the experiment. It expects to connect to a TDT external sound processor (and thus must run on a Windows computer), and sends synchronization timestamps to both the TDT and the EEG acquisition device, via TTL. The raw EEG data were saved continuously on a separate computer, using “BrainVision Recorder” software.

EEG preprocessing

  • is a Bash wrapper around scripts/, which converts the Brainvision .vhdr, .vmrk, and .eeg files into .fif files, and handles cases of mid-experiment recording failure where multiple raw recordings were generated and partial blocks had to be repeated.
  • This is another Bash wrapper, around scripts/ The wrapper allows skipping subjects and/or re-opening files that have already been annotated (e.g., if corrections are necessary).
  • is a Bash wrapper around scripts/, which in addition to a new raw file with projectors added, also creates epoch and event objects for each subject’s blinks, and a summary CSV of blink detection across all subjects.
  • is a Bash wrapper around scripts/ This creates 3 versions of epochs objects for each subject: one with traditional stimulus-onset-alignment (just in case), one with epochs aligned at the consonant-vowel transition point (“CV alignment”), and one with CV alignment and also the early part of all epochs truncated (to exclude the brain’s early response to the stimuli, thought to be dominated by acoustic-phonetic representations, so that only the later (hopefully phonological) representations are all that is left). This also generates a summary file telling how many epochs were retained for each subject (to make sure the epoch rejection criteria are not too severe).
  • wraps scripts/ and scripts/ It generates a summary figure (figures/supplement/subject-summary.pdf) and table (processed-data/blinks-epochs-snr.csv) useful for checking for a systematic relationship between number of retained epochs and a subject’s blink behavior and data SNR.

Supervised learning

  • creates the necessary output directories and loops over subjects and truncation durations, running scripts/ each time. The output goes into processed-data/{analysis_type}/classifiers/{subject_ID}. Note that scripts/ is set up to run phonological-feature-based classification for each value of the trucation duration defined in scripts/, but will only run OVR and pairwise classification when truncation duration is 0 (i.e., on the untruncated epochs).
  • aggregates the Equal Error Rate and threshold values from each subject into a single table, in processed-data/{analysis_type}.

TODO: rest of analysis pipeline