Skip to content

Tools for downloading and using the unique data set for the PSST: Post-Stroke Speech Transcription challenge.

Notifications You must be signed in to change notification settings

PSST-Challenge/psstdata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

psstdata

Python tools for downloading, loading, and using the data for the PSST challenge.

DOI

Citing This Work

Robert C. Gale, Mikala Fleegle, Gerasimos Fergadiotis, and Steven Bedrick. 2022. The Post-Stroke Speech Transcription (PSST) Challenge. In Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference, pages 41–55, Marseille, France. European Language Resources Association.

Robert Gale, Mikala Fleegle, Steven Bedrick, and Gerasimos Fergadiotis. 2022. Dataset and tools for the PSST Challenge on Post-Stroke Speech Transcription. March. Project funded by the National Institute on Deafness and Other Communication Disorders grant number R01DC015999-04S1. DOI: 10.5281/zenodo.6326002

Brian MacWhinney, Davida Fromm, Margaret Forbes, and Audrey Holland. 2011. AphasiaBank: Methods for Studying Discourse. Aphasiology, 25(11):1286–1307. Supported by NIH-NIDCD R01-DC008524 (2022-2027).

Access to the data

The data is hosted on TalkBank, and protected by password. To get the password and participate in the challenge, please complete this form.

The psstdata tools will prompt for these credentials upon the first download. Credentials are thereafter stored in ~/.config/psstdata/settings.json, and the data files are kept in ~/psst-data. (Tip: you can change where data is stored in the settings.json)

Just the data, please!

If you're not using Python, or you'd like write your data-loading code, you can download the data set directly from TalkBank. Once you have the password, head over to our resource page at TalkBank.

Usage Notes

Conditions for using the PSST Dataset are described on the task website.

Setup

First, please note that this package was developed for and tested using Python 3.8 (MacOS and Linux), so switching to this version may serve as a workaround for some problems.

With a minimum of Python 3.? installed, psstdata can be installed using pip:

pip install psstdata  # Install python helpers
python -m psstdata    # Download `./psst-data` into your user directory (437MB on disk)

The python helpers include data loader tools. For more information, see Basic Usage.

Contents

Data Packs

The data retrieved by this tool is described in detail in each data pack's README file. A copy of those files is available in this repository for each of the train, valid, and test data packs. (These three files have only trivial differences.)

Additional Resources

This tool also provides some additional resources to get you set up more quickly. These are referenced in the baseline systems, which you are certainly welcome to use as an example or a jumping off point!

(Key: python referencejson file)

Basic usage

>>> import psstdata

>>> data = psstdata.load()

psstdata INFO: Downloading a new data version: 2022-03-02
psstdata INFO: Loaded data version 2022-03-02 from /Users/bobby/psst-data

This will download data to the default directory (~/psst-data/) and return an object of type PSSTData, containing the train, valid, and test splits:

>>> len(data.train)

2298

>>> len(data.valid)

341

>>> len(data.test)

652

And each of those sets is a PSSTUtteranceCollection, which is a collection of PSSTUtterance:

>>> data.train[0]

PSSTUtterance(utterance_id='ACWT02a-BNT01-house', session='ACWT02a', test='BNT', prompt='house', transcript='HH AW S', aq_index=74.6, correctness=True, filename='audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav', duration_frames=12752)

>>> data.train['ACWT02a-BNT01-house']

PSSTUtterance(utterance_id='ACWT02a-BNT01-house', session='ACWT02a', test='BNT', prompt='house', transcript='HH AW S', aq_index=74.6, correctness=True, filename='audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav', duration_frames=12752)

However, you'll basically only need four fields:

# Print the first four records in the train data

for utterance in data.train[:4]:

    # The key ingredients
    utterance_id = utterance.utterance_id
    transcript = utterance.transcript
    correctness = "Y" if utterance.correctness else "N"
    filename_absolute = utterance.filename_absolute

    print(f"{utterance_id:26s} {transcript:26s} {correctness:11s} {filename_absolute}")

    
""" utterance_id           transcript                 correctness filename_absolute

ACWT02a-BNT01-house        HH AW S                    Y           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav
ACWT02a-BNT02-comb         K OW M                     Y           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT02-comb.wav
ACWT02a-BNT03-toothbrush   T UW TH B R AH SH          Y           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT03-toothbrush.wav
ACWT02a-BNT04-octopus      AA S AH P R OW G P UH S    N           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT04-octopus.wav
"""

Uninstalling

Removing the package can be accomplished using pip: pip uninstall psstdata

You may also want to delete the data and configs (Copy/paste rm -rf commands cautiously, of course!!)

  • Data: rm -rf ~/psst-data
  • Configs: rm -rf ~/.config/psstdata

About

Tools for downloading and using the unique data set for the PSST: Post-Stroke Speech Transcription challenge.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Languages