psstdata

Python tools for downloading, loading, and using the data for the PSST challenge.

Citing This Work

Robert C. Gale, Mikala Fleegle, Gerasimos Fergadiotis, and Steven Bedrick. 2022. The Post-Stroke Speech Transcription (PSST) Challenge. In Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference, pages 41–55, Marseille, France. European Language Resources Association.

Robert Gale, Mikala Fleegle, Steven Bedrick, and Gerasimos Fergadiotis. 2022. Dataset and tools for the PSST Challenge on Post-Stroke Speech Transcription. March. Project funded by the National Institute on Deafness and Other Communication Disorders grant number R01DC015999-04S1.

Brian MacWhinney, Davida Fromm, Margaret Forbes, and Audrey Holland. 2011. AphasiaBank: Methods for Studying Discourse. Aphasiology, 25(11):1286–1307. Supported by NIH-NIDCD R01-DC008524 (2022-2027).

Access to the data

The data is hosted on TalkBank, and protected by password. To get the password and participate in the challenge, please complete this form.

The psstdata tools will prompt for these credentials upon the first download. Credentials are thereafter stored in ~/.config/psstdata/settings.json, and the data files are kept in ~/psst-data. (Tip: you can change where data is stored in the settings.json)

Just the data, please!

If you're not using Python, or you'd like write your data-loading code, you can download the data set directly from TalkBank. Once you have the password, head over to our resource page at TalkBank.

Usage Notes

Conditions for using the PSST Dataset are described on the task website.

Setup

First, please note that this package was developed for and tested using Python 3.8 (MacOS and Linux), so switching to this version may serve as a workaround for some problems.

With a minimum of Python 3.? installed, psstdata can be installed using pip:

pip install psstdata  # Install python helpers
python -m psstdata    # Download `./psst-data` into your user directory (437MB on disk)

The python helpers include data loader tools. For more information, see Basic Usage.

Basic usage

>>> import psstdata

>>> data = psstdata.load()

psstdata INFO: Downloading a new data version: 2022-03-02
psstdata INFO: Loaded data version 2022-03-02 from /Users/bobby/psst-data

This will download data to the default directory (~/psst-data/) and return an object of type PSSTData, containing the train, valid, and test splits:

>>> len(data.train)

2298

>>> len(data.valid)

341

>>> len(data.test)

652

And each of those sets is a PSSTUtteranceCollection, which is a collection of PSSTUtterance:

>>> data.train[0]

PSSTUtterance(utterance_id='ACWT02a-BNT01-house', session='ACWT02a', test='BNT', prompt='house', transcript='HH AW S', aq_index=74.6, correctness=True, filename='audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav', duration_frames=12752)

>>> data.train['ACWT02a-BNT01-house']

PSSTUtterance(utterance_id='ACWT02a-BNT01-house', session='ACWT02a', test='BNT', prompt='house', transcript='HH AW S', aq_index=74.6, correctness=True, filename='audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav', duration_frames=12752)

However, you'll basically only need four fields:

# Print the first four records in the train data

for utterance in data.train[:4]:

    # The key ingredients
    utterance_id = utterance.utterance_id
    transcript = utterance.transcript
    correctness = "Y" if utterance.correctness else "N"
    filename_absolute = utterance.filename_absolute

    print(f"{utterance_id:26s} {transcript:26s} {correctness:11s} {filename_absolute}")

    
""" utterance_id           transcript                 correctness filename_absolute

ACWT02a-BNT01-house        HH AW S                    Y           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav
ACWT02a-BNT02-comb         K OW M                     Y           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT02-comb.wav
ACWT02a-BNT03-toothbrush   T UW TH B R AH SH          Y           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT03-toothbrush.wav
ACWT02a-BNT04-octopus      AA S AH P R OW G P UH S    N           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT04-octopus.wav
"""

Uninstalling

Removing the package can be accomplished using pip: pip uninstall psstdata

You may also want to delete the data and configs (Copy/paste rm -rf commands cautiously, of course!!)

Data: rm -rf ~/psst-data
Configs: rm -rf ~/.config/psstdata

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
psstdata		psstdata
readme		readme
.gitignore		.gitignore
CITATION.cff		CITATION.cff
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

psstdata

Citing This Work

Access to the data

Just the data, please!

Usage Notes

Setup

Contents

Data Packs

Additional Resources

Basic usage

Uninstalling

About

Releases 1

Packages

Contributors 2

Languages

PSST-Challenge/psstdata

Folders and files

Latest commit

History

Repository files navigation

psstdata

Citing This Work

Access to the data

Just the data, please!

Usage Notes

Setup

Contents

Data Packs

Additional Resources

Basic usage

Uninstalling

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages