Skip to content

Latest commit

 

History

History
93 lines (62 loc) · 3.34 KB

tools.rst

File metadata and controls

93 lines (62 loc) · 3.34 KB

Basic tools

Data validation

This is typically done (repeatedly!) in the process of importing your data into our format for the first time, but you should also do this whenever you make a change to the dataset.

Looks for errors and inconsistency in the metadata, or for missing audios. The validation will pass if formatting instructions are met (see format).

child-project validate /path/to/dataset --help

Example:

# validate the metadata and raw recordings
child-project validate /path/to/dataset

# validate the metadata only
child-project validate /path/to/dataset --ignore-recordings 

# validate the metadata and the recordings of the 'standard' profile
# (in recordings/converted/standard)
child-project validate /path/to/dataset --profile standard 

# validate the metadata and all annotations within /path/to/dataset/annotations
child-project validate /path/to/dataset --ignore-recordings --annotations /path/to/dataset/annotations/*

# validate the metadata and annotations from the 'textgrid' set
child-project validate /path/to/dataset --ignore-recordings --annotations /path/to/dataset/annotations/textgrid/*

Dataset overview

An overview of the contents of a dataset can be obtained with the child-project overview command.

child-project overview --help

Example:

$ child-project overview .

recordings:
lena: 288.00 hours, 0/18 files locally available
olympus: 49.57 hours, 0/3 files locally available
usb: 223.42 hours, 0/20 files locally available

annotations:
alice: 560.99 hours, 0/40 files locally available
alice_vtc: 560.99 hours, 0/40 files locally available
eaf/nk: 1.47 hours, 0/88 files locally available
lena: 272.00 hours, 0/17 files locally available
textgrid/mm: 8.75 hours, 0/525 files locally available
vtc: 560.99 hours, 40/40 files locally available

Compute recordings duration

Compute recordings duration in ms and store in into a column named ‘duration’ in the metadata.

child-project compute-durations /path/to/dataset --help

Compute the correlation between audio files

Compute the correlation between two audio files and prints a divergence score. The divergence is computed over a given duration (default 5min) that can be changed with the --interval option. One segment of that duration is taken randomly, the difference in audio signal is calculated and averaged over the total duration. The result is printed as a divergence score. The closer the score is to 0, the more likely it is the 2 files are identical. We can consider that scores below 0.1 reflect a very high probability that the files are the same. At the other end of the spectrum, values higher than 1 almost certainly means they are different recordings. So a window exists in which we can't be sure and would need additional correlation computations or manual checks. Running the correlation multiple time is useful because files that are different have a high variability in score whereas similar files will have a much more consistent output.

Giving a higher --interval value may take more time to compute.

child-project compare-recordings /path/to/dataset --help