# Corpus Validation
Clean and valid data is essential for successful machine learning. For this purpose the `validation` module provides different methods for validate a corpus on specific properties.

In [1]:
import audiomate
from audiomate.corpus import io
from audiomate.corpus import validation

In [2]:
# clear the data if already existing
import shutil
shutil.rmtree('output/fsd', ignore_errors=True)

## Data
First we download the Free-spoken-digit corpus and load it.

In [3]:
corpus_path = 'output/fsd'

io.FreeSpokenDigitDownloader().download(corpus_path)
corpus = audiomate.Corpus.load(corpus_path, reader='free-spoken-digits')

## Perform validation and print result

We can either perform a single validation task ...

In [4]:
val = validation.UtteranceTranscriptionRatioValidator(max_characters_per_second=6, 
                                                    label_list_idx=audiomate.corpus.LL_WORD_TRANSCRIPT)

result = val.validate(corpus)
print(result.get_report())

Utterance-Transcription-Ratio (word-transcript)

--> Label-List ID: word-transcript
--> Threshold max. characters per second: 6

Result: Failed

Invalid Utterances:
    * 2_theo_34 (6.211180124223603)
    * 6_nicolas_23 (6.172839506172839)
    * 6_nicolas_35 (6.177606177606178)
    * 6_nicolas_7 (6.962576153176675)
    * 6_nicolas_9 (6.354249404289119)
    * 6_yweweler_1 (6.39488409272582)
    * 6_yweweler_10 (6.1443932411674345)
    * 6_yweweler_17 (6.182380216383307)
    * 6_yweweler_3 (6.968641114982579)


Or we can combine multiple validation tasks to run in one go.

In [5]:
val = validation.CombinedValidator(validators=[
    validation.UtteranceTranscriptionRatioValidator(
        max_characters_per_second=6, 
        label_list_idx=audiomate.corpus.LL_WORD_TRANSCRIPT
    ),
    validation.LabelCountValidator(
        min_number_of_labels=1,
        label_list_idx=audiomate.corpus.LL_WORD_TRANSCRIPT
    )
])

result = val.validate(corpus)
print(result.get_report())

Label-Count (word-transcript) --> Passed
Utterance-Transcription-Ratio (word-transcript) --> Failed


Label-Count (word-transcript)

--> Label-List ID: word-transcript
--> Min. number of labels: 1

Result: Passed


Utterance-Transcription-Ratio (word-transcript)

--> Label-List ID: word-transcript
--> Threshold max. characters per second: 6

Result: Failed

Invalid Utterances:
    * 2_theo_34 (6.211180124223603)
    * 6_nicolas_23 (6.172839506172839)
    * 6_nicolas_35 (6.177606177606178)
    * 6_nicolas_7 (6.962576153176675)
    * 6_nicolas_9 (6.354249404289119)
    * 6_yweweler_1 (6.39488409272582)
    * 6_yweweler_10 (6.1443932411674345)
    * 6_yweweler_17 (6.182380216383307)
    * 6_yweweler_3 (6.968641114982579)


