Skip to content
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

Mitads-1.0.0-alpha

Pre-release
Pre-release
Compare
Choose a tag to compare
@Mte90 Mte90 released this 31 Jul 12:15
· 53 commits to master since this release

First official release of the Mitads text corpus!

What?

Mitads is an Italian text corpus with sentences extracted from discussions, chats, books to get a kind of spoken Italian that can be used with AI like DeepSpeech.
This dataset is released as Public Domain, it is generated with the scripts available at https://github.com/MozillaItalia/DeepSpeech-Italian-Model/tree/master/MITADS and is based on aggregating different datasets or resources that allow to be released in this aggregated way (basically it isn't possible to recreate from this the original datasets).
As it is a generated on-the-fly we cannot release the file cache or file generated during the process (for license issues) except the final corpus with a log file.
This corpus doesn't include repeated sentences, we implemented various sanitization but this tasks is never ending and require your help to improve the quality of the corpus itself.

How works

Every script in the Mitads folder is for a specific resource that handle the download and parsing with generating txt files.
Usually every script has a caching workflow of external resources to speed up the development and generation itself, with specific rules to ignore lines, words and so on.
It is included a python library that is used for common tasks along the various scripts.
There is a final Bash script that execute all of them, do a final sanitization, remove duplciate sentences and generate the final corpus.

Numbers

Tickets to do before final release:

Next steps

Close the last tickets and integrate this corpus with the script to generate a new model version. In our internal discussions use a text corpus more similar to Italian that is spoken between people the words recognition should improve a lot.
After the official release we will evaluate how to improve the performance, quality and maybe found new dataset suitable for this project.

Reach us!

Check with @mozitabot on Telegram and join the Mozilla Italia Developers group (we talk italian there).