Applied Machine Learning - Project 1
Multilingual Dialog Dataset
In order to provide conversational training data in other languages than English we propose parsing openly available theatre plays in French. For this purpose, we will be curating dialog datasets in French, obtained by crawling through websites that aggregate openly available theatre works in a consistent and parseable format. In addition, we will parse sample interviews, released by authors through free sources on the web as well as language tutorials that feature conversations in French.
Extracted dialogs are in an XML where each 's' mark down is a conversation and each 'utt' is an utterance:
Combined resulting corpus can be found at: https://drive.google.com/open?id=0B1ItK6JlQ6ImRXAzMm1jSU9aOTA