This corpus is a collection of screenplays (400) from the French TV show Kaamelott. The transcriptions are not official. Originally, they have been automatically scraped from a French website: Hypnoweb; then, they have been normalized by automatic procedures to produce a text version.
Ultimately, Corpus Kaamelott intends to be an NLP-ready annotated resource, available in multiple formats.
See the documentation for more informations.
At this time, three formats are available:
- text version;
- POS tagged version (word/tag/lemma);
- XML-TEI version.
As things progress, you can evaluate the result of the most recent developments in the
cat/folder groups the lines by speaker.
sample/folder is a set of some screenplays selected by sampling. By nature, they should not be considered as stable, but as a work in progress.
static/cast.txt: makes the link between characters and the actors who interpret them.
static/characters.txt: directory of characters in Kaamelott.
static/episodes.txtlists all the episodes transcribed on Hypnoweb.
static/index.txtis a collection of metadata about the original screenplays scraped from Hypnoweb.
static/ne.txtlists the named entities.
static/slang.txtis a lexicon of slang expressions in tabulated format.
static/tagset_map.txtestablishes the correspondence between the POS-tags used in the corpus and the universal tagset.
tagged/folder contains the 400 screenplays in tagged format (e.g. : word/tag/lemma). Each line lists, in tabulated format, the speaker and his cue, tagged.
tools/folder presents some useful scripts to manipulate the corpus, like a custom reader for NLTK (see below).
txt/folder contains the 400 screenplays in text format. As for the tagged version, each line lists, in tabulated format, the speaker and his cue.
xml/folder hosts the corpus in XML-TEI compliant format.
The custom KaamelottCorpusReader
KaamelottCorpusReader Python class is based on the NLTK CorpusReader API. Be sure to have NLTK installed before using it.
Below is an example of use:
# Modules to import from collections import defaultdict from KaamelottCorpusReader import KaamelottCorpusReader as KCR # Parse the tagged corpus kaam = KCR('./tagged', r'.*\.pos') # Select a screenplay tagged = kaam.tagged_corpus('S01E01-heat.pos') # Get all the rows rows = tagged.values() # Make a dictionary of lines by speaker d = defaultdict(list) [ d[speaker].append(line) for row in rows for speaker, lines in row for line in lines ] # Who are the speakers in the screenplay? speakers = d.keys() # Print the fifth line of character Karadoc print(d['Karadoc']) # [('De', 'P', 'de'), ('quoi', 'PROWH', 'quoi?'), ('?', 'PONCT', '?')]
Transcribed screenplays come from Hypnoweb.net.
The reference lexicon for spellchecking comes from Lexique 3 (v3.83) :
- New, Boris, Pallier, Christophe, Ferrand, Ludovic, Matos, Rafael (2001) Une base de données lexicales du français contemporain sur internet: LEXIQUE, L’Année Psychologique, 101, 447-462. http://www.lexique.org
- New, Boris, Pallier, Christophe, Brysbaert, Marc, Ferrand, Ludovic (2004) Lexique 2 : A New French Lexical Database. Behavior Research Methods, Instruments, & Computers, 36 (3), 516-524.
French slang lexicon was created thanks to Bob.
The POS-tagger has been trained with the French Treebank :
- Abeillé, Anne, Clément, L., Toussenel, F. – "Building a treebank for French". In Abeillé, Anne (ed.) Treebanks. – Dordrecht : Kluwer, 2003. – p. 165-187.
Lemmatization of tokens was made possible thanks to the French LEFFF Lemmatizer by Claude Coulombe, based on the Lefff lexicon:
- Sagot (2010). The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French. In Proceedings of the 7th international conference on Language Resources and Evaluation (LREC 2010), Istanbul, Turkey