Video-Transcript Generator for Plenary Protocols of the German Bundestag
If you're just looking for a demo of already processed speeches, check out:
Generates time-based transcripts from parliamentary protocols published via the Bundestag Open Data service (https://www.bundestag.de/service/opendata).
The timings for already processed speeches are published at:
- Aeneas ("automagically synchronize audio and text")
- Aeneas Dependencies: Python (2.7.x preferred), FFmpeg, and eSpeak
Step 1: Install Aeneas
For Mac OS, there is an all-in-one installer, which takes care of the dependencies: https://github.com/sillsdev/aeneas-installer/releases.
Step 2: Input files (XML)
XML input files are located at _server/input/xml/. Some example files are already included. To add new files, download the file from https://www.bundestag.de/service/opendata and place it into the same directory.
Step 3: Scrape Media IDs (only if new XML files were added)
Careful: This potentially sends thousands of requests to Bundestag servers!
During the forced alignment process, we need access to a local copy of the respective audio file (depending on a single speech, agenda item or entire meeting). In order to show a preview upon completion, we also need to get the remote URL of the video file. Both file URLs can be retrieved when the Media ID is known. As Media IDs are not included in the original XML files from https://www.bundestag.de/service/opendata, we need to scrape them from the Bundestag Mediathek RSS Feed and write them to the respective XML nodes (eg.
<rede [...] media-id="1234567">).
Step 4: Generate Video-Transcript (Force align XML & Audio)
http://localhost/VideoTranscriptGenerator/index.html (or wherever you placed the scripts) -> Choose XML file -> Optionally choose agenda item or single speech -> Generate Video-Transcript -> DONE!
The generated JSON, XML and HTML files are saved inside
- Functionality is currently limited to the 19th electoral period (previous periods are published in a different format)
- The forced alignment algorithm needs some configuration tweaks to better deal with the first few sentences and non-speech periods.
- Entire agenda items often include additional text, which is not reflected in the protocols. Agenda items are in some cases also discussed in conjunction with others ("in Verbindung mit"). The Media ID scraper can as of now not deal with these inconsistencies. This only affects entire agenda items though.
- In the XML format, there is currently no differentiation between actual speeches (for which a media file exists) and other speaker contributions, for example during electoral proceedings (eg. "Wahl des Bundestagspräsidenten") or discussion formats (eg. "Befragung der Bundesregierung", "Fragestunde"). Thus, there is sometimes no or a wrong Media ID assigned to the
<rede>nodes. Right now they are thus ignored and not processed.
- Make sure the
_server/outputdirectories exist and are writable