GermaParl is a R data package that includes:
- a small subset of the linguistically annotated and CWB indexed GermaParl corpus of plenary protocols of the German Bundestag by default;
- functionality to load the the full CWB version of GermaParl from another data source, and
- supplemantary functionality to work with topic models and annotations.
The GitHub repository GermaParlTEI offers the TEI-XML versions of the corpus. The GermaParl data package makes accessible an indexed version of the data. It has been imported into the Corpus Workbench (CWB). Using the CWB keeps the data size modest, ensures performance, exposes the syntax of the Corpus Query Processor (CQP), and generates opportunites to combine quantitative and qualitative approaches to analysing text.
The GermaParl package is designed to be used with polmineR as a toolset for various standard qualitative and quantitative tasks in text analysis (count, dispersion, ngrams, cooccurrences, viewing concordances as well as going back to the original full-text). Using polmineR, you can easily generate data structures (such as term-document matrices) that are required as input for advanced statistical procedures.
The GermaParl package can be installed from the ‘drat’ repository of the GitHub presence of the PolMine Project:
if ("drat" %in% rownames(available.packages()) == FALSE) install.packages("drat") drat::addRepo("polmine") # lowercase necessary in this case install.packages("GermaParl")
After the initial installation, the package only includes a small subset of the GermaParl corpus. The subset serves as sample data and for running package tests. To download the full corpus, use a function to download the full corpus from an external webspace (amazon AWS, for the time being):
To avoid bloating the data that needs to be downloaded - it is somewhat large already -, further annotation can be generated on demand. See the package documentation to learn about the functionality that is available.
The CWB version of GermaParl can be used with any tool for working with CWB indexed corora. The CQP command line tool would be a classic choice. The dissemination mechanism is optimized to work with the polmineR R package. See the documentation for instructions how to install polmineR.
To check whether the installation has been successful, run the following commands. For further instructions, see the documentation of polmineR.
library(polmineR) use("GermaParl") # to activate the corpus in the data package corpus() # to see whether the GERMAPARL corpus is listed size("GERMAPARL") # to learn about the size of the corpus
The data comes with a CC BY-NC-SA 4.0 license. That means:
BY - Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NC - NonCommercial — You may not use the material for commercial purposes.
SA - ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
See the CC Attribution-NonCommercial-ShareAlike 4.0 Unported License for further explanations.
If you work with GermaParl package, please include the following reference in your bibliography to attribute the language resource:
Blaette, Andreas (2018): GermaParl. R Data Package for the GermaParl Corpus of Plenary Protocols of the German Bundestag (v1.2.1). Available from: https://doi.org/10.5281/zenodo.1312551.
We hope that GermaParl in combination with polmineR will inspire your research and make it more productive. We would be glad to learn what you do with the data, and make your blog entries or publications visible here.
And please do not forget to bring issues that you come across to our attention. Improving data quality is an important concern of the PolMine Project, this is why the data is versioned. The resource will benefit from its community of users and your feedback!