GermaParl R Data Package
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
data-raw
data
demo
docs
inst
man
tools
vignettes
.Rbuildignore
.travis.yml
DESCRIPTION
NAMESPACE
NEWS.md
README.Rmd
README.md
_pkgdown.yml
appveyor.yml
configure

README.md

DOI Travis-CI Build Status

The GermaParl R Data Package

About

GermaParl is a R data package that includes:

  • a small subset of the linguistically annotated and CWB indexed GermaParl corpus of plenary protocols of the German Bundestag by default;
  • functionality to load the the full CWB version of GermaParl from another data source, and
  • supplemantary functionality to work with topic models and annotations.

The GitHub repository GermaParlTEI offers the TEI-XML versions of the corpus. The GermaParl data package makes accessible an indexed version of the data. It has been imported into the Corpus Workbench (CWB). Using the CWB keeps the data size modest, ensures performance, exposes the syntax of the Corpus Query Processor (CQP), and generates opportunites to combine quantitative and qualitative approaches to analysing text.

The GermaParl package is designed to be used with polmineR as a toolset for various standard qualitative and quantitative tasks in text analysis (count, dispersion, ngrams, cooccurrences, viewing concordances as well as going back to the original full-text). Using polmineR, you can easily generate data structures (such as term-document matrices) that are required as input for advanced statistical procedures.

Installation

The GermaParl package can be installed from the ‘drat’ repository of the GitHub presence of the PolMine Project:

if ("drat" %in% rownames(available.packages()) == FALSE) install.packages("drat")
drat::addRepo("polmine") # lowercase necessary in this case
install.packages("GermaParl")

After the initial installation, the package only includes a small subset of the GermaParl corpus. The subset serves as sample data and for running package tests. To download the full corpus, use a function to download the full corpus from an external webspace (amazon AWS, for the time being):

library(GermaParl)
germaparl_download_corpus()

To avoid bloating the data that needs to be downloaded - it is somewhat large already -, further annotation can be generated on demand. See the package documentation to learn about the functionality that is available.

Using GermaParl

The CWB version of GermaParl can be used with any tool for working with CWB indexed corora. The CQP command line tool would be a classic choice. The dissemination mechanism is optimized to work with the polmineR R package. See the documentation for instructions how to install polmineR.

To check whether the installation has been successful, run the following commands. For further instructions, see the documentation of polmineR.

library(polmineR)
use("GermaParl") # to activate the corpus in the data package
corpus() # to see whether the GERMAPARL corpus is listed
size("GERMAPARL") # to learn about the size of the corpus

License

The data comes with a CC BY-NC-SA 4.0 license. That means:

BY - Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

NC - NonCommercial — You may not use the material for commercial purposes.

SA - ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

See the CC Attribution-NonCommercial-ShareAlike 4.0 Unported License for further explanations.

Quoting GermaParl

If you work with GermaParl package, please include the following reference in your bibliography to attribute the language resource:

Blaette, Andreas (2018): GermaParl. R Data Package for the GermaParl Corpus of Plenary Protocols of the German Bundestag (v1.2.1). Available from: https://doi.org/10.5281/zenodo.1312551.

Feedback

We hope that GermaParl in combination with polmineR will inspire your research and make it more productive. We would be glad to learn what you do with the data, and make your blog entries or publications visible here.

And please do not forget to bring issues that you come across to our attention. Improving data quality is an important concern of the PolMine Project, this is why the data is versioned. The resource will benefit from its community of users and your feedback!