Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Andreas Blätte authored and Andreas Blätte committed May 23, 2023
0 parents commit c1e7e7d
Show file tree
Hide file tree
Showing 18 changed files with 8,718 additions and 0 deletions.
355 changes: 355 additions & 0 deletions rmd/01-DataReport.Rmd

Large diffs are not rendered by default.

187 changes: 187 additions & 0 deletions rmd/02-Corpus-Preparation.Rmd

Large diffs are not rendered by default.

70 changes: 70 additions & 0 deletions rmd/03-XML-Structure.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# XML Structure{#xml-structure}

When working with both the TEI/XML files directly or with the CWB version of the corpus, it is important to know the structure of the data. In the following, the TEI/XML is presented to illustrate this structure.

In GermaParl, each session is represented in a separate XML file. These files are structured in a TEI-inspired format. The format structures debates into a TEI-Header containing metadata and a text body containing speeches of individual speakers and metadata on speaker level. Both elements are described in the following.

## TEI-Header

\scriptsize

```{r germaparl_xml_structure_header, echo = FALSE}
germaparltei_xml <- xml2::read_xml("./data_raw/BT_01_001_min.xml")
xml2::xml_structure(xml2::xml_find_all(germaparltei_xml, ".//teiHeader"))
```

\normalsize

The TEI-Header comprises of metadata containing general information about the corpus and the encoding project as well as session specific metadata such as the date, legislative period and the session number. There is a number of elements. The most important are:

- **titleStmt/legislativePeriod**: The legislative period of the debate.
- **titleStmt/sessionNo**: The protocol or session number of the debate.
- **edition/package**: The R package used to create the TEI.
- **edition/birthday**: The date the TEI file was created.
- **publicationStmt/date**: The date of the debate.
- **sourceDesc/url**: The source URL of the raw file.
- **sourceDesc/filetype**: The file type of the raw source file.

## TEI-Text

\scriptsize

```{r germaparl_xml_structure_text, echo = FALSE}
xml2::xml_structure(xml2::xml_find_all(germaparltei_xml, ".//text"))
```

\normalsize

Every single XML file contains one single session. The entire debate is wrapped into a `<text>` node which contains a single `<body>` node. In this `<body>` node, every single agenda item is encoded as a `<div>` node. Each `<div>` node contains a number of attributes:

- **type**: The type of agenda item.
- **n**: The number of agenda item.
- **what**: The category of agenda item.
- **desc**: The verbatim call of the agenda item.

Within these `<div>` nodes, each contribution of a speaker is encoded as a `<sp>` node.

Each `<sp>` node contains a number of attributes which were already addressed as structural attributes in the presentation of the data report:

- **who**: The raw name of a speaker before "enhancing" the data. These might already adjusted and harmonized to facilitate the matching which is performed during this process.
- **parliamentary_group**: The parliamentary group, mostly extracted from the protocol text. These might already adjusted and harmonized to facilitate the matching which is performed during this process.
- **role**: The parliamentary role of a speaker, derived from the speaker call of the protocol text.
- **position**: The parliamentary position of a speaker, i.e. which governmental office a speaker is associated with in the speaker call in the protocol text.
- **party**: The party affiliation of a speaker, added during enhancing the raw protocol.
- **name**: The full name of a speaker, added during enhancing the raw protocol.

Except for the attribute of `position` which is not entirely consolidated due to the high amount of variation, these attributes are also part of the CWB corpus. Also the attribute `who_orignal` is mainly added to the TEI for documentation purposes. As shown before, the naming scheme was changed. This is discussed in `r link_or_footnote("the release note of GermaParl v2 Release Candidate 3.", "https://polmine.github.io/posts/2023/04/03/GermaParl-v2-beta3-Release-Note.html", "2023-05-22")`

The first child of each `<sp>` node is a `<speaker>` node containing the speaker call. This line is used to segment the running text into speeches. After the speaker information is extracted from this line, this element is redundant and is thus not part of the CWB corpus.

Utterances of speakers are then added as paragraphs as additional children of the `<sp>` node. In addition, interrupting interjections of other speakers or other non-verbal elements such as transcriber comments are added as `<stage>` nodes which represent elements which occur during a speaker's turn but are not substantial part of the current utterance. Each `<stage>` node has an attribute **type** which currently only has the value "interjection". In the CWB version, these stage nodes are represented as specific kinds of paragraph nodes of type "stage".

The XML structure above depicts a single speech in a single agenda item. In most XML files, there will be more of these nodes.

## XML in the CWB corpus

While this TEI/XML format provides a structurally annotated representation of the data, the linguistic annotation is added to the data in form of a hierarchical XML representation. This is the file format imported into the Corpus Workbench. As the format is rather specific for our use case and entirely reproducible from the TEI version, we consider this format as a intermediate and do not provide it like the TEI version. However, it informs the internal structure of the Corpus Workbench version of the corpus and thus, it is informative to consider its makeup.

This format and its consequences are discussed in some detail in `r link_or_footnote("the release note of GermaParl v2 Release Candidate 3.", "https://polmine.github.io/posts/2023/04/03/GermaParl-v2-beta3-Release-Note.html", "2023-05-22")` For the purposes of this documentation, it is important to note that this structure has consequences for the Corpus Workbench version of the corpus.

The hierarchical structure stemming from both the difference between document level and speaker level annotation as well as nested stage paragraphs and the linguistic annotation with sentence annotation and named entity recognition represents a difference to the structure of GermaParl v1.
81 changes: 81 additions & 0 deletions rmd/04-Tools_Packages_Resources.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Data and Resources, Tools and Packages{#data-resources}

While used data as well as resources, tools and packages were mentioned in the previous sections, this chapter should briefly summarize these.

## Data

### Protocols

The original protocols were downloaded from the Website of the German Bundestag which is also the source of the *Stammdaten* file. Wikipedia was used for additional information of speakers.

`r if (knitr::is_html_output()) "The following table shows the source of each individual file" else "For a complete list of all sources, please see the online html version of this document."`

```{r download-report-bt-sources, echo = FALSE, eval = knitr::is_html_output()}
tei_files <- list.files("~/lab/github/GermaParlTEI_beta",
pattern = ".xml",
full.names = TRUE,
recursive = TRUE)
download_metadata <- lapply(tei_files, function(tei_file) {
tei <- xml2::read_xml(tei_file)
url <- xml2::xml_find_first(tei, "//url") |> xml2::xml_text()
date <- xml2::xml_find_first(tei, "//sourceDesc/date") |> xml2::xml_text()
type <- xml2::xml_find_first(tei, "//sourceDesc/filetype") |> xml2::xml_text()
data.table(
file = basename(tei_file),
url = url,
filetype = type,
date = date
)
}
)
download_metadata_dt <- data.table::rbindlist(download_metadata)
download_metadata_tbl <- download_metadata_dt |>
mutate(id_as_numeric = as.integer(gsub("BT_(\\d+)_(\\d+).xml", "\\1\\2", file))) |>
group_by(url, filetype) |>
mutate(min_protocol = file[which.min(id_as_numeric)]) |>
mutate(max_protocol = file[which.max(id_as_numeric)]) |>
ungroup() |>
select(url, date, filetype, min_protocol, max_protocol) |>
unique()
download_metadata_tbl$protocols <- ifelse(download_metadata_tbl$min_protocol != download_metadata_tbl$max_protocol,
sprintf("%s - %s", download_metadata_tbl$min_protocol, download_metadata_tbl$max_protocol),
download_metadata_tbl$min_protocol)
download_metadata_tbl$min_protocol <- NULL
download_metadata_tbl$max_protocol <- NULL
download_metadata_tbl <- download_metadata_tbl[, c("protocols", "date", "filetype", "url")]
knitr::kable(download_metadata_tbl,
format = "html",
booktabs = TRUE,
escape = TRUE,
col.names = c("protocol name(s)", "download date", "original filetype", "source url"),
caption = "Download Report of the GermaParl corpus ")
```

### External Data

The majority of the information added to the initial protocols originates from the `r link_or_footnote("Stamdaten of the German Bundestag.", "https://www.bundestag.de/services/opendata", "2023-05-23")` It comprises information about Members of Parliament. Party affiliations are added from Wikipedia. In the preparation process, the Stammdaten file was converted into a data.table object which was then stored in a R data package for versioning and documentation purposes. Other speakers are enriched with Wikipedia. In cases in which no information about a speaker could be found on Wikipedia, `r link_or_footnote("Munzinger", "https://www.munzinger.de/search/start.jsp", "2023-05-23")` proofed a valuable resource for speaker names and party affiliations.

## Tools

### Stanford CoreNLP{-}

For the current iteration of the corpus, the Java version of Stanford CoreNLP (version 4.5.x) was used to perform the initial linguistic annotation [@manning_stanford_2014]. More specifically, tokenization, splitting of sentences, Part-of-Speech tagging in the Universal Dependencies tag set and Named Entity Recognition were performed using the default German language model. To make use of the parallel computing capabilities of Stanford CoreNLP from within R, the R wrapper `bignlp` was developed in the context of the PolMine project [@bignlpRPackage]. It is available on `r link_or_footnote("GitHub.", "https://github.com/PolMine/bignlp", "2023-05-23")`

### TreeTagger{-}

To add lemmata and language specific Part-of-Speech tags to the current corpus, TreeTagger was used [@schmid_probabilistic_1994]. While not the most recent solution to add these annotation layers, TreeTagger is fast and robust.

### Corpus Workbench{-}

The corpus is also provided in the format of the the `r link_or_footnote("IMS Corpus Workbench.", "https://cwb.sourceforge.io/", "2023-05-23")`. The preparation workflow mainly communicates with the CWB via the `cwbtools` R package [@cwbtoolsRPackage] which is used for the encoding of the data.

### Additional R Packages{-}

The workflow is set up in R [@RCore]. For different parsers, the packages `xml2` [@xml2] and `stringr` [@stringr] provide important functionality. To facilitate the structural annotation of the protocols, the R package `r link_or_footnote("frappp", "https://polmine.github.io/frappp_slides/slides_en.html", "2023-05-23")` is crucial. The R package `trickypdf` [@trickypdf] was used to resolve the two-column layout of PDF files.
1 change: 1 addition & 0 deletions rmd/05-references.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# References
151 changes: 151 additions & 0 deletions rmd/404.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
<!DOCTYPE html>
<html lang="" xml:lang="">
<head>

<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<title>Page not found | README</title>
<meta name="description" content="" />
<meta name="generator" content="bookdown 0.24 and GitBook 2.6.7" />

<meta property="og:title" content="Page not found | README" />
<meta property="og:type" content="book" />





<meta name="twitter:card" content="summary" />
<meta name="twitter:title" content="Page not found | README" />







<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black" />




<script src="libs/header-attrs-2.11/header-attrs.js"></script>
<script src="libs/jquery-3.6.0/jquery-3.6.0.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/fuse.js@6.4.6/dist/fuse.min.js"></script>
<link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-bookdown.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />








<link href="libs/anchor-sections-1.0.1/anchor-sections.css" rel="stylesheet" />
<script src="libs/anchor-sections-1.0.1/anchor-sections.js"></script>




<link rel="stylesheet" href="style.css" type="text/css" />
</head>

<body>



<div class="book without-animation with-summary font-size-2 font-family-1" data-basepath=".">

<div class="book-summary">
<nav role="navigation">


</nav>
</div>

<div class="book-body">
<div class="body-inner">
<div class="book-header" role="navigation">
<h1>
<i class="fa fa-circle-o-notch fa-spin"></i><a href="./"></a>
</h1>
</div>

<div class="page-wrapper" tabindex="-1" role="main">
<div class="page-inner">

<section class="normal" id="section-">
<div id="page-not-found" class="section level1">
<h1>Page not found</h1>
<p>The page you requested cannot be found (perhaps it was moved or renamed).</p>
<p>You may want to try searching to find the page's new location, or use
the table of contents to find the page you are looking for.</p>
</div>
</section>

</div>
</div>
</div>


</div>
</div>
<script src="libs/gitbook-2.6.7/js/app.min.js"></script>
<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
<script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
<script>
gitbook.require(["gitbook"], function(gitbook) {
gitbook.start({
"sharing": {
"github": false,
"facebook": true,
"twitter": true,
"linkedin": false,
"weibo": false,
"instapaper": false,
"vk": false,
"whatsapp": false,
"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
},
"fontsettings": {
"theme": "white",
"family": "sans",
"size": 2
},
"edit": {
"link": "https://github.com/USERNAME/REPO/edit/BRANCH/%s",
"text": "Edit"
},
"history": {
"link": null,
"text": null
},
"view": {
"link": null,
"text": null
},
"download": null,
"search": false,
"toc": {
"collapse": "subsection"
}
});
});
</script>

</body>

</html>
7 changes: 7 additions & 0 deletions rmd/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
### Build the Book

```
setwd("~/lab/github/BuildingGermaParl/docs")
bookdown::render_book("index.Rmd", "bookdown::gitbook")
bookdown::render_book("index.Rmd", "bookdown::pdf_book")
```
Loading

0 comments on commit c1e7e7d

Please sign in to comment.