In [None]:
![uq](https://slcladal.github.io/images/uq1.jpg)

# Introduction

This tutorial introduces part-of-speech tagging and syntactic parsing using R. The entire R markdown document for this tutorial can be downloaded [here](https://slcladal.github.io/tagging.Rmd). Another highly recommendable tutorial on part-of-speech tagging in R produced by Andreas Niekler and Gregor Wiedemann can be found [here](https://tm4ss.github.io/docs/Tutorial_8_NER_POS.html)  [see @WN17].

# Part-Of-Speech Tagging

Many analyses of language data require that we distinguish different parts of speech. In order to determine the word class of a certain word, we use a procedure which is called part-of-speech tagging (commonly referred to as pos-, pos-, or PoS-tagging). pos-tagging is a common procedure when working with natural language data. Despite being used quite frequently, it is a rather complex issue that requires the application of statistical methods that are quite advanced. In the following, we will explore different options for pos-tagging and syntactic parsing. 

Parts-of-speech, or word categories, refer to the grammatical nature or category of a lexical item, e.g. in the sentence *Jane likes the girl* each lexical item can be classified according to whether it belongs to the group of determiners, verbs, nouns, etc.  pos-tagging refers to a (computation) process in which information is added to existing text. This process is also called *annotation*. Annotation can be very different depending on the task at hand. The most common type of annotation when it comes to language data is part-of-speech tagging where the word class is determined for each word in a text and the word class is then added to the word as a tag. However, there are many different ways to tag or annotate texts. 

Pos–tagging assigns part-of-speech tags to character strings (these represent mostly words, of course, but also encompass punctuation marks and other elements). This means that pos–tagging is one specific type of annotation, i.e. adding information to data (either by directly adding information to the data itself or by storing information in e.g. a list which is linked to the data). It is important to note that annotation encompasses various types of information such as pauses, overlap, etc. pos–tagging is just one of these many ways in which corpus data can be *enriched*. Sentiment Analysis, for instance, also annotates texts or words with respect to its or their emotional value or polarity. 

Annotation is required in many machine-learning contexts because annotated texts are commonly used as training sets on which machine learning or deep learning models are trained that then predict, for unknown words or texts, what values they would most likely be assigned if the annotation were done manually. Also, it should be mentioned that  by many online services offer pos-tagging (e.g. [here](http://www.infogistics.com/posdemo.htm) or [here](https://linguakit.com/en/part-of-speech-tagging).

When pos–tagged, the example sentence could look like the example below.

1. Jane/NNP likes/VBZ the/DT girl/NN

In the example above, `NNP` stands for proper noun (singular), `VBZ` stands for 3rd person singular present tense verb, `DT` for determiner, and `NN` for noun(singular or mass). The pos-tags used by the `openNLPpackage` are the [Penn English Treebank pos-tags](https://dpdearing.com/posts/2011/12/opennlp-part-of-speech-pos-tags-penn-english-treebank/). A more elaborate description of the tags can be found here which is summarised below:


library(DT)
Tag <- c("CC", "CD", "DT", "EX", "FW", "IN", "JJ", "JJR", "JJS", "LS", "MD", "NN", "NNS", "NNP", "NNPS", "PDT", "POS", "PRP", "PRP$", "RB", "RBR", "RBS", "RP", "SYM", "TO", "UH", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "WDT", "WP", "WP$", "WRB")
Description <- c("Coordinating conjunction", "Cardinal number", "Determiner", "Existential there", "Foreign word", "Preposition or subordinating con", "Adjective", "Adjective, comparative", "Adjective, superlative", "List item marker", "Modal", "Noun, singular or mass", "Noun, plural", "Proper noun, singular", "Proper noun, plural", "Predeterminer", "Possessive ending", "Personal pronoun", "Possessive pronoun", "Adverb", "Adverb, comparative", "Adverb, superlative", "Particle", "Symbol", "to", "Interjection", "Verb, base form", "Verb, past tense", "Verb, gerund or present particip", "Verb, past participle", "Verb, non-3rd person singular pr", "Verb, 3rd person singular presen", "Wh-determiner", "Wh-pronoun", "Possessive wh-pronoun", "Wh-adverb")
Examples <- c("and, or, but", "one, two, three", "a, the", "There/EX was a party in progress", "persona/FW non/FW grata/FW", "uh, well, yes", "good, bad, ugly", "better, nicer", "best, nicest", "a., b., 1., 2.", "can, would, will", "tree, chair", "trees, chairs", "John, Paul, CIA", "Johns, Pauls, CIAs", "all/PDT this marble, many/PDT a soul", "John/NNP 's/POS, the parentss/NNP '/POS distress", "I, you, he", "mine, yours", "evry, enough, not", "later", "latest", "RP", "CO2", "to", "uhm, uh", "go, walk", "walked, saw", "walking, seeing", "walked, thought", "walk, think", "walks, thinks", "which, that", "what, who, whom (wh-pronoun)", "whose, who (wh-words)", "how, where, why (wh-adverb)")
tags <- data.frame(Tag, Description, Examples)
# inspect results
datatable(tags, rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")


In [None]:
Assigning these pos-tags to words appears to be rather straight forward. However, pos-tagging is quite complex and there are various ways by which a computer can be trained to assign pos-tags. For example, one could use orthographic or morphological information to devise rules such as. . .

* If a word ends in *ment*, assign the pos-tag `NN` (for common noun)

* If a word does not occur at the beginning of a sentence but is capitalized, assign the pos-tag `NNP` (for proper noun)

Using such rules has the disadvantage that pos-tags can only be assigned to a relatively small number of words as most words will be ambiguous – think of the similarity of the English plural and the English past tense morpheme,for instance, which are orthographically identical.Another option would be to use a dictionary in which each word is as-signed a certain pos-tag and a program could assign the pos-tag if the word occurs in a given text. This procedure has the disadvantage that most words belong to more than one word class and pos-tagging would thus have to rely on additional information.The problem of words that belong to more than one word class can partly be remedied by including contextual information such as. . 

* If the previous word is a determiner and the following word is a common noun, assign the pos-tag `JJ` (for a common adjective)

This procedure works quite well but there are still better options.The best way to pos-tag a text is to create a manually annotated training set which resembles the language variety at hand. Based on the frequency of the association between a given word and the pos-tags it is assigned in the training data, it is possible to tag a word with the pos-tag that is most often assigned to the given word in the training data.All of the above methods can and should be optimized by combining them and additionally including pos–n–grams, i.e. determining a pos-tag of an unknown word based on which sequence of pos-tags is most similar to the sequence at hand and also most common in the training data.This introduction is extremely superficial and only intends to scratch some of the basic procedures that pos-tagging relies on. The interested reader is referred to introductions on machine learning and pos-tagging such as e.g.https://class.coursera.org/nlp/lecture/149.

There are several different R packages that assist with pos-tagging texts [see @kumar2016mastering]. In this tutorial, we will use the `openNLP`, the `corNLP`, and the `TreeTagger` packages. Each of these has advantages and shortcomings and it is advantageous to try which result best matches one's needs.

**Preparation and session set up**

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R [here](https://slcladal.github.io/intror.html). For this tutorials, we need to install certain *packages* from an R *library* so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).


# install packages
install.packages("tidyverse")
install.packages("igraph")
install.packages("tm")
install.packages("NLP")
install.packages("openNLP")
install.packages("openNLPdata")
install.packages("coreNLP")
install.packages("koRpus")
install.packages("koRpus.lang.en", repos="https://undocumeantit.github.io/repos/l10n/")
install.packages("koRpus.lang.de", repos="https://undocumeantit.github.io/repos/l10n/")
install.packages("koRpus.lang.es", repos="https://undocumeantit.github.io/repos/l10n/")
install.packages("koRpus.lang.nl", repos="https://undocumeantit.github.io/repos/l10n/")
install.packages("koRpus.lang.it", repos="https://undocumeantit.github.io/repos/l10n/")
install.packages("koRpus.lang.fr", repos="https://undocumeantit.github.io/repos/l10n/")
install.packages("koRpus.lang.pt", repos="https://undocumeantit.github.io/repos/l10n/")
install.packages("koRpus.lang.ru", repos="https://undocumeantit.github.io/repos/l10n/")
install.packages("flextable")
# install phrasemachine
phrasemachineurl <- "https://cran.r-project.org/src/contrib/Archive/phrasemachine/phrasemachine_1.1.2.tar.gz"
install.packages(phrasemachineurl, repos=NULL, type="source")
# install parsent
pacman::p_load_gh(c("trinker/textshape", "trinker/parsent"))
# install klippy for copy-to-clipboard button in code chunks
remotes::install_github("rlesur/klippy")


In [None]:
***

<div class="warning" style='padding:0.1em; background-color:#51247a; color:#f2f2f2'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>NOTE</b><br>Downloading and installing the jar files for CoreNLP will take quite long (between 5 and 10 minutes!). The installation will also require data from your plan (the files are app. 350 MB) - it is thus recommendable to be logged into an institutional network that has a decent connectivity and download rate (e.g., a university network).</p>
<p style='margin-left:1em;'>
</p></span>
</div>

***


# download java files for CoreNLP
downloadCoreNLP()


In [None]:
Now that we have installed the packages, we activate them as shown below.



# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# load packages
library(tidyverse)
library(igraph)
library(tm)
library(NLP)
library(openNLP)
library(openNLPdata)
library(coreNLP)
library(koRpus)
library(koRpus.lang.en)
library(koRpus.lang.de)
library(koRpus.lang.es)
library(koRpus.lang.nl)
library(koRpus.lang.it)
library(koRpus.lang.fr)
library(koRpus.lang.pt)
library(koRpus.lang.ru)
library(phrasemachine)
library(flextable)
# load function for pos-tagging objects in R
source("https://slcladal.github.io/rscripts/POStagObject.r") 
# syntax tree drawing function
source("https://slcladal.github.io/rscripts/parsetgraph.R")
# activate klippy for copy-to-clipboard button
klippy::klippy()


In [None]:
Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.

# POS-Tagging with openNLP

In R we can pos–tag large amounts of text by various means. This section explores pos-tagging using the `openNLP` package. Using the `openNLP` package for pos-tagging works particularly well when the aim is to pos-tag newspaper texts as the `openNLP` package implements the *Apache OpenNLPMaxent Part of Speech tagger* and it comes with pre-trained models. Ideally, pos-taggers should be trained on data resembling the data to be pos-tagged.However, I do not know how to trained the *Apache openNLP pos-tagger* via R and it would be great if someone would provide a tutorial on how to do that. Using pre-trained models has the advantage that we do not need to train the pos-tagger ourselves. However, it also means that one has to rely on models trained on data that may not really resemble the data a at hand.This implies that using it for texts that differ from newspaper texts, i.e.the language the models have been trained on, does not work as well, as the model applies the probabilities of newspaper language to the language variety at hand. pos-tagging with the `openNLP` requires the `NLP` package and installing the models on which the `openNLP` package is based.

To pos-tag a text, we start by loading an example text into R.


# load corpus data
text <- readLines("https://slcladal.github.io/data/testcorpus/linguistics07.txt", skipNul = T)
# clean data
text <- text %>%
 str_squish() 


# inspect data
text %>%
  as.data.frame() %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "Example text.")  %>%
  flextable::border_outer()


In [None]:
Now that the text data has been read into R, we can proceed with the part-of-speech tagging. To perform the pos-tagging, we load the function for pos-tagging, load the `NLP` and `openNLP` packages.

***

<div class="warning" style='padding:0.1em; background-color:#51247a; color:#f2f2f2'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>NOTE</b><br>You need to change the path that is used in the code below and include the path to `en-pos-maxent.bin` on your computer!</p>
<p style='margin-left:1em;'>
</p></span>
</div>

***


POStag <- function(object){
  require("stringr")
  require("NLP")
  require("openNLP")
  require("openNLPdata")
  # define paths to corpus files
  corpus.tmp <- object
  # define sentence annotator
  sent_token_annotator <- openNLP::Maxent_Sent_Token_Annotator()
  # define word annotator
  word_token_annotator <- openNLP::Maxent_Word_Token_Annotator()
  # define pos annotator
  pos_tag_annotator <- openNLP::Maxent_POS_Tag_Annotator(language = "en", probs = FALSE, 
    # WARNING: YOU NEED TO INCLUDE YOUR OWN PATH HERE!                                            
    model = "C:\\Users\\marti\\OneDrive\\Dokumente\\R\\win-library\\4.1\\openNLPdata\\models\\en-pos-maxent.bin")
  # convert all file content to strings
  Corpus <- lapply(corpus.tmp, function(x){
    x <- as.String(x)  }  )
  # loop over file contents
  lapply(Corpus, function(x){
    y1 <- NLP::annotate(x, list(sent_token_annotator, word_token_annotator))
    y2<- NLP::annotate(x, pos_tag_annotator, y1)
    y2w <- subset(y2, type == "word")
    tags <- sapply(y2w$features, '[[', "POS")
    r1 <- sprintf("%s/%s", x[y2w], tags)
    r2 <- paste(r1, collapse = " ")
    return(r2)  }  )
  }


In [None]:
We now apply this function to our text.



# pos tagging data
textpos <- POStag(object = text)


# inspect data
textpos %>%
  unlist() %>%
  as.data.frame() %>%
  flextable::flextable() %>%
  flextable::set_header_labels(values = list(".")) %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "Pos-tagged text.")  %>%
  flextable::border_outer()


In [None]:
The resulting vector contains the part-of-speech tagged text and shows that the function fulfills its purpose in automatically pos-tagging the text. The pos-tagged text could now be processed further, e.g. by extracting all  adjectives in the text or by creating concordances of nouns ending in *ment*.

## POS-Tagging non-English texts

By default, `openNLP` is only able to handle English text. However, the functionality of `openNLP` can be extended to languages other than English. In order to extend `openNLP`'s functionality, you need to download the available language models from http://opennlp.sourceforge.net/models-1.5/ and save them in the `models` folder of the `openNLPdata` package in your R library. 

To ease the implementation of openNLP-based pos-tagging, we will write functions for pos-tagging that we can then apply to the texts that we want to tag rather than executing a piping of commands. 

### POS-Tagging a Dutch text

One of the languages that openNLP can handle is Dutch. To pos-tag Dutch texts, we write a function that we can then apply to the Dutch text in order to pos-tag it.


postag_nl <- function(object){
  require("stringr")
  require("NLP")
  require("openNLP")
  require("openNLPdata")
  # define sentence annotator
  sent_token_annotator <- openNLP::Maxent_Sent_Token_Annotator()
  # define word annotator
  word_token_annotator <- openNLP::Maxent_Word_Token_Annotator()
  # define pos annotator
  pos_tag_annotator <- openNLP::Maxent_POS_Tag_Annotator(language = "nl", probs = FALSE,                                    model = "C:\\Users\\marti\\OneDrive\\Dokumente\\R\\win-library\\4.1\\openNLPmodels.nl\\models\\nl-pos-maxent.bin")
  # convert all file content to strings
  Corpus <- lapply(object, function(x){
    x <- as.String(x)  }  )
  # loop over file contents
  lapply(Corpus, function(x){
    y1 <- NLP::annotate(x, list(sent_token_annotator, word_token_annotator))
    y2<- NLP::annotate(x, pos_tag_annotator, y1)
    y2w <- subset(y2, type == "word")
    tags <- sapply(y2w$features, '[[', "POS")
    r1 <- sprintf("%s/%s", x[y2w], tags)
    r2 <- paste(r1, collapse = " ")
    return(r2)  }  )
  }


In [None]:
We now apply this function to our text.



dutchtext <- readLines("D:\\Uni\\UQ\\SLC\\LADAL\\SLCLADAL.github.io\\data/dutch.txt")
# pos tagging data
textpos_nl <- postag_nl(object = dutchtext)


# inspect data
textpos_nl %>%
  unlist() %>%
  as.data.frame() %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "Pos-tagged textpos_nl.")  %>%
  flextable::border_outer()


In [None]:
### POS-Tagging a German text



postag_de <- function(object){
  require("stringr")
  require("NLP")
  require("openNLP")
  require("openNLPdata")
  # define sentence annotator
  sent_token_annotator <- openNLP::Maxent_Sent_Token_Annotator()
  # define word annotator
  word_token_annotator <- openNLP::Maxent_Word_Token_Annotator()
  # define pos annotator
  pos_tag_annotator <- openNLP::Maxent_POS_Tag_Annotator(language = "de", probs = FALSE,                                    model = "C:\\Users\\marti\\OneDrive\\Dokumente\\R\\win-library\\4.1\\openNLPmodels.de\\models/de-pos-maxent.bin")
  # convert all file content to strings
  Corpus <- lapply(object, function(x){
    x <- as.String(x)  }  )
  # loop over file contents
  lapply(Corpus, function(x){
    y1 <- NLP::annotate(x, list(sent_token_annotator, word_token_annotator))
    y2<- NLP::annotate(x, pos_tag_annotator, y1)
    y2w <- subset(y2, type == "word")
    tags <- sapply(y2w$features, '[[', "POS")
    r1 <- sprintf("%s/%s", x[y2w], tags)
    r2 <- paste(r1, collapse = " ")
    return(r2)  }  )
  }


In [None]:
We now apply this function to our text.



gertext <- readLines("D:\\Uni\\UQ\\SLC\\LADAL\\SLCLADAL.github.io\\data/german.txt")
# pos tagging data
textpos_de <- postag_de(object = gertext)


# inspect data
textpos_de %>%
  unlist() %>%
  as.data.frame() %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "Pos-tagged textpos_de.")  %>%
  flextable::border_outer()


In [None]:
***

<div class="warning" style='padding:0.1em; background-color:#51247a; color:#f2f2f2'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>NOTE</b><br>You need to change the path that is used in the code below and include the path to `en-pos-maxent.bin` on your computer!</p>
<p style='margin-left:1em;'>
</p></span>
</div>

***


#### pos-tagging texts from different languages

postag_lang <- function(text, language){
  require("dplyr")
  require("stringr")
  require("NLP")
  require("openNLP")
  require("openNLPdata")
  # define sentence annotator
  sent_token_annotator <- openNLP::Maxent_Sent_Token_Annotator()
  # define word annotator
  word_token_annotator <- openNLP::Maxent_Word_Token_Annotator()
  # define pos annotator
  pos_tag_annotator_en <- openNLP::Maxent_POS_Tag_Annotator(language = "en", probs = FALSE, 
  model = "C:\\Users\\marti\\OneDrive\\Dokumente\\R\\win-library\\4.1\\openNLPdata\\models\\en-pos-maxent.bin")
  pos_tag_annotator_da <- openNLP::Maxent_POS_Tag_Annotator(language = "da", probs = FALSE,                                      model = "C:\\Users\\marti\\OneDrive\\Dokumente\\R\\win-library\\4.1\\openNLPmodels.da\\models\\da-pos-maxent.bin")
  pos_tag_annotator_nl <- openNLP::Maxent_POS_Tag_Annotator(language = "nl", probs = FALSE,                                    model = "C:\\Users\\marti\\OneDrive\\Dokumente\\R\\win-library\\4.1\\openNLPmodels.nl\\models\\nl-pos-maxent.bin")
  pos_tag_annotator_pt <- openNLP::Maxent_POS_Tag_Annotator(language = "pt", probs = FALSE,                                    model = "C:\\Users\\marti\\OneDrive\\Dokumente\\R\\win-library\\4.1\\openNLPmodels.pt\\models\\pt-pos-maxent.bin")
  pos_tag_annotator_de <- openNLP::Maxent_POS_Tag_Annotator(language = "dt", probs = FALSE,                                      model = "C:\\Users\\marti\\OneDrive\\Dokumente\\R\\win-library\\4.1\\openNLPmodels.de\\models\\de-pos-maxent.bin")
  pos_tag_annotator_es <- openNLP::Maxent_POS_Tag_Annotator(language = "es", probs = FALSE,                                      model = "C:\\Users\\marti\\OneDrive\\Dokumente\\R\\win-library\\4.1\\openNLPmodels.es\\models\\es-pos-maxent.bin")
  pos_tag_annotator_it <- openNLP::Maxent_POS_Tag_Annotator(language = "it", probs = FALSE,                                      model = "C:\\Users\\marti\\OneDrive\\Dokumente\\R\\win-library\\4.1\\openNLPmodels.it\\models\\it-pos-maxent.bin")
  pos_tag_annotator_sv <- openNLP::Maxent_POS_Tag_Annotator(language = "sv", probs = FALSE,                                      model = "C:\\Users\\marti\\OneDrive\\Dokumente\\R\\win-library\\4.1\\openNLPmodels.sv\\models\\sv-pos-maxent.bin")
  case_when(language == "de" ~ pos_tag_annotator_de -> pos_tag_annotator,
            language == "pt" ~ pos_tag_annotator_pt -> pos_tag_annotator,
            language == "nl" ~ pos_tag_annotator_nl -> pos_tag_annotator,
            language == "en" ~ pos_tag_annotator_en -> pos_tag_annotator,
            language == "es" ~ pos_tag_annotator_es -> pos_tag_annotator,
            language == "da" ~ pos_tag_annotator_da -> pos_tag_annotator,
            language == "it" ~ pos_tag_annotator_it -> pos_tag_annotator,
            language == "fr" ~ pos_tag_annotator_fr -> pos_tag_annotator,
            language == "sv" ~ pos_tag_annotator_sv -> pos_tag_annotator,
            FALSE ~ cat("Define language as either\nen, es, de, fr, it, pt, sv, or nl!"))
  # convert all file content to strings
  Corpus <- lapply(text, function(x){
    x <- as.String(x)  }  )
  # loop over file contents
  lapply(Corpus, function(x){
    y1 <- NLP::annotate(x, list(sent_token_annotator, word_token_annotator))
    y2<- NLP::annotate(x, pos_tag_annotator, y1)
    y2w <- subset(y2, type == "word")
    tags <- sapply(y2w$features, '[[', "POS")
    r1 <- sprintf("%s/%s", x[y2w], tags)
    r2 <- paste(r1, collapse = " ")
    return(r2)  }  )
  }

# pos tagging data
textpos_de2 <- postag_lang(text = gertext, language = "de")


# inspect data
textpos_de2 %>%
  unlist() %>%
  as.data.frame() %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "Pos-tagged textpos_de2.")  %>%
  flextable::border_outer()


In [None]:
# POS-Tagging with TreeTagger

An alternative to `openNLP` for pos-tagging texts in R is the `koRpus` package. The `koRpus` package uses the [TreeTagger](cf.http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) which means that the TreeTagger has to be installed prior to pos-tagging based on `koRpus` package. The `koRpus` package simply accesses the TreeTagger via R. The fact that the the implementation of the TreeTagger requires software outside of R that has to be installed separately and prior to being able to use the `koRpus` package and its functions in R represents a major disadvantage compared to the `openNLP` approach (which also relies on external software but this is contained within the `openNLP` package). In addition, the installation of the TreeTagger can  be  quite  tedious  to  implement  (in case you are running a Windows machine as I do). On the other hand, the `koRpus` package has the advantage that the TreeTagger encompasses more languages than the `openNLP` package and the TreeTagger can be relatively easily modified and trained on new data.

## Installation on Windows

As the installation of the TreeTagger can be somewhat tedious and time consuming, I would like to expand on what I did to get it going as this might save you some time and frustration.

1. The first thing you should do is to install or re-install a Perl interpreter. You can simply do so by clicking [here](http://www.activestate.com/activeperl/) and following the installation instructions. 

2. Download the TreeTagger (click [here](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-windows-3.2.3.zip) for a Windows-64 version and [here](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-windows32-3.2.3.zip) for a Windows-32 version). Move the zip-file to the root directory on your C-drive (C:/) and extract the zip file.

3. Download the parameter files for the languages you need (you will find the parameters files under *Parameter files* [here](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/#Windows)). Then decompress the parameter files and move them to the sub-directory `TreeTagger/lib`. Rename the parameter files to `<language>-utf8.par`. For example, rename the file `french-par-linux-3.2-utf8.bin` as `french-utf8.par`.

4. Now you need to set a path variable. You need to add the path `C:\TreeTagger\bin` to the `PATH` environment variable. How to do this differs across Windows versions. You can find a video tutorial on how to change path variables in Windows 10 [here](https://www.youtube.com/watch?v=Frrlv_5rGhY). If you have a different Windows version, e.g. Windows 7, I recommend you search for *Set path variable in Windows 7* on YouTube and follow the instructions. 
   
5. Next, open a command prompt window and type the command:


set PATH=C:\TreeTagger\bin;%PATH%


In [None]:
6. Then, go to the directory C:\TreeTagger by typing the command:



cd c:\TreeTagger


In [None]:
7. Now, everything should be running and you can test the tagger, e.g. by pos-tagging the TreeTagger installation file. To do this, type the command:



tag-english INSTALL.txt


In [None]:
If you do not get the pos-tagged results after a few seconds, restart your computer and repeat steps 1 to 7. 
   
If you install the TreeTagger in a different directory, you have to modify the first path in the batch files tag-*.bat using an editor such as `Wordpad`.

The main issues that occurred during my installation was that I had to re-install Java but once I had done so, it finally worked.  

## Using the TreeTagger

In this example, we simply implement the TreeTagger without training it!  This is in fact not a good practice and should be avoided as I have no way of knowing how good the performance is or what I could do to improve its performance!


# perform  POS  tagging
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en") 
postagged <- treetag("D:\\Uni\\UQ\\SLC\\LADAL\\SLCLADAL.github.io\\data\\testcorpus/linguistics07.txt")


# inspect data
postagged@tokens %>%
  as.data.frame() %>%
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "First 10 rows of the output.")  %>%
  flextable::border_outer()


In [None]:
We can now paste the text and the pos-tags together using the `paste` function from base R.



postaggedtext <- paste(postagged@tokens$token, postagged@tokens$tag, sep = "/", collapse = " ")


# inspect data
postaggedtext %>%
  as.data.frame() %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "TreeTagger pos-tagged text.")  %>%
  flextable::border_outer()


In [None]:
## POS-Tagging multiple files

To pos-tag several files at once you need to create a list of paths and then apply the `treetagger` to each of the path elements. Once we have done so, we inspect the structure of the resulting vector which now holds the pos-tagged texts.


corpuspath <- here::here("data/testcorpus")
# generate list of corpus files
corpusfiles <- list.files(corpuspath)
cfiles <- paste0(corpuspath, "/", corpusfiles, sep = "")
# apply treetagger to each file in corpus
postagged_files <- sapply(cfiles, function(x){
  set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en") 
  x <- treetag(x)
  x <- paste(x@tokens$token, x@tokens$tag, sep = "/", collapse = " ")
})
# inspect  text.tagged
str(postagged_files)


In [None]:
The first pos-tagged corpus file looks like this.



# inspect data
postagged_files[1] %>%
  as.data.frame() %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "First 10 rows of the output.")  %>%
  flextable::border_outer()


In [None]:
Note that the tag set used for the pos-tagging above differs slightly from the Penn Treebank Tag set. For example, the end of sentences are tagged as `SENT` rather as `PUNC`. 

## POS-Tagging non-English texts

In addition to being very flexible, the TreeTagger is appealing because it supports a comparatively large sample of languages. Here is the list of languages (or varieties) that are currently supported: Bulgarian, Catalan, Chinese, Coptic, Czech, Danish, Dutch, English, Estonian, Finnish, French, Spoken French, Old French, Galician, German, Spoken German, Middle High German, Greek, Ancient Greek, Hausa, Italian, Korean, Latin, Mongolian, Norwegian (Bokmaal), Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swahili, Swedish. 

Unfortunately, not all of these language models are available for R. Currently only language support for English, German, French, Spanish, Italian, and Dutch is currently available via the `koRpus` package. To be able to use the existing pos-tagging models, you need to download the parameter files as described above and then stored in the `lib` folder of the TreeTagger folder at the root of your `C:/`- drive as well as install the respective `koRpus.lang` models (see below). These models then have to be activated using the `library` function.

### POS-Tagging a German text


# perform  POS  tagging
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-german.bat", lang="de") 
postagged_german <- treetag(here::here("data", "german.txt"))


# inspect data
postagged_german@tokens %>%
  as.data.frame() %>%
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "First 10 rows of the postagged_german output.")  %>%
  flextable::border_outer()


In [None]:
The German annotation uses the [Stuttgart-Tuebingen tag set](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/STTS-Tagset.pdf) (STTS) which is described in great detail [here](http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf).

### POS-Tagging a French text


# perform  POS  tagging
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-french.bat", lang="fr") 
postagged_french <- treetag(here::here("data", "french.txt"))


# inspect data
postagged_french@tokens %>%
  as.data.frame() %>%
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "First 10 rows of the postagged_french output.")  %>%
  flextable::border_outer()


In [None]:
The tag set used for pos-tagging French texts is described here and summarized below.



Tag <- c("BR",  "ADJ",  "ADV",  "DET:ART",  "DET:POS",  "INT",  "KON",  "NAM",  "NOM",  "NUM",  "PRO",  "PRO:DEM",  "PRO:IND",  "PRO:PER",  "PRO:POS",  "PRO:REL",  "PRP",  "PRP:det",  "PUN",  "PUN:cit",  "SENT",  "SYM",  "VER:cond", "VER:futu", "VER:impe", "VER:impf", "VER:infi", "VER:pper", "VER:ppre", "VER:pres", "VER:simp", "VER:subi", "VER:subp")
Description <- c("abreviation", "adjective", "adverb", "article", "possessive pronoun (ma, ta, ...)", "interjection", "conjunction", "proper name", "noun", "numeral", "pronoun", "demonstrative pronoun", "indefinite pronoun", "personal pronoun", "possessive pronoun (mien, tien, ...)", "relative pronoun", "preposition", "preposition plus article (au,du,aux,des)", "punctuation", "punctuation citation", "sentence tag", "symbol", "verb conditional", "verb futur", "verb imperative", "verb imperfect", "verb infinitive", "verb past participle", "verb present participle", "verb present", "verb simple past", "verb subjunctive imperfect", "verb subjunctive present")
tagfr <- data.frame(Tag, Description)
# inspect  results
datatable(tagfr, rownames = FALSE, options = list(pageLength = 5, scrollX=T), filter = "none")


In [None]:
### POS-Tagging a Spanish text



# perform  POS  tagging
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-spanish.bat", lang="es") 
postagged_spanish <- treetag(here::here("data", "spanish.txt"))


# inspect data
postagged_spanish@tokens %>%
  as.data.frame() %>%
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "First 10 rows of the postagged_spanish output.")  %>%
  flextable::border_outer()


In [None]:
The tag set used for pos-tagging Spanish texts is described [here](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/french-tagset.html) and summarized below.



Tag <- c("ACRNM", "ADJ", "ADV", "ALFP", "ALFS", "ART", "BACKSLAS", "CARD", "CC", "CCAD", "CCNEG", "CM", "CODE", "COLON", "CQUE", "CSUBF", "CSUBI", "CSUBX", "DASH", "DM", "DOTS", "FO", "FS", "INT", "ITJN", "LP", "NC", "NEG", "NMEA", "NMON", "NP", "ORD", "PAL", "PDEL", "PE", "PERCT", "PNC", "PPC", "PPO", "PPX", "PREP", "PREP", "PREP/DEL", "QT", "QU", "REL", "RP", "SE", "SEMICOLON", "SLASH", "SYM", "UMMX", "VCLIger", "VCLIinf", "VCLIfin", "VEadj", "VEfin", "VEger", "VEinf", "VHadj", "VHfin", "VHger", "VHinf", "VLadj", "VLfin", "VLger", "VLinf", "VMadj", "VMfin", "VMger", "VMinf", "VSadj", "VSfin", "VSger", "VSinf")
Description <- c("acronym (ISO, CEI)", "Adjectives (mayores, mayor)", "Adverbs (muy, demasiado, cómo)", "Plural letter of the alphabet (As/Aes, bes)", "Singular letter of the alphabet (A, b)", "Articles (un, las, la, unas)", "H backslash", "Cardinals", "Coordinating conjunction (y, o)", "Adversative coordinating conjunction (pero)", "Negative coordinating conjunction (ni)", "comma (,)", "Alphanumeric code", "colon (:)", "que (as conjunction)", "Subordinating conjunction that introduces finite clauses (apenas)", "Subordinating conjunction that introduces infinite clauses (al)", "Subordinating conjunction underspecified for subord-type (aunque)", "dash (-)", "Demonstrative pronouns (ésas, ése, esta)", "pos-tag for ...", "Formula", "Full stop punctuation marks", "Interrogative pronouns (quiénes, cuántas, cuánto)", "Interjection (oh, ja)", "left parenthesis", "Common nouns (mesas, mesa, libro, ordenador)", "Negation", "measure noun (metros, litros)", "month name", "Proper nouns", "Ordinals (primer, primeras, primera)", "Portmanteau word formed by a and el", "Portmanteau word formed by de and el", "Foreign word", "percent sign", "Unclassified word", "Clitic personal pronoun (le, les)", "Possessive pronouns (mi, su, sus)", "Clitics and personal pronouns (nos, me, nosotras, te, sí)", "Negative preposition (sin)", "Preposition", "Complex preposition despues del", "quotation symbol", "Quantifiers (sendas, cada)", "Relative pronouns (cuyas, cuyo)", "right parenthesis", "Se (as particle)", "semicolon", "slash", "Symbols", "measure unit (MHz, km, mA)", "clitic gerund verb", "clitic infinitive verb", "clitic finite verb", "Verb estar. Past participle", "Verb estar. Finite", "Verb estar. Gerund", "Verb estar. Infinitive", "Verb haber. Past participle", "Verb haber. Finite", "Verb haber. Gerund", "Verb haber. Infinitive", "Lexical verb. Past participle", "Lexical verb. Finite", "Lexical verb. Gerund", "Lexical verb. Infinitive", "Modal verb. Past participle", "Modal verb. Finite", "Modal verb. Gerund", "Modal verb. Infinitive", "Verb ser. Past participle", "Verb ser. Finite", "Verb ser. Gerund", "Verb ser. Infinitive")
tagfr <- data.frame(Tag, Description)
# inspect  results
datatable(tagfr, rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")


In [None]:
### POS-Tagging an Italian text



# perform  POS  tagging
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-italian.bat", lang="it") 
postagged_italian <- treetag(here::here("data", "italian.txt"))


# inspect data
postagged_italian@tokens %>%
  as.data.frame() %>%
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "First 10 rows of the postagged_italian output.")  %>%
  flextable::border_outer()


In [None]:
The tag set used for pos-tagging Italian texts is described [here](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt) and summarized below.



Tag <- c("ABR", "ADJ", "ADV", "CON", "DET:def", "DET:indef", "FW", "INT", "LS", "NOM", "NPR", "NUM", "PON", "PRE", "PRE:det", "PRO", "PRO:demo", "PRO:indef", "PRO:inter", "PRO:pers", "PRO:poss", "PRO:refl", "PRO:rela", "SENT", "SYM", "VER:cimp", "VER:cond", "VER:cpre", "VER:futu", "VER:geru", "VER:impe", "VER:impf", "VER:infi", "VER:pper", "VER:ppre", "VER:pres", "VER:refl:infi", "VER:remo")
Description <- c("abbreviation", "adjective", "adverb", "conjunction", "definite article", "indefinite article", "foreign word", "interjection", "list symbol", "noun", "name", "numeral", "punctuation", "preposition", "preposition+article", "pronoun", "demonstrative pronoun", "indefinite pronoun", "interrogative pronoun", "personal pronoun", "possessive pronoun", "reflexive pronoun", "relative pronoun", "sentence marker", "symbol", "verb conjunctive imperfect", "verb conditional", "verb conjunctive present", "verb future tense", "verb gerund", "verb imperative", "verb imperfect", "verb infinitive", "verb participle perfect", "verb participle present", "verb present", "verb reflexive infinitive", "verb simple past")
tagfr <- data.frame(Tag, Description)
# inspect  results
datatable(tagfr, rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")


In [None]:
### POS-Tagging a Dutch text



# perform  POS  tagging
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-dutch.bat", lang="nl") 
postagged_dutch <- treetag(here::here("data", "dutch.txt"))


# inspect data
postagged_dutch@tokens %>%
  as.data.frame() %>%
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .95, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "First 10 rows of the postagged_dutch output.")  %>%
  flextable::border_outer()


In [None]:
The tag set used for pos-tagging Dutch texts is available  [here](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt) and summarized below.



Tag <- c("$.", "adj", "adj*kop", "adjabbr", "adv", "advabbr", "conjcoord", "conjsubo", "det__art", "det__demo", "det__indef", "det__poss", "det__quest", "det__rel", "int", "noun*kop", "nounabbr", "nounpl", "nounprop", "nounsg", "num__card", "num__ord", "partte", "prep", "prepabbr", "pronadv", "prondemo", "pronindef", "pronpers", "pronposs", "pronquest", "pronrefl", "pronrel", "punc", "verbinf", "verbpapa", "verbpastpl", "verbpastsg", "verbpresp", "verbprespl", "verbpressg")
Description <- c("sentence-final punctuation", "adjective", "truncated adjective", "abbreviated adjective", "adverb", "abbreviated adverb", "coordinating conjunction", "subordinating conjunction", "article", "attributively used demonstrative pronoun", "attributively used indefinite pronoun", "attributively used possessive pronoun", "attributively used question pronoun", "attributively used relative pronoun", "interjection", "truncated noun", "abbreviated noun", "plural noun", "proper name", "singular noun", "cardinal number", "ordinal number", "particle te", "preposition", "abbreviated preposition", "pronomial adverb", "demonstrative pronoun (used substitutively)", "indefined pronoun", "personal pronoun", "possessive pronoun", "question pronoun", "reflexive pronoun", "relative pronoun", "(non-sentential) punctuation", "infinitival verb", "past participle verb", "plural past tense verb", "singular past tense verb", "present participle verb", "plural present tense verb", "singular present tense verb")
tagfr <- data.frame(Tag, Description)
# inspect  results
datatable(tagfr, rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")


In [None]:
# POS-Tagging with coreNLP

Another package that is very handy when it comes to pos-tagging but more importantly syntactic parsing is the `coreNLP` package [see @arnold2015humanities]. 

***

<div class="warning" style='padding:0.1em; background-color:#51247a; color:#f2f2f2'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>NOTE</b><br>Unfortunately, we cannot use it at this point as my machine runs out of memory when I try running the code below.</p>
<p style='margin-left:1em;'>
</p></span>
</div>

***


options(java.parameters = "-Xmx4096m")
initCoreNLP(mem = "8g")
# annotate
annotation <- annotateString(text)
annotation


In [None]:
# Syntactic Parsing

Parsing refers to another type of annotation in which either structural information (as in the case of XML documents) or syntactic relations are added to text. As syntactic parsing is commonly more relevant in the language sciences, the following will focus only on syntactic parsing. syntactic parsing builds on pos-tagging and allows drawing syntactic trees or dependencies. Unfortunately, syntactic parsing still has relatively high error rates when dealing with language that is not very formal. However, syntactic parsing is very reliable when dealing with written language.


text <- readLines("https://slcladal.github.io/data/english.txt")
# convert character to string
s <- as.String(text)
# define sentence and word token annotator
sent_token_annotator <- openNLP::Maxent_Sent_Token_Annotator()
word_token_annotator <- openNLP::Maxent_Word_Token_Annotator()
# apply sentence and word annotator
a2 <- NLP::annotate(s, list(sent_token_annotator, word_token_annotator))
# define syntactic parsing annotator
parse_annotator <- openNLP::Parse_Annotator()
# apply parser
p <- parse_annotator(s, a2)
# extract parsed information
ptexts <- sapply(p$features, '[[', "parse")
ptexts


# read into NLP Tree objects.
ptrees <- lapply(ptexts, Tree_parse)
# show frist tree
ptrees[[1]]


In [None]:
These trees can, of course, also be shown visually, for instance, in the form of a syntax trees (or tree dendrogram). 



# generate syntax tree
parse2graph(ptexts[1], leaf.color='red',
            # to put sentence in title (not advisable for long sentences)
            #title = stringr::str_squish(stringr::str_remove_all(ptexts[1], "\\(\\,{0,1}[A-Z]{0,4}|\\)")), 
            margin=-0.05,
            vertex.color=NA,
            vertex.frame.color=NA, 
            vertex.label.font=2,
            vertex.label.cex=.75,  
            asp=.8,
            edge.width=.5, 
            edge.color='gray', 
            edge.arrow.size=0)


In [None]:
Syntax trees are very handy because the allow us to check how reliable the parser performed. 

We can use the `get_phrase_type_regex` function from the `parsent` package written by Tyler Rinker
to extract phrases from the parsed tree.


pacman::p_load_gh(c("trinker/textshape", "trinker/parsent"))
nps <- get_phrase_type_regex(ptexts[1], "NP") %>%
  unlist()
# inspect
nps


In [None]:
We can now extract the leaves from the text to get the parsed object.



nps_text <- stringr::str_squish(stringr::str_remove_all(nps, "\\(\\,{0,1}[A-Z]{0,4}|\\)"))
# inspect
nps_text


In [None]:
Unfortunately, we can only extract top level phrases (the NPs with the NPs are npt extracted separately).

In order to extract all phrases, we can use the `phrasemachine` from the CRAN archive. 

We now load the  `phrasemachine` package and pos-tag the text(s) (we will simply re-use the English text we pos-tagged before.)


# pos tag text
tagged_documents <- phrasemachine::POS_tag_documents(text)
# inspect
tagged_documents


In [None]:
In a next step, we can use the `extract_phrases` function to extract phrases.



#extract phrases
phrases <- phrasemachine::extract_phrases(tagged_documents,
                                          regex = "(A|N)*N(PD*(A|N)*N)*",
                                          maximum_ngram_length = 8,
                                          minimum_ngram_length = 1)
# inspect
phrases


In [None]:
Now, we have all noun phrases that occur in the English sample text.

# Citation & Session Info 

Schweinberger, Martin. `r format(Sys.time(), '%Y')`. *POS-Tagging and Syntactic Parsing with R*. Brisbane: The University of Queensland. url: https://slcladal.github.io/tagging.html (Version `r format(Sys.time(), '%Y.%m.%d')`).


@manual{schweinberger`r format(Sys.time(), '%Y')`pos,
  author = {Schweinberger, Martin},
  title = {pos-Tagging and Syntactic Parsing with R},
  note = {https://slcladal.github.io/tagging.html},
  year = {`r format(Sys.time(), '%Y')`},
  organization = "The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {`r format(Sys.time(), '%Y.%m.%d')`}
}


sessionInfo()


In [None]:
***

[Back to top](#introduction)

[Back to HOME](https://slcladal.github.io/index.html)

***

# References
