![This is an interactive LADAL notebook.](https://slcladal.github.io/images/uq1.jpg)

***

Please copy this Jupyter notebook so that you are able to edit it.

Simply go to: File > Save a copy in Drive.

If you want to run this notebook on your own computer, you need to do 2 things:

1. Make sure that you have R installed.

2. You need to download the [bibliography file](https://slcladal.github.io/bibliography.bib) and store it in the same folder where you store the Rmd file.

Once you have done that, you are good to go.

***

# POS-Tagging and Syntactic Parsing with R

This tutorial introduces part-of-speech tagging and syntactic parsing using R. This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to annotate textual data with part-of-speech (pos) tags and how to syntactically parse textual data  using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with pos-tagging and syntactic parsing. Another highly recommendable tutorial on part-of-speech tagging in R with UDPipe is available [here](https://bnosac.github.io/udpipe/en/) and another tutorial on pos-tagging and syntactic parsing by Andreas Niekler and Gregor Wiedemann can be found [here](https://tm4ss.github.io/docs/Tutorial_8_NER_POS.html)  (see Wiedemann and Niekler 2017).

# Part-Of-Speech Tagging

Many analyses of language data require that we distinguish different parts of speech. In order to determine the word class of a certain word, we use a procedure which is called part-of-speech tagging (commonly referred to as pos-, pos-, or PoS-tagging). pos-tagging is a common procedure when working with natural language data. Despite being used quite frequently, it is a rather complex issue that requires the application of statistical methods that are quite advanced. In the following, we will explore different options for pos-tagging and syntactic parsing. 

Parts-of-speech, or word categories, refer to the grammatical nature or category of a lexical item, e.g. in the sentence *Jane likes the girl* each lexical item can be classified according to whether it belongs to the group of determiners, verbs, nouns, etc.  pos-tagging refers to a (computation) process in which information is added to existing text. This process is also called *annotation*. Annotation can be very different depending on the task at hand. The most common type of annotation when it comes to language data is part-of-speech tagging where the word class is determined for each word in a text and the word class is then added to the word as a tag. However, there are many different ways to tag or annotate texts. 

Pos–tagging assigns part-of-speech tags to character strings (these represent mostly words, of course, but also encompass punctuation marks and other elements). This means that pos–tagging is one specific type of annotation, i.e. adding information to data (either by directly adding information to the data itself or by storing information in e.g. a list which is linked to the data). It is important to note that annotation encompasses various types of information such as pauses, overlap, etc. pos–tagging is just one of these many ways in which corpus data can be *enriched*. Sentiment Analysis, for instance, also annotates texts or words with respect to its or their emotional value or polarity. 

Annotation is required in many machine-learning contexts because annotated texts are commonly used as training sets on which machine learning or deep learning models are trained that then predict, for unknown words or texts, what values they would most likely be assigned if the annotation were done manually. Also, it should be mentioned that  by many online services offer pos-tagging (e.g. [here](http://www.infogistics.com/posdemo.htm) or [here](https://linguakit.com/en/part-of-speech-tagging).

When pos–tagged, the example sentence could look like the example below.

1. Jane/NNP likes/VBZ the/DT girl/NN

In the example above, `NNP` stands for proper noun (singular), `VBZ` stands for 3rd person singular present tense verb, `DT` for determiner, and `NN` for noun(singular or mass). The pos-tags used by the `openNLPpackage` are the [Penn English Treebank pos-tags](https://dpdearing.com/posts/2011/12/opennlp-part-of-speech-pos-tags-penn-english-treebank/). A more elaborate description of the tags can be found here which is summarised below:

![Overview of Penn English Treebank part-of-speech tags.](https://slcladal.github.io/images/postagtb.png)

Assigning these pos-tags to words appears to be rather straight forward. However, pos-tagging is quite complex and there are various ways by which a computer can be trained to assign pos-tags. For example, one could use orthographic or morphological information to devise rules such as. . .

* If a word ends in *ment*, assign the pos-tag `NN` (for common noun)

* If a word does not occur at the beginning of a sentence but is capitalized, assign the pos-tag `NNP` (for proper noun)

Using such rules has the disadvantage that pos-tags can only be assigned to a relatively small number of words as most words will be ambiguous – think of the similarity of the English plural (-(e)s)  and the English 3rd person, present tense indicative morpheme (-(e)s), for instance, which are orthographically identical.Another option would be to use a dictionary in which each word is as-signed a certain pos-tag and a program could assign the pos-tag if the word occurs in a given text. This procedure has the disadvantage that most words belong to more than one word class and pos-tagging would thus have to rely on additional information.The problem of words that belong to more than one word class can partly be remedied by including contextual information such as. . 

* If the previous word is a determiner and the following word is a common noun, assign the pos-tag `JJ` (for a common adjective)


This procedure works quite well but there are still better options.The best way to pos-tag a text is to create a manually annotated training set which resembles the language variety at hand. Based on the frequency of the association between a given word and the pos-tags it is assigned in the training data, it is possible to tag a word with the pos-tag that is most often assigned to the given word in the training data.All of the above methods can and should be optimized by combining them and additionally including pos–n–grams, i.e. determining a pos-tag of an unknown word based on which sequence of pos-tags is most similar to the sequence at hand and also most common in the training data.This introduction is extremely superficial and only intends to scratch some of the basic procedures that pos-tagging relies on. The interested reader is referred to introductions on machine learning and pos-tagging such as e.g.https://class.coursera.org/nlp/lecture/149.


There are several different R packages that assist with pos-tagging texts (see Kumar and Paul 2016). In this tutorial, we will use the `udpipe` (Wijffels 2021) and the `openNLP`  packages  (Hornik 2019). Each of these has advantages and shortcomings and it is advantageous to try which result best matches one's needs. That said, the `udpipe` package is really great as it is easy to use, covers a wide range of languages, is very flexible, and very accurate.

**Preparation and session set up**

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R [here](https://slcladal.github.io/intror.html). For this tutorials, we need to install certain *packages* from an R *library* so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).


In [None]:
# install packages
install.packages("tidyverse")
install.packages("igraph")
install.packages("tm")
install.packages("NLP")
install.packages("openNLP")
install.packages("openNLPdata")
install.packages("udpipe")
install.packages("textplot") 
install.packages("ggraph") 
install.packages("ggplot2") 
install.packages("pacman") 
# install phrasemachine
phrasemachineurl <- "https://cran.r-project.org/src/contrib/Archive/phrasemachine/phrasemachine_1.1.2.tar.gz"
install.packages(phrasemachineurl, repos=NULL, type="source")
# install parsent
pacman::p_load_gh(c("trinker/textshape", "trinker/parsent"))


Now that we have installed the packages, we activate them as shown below.



In [None]:
# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# load packages
library(tidyverse)
library(igraph)
library(tm)
library(NLP)
library(openNLP)
library(openNLPdata)
library(udpipe)
library(textplot) 
library(udpipe) 
library(ggraph) 
library(ggplot2) 
library(igraph)
library(phrasemachine)
# load function for pos-tagging objects in R
source("https://slcladal.github.io/rscripts/POStagObject.r") 
# syntax tree drawing function
source("https://slcladal.github.io/rscripts/parsetgraph.R")


Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.

# POS-Tagging with UDPipe

UDPipe was developed at the Charles University in Prague and the `udpipe` R package is an extremely interesting and really fantastic package as it provides a very easy and handy way for language-agnostic tokenization, pos-tagging, lemmatization and dependency parsing of raw text in R. It is particularly handy because it addresses and remedies major shortcomings that previous methods for pos-tagging had, namely

* it offers a wide range of language models (64 languages at this point)
* it does not rely on external software (like, e.g., TreeTagger, that had to be installed separately and could be challenging when using different operating systems)
* it is really easy to implement as one only need to install and load the `udpipe` package and download and activate the language model one is interested in
* it allows to train and tune one's own models rather easily

The available pre-trained language models in UDPipe are:

* Afrikaans: afrikaans-afribooms   
* Ancient Greek:   
  + ancient_greek-perseus  
  + ancient_greek-proiel   
* Arabic: arabic-padt   
* Armenian: armenian-armtdp   
* Basque: basque-bdt   
* Belarusian: belarusian-hse   
* bulgarian-btb   
* Buryat: buryat-bdt  
* Catalan: catalan-ancora   
* Chinese:   
  + chinese-gsd  
  + chinese-gsdsimp  
  + classical_chinese-kyoto  
* Coptic: coptic-scriptorium  
* Croatian: croatian-set  
* Czech  
  + czech-cac  
  + czech-cltt  
  + czech-fictree  
  + czech-pdt  
* Danish: danish-ddt  
* Dutch  
  + dutch-alpino  
  + dutch-lassysmall  
* English  
  + english-ewt  
  + english-gum  
  + english-lines  
  + english-partut  
* Estonian   
  + estonian-edt  
  + estonian-ewt  
* Finnish  
  + finnish-ftb  
  + finnish-tdt  
* French  
  + french-gsd  
  + french-partut  
  + french-sequoia  
  + french-spoken  
* Galician  
  + galician-ctg  
  + galician-treegal  
* German  
  + german-gsd  
  + german-hdt  
* Gothic: gothic-proiel  
* Greek: greek-gdt  
* Hebrew: hebrew-htb  
* Hindi: hindi-hdtb  
* Hungarian: hungarian-szeged  
* Indonesian: indonesian-gsd  
* Irish Gaelic: irish-idt  
* Italian  
  + italian-isdt  
  + italian-partut  
  + italian-postwita  
  + italian-twittiro  
  + italian-vit  
* Japanese: japanese-gsd  
* Kazakh: kazakh-ktb  
* Korean  
  + korean-gsd  
  + korean-kaist  
* Kurmanji: kurmanji-mg  
* Latin  
  + latin-ittb  
  + latin-perseus  
  + latin-proiel  
* Latvian: latvian-lvtb  
* Lithuanian  
  + lithuanian-alksnis  
  + lithuanian-hse  
* Maltese: maltese-mudt  
* Marathi: marathi-ufal  
* North Sami: north_sami-giella  
* Norwegian  
  + norwegian-bokmaal  
  + norwegian-nynorsk  
  + norwegian-nynorsklia  
* old_church_slavonic-proiel  
* Old French: old_french-srcmf  
* Old Russian: old_russian-torot  
* Persian: persian-seraji  
* Polish  
  + polish-lfg  
  + polish-pdb  
  + polish-sz  
* Portugese  
  + portuguese-bosque  
  + portuguese-br  
  + portuguese-gsd  
* Romanian  
  + romanian-nonstandard  
  + romanian-rrt  
* Russian  
  + russian-gsd  
  + russian-syntagrus  
  + russian-taiga  
* Sanskrit: sanskrit-ufal  
* Scottish Gaelic: scottish_gaelic-arcosg  
* Serbian: serbian-set  
* Slovak: slovak-snk  
* Slovenian  
  + slovenian-ssj  
  + slovenian-sst  
* Spanish  
  + spanish-ancora  
  + spanish-gsd  
* Swedish  
  + swedish-lines  
  + swedish-talbanken  
* Tamil: tamil-ttb  
* Telugu: telugu-mtg  
* Turkish: turkish-imst  
* Ukrainian: ukrainian-iu  
* Upper Sorbia: upper_sorbian-ufal  
* Urdu: urdu-udtb  
* Uyghur: uyghur-udt  
* Vietnamese: vietnamese-vtb   
* Wolof: wolof-wtb   


To download any of these models, we can use the `udpipe_download_model` function. For example, to download the `english-ewt` model, we would use the call: `m_eng	<- udpipe::udpipe_download_model(language = "english-ewt")`. 

We start by loading  a text


In [None]:
# load text
text <- readLines("https://slcladal.github.io/data/testcorpus/linguistics06.txt", skipNul = T)
# clean data
text <- text %>%
 str_squish() 


***

You can also use you own data. The code chunk below shows you how to upload two files from your own computer **BUT** to be able to load your own data, you need to click on the folder symbol to the left of the screen:

![Colab Folder Symbol](https://slcladal.github.io/images/ColabFolder.png)

Then on the upload symbol. 

![Colab Upload Symbol](https://slcladal.github.io/images/ColabUpload.png)

Next, upload the files you want to analyze and then the respective files names in the `file` argument of the `scan` function. When you then execute the code (like to code chunk below, you will upload your own data.


In [None]:
mytext1 <- scan(file = "linguistics01.txt",
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T) %>%
            paste0(collapse = " ")
mytext2 <- scan(file = "linguistics02.txt",
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T) %>%
            paste0(collapse = " ")
# inspect
mytext1; mytext2


To apply the code and functions below to your own data, you will need to modify the code chunks and replace the data we use here with your own data object. 

***


Now that we have a text that we can work with, we will download a pre-trained language model.


In [None]:
# download language model
m_eng	<- udpipe::udpipe_download_model(language = "english-ewt")


If you have downloaded a model once, you can also load the model directly from the current Google Drive directory if you have downloaded it in the current session.



In [None]:
# load language model from your computer after you have downloaded it once
#m_eng <- udpipe_load_model(file = "english-ewt-ud-2.5-191206.udpipe")


We can now use the model to annotate out text.



In [None]:
# tokenise, tag, dependency parsing
text_anndf <- udpipe::udpipe_annotate(m_eng, x = text) %>%
  as.data.frame() %>%
  dplyr::select(-sentence)
# inspect
head(text_anndf, 10)


It can be useful to extract only the words and their pos-tags and convert them back into a text format (rather than a tabular format). 



In [None]:
tagged_text <- paste(text_anndf$token, "/", text_anndf$xpos, collapse = " ", sep = "")
# inspect tagged text
tagged_text


# POS-Tagging non-English texts 

We can apply the same method for annotating, e.g. adding pos-tags, to other languages. For this, we could train our own model, or, we can use one of the many pre-trained language models that `udpipe` provides.

Let us explore how to do this by using  example texts from different languages, here from German and Spanish (but we could also annotate texts from any of the wide variety of languages for which UDPipe provides pre-trained models.


We begin by loading a German and a Dutch text.


In [None]:
# load texts
gertext <- readLines("https://slcladal.github.io/data/german.txt") 
duttext <- readLines("https://slcladal.github.io/data/dutch.txt") 


Next, we install the pre-trained language models.



In [None]:
# download language model
m_ger	<- udpipe::udpipe_download_model(language = "german-gsd")
m_dut	<- udpipe::udpipe_download_model(language = "dutch-alpino")


Or we load them from our the current Google Drive workspace (if we have downloaded and saved them before).



In [None]:
# load language model from your computer after you have downloaded it once
#m_ger	<- udpipe::udpipe_load_model(file = "german-gsd-ud-2.5-191206.udpipe")
#m_dut	<- udpipe::udpipe_load_model(file = "dutch-alpino-ud-2.5-191206.udpipe")


Now, pos-tag the German text.



In [None]:
# tokenise, tag, dependency parsing of german text
ger_pos <- udpipe::udpipe_annotate(m_ger, x = gertext) %>%
  as.data.frame() %>%
  dplyr::summarise(postxt = paste(token, "/", xpos, collapse = " ", sep = "")) %>%
  dplyr::pull(unique(postxt))
# inspect
ger_pos


And finally, we also pos-tag the Dutch text.



In [None]:
# tokenise, tag, dependency parsing of german text
nl_pos <- udpipe::udpipe_annotate(m_dut, x = duttext) %>%
   as.data.frame() %>%
  dplyr::summarise(postxt = paste(token, "/", xpos, collapse = " ", sep = "")) %>%
  dplyr::pull(unique(postxt))
# inspect
nl_pos


# Dependency Parsing Using UDPipe

In addition to pos-tagging, we can also generate plots showing the syntactic dependencies of the different constituents of a sentence. For this, we generate an object that contains a sentence (in this case, the sentence *Linguistics is the scientific study of language*), and we then plot (or visualize) the dependencies using the `textplot_dependencyparser` fucntion.  


In [None]:
# parse text
sent <- udpipe::udpipe_annotate(m_eng, x = "Linguistics is the scientific study of language") %>%
  as.data.frame()
# inspect
head(sent)


We now generate the plot.



In [None]:
# generate dependency plot
dplot <- textplot::textplot_dependencyparser(sent, size = 4) 
# show plot
dplot


# Syntactic Parsing with openNLP

Parsing refers to another type of annotation in which either structural information (as in the case of XML documents) or syntactic relations are added to text. As syntactic parsing is commonly more relevant in the language sciences, the following will focus only on syntactic parsing. syntactic parsing builds on pos-tagging and allows drawing syntactic trees or dependencies. Unfortunately, syntactic parsing still has relatively high error rates when dealing with language that is not very formal. However, syntactic parsing is very reliable when dealing with written language.


In [None]:
text <- readLines("https://slcladal.github.io/data/english.txt")
# convert character to string
s <- as.String(text)
# define sentence and word token annotator
sent_token_annotator <- openNLP::Maxent_Sent_Token_Annotator()
word_token_annotator <- openNLP::Maxent_Word_Token_Annotator()
# apply sentence and word annotator
a2 <- NLP::annotate(s, list(sent_token_annotator, word_token_annotator))
# define syntactic parsing annotator
parse_annotator <- openNLP::Parse_Annotator()
# apply parser
p <- parse_annotator(s, a2)
# extract parsed information
ptexts <- sapply(p$features, '[[', "parse")
ptexts


In [None]:
# read into NLP Tree objects.
ptrees <- lapply(ptexts, Tree_parse)
# show frist tree
ptrees[[1]]


These trees can, of course, also be shown visually, for instance, in the form of a syntax trees (or tree dendrogram). 



In [None]:
# load function
source("https://slcladal.github.io/rscripts/parsetgraph.R")
# generate syntax tree
parse2graph(ptexts[1], leaf.color='red',
            # to put sentence in title (not advisable for long sentences)
            #title = stringr::str_squish(stringr::str_remove_all(ptexts[1], "\\(\\,{0,1}[A-Z]{0,4}|\\)")), 
            margin=-0.05,
            vertex.color=NA,
            vertex.frame.color=NA, 
            vertex.label.font=2,
            vertex.label.cex=.75,  
            asp=.8,
            edge.width=.5, 
            edge.color='gray', 
            edge.arrow.size=0)


Syntax trees are very handy because the allow us to check how reliable the parser performed. 

We can use the `get_phrase_type_regex` function from the `parsent` package written by Tyler Rinker
to extract phrases from the parsed tree.


In [None]:
pacman::p_load_gh(c("trinker/textshape", "trinker/parsent"))
nps <- get_phrase_type_regex(ptexts[1], "NP") %>%
  unlist()
# inspect
nps


We can now extract the leaves from the text to get the parsed object.



In [None]:
nps_text <- stringr::str_squish(stringr::str_remove_all(nps, "\\(\\,{0,1}[A-Z]{0,4}|\\)"))
# inspect
nps_text


Unfortunately, we can only extract top level phrases (the NPs with the NPs are npt extracted separately).

In order to extract all phrases, we can use the `phrasemachine` from the CRAN archive. 

We now load the  `phrasemachine` package and pos-tag the text(s) (we will simply re-use the English text we pos-tagged before.)


In [None]:
# pos tag text
tagged_documents <- phrasemachine::POS_tag_documents(text)
# inspect
tagged_documents


In a next step, we can use the `extract_phrases` function to extract phrases.



In [None]:
#extract phrases
phrases <- phrasemachine::extract_phrases(tagged_documents,
                                          regex = "(A|N)*N(PD*(A|N)*N)*",
                                          maximum_ngram_length = 8,
                                          minimum_ngram_length = 1)
# inspect
phrases


Now, we have all noun phrases that occur in the English sample text.


# Citation & Session Info 

Schweinberger, Martin. 2022. *POS-Tagging and Syntactic Parsing with R*. Brisbane: The University of Queensland. url: https://slcladal.github.io/tagging.html.


In [None]:
sessionInfo()



# References

Hornik, Kurt. 2019. *OpenNLP: Apache Opennlp Tools Interface*. https://cran.r-project.org/web/packages/openNLP/index.html.

Kumar, Ashish, and Avinash Paul. 2016. *Mastering Text Mining with R*. Packt Publishing Ltd.

Wiedemann, Gregor, and Andreas Niekler. 2017. Hands-on: A Five Day Text Mining Course for Humanists and Social Scientists in R. In *Proceedings of the Workshop on Teaching NLP for Digital Humanities (Teach4DH2017)*, Berlin, Germany, September 12, 2017., 57–65. http://ceur-ws.org/Vol-1918/wiedemann.pdf.

Wijffels, Jan. 2021. *Udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ’Udpipe’ ’Nlp’ Toolkit.* https://CRAN.R-project.org/package=udpipe.
