![](https://github.com/MartinSchweinberger/SLAT7829/blob/master/images/bannerSLAT7829.jpeg?raw=true)

# Exploring Corpus Data

This tutorial showcases how to perform basic corpus linguistics methods to explore language data using the *Australian Corpus of English* (ACE). The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with corpus linguistics.

Most of the applications of Corpus Linguistics are based upon a relatively limited number of key procedures or concepts (e.g. concordancing, word frequencies, annotation or tagging, parsing, collocation, text classification, Sentiment Analysis, Entity Extraction, Topic Modeling, etc.). In the following, we will explore some of these procedures to help you perform such tasks. 


## Preparation and session set up  


Activate required packages.


In [None]:
# load packages
library(dplyr)
library(stringr)
library(ggplot2)
library(quanteda)
library(tm)
library(tidytext)
library(wordcloud2)
library(flextable)
library(quanteda.textstats)
library(quanteda.textplots)
library(tidyr)


Once you have  initiated the session by executing the code shown above, you are good to go.

# Loading data



In the following, we will use R to query and investigate the *Australian Corpus of English* (ACE). 

We begin by loading the data which consists of two steps: (1) creating a vector or list of all files that we want to load and then (2) loading the files.

We start by creating a list of files we want to load.


In [None]:
# load text
corpusfiles <- list.files(here::here("ACE"), # path to the corpus data
                          # file types you want to analyze, e.g. txt-files
                          pattern = ".*.TXT",
                          # full paths - not just the names of the files
                          full.names = T) 
# inspect
head(corpusfiles)


Next, we load the data.



In [None]:
corpus <- sapply(corpusfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(corpus)


***

## Using your own data

You can also use your own data. You can see below what you need to do to upload and use your own data.

To be able to load your own data, you need to click on the folder symbol to the left of the screen:

![Binder Folder Symbol](https://slcladal.github.io/images/binderfolder.JPG)

Then, click on the `New Folder` symbol and create a new folder and call it `MyData`.

![Binder New Folder Symbol](https://slcladal.github.io/images/bindernewfolder.JPG)

Then click on the upload symbol and upload your files into the `MyData` folder.

![Binder Upload Symbol](https://slcladal.github.io/images/binderupload.JPG)

Select and upload the files you want to analyze (**IMPORTANT**: here, we assume that you upload some form of text data - not tabular data!). When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.


In [None]:
myfiles <- list.files(here::here("MyData"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# load colt files
mycorpus <- sapply(myfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(mycorpus)


**Keep in mind though that you need to adapt the names of the texts in the code chunks below so that the code below work on your own texts!**

***

# Cleaning data

We start by cleaning the corpus data by removing tags, artefacts and non-alpha-numeric characters.


In [None]:
corpus <- # remove A01 0001 1 sequences
  stringr::str_remove_all(corpus, "[A-Z][0-9]{2,2}") %>%
  # remove tags 
  stringr::str_remove_all("<.*?>") %>%
  # remove non-alphanumeric characters 
  stringr::str_remove_all("[^[:alnum:] ]") %>%
  # remove single characters
  stringr::str_remove_all(" \\w ") %>%
  # remove superfluous white spaces
  stringr::str_squish()
# inspect
substr(corpus[1], start=1, stop=200)


# Concordancing

Once we have loaded and cleaned the data, we perform the concordancing and extract the KWICs. 

Concordancing refers to the extraction of words from a given text or texts. Commonly, concordances are displayed in the form of key-word in contexts (KWIC) where the search term is shown with some preceding and following context. Thus, such displays are referred to as key word in context concordances. 

Concordancing is helpful for seeing how the term is used in the data, for inspecting how often a given word occurs in a text or a collection of texts, for extracting examples, and it also represents a basic procedure and often the first step in more sophisticated analyses of language data. 

To create these kwics, we use the `kwic` function from the `quanteda` package. 


The `kwic` function has the following schema:

`kwic(x, pattern, window = 5, valuetype = c("glob", "regex", "fixed"), separator = " ", case_insensitive = TRUE, index = NULL)`

The arguments (or parameters) of the `kwic` function mean:

* `x`: a character, corpus, or tokens object  
* `pattern`: a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.  
* `window`: the number of context words to be displayed around the keyword  
* `valuetype`: the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.  
* `separator`: a character to separate words in the output  
* `case_insensitive`: logical; if TRUE, ignore case when matching a pattern or dictionary values  
* `index`: an index object to specify keywords



To start with, we generate kwics for the term *australia* as shown below. 


In [None]:
# create kwic
australia <- lapply(corpus, function(x){
  x <- quanteda::kwic(tokens(x), # define text(s) 
                         # define pattern
                          pattern = "australia",
                         # define window size
                          window = 5) %>%
  # convert into a data frame
  as.data.frame() %>%
  # remove superfluous columns
  dplyr::select(-to, -from, -pattern)
})
# combine list of data frames into one big data frame
australia <- dplyr::bind_rows(australia)
# inspect data
head(australia)


We can also use regular expressions in our search to extract not only *australia* but also more complex and even vague patterns. Vague means that only part of the pattern is specified. For instance, maybe only *walk* is specified and now we want all words containing this sequence including *walking*, *walker*, *walked*, and *walks*. To retrieve such vague patterns, we need to use so-called *regular expressions*. Also, when using a regular expression in the `pattern` argument, we need to specify the `valuetype` as `regex` (as shown below).



In [None]:
# create kwic
austr <- lapply(corpus, function(x){
  x <- quanteda::kwic(x = tokens(x), 
                          pattern = "austr.*",
                          window = 5,
                          valuetype = "regex") %>%
  # convert into a data frame
  as.data.frame() %>%
  # remove superfluous columns
  dplyr::select(-to, -from, -pattern)
})
# combine list of data frames into one big data frame
austr <- dplyr::bind_rows(austr)
# inspect data
head(austr)


When search for expressions that represent phrase and that consists out of several elements such as *new south wales*, we also need to specify that we are looking for a phrase in the pattern argument. 



In [None]:
# create kwic
nsw <- lapply(corpus, function(x){
  x <- quanteda::kwic(x = tokens(x), 
                          pattern = quanteda::phrase("new south wales"),
                          window = 5) %>%
  # convert into a data frame
  as.data.frame() %>%
  # clean file names
    dplyr::mutate(docname = stringr::str_remove_all(docname, ".*/"),
                docname = stringr::str_remove_all(docname, ".TXT"))
})
# combine list of data frames into one big data frame
nsw <- dplyr::bind_rows(nsw)
# inspect data
head(nsw)


We could now continue and analyze how certain words and phrases are used.

# Word Frequency

Almost all methods used in text analytics rely on frequency information. Thus, fending out out frequent words are in a text is a fundamental technique in text analytics. In fact, frequency information lies at the very core of Text Analysis. Such frequency information often comes in the form of word frequency lists, i.e. lists of word forms and their frequency in a given text or collection of texts.  

As extracting word frequency lists is very important, we will now We will now extract a frequency list from a corpus.

In a first step, we load a corpus, convert everything to lower case, remove non-word symbols (including punctuation), and split the corpus data into individual words.


In [None]:
# load and process corpus
ace_words <- corpus  %>%
  # remove tags
  stringr::str_remove_all("<.*?</.*?>") %>%
  # convert everything to lower case
  tolower() %>%
  # remove non-word characters
  str_replace_all("[^[:alpha:][:space:]]*", "")  %>%
  tm::removePunctuation() %>%
  stringr::str_squish() %>%
  stringr::str_split(" ") %>%
  unlist()
# inspect data
head(ace_words)


Now that we have a vector of words, we can easily create a table representing a word frequency list (as shown below).



In [None]:
# create table
wfreq <- ace_words %>%
  table() %>%
  as.data.frame() %>%
  arrange(desc(Freq)) %>%
  dplyr::rename(word = 1,
                frequency = 2)
# inspect data
head(wfreq, 15)


The most frequent words are all function words which are often not meaningful or useful for an analysis. Thus, we now remove these function words (also called *stopwords*) from the frequency list and inspect the list without stopwords.



In [None]:
# create table wo stopwords
wfreq_wostop <- wfreq %>%
  anti_join(tidytext::stop_words, by = "word") %>%
  dplyr::filter(word != "")
# inspect data
head(wfreq_wostop, 15)


Such word frequency lists can be visualized in various ways. The most common way to visualize word frequency lists is in the form of bargraphs.



In [None]:
wfreq_wostop %>%
  head(10) %>%
  ggplot(aes(x = reorder(word, -frequency, mean), y = frequency)) +
  geom_bar(stat = "identity") +
  labs(title = "10 most frequent non-stop words \nin the example text",
       x = "") +
  theme(axis.text.x = element_text(angle = 45, size = 12, hjust = 1))


# Wordclouds

Alternatively, word frequency lists can be visualized, although less informative, as word clouds. 


In [None]:
# create wordcloud
corpus %>%
  # remove tags
  stringr::str_remove_all("<.*?</.*?>") %>%
  # convert to corpus
  quanteda::corpus() %>%
  # remove punctuation
  quanteda::tokens(remove_punct = TRUE,
                   remove_symbols = TRUE,
                   remove_numbers = TRUE) %>%
  # remove stop words
  quanteda::tokens_remove(stopwords("english")) %>%
  # create data frequency matrix
  quanteda::dfm() %>%
  # create word cloud
  quanteda.textplots::textplot_wordcloud(max_words = 200)


The `textplot_wordcloud` function has the following schema:

`textplot_wordcloud(x, min_size = 0.5, max_size = 4, min_count = 3, max_words = 500, color = "darkblue",`
  `font = NULL, adjust = 0, rotation = 0.1, random_order = FALSE, random_color = FALSE,`
  `ordered_color = FALSE, labelcolor = "gray20", labelsize = 1.5, labeloffset = 0, `
  `fixed_aspect = TRUE, comparison = FALSE )`

The arguments (or parameters) of the `kwic` function mean:

* `x`: a dfm or quanteda.textstats::textstat_keyness object  
* `min_size`: size of the smallest word  
* `max_size`: size of the largest word  
* `min_count`: words with frequency below min_count will not be plotted  
* `max_words`: maximum number of words to be plotted. The least frequent terms dropped. The maximum frequency will be split evenly across categories when comparison = TRUE.  
* `color`: colour of words from least to most frequent  
* `font`: font-family of words and labels. Use default font if NULL.  
* `adjust`: adjust sizes of words by a constant. Useful for non-English words for which R fails to obtain correct sizes.  
* `rotation`: proportion of words with 90 degree rotation  
* `random_order`: plot words in random order. If FALSE, they will be plotted in decreasing frequency.  
* `random_color`: choose colours randomly from the colours. If FALSE, the colour is chosen based on the frequency  
* `ordered_color`: if TRUE, then colours are assigned to words in order.  
* `labelcolor`: colour of group labels. Only used when comparison = TRUE.  
* `labelsize`: size of group labels. Only used when comparison = TRUE.  
* `labeloffset`: position of group labels. Only used when comparison = TRUE.  
* `fixed_aspect`: logical; if TRUE, the aspect ratio is fixed. Variable aspect ratio only supported if rotation = 0.  
* `comparison`: logical; if TRUE, plot a wordcloud that compares documents in the same way as wordcloud::comparison.cloud(). If x is a quanteda.textstats::textstat_keyness object, then only the target category's key terms are plotted when comparison = FALSE, otherwise the top max_words / 2 terms are plotted from the target and reference categories.


In [None]:
# create wordcloud
corpus %>%
  stringr::str_remove_all("<.*?</.*?>") %>%
  quanteda::corpus() %>%
  quanteda::tokens(remove_punct = TRUE,
                   remove_symbols = TRUE,
                   remove_numbers = TRUE) %>%
  quanteda::tokens_remove(stopwords("english")) %>%
  quanteda::dfm() %>%
   dfm_trim(min_termfreq = 5) %>%
  quanteda.textplots::textplot_wordcloud(max_words = 200, rotation = .25,  color = RColorBrewer::brewer.pal(10, "RdBu"))


# Analyzing Frequencies 

We can also investigate the use of the term *australia* across chapters of the example text. In a first step, we extract the number of words in each chapter.


In [None]:
# extract number of words per chapter
Words <- corpus %>%
  stringr::str_split(" ")  %>%
  lengths()
# inspect data
Words


Next, we extract the number of matches in each chapter.



In [None]:
# extract number of matches per chapter
Matches <- corpus %>%
  tolower() %>%
  stringr::str_count("australia")
# inspect the number of matches per chapter
Matches


Now, we extract the names of the chapters and create a table with the chapter names and the relative frequency of matches per 1,000 words.



In [None]:
# extract chapters
corpusfiles <- stringr::str_remove_all(corpusfiles, ".*/") %>%
  stringr::str_remove_all(".TXT")
corpusfiles


In [None]:
# create table of results
tb <- data.frame(corpusfiles, Matches, Words) %>%
  dplyr::mutate(Frequency = round(Matches/Words*1000, 2))
# inspect data
tb


We can now visualize the relative frequencies of our search word per chapter.



In [None]:
# create plot
ggplot(tb, aes(x = reorder(corpusfiles, -Frequency), y = Frequency)) + 
  geom_bar(stat = "identity") + 
  geom_text(aes(y = Frequency +.5, label = Frequency), color = "gray30", size=4) + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(y ="Relative Frequency (per 1,000 words)", x = "Text type in ACE")


# Dispersion plots

To show when in a text or in a collection of texts certain terms occur, we can use *dispersion plots*. The `quanteda` package offers a very easy-to-use function `textplot_xray` to generate dispersion plots.


In [None]:
# add chapter names
names(corpus) <- corpusfiles
# generate dispersion plots
quanteda.textplots::textplot_xray(kwic(tokens(corpus), pattern = "australia"),
                                  kwic(tokens(corpus), pattern = "queensland"),
                                  sort = T)


We can modify the plot by saving it into an object and then use `ggplot` to modify it appearance.



In [None]:
# generate and save dispersion plots
dp <- quanteda.textplots::textplot_xray(kwic(tokens(corpus), pattern = "queensland"),
                                        kwic(tokens(corpus), pattern = phrase("new south wales")))
# modify plot
dp + aes(color = keyword) + 
  scale_color_manual(values = c('red', 'blue')) +
  theme(legend.position = "none")


We end the session by calling the session info which tells us what packages and what version of the software and packages we have used.



In [None]:
sessionInfo()



***

[Back to HOME](https://github.com/MartinSchweinberger/SLAT7829Tutorials)

***
