

# Concordancing using R

This tutorial is based on the interactive Jupyter notebook accompanying the Language Technology and Data Analysis Laboratory (LADAL) tutorials on [text analysis and distant reading](https://ladal.edu.au/textanalysis.html) and on [concordancing](https://ladal.edu.au/kwics.html). This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to use basic concordancing techniques on textual data using R.


**Preparation and session set up**

This tutorial is based on R. If you are new to R, you will find an introduction to it and more information on how to use R [here](https://ladal.edu.au/intror.html). For this tutorial, we need to activate certain *packages* from an R *library* so that the scripts shown below are executed without errors. If you want to run this notebook on your own computer, you will need to install a notebook server (e.g. [Anaconda](https://anaconda.org/)) and then install the R packages using the code in the first cell.

In [None]:
# install packages
install.packages('quanteda')
install.packages('tidyverse')
install.packages('tidytext')
install.packages('tm')
install.packages('flextable')
install.packages('plyr')


In [None]:
# set options
options(stringsAsFactors = F)
options(scipen = 999)
options(max.print=1000)
# load packages
library(quanteda)
library(tidyverse)
library(tidytext)
library(tm)
library(flextable)
library(plyr)

## Loading some data

The data we will use today is a a part of the COrpus of Oz Early English (COOEE, Fritz 2007). COOEE is stratified in two ways; it has four temporal slices and four registers. We are using the material from the third time period (1851-1875) and for each register (Government English, Private Written, Public Written and Speech Based) the texts have been joined in a single file. The code reads all the txt files in the local directory - if you add any files .txt there and then rerun this cell, the additional files will be read and may cause problems. 
This cell, and most of those which follow, produce visible outputs. This is good practice as it allows us to see whether the code is producing the results which we expect.


In [None]:
# Load data

my_files <- list.files(pattern = "\\.txt$")
# inspect
my_files

## read in data
cooee <- lapply(my_files, readLines)

# join tweets as text objects
for (i in 1:length(cooee)) {
    name <- substr(my_files[i], 1, nchar(my_files[i]) - 4)
    assign(name, paste(as.data.frame(cooee[i])))
    i = i + 1
}
#inspect
substr(privateW, 1, 200)

## Cleaning the data

In a first step, we load a corpus, convert everything to lower case, remove non-word symbols (including punctuation), and remove extra spaces. Compare the output from the previous cell to the results here.

Part of the code here is a user-defined function. This is a way of modularising code - procedures which we expect to reuse can be split off and then called by other code as a complete procedure. The layout here is important: a function has to be read to memory before it can be called.

At this point, you can choose which group of data you want to explore (governmentE,privateW, publicW, speechB - case is important!).

In [None]:
txtclean <- function(x){
  x <- x %>%
    iconv(to = "UTF-8") %>%
    # all lower case
    base::tolower() %>%
    # remove non-word characters
    stringr::str_replace_all("[^[:alpha:][:space:]]*", "")  %>%
    # remove punctuation
    tm::removePunctuation() %>%
    # remove superfluous white spaces
    stringr::str_squish() %>%
    paste(collapse = " ")
}

# you can choose which register to look at by changing the argument passed to txtclean
# (governmentE,privateW, publicW, speechB)
kwic_data <- txtclean(publicW)
substr(kwic_data, 1, 200)


## Concordancing

**Concordancing**, also known as **keyword-in-context (kwic)** analysis, is a very useful first step in exploring text data. A concordance is a list of all the occurrences of a type in the text, presented as a table where the keyword is centred and aligned and context on either side is also shown.

The `quanteda` package includes a concordancing function which allows a number of parameters to be adjusted. We will explore some of these in the remainder of this session. The relevant function is called `kwic`, and minimally it requires us to specify a text to search in and a pattern to match. 

In [None]:
kwic_results <- kwic(
  # define text
  kwic_data, 
  # define search pattern
  pattern = "australia")
if (nrow(kwic_results) > 10) {
    kwic_results[1:10]
} else {
    kwic_results
}


The warning message here indicates that `quanteda::kwic` is designed to work on tokenised data. As we will see, the results are exactly the same, but in order to avoid the warning messages, we will create a tokenised version of our text and use that as we go on.

We can check how many tokens of the target there are by querying the number of rows in the results table.

In [None]:
kwic_token <- quanteda::tokens(kwic_data)
kwic_results <- quanteda::kwic(
  # define and tokenize text
  kwic_token, 
  # define search pattern
  pattern = "australia")

nrow(kwic_results)
if (nrow(kwic_results) > 10) {
    kwic_results[1:10]
} else {
    kwic_results
}

One parameter which we can manipulate with `kwic` is the amount of context we see around the keyword. The default is to provide five tokens before and five after the search term; our next concordance increases this to 10 items either side. What counts as a *token* for these purposes?

In [None]:
kwic_results_longer <- kwic(
  # define text
  kwic_token, 
  # define search pattern
  pattern = "australia", 
  # define context window size
  window = 10)
if (nrow(kwic_results_longer) > 10) {
    kwic_results_longer[1:10]
} else {
    kwic_results_longer
}

### Concordancing - multiple words

`kwic` can also generate a concordance for a phrase - we just have to tell the function that the pattern we are interested in **is** a phrase using the syntax `pattern = phrase("[Target Words]")`. Again, we can also find out how many tokens of our target word(s) there are by checking how many rows there are in the output table.

In [None]:
kwic_phrase <- kwic(kwic_token, pattern = phrase("australia felix"))
nrow(kwic_phrase)
if (nrow(kwic_phrase) > 10) {
    kwic_phrase[1:10]
} else {
    kwic_phrase
}

### Arranging concordances

Often it is useful to sort the results of a concordance according to what is in the context. Doing this can tell us whether a word commonly co-occurs with other words. This is easy to do for the following context; we can sort the start of the `post` string alphabetically. Sorting by the preceding word is more complex. We leave it as an exercise for you to think about how this might be accomplished programmatically.

In [None]:
kwic_ordered <- kwic(x = kwic_token, pattern = "australia") %>%
  dplyr::arrange(post)
if (nrow(kwic_ordered) > 10) {
    kwic_ordered[1:10]
} else {
    kwic_ordered
}

A more sophisticated approach can include information about the frequency of the co-occurrence pattern and show us the most frequent result first.

In [None]:
# function to get frequency of word from frequency table
get_freq <- function(word) {
    freq <- subset(post_freq, post_freq$x == word)[['freq']]
}

# get concordance results
kwic_result <- as.data.frame(kwic(kwic_token, pattern = 'australia'))
# add column with first word following search term
kwic_result$post_word <- str_remove_all(kwic_aus$post, " .*")
# make table of frequencies of folowing words
post_freq <- plyr::count(kwic_result$post_word)
# add frequency to concordance table and sort results in descending order
kwic_result$post_freq <- lapply(kwic_aus$post_word, get_freq)
kwic_result[order(-unlist(kwic_aus$post_freq)),]

### Saving results

You can export the results of the various analyses we have looked at here. In each case, the results are some kind of table, and R lets you write these to a file easily using the `write.csv` command. Here is an example of exporting a concordance. This code saves a file in the scratch space, you can download it from there to your computer. Alternatively, you can specify a location on your computer to save the file - but you need to supply a complete path (and that should use / dividers not \\).

In [None]:
# export results
write.csv(kwic_results_longer, 'australia_prw.csv')

# Citation & Session Info 

Fritz, Clemens WA. 2007. *From English in Australia to Australian English: 1788-1900.* Peter Lang Frankfurt.

Schweinberger, Martin. 2022. *Text Analysis and Distant Reading using R*. Brisbane: The University of Queensland. url: https://ladal.edu.au/textanalysis.html.


In [None]:
sessionInfo()

