![An interactive LADAL notebook](https://slcladal.github.io/images/uq1.jpg)

# Concordancing with R

This tutorial is the interactive Jupyter notebook accompanying the [*Language Technology and Data Analysis Laboratory* (LADAL) tutorial **Concordancing with R**](https://ladal.edu.au/kwics.html). 


**Preparation and session set up**

We start by activating the packages we need for this tutorial.


In [None]:
# set options
options(warn=-1)  # do not show warnings or messages
knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
# activate packages
suppressMessages(library(quanteda)) # for concordancing
suppressMessages(library(dplyr))    # for table processing
suppressMessages(library(stringr))  # for text processing
suppressMessages(library(writexl))  # for saving data
suppressMessages(library(here))     # for easy pathing


<div class="warning" style='padding:0.1em; background-color: rgba(251,184,0,.5); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>If you are using this notebook on your own computer and you have not already installed the R packages listed above, you need to install them.<br> [Here](https://www.dataquest.io/blog/install-package-r/) is Dataquest post on hoe to install packages in R.</b><br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>

## Using your own data

While the tutorial uses data from the LADAL website, you can also use your own data. To use your own data, follow the instructions below.

To be able to load your own data, you need to click on the folder symbol to the left of the screen:

![Binder Folder Symbol](https://slcladal.github.io/images/binderfolder.JPG)


Then, when the menu has unfolded, click on the smaller folder symbol (encircled in red in the picture below).

![Small Binder Folder Symbol](https://slcladal.github.io/images/upload2.png)

Now, you are in the main menu and can click on the 'MyData' folder.

![MyData Folder Symbol](https://slcladal.github.io/images/upload3.png)

Now, that you are in the MyData folder, you can click on the upload symbol.

![Binder Upload Symbol](https://slcladal.github.io/images/binderupload.JPG)

Select and upload the files you want to analyze. When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.

<div class="warning" style='padding:0.1em; background-color: rgba(251,184,0,.5); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>IMPORTANT: here, we assume that you upload some form of text data - not tabular data! You can upload only txt and docx files!</b><br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>


In [None]:
myfiles <- list.files(here::here("MyData"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# load files
mytext <- sapply(myfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(mytext)


<div class="warning" style='padding:0.1em; background-color: rgba(251,184,0,.5); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>If you are using your own data, do not execute the next code chunk and change `mytext` into `text` in the code chunk above.</b><br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>


If you do not use your own data, you can load the default data, Lewis Caroll's  *Alice's Adventures in Wonderland*, by executing the following code chunk.


In [None]:
text <- base::readRDS(url("https://slcladal.github.io/data/alice.rda", "rb"))
# inspect first 6 text elements
head(text)


The data consists of many separate text elements. Next, we combine the elements into a single text. Then we clean it by removing superfluous white spaces and then we split it into individual  words (this is called tokenising).



In [None]:
text <- text %>%
  # collapse lines into a single  text
  paste0(collapse = " ") %>%
  # remove superfluous white spaces
  str_squish() %>%
  # tokenize
  tokens()
# inspect
head(text)


The text is now split into individual words. 

## Creating simple concordances

Now we can extract concordances using the `kwic` function from the `quanteda` package. This function requires 

+ a text (`x`) 
+ a keyword defined by a search pattern (`pattern`) 


In [None]:
mykwic <- kwic(
  # define text
  text, 
  # define target word (this is called the "search pattern")
  pattern = "alice")
# inspect
mykwic %>%
  as.data.frame() %>%
  head()


To extract the frequency of the search term (*alice*) we can use `nrow` or `length`.



In [None]:
nrow(mykwic); length(mykwic$keyword)



The results show that there are 386 instances of the search term (*alice*). 

We now  increase the context window size to 10 words/elements (the default is 5 word/elements).


In [None]:
mykwic_longer <- kwic(text, pattern = "alice", 
  # define context window size
  window = 10)
# inspect
mykwic_longer %>%
  as.data.frame() %>%
  head()


## Exporting concordances

To export a concordance table as an MS Excel spreadsheet, we use `write_xlsx`. Be aware that we use the `here` function to  save the file in the current working directory.


In [None]:
write_xlsx(mykwic, here::here("mykwic.xlsx"))
# check where the working directory is
getwd()


## Extracting more than single words

To extract more than just one word, we specify that we are searching for a `phrase` (you can also include full sentences).


In [None]:
kwic_pooralice <- kwic(text, pattern = phrase("poor alice"))
# inspect
kwic_pooralice %>%
  as.data.frame() %>%
  head()


## Searches using regular expressions

Regular expressions add flexibility by allowing us to search for abstract patterns rather than concrete words or phrases. A regular expression is a special sequence of characters that describe a pattern. For more information about regular expression in R [see this tutorial](https://ladal.edu.au/regex.html).


To specifiy that we are using regular expressions, we set `valuetype` to `"regex"`. The search pattern `"\\balic.*|\\bhatt.*"` retrieves elements that contain `alic` and `hatt` followed by any characters and where the `a` in `alic` and the `h` in `hatt` are at a word boundary. The `|` is an operator (like `+`, `-`, or `*`) that stands for *or*.


In [None]:
# define search patterns
patterns <- c("\\balic.*|\\bhatt.*")
kwic_regex <- kwic(text, patterns, 
  # define valuetype
  valuetype = "regex")
# inspect
kwic_regex %>%
  as.data.frame() %>%
  head()


## Piping concordances

Quite often, we only want to retrieve patterns if they occur in a certain context. For instance, we might be interested in instances of *alice* but only if the preceding word is *poor*. Such conditional concordances can be retrieved by piping using  `%>%`  which can be translated as *and then*. We then extract concordances that contain *poor* using `filter`. Note the the `$` stands for the end of a string so that *poor$* means that *poor* is the last element preceding the keyword.


In [None]:
kwic_pipe <- kwic(x = text, pattern = "alice") %>%
  dplyr::filter(stringr::str_detect(pre, "poor$|little$"))
# inspect
kwic_pipe %>%
  as.data.frame() %>%
  head()


## Arranging concordances and adding frequency information

When inspecting concordances, it is useful to re-order the concordances so that they appeared with frequent collocates first (at the top). To reorder concordances, we use  `arrange` which takes the column by which we want to re-arrange as its argument. 

In the example below, we extract all instances of *alice* and then arrange the instances according to the alphabetical order of the first word in the `post` column.


In [None]:
kwic_ordered <- kwic(x = text, pattern = "alice") %>%
  dplyr::arrange(post)
# inspect
kwic_ordered %>%
  as.data.frame() %>%
  head() 


A more useful option may be to arrange concordances according to the frequency of co-occurring terms. In order to do this, we need to extract the co-occurring words and their frequency. We do this by using `mutate`, `group_by`, `n()` and`str_remove_all`. 



In [None]:
kwic_ordered_coll <- kwic(
  # define text
  x = text, 
  # define search pattern
  pattern = "alice") %>%
  # extract word following the keyword
  dplyr::mutate(post_word = str_remove_all(post, " .*")) %>%
  # group following words
  dplyr::group_by(post_word) %>%
  # extract frequencies of the following words
  dplyr::mutate(post_word_freq = n()) %>%
  # arrange/order by the frequency of the following word
  dplyr::arrange(-post_word_freq)
# inspect
kwic_ordered_coll %>%
  as.data.frame() %>%
  head()


## Ordering by subsequent elements

We now extract the three words following the keyword (*alice*) and organize the concordances by the frequencies of the following words. 

We begin by creating a clean *post* column (that is all in)we convert post to lower case and remove punctuation).


In [None]:
mykwic %>%
  # convert to data frame
  as.data.frame() %>%
  # create new CleanPost
  dplyr::mutate(CleanPost = stringr::str_remove_all(post, "[:punct:]"),
                CleanPost = stringr::str_squish(CleanPost),
                CleanPost = tolower(CleanPost))-> mykwic_following
# inspect
head(mykwic_following)


Next, we extract the 1^st^, 2^nd^, and 3^rd^ words following the keyword.



In [None]:
mykwic_following %>%
  # extract first element after keyword
  dplyr::mutate(FirstWord = stringr::str_remove_all(CleanPost, " .*")) %>%
  # extract second element after keyword
  dplyr::mutate(SecWord = stringr::str_remove(CleanPost, ".*? "),
                SecWord = stringr::str_remove_all(SecWord, " .*")) %>%
  # extract third element after keyword
  dplyr::mutate(ThirdWord = stringr::str_remove(CleanPost, ".*? "),
                ThirdWord = stringr::str_remove(ThirdWord, ".*? "),
                ThirdWord = stringr::str_remove_all(ThirdWord, " .*")) -> mykwic_following
# inspect
head(mykwic_following)


Now, we calculate the frequencies of the subsequent words and order in descending order from the  1^st^ to the 3^rd^ word following the keyword.



In [None]:
mykwic_following %>%
  # calculate frequency of following words
  # 1st word
  dplyr::group_by(FirstWord) %>%
  dplyr::mutate(FreqW1 = n()) %>%
  # 2nd word
  dplyr::group_by(SecWord) %>%
  dplyr::mutate(FreqW2 = n()) %>%
  # 3rd word
  dplyr::group_by(ThirdWord) %>%
  dplyr::mutate(FreqW3 = n()) %>%
  # ungroup
  dplyr::ungroup() %>%
  # arrange by following words
  dplyr::arrange(-FreqW1, -FreqW2, -FreqW3) -> mykwic_following
# inspect results
head(mykwic_following, 10)


The results now show the concordance arranged by the frequency of the words following the keyword.


[Back to LADAL](https://ladal.edu.au/kwics.html)
