![An interactive LADAL notebook](https://slcladal.github.io/images/uq1.jpg)

# Concordancing with R

This tutorial is the interactive Jupyter notebook accompanying the [*Language Technology and Data Analysis Laboratory* (LADAL) tutorial **Concordancing with R**](https://ladal.edu.au/kwics.html). 


**Preparation and session set up**

We start by activating the packages we need for this tutorial.


In [None]:
# set options
options(warn=-1)  # do not show warnings or messages
# activate packages
library(quanteda) # for concordancing
library(dplyr)    # for table processing
library(stringr)  # for text processing
library(writexl)  # for saving data
library(here)     # for easy pathing


## Using your own data



<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>

While the tutorial uses example data (Lewis Carroll's *Alice in Wonderland*), you can also **use your own data**. To use your own data, click on the folder called `MyTexts` (it is in the menu to the left of the screen) and then simply drag and drop your txt-files into the folder. When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.<br>
<br>
You can upload <b>only txt-files</b> (simple unformatted files created in or saved by a text editor)! The notebook assumes that you upload some form of text data - not tabular data! <br>
<br>
<b>IMPORTANT</p>: Be sure to to then <b>replace `mytext` with `text` in the code chunk below and not execute the code chunk which loads an example text</b> from the LADAL repository so that you work with your and not the sample data!</b><br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>


In [None]:
myfiles <- list.files(here::here("notebooks/MyTexts"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# loop over the vector 'myfiles' that contains paths to the data
mytext <- sapply(myfiles, function(x){

  # read the content of each file using 'scan'
  x <- scan(x, 
            what = "char",    # specify that the input is characters
            sep = "",         # set separator to an empty string (read entire content)
            quote = "",       # set quote to an empty string (no quoting)
            quiet = T,        # suppress scan messages
            skipNul = T)      # skip NUL bytes if encountered

  # combine the character vector into a single string with spaces
  x <- paste0(x, sep = " ", collapse = " ")

  # remove extra whitespaces using 'str_squish' from the 'stringr' package
  x <- stringr::str_squish(x)

})

# inspect the structure of the text object
str(mytext)


<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>If you are using your own data, do not execute the next code chunk and change `mytext` into `text` in the code chunk above.</b><br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>


## Loading the example data

If you do not use your own data, you can load the default data, Lewis Caroll's  *Alice's Adventures in Wonderland*, by executing the following code chunk.


In [None]:
text <- base::readRDS(url("https://slcladal.github.io/data/alice.rda", "rb"))
# inspect first 6 text elements
head(text)


The data consists of many separate text elements. 

## Creating simple concordances

Now we can extract concordances using the `kwic` function from the `quanteda` package. This function has the following arguments: 

+ `x`: a text or collection of texts. The text needs to be tokenised, i.e. split it into individual words, which is why we use the *text* in the `tokens()` function. 
+ `pattern`: a keyword defined by a search pattern  
+ `window`: the size of the context window (how many word before and after)  
+ `valuetype`: the type of pattern matching  
  + "glob" for "glob"-style wildcard expressions;  
  + "regex" for regular expressions; or  
  + "fixed" for exact matching  
+ `separator`: a character to separate words in the output  
+ `case_insensitive`: logical; if TRUE, ignore case when matching a pattern or dictionary values

<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>You can easily change and adapt the concordance. For instance, you can search for a different word, like *speak*, by substituting *alice* with *speak* as the pattern. Additionally, if you wish to widen the context window, just replace the '5' with '10'. This adjustment will extend the context around the keyword by 5 additional words in both the preceding and following context. </b><br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>


In [None]:
mykwic <- kwic(
  # tokenise and define text
  tokens(text), 
  # define target word (this is called the "search pattern")
  pattern = phrase("alice"),
  # 5 words before and after
  window = 5,
  # no regex
  valuetype = "fixed",
  # words separated by whitespace
  separator = " ",
  # search should be case insensitive
  case_insensitive = TRUE)

# inspect resulting kwic
mykwic %>%
  # convert into a data frame
  as.data.frame() %>%
  # show only first 10 results
  head(10)


## Exporting concordances

To export a concordance table as an MS Excel spreadsheet, we use `write_xlsx`. Be aware that we use the `here` function to  save the file in the current working directory.


In [None]:
# save data for MyOutput folder
write_xlsx(mykwic, here::here("notebooks/MyOutput/mykwic.xlsx"))


<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>You will find the generated MS Excel spreadsheet named *mykwic.xlsx* in the `MyOutput` folder (located on the left side of the screen).</b> <br><br>Simply double-click the `MyOutput` folder icon, then right-click on the *mykwic.xlsx* file, and choose Download from the dropdown menu to download the file. <br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>





[Back to LADAL](https://ladal.edu.au/kwics.html)
