![An interactive LADAL notebook](https://slcladal.github.io/images/uq1.jpg)

# String processing and cleaning data in R

This tutorial is the interactive Jupyter notebook accompanying the [*Language Technology and Data Analysis Laboratory* (LADAL) tutorial *String Processing in R*](https://ladal.edu.au/coll.html). 


**Preparation and session set up**

We set up our session by activating the packages we need for this tutorial. 


In [None]:
# set options
options(warn=-1)  # do not show warnings or messages
# load packages
library(dplyr)         # data manipulation and transformation
library(stringr)       # string manipulation functions
library(here)          # for generating relative paths


## Using your own data

<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>

While the tutorial uses example data, you can also **use your own data**. To use your own data, click on the folder called `MyTexts` (it is in the menu to the left of the screen) and then simply drag and drop your txt-files into the folder. When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.<br>
<br>
You can upload <b>only txt-files</b> (simple unformatted files created in or saved by a text editor)! The notebook assumes that you upload some form of text data - not tabular data! <br>
<br>
<b>IMPORTANT</b>: Be sure to <b>replace `mytext` with `text` in the code chunk below and  do not execute the code chunk which loads an example text</b> so that you work with your and not the sample data!</b><br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>


In [None]:
myfiles <- list.files(here::here("notebooks/MyTexts"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# loop over the vector 'myfiles' that contains paths to the data
mytext <- sapply(myfiles, function(x){

  # read the content of each file using 'scan'
  x <- scan(x, 
            what = "char",    # specify that the input is characters
            sep = "",         # set separator to an empty string (read entire content)
            quote = "",       # set quote to an empty string (no quoting)
            quiet = T,        # suppress scan messages
            skipNul = T)      # skip NUL bytes if encountered

  # combine the character vector into a single string with spaces
  x <- paste0(x, sep = " ", collapse = " ")

  # remove extra whitespaces using 'str_squish' from the 'stringr' package
  x <- stringr::str_squish(x)

})

# inspect the structure of the text object
str(mytext)


## Loading the example data

We begin by loading the example data which represents three files, reprsentinf transcripts of conversations of the Irish component of the [*International Corpus of English*](https://www.ice-corpora.uzh.ch/en.html). 


In [None]:
# load text
texts <- base::readRDS(url("https://slcladal.github.io/data/iceire_sample.rda", "rb"))
# inspect data
str(texts); substring(texts, 1, 200)


<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>The aim of this notebook is to showcase, how you can clean text data and then export the cleaned text for further analysis.</b> <br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>

## Reformatting

In a first step, we will reformat the data so that each speech unit is on a separate line. As speech units start with the sequence `<S1A`, we can use this as our starting point.


In [None]:
# replace instances of "<[S|s]1[A|a]" with "~~~<S1A", split the resulting text, and unlist
texts_split <- stringr::str_replace_all(texts, "<[S|s]1[A|a]", "~~~<S1A") %>%
  stringr::str_split("~~~") %>%
  unlist() 

# create a data frame with 'id', 'file', 'speaker', and 'text' columns
texts_df <- tibble(id = 1:length(texts_split),
                  corpus = rep("ICE-IRE", length(texts_split)),
                  file = str_remove_all(texts_split, "#.*"),
                  speaker = str_remove_all(texts_split, ">.*"), 
                  texts_split)

# inspect the first 10 rows of the created data frame
head(texts_df, 10)


## Cleaning

Now that we have the data in tabular format, it is easy to clean it.


In [None]:
# create a data frame with 'id', 'file', 'speaker', and 'text' columns
texts_df %>%
  
  # clean 'file' column by removing "<" and anything after "|$"
  dplyr::mutate(file = stringr::str_replace_all(file, "<", ""),
                file = stringr::str_replace_all(file, "[ |\\$].*", ""),
                
                # clean 'speaker' column by removing anything before "$"
                speaker = stringr::str_replace_all(speaker, ".*\\$", ""),
                
                # clean 'text' column by removing "<.*?>"
                t_clean = stringr::str_remove_all(texts_split, "<.*?>")) %>%
  
  # filter out unnecessary rows based on speaker length
  dplyr::filter(!nchar(speaker) < 1,
                !nchar(speaker) > 2) -> clean_df  # assign the result to 'clean_df'
  
# inspect the first 10 rows of the cleaned data frame
head(clean_df, 10)


In [None]:
# group the 'clean_df' dataframe by 'corpus' and 'file'
clean_df %>%
  dplyr::group_by(corpus, file) %>%
  
  # concatenate the cleaned text ('t_clean') into a single string for each group
  dplyr::summarise(text = paste0(t_clean, collapse = " ")) %>%
  
  # remove grouping
  dplyr::ungroup() %>%
  
  # extract the 'text' column
  dplyr::pull(text) %>%
  
  # convert the text to lowercase
  tolower() %>%
  
  # remove extra spaces in the resulting character vector
  stringr::str_squish() -> ctexts  # Assign the cleaned and formatted text to 'ctexts'
  
# inspect the structure of the resulting character vector
str(ctexts)


We create names for the files.



In [None]:
# group the 'clean_df' dataframe by 'corpus' and 'file'
clean_df %>%
  dplyr::group_by(corpus, file) %>%
  
  # concatenate the cleaned text ('t_clean') into a single string for each group
  dplyr::summarise(text = unique(paste0(t_clean, collapse = " "))) %>%
  
  # create a new column 'cfile' by combining 'corpus' and 'file', remove spaces and hyphens
  dplyr::mutate(cfile = paste0(corpus, "_", file, ".txt"),
                cfile = stringr::str_remove_all(cfile, " "),
                cfile = stringr::str_remove_all(cfile, "-")) %>%
  
  # remove grouping
  dplyr::ungroup() %>%
  
  # extract the 'cfile' column and squish the resulting character vector
  dplyr::pull(cfile) %>%
  # remove superfluous white spaces and save the results as names of the elements of 'ctexts'
  stringr::str_squish() -> nms -> names(ctexts)
  
# 'nms' contains the cleaned and formatted file names
nms


In [None]:
# save result to disc
# Use lapply to iterate over the indices of ctexts
# and write each text to a separate file in the MyOutput folder

lapply(seq_along(ctexts), function(i) {
  # unlist the i-th element in ctexts to get the text content
  text_content <- unlist(ctexts[i])

  # construct the file path using the 'here' package
  file_path <- paste0(here::here("notebooks/MyOutput/"), names(ctexts)[i])

  # write the text content to the specified file
  writeLines(text = text_content, con = file_path)
})


<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>You will find the txt-files in the `MyOutput` folder (located on the left side of the screen).</b> <br><br>Simply double-click the `MyOutput` folder icon, then highlight the files, and choose *Download* from the dropdown menu to download the files. <br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>



***

[Back to LADAL](https://ladal.edu.au/string.html)

***
