![An interactive LADAL notebook](https://slcladal.github.io/images/uq1.jpg)

# String processing and cleaning data in R

This tutorial is the interactive Jupyter notebook accompanying the [*Language Technology and Data Analysis Laboratory* (LADAL) tutorial *String Processing in R*](https://ladal.edu.au/coll.html). 


**Preparation and session set up**

We set up our session by activating the packages we need for this tutorial. 


In [None]:
# set options
options(warn=-1)  # do not show warnings or messages
# load packages
library(dplyr)         # data manipulation and transformation


## Using your own data

<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>

Here, you can **use your own data**. To use your own data, click on the folder called `MyTexts` (it is in the menu to the left of the screen) and then simply drag and drop your txt-files into the folder. When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.<br>
<br>
You can upload <b>only txt-files</b> (simple unformatted files created in or saved by a text editor)! The notebook assumes that you upload some form of text data - not tabular data! <br>
<br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>


In [None]:
# load function that helps loading texts
source("https://slcladal.github.io/rscripts/loadtxts.R")
# load texts
text <- loadtxts("notebooks/MyTexts")
# inspect the structure of the text object
str(text)


## Reformatting

In a first step, we generate file names from the paths. 


In [None]:
# extract names of texts
nms <- names(texts) %>%
  stringr::str_remove_all(".*/") %>%
  stringr::str_remove_all(".txt")
# inspect names
nms


Now, we split the text into speech units (this is optional).



In [None]:
# replace instances of "<[S|s]1[A|a]" with "~~~<S1A", split the resulting text
texts_split <- stringr::str_replace_all(texts, "<[S|s]1[A|a]", "~~~<S1A") %>%
  stringr::str_split("~~~")
# inspect
str(texts_split)


The text now is in a list format which each speech unit being an element in the list.

We now repeat the names as often as there are elements in the list items because we want to create a data frame with the file, the original text, and the clean text.


In [None]:
# extract how many items are in each text
lngth <- sapply(texts_split, function(x) length(x) )
# repeat name as many times as there are elements
files <- rep(nms, lngth) 
# inspect
table(files)


We now clean the split test.



In [None]:
# create a data frame of split texts
stexts <- unlist(texts_split) %>% tibble()
# add the file names
stexts_df <- data.frame(files, stexts) %>%
  # add column names
  dplyr::rename(file = 1, 
                text = 2)
# inspect the first 10 rows of the created data frame
head(stexts_df, 10)


## Cleaning

Now that we've organized the data into a tabular format, the cleaning process becomes straightforward. We work with the data frame, employing `str_remove_all` and `str_replace_all` to eliminate undesired text sequences from the column contents. The distinction lies in their usage:

+ `str_remove_all` requires specifying the column to clean and indicating what to remove.  

+ while `str_replace_all` additionally needs information on the replacement pattern for the specified pattern.  


In [None]:
# create a data frame with 'id', 'file', 'speaker', and 'text' columns
stexts_df %>%
  
  # # clean 'text' column by removing "<.*?>"
  dplyr::mutate(text_clean = stringr::str_remove_all(text, "<.*?>"),
                
                # convert 't_clean' to lower case)
                text_clean = tolower(text_clean)) -> clean_df  # assign the result to 'clean_df'
  
# inspect the first 10 rows of the cleaned data frame
head(clean_df, 10)


With the data now arranged in tabular form, the cleaning process becomes straightforward. In the subsequent step, we aggregate the cleaned texts from the 'text_clean' column by file, ensuring we obtain a single consolidated clean text for each file. After this, we extract the cleaned text and store it in an object named 'ctexts.'



In [None]:
# group the 'clean_df' dataframe by 'corpus' and 'file'
clean_df %>%
  dplyr::group_by(file) %>%
  
  # concatenate the cleaned text ('t_clean') into a single string for each group
  dplyr::summarise(text = paste0(text_clean, collapse = " ")) %>%
  
  # remove grouping
  dplyr::ungroup() %>%
  
  # extract the 'text' column
  dplyr::pull(text) %>%
  
  # convert the text to lowercase
  tolower() %>%
  
  # remove extra spaces in the resulting character vector
  stringr::str_squish() -> ctexts  # Assign the cleaned and formatted text to 'ctexts'
  
# inspect the structure of the resulting character vector
str(ctexts)


We now add names to the cleaned texts.



In [None]:
nms -> names(ctexts)



## Saving to MyOutput

As a concluding step, we save the outcomes – the three files housing our cleaned texts – in the 'MyOutput' folder, conveniently visible on the left side of the screen.


In [None]:
# load function that helps loading texts
source("https://slcladal.github.io/rscripts/savetxts.R")
savetxts(ctexts)


<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>You will find the txt-files in the `MyOutput` folder (located on the left side of the screen).</b> <br><br>Simply double-click the `MyOutput` folder icon, then highlight the files, and choose *Download* from the dropdown menu to download the files. <br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>



***

[Back to LADAL](https://ladal.edu.au/string.html)

***
