![](https://github.com/MartinSchweinberger/SLAT7829/blob/master/images/bannerSLAT7829.jpeg?raw=true)

# Case Study: Comparing British and American English

This tutorial show how you can compare two corpora, the BROWN corpus which represents American English and the LOB corpus which represents British English.

However, please keep in mind that the case studies  merely aim to exemplify ways in which R can be used in language-based research - rather than providing detailed procedures on how to do corpus-based research. 



## Preparation and session set up

Activate required packages.


In [None]:
# load packages
library(dplyr)
library(stringr)
library(ggplot2)
library(quanteda)
library(quanteda.textstats)


## Loading corpus data into R

For the present tutorial, we will load the BROWN and the LOB corpus. Loading corpus data into R consists of two steps: 

1. create a list of paths of the corpus files

2. loop over these paths and load the data in the files identified by the paths.

To create a list of corpus files, you could use the code chunk below (the code chunk assumes that the BROWN data is in a folder called *BROWN* and the LOB data is in a folder called *LOB*.


In [None]:
brownfiles <- list.files(here::here("BROWN"), 
                          pattern = ".*.TXT",
                          full.names = T)
lobfiles <- list.files(here::here("LOB"), 
                          pattern = ".*.TXT",
                          full.names = T) 


You can then use the `sapply` function to loop over the paths and load the data int R using e.g. the `scan` function as shown below. In addition to loading the file content, we also paste all the content together using the `paste0` function and remove superfluous white spaces using the `str_squish` function from the `stringr` package.



In [None]:
brown <- sapply(brownfiles, function(x){
  x <- scan(x, what = "char", sep = "", quote = "", 
            quiet = T, skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
lob <- sapply(lobfiles, function(x){
  x <- scan(x, what = "char", sep = "", quote = "", 
            quiet = T, skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect data
str(brown); str(lob)


Once you have loaded your data into R, you can then continue with processing and transforming the data according to your needs.

***

## Using your own data

You can also use your own data. You can see below what you need to do to upload and use your own data.

To be able to load your own data, you need to click on the folder symbol to the left of the screen:

![Binder Folder Symbol](https://slcladal.github.io/images/binderfolder.JPG)

Then, click on the `New Folder` symbol and create a new folder and call it `MyData`.

![Binder New Folder Symbol](https://slcladal.github.io/images/bindernewfolder.JPG)

Then click on the upload symbol and upload your files into the `MyData` folder.

![Binder Upload Symbol](https://slcladal.github.io/images/binderupload.JPG)

Select and upload the files you want to analyze (**IMPORTANT**: here, we assume that you upload some form of text data - not tabular data!). When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.


In [None]:
myfiles <- list.files(here::here("MyData"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# load colt files
mycorpus <- sapply(myfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(mycorpus)


**Keep in mind though that you need to adapt the names of the texts in the code chunks below so that the code below work on your own texts!**

***


## Data processing

In a first step, we clean the data. To do this, we inspect the first 200 characters of the data to see what it looks like and what we need to remove.  


In [None]:
# inspect first 200 characters of the brown corpus
substr(brown[1], start=1, stop=200)
# inspect first 200 characters of the lob corpus
substr(lob[1], start=1, stop=200)


The inspection shows that in the BROWN data, there are weird sequences like `A01 0010 1` or `A01 0010 1`. In the LOB corpus, there are also weird sequences like `A01 1` or `A01 2` and tags (pointy brackets and asterisks). As such, we remove these weird sequences from the corpus data. 

We start by cleaning the BROWN data.


In [None]:
brown <- data.frame(rep("BROWN", length(names(brown))), names(brown), brown) %>%
  # change column names
  dplyr::rename(corpus = 1, 
                file = 2,
                content = 3) %>%
  # shorten file names
  dplyr::mutate(file = stringr::str_remove_all(file, ".*/"),
                file = stringr::str_remove_all(file, ".TXT")) %>%
  # clean corpus data
  dplyr::mutate(content = stringr::str_remove_all(content, "[A-Z][0-9]{2,2} [0-9]{4,4} [0-9]{1,5} "),
                content = stringr::str_remove_all(content, "[^[:alnum:] ]"),
                content = stringr::str_remove_all(content, " \\w "))
# inspect
substr(brown[1, 3], start=1, stop=200)


We also clean the LOB data.



In [None]:
lob <- data.frame(rep("LOB", length(names(lob))), names(lob), lob) %>%
  # change column names
  dplyr::rename(corpus = 1, 
                file = 2,
                content = 3) %>%
  # shorten file names
  dplyr::mutate(file = stringr::str_remove_all(file, ".*/"),
                file = stringr::str_remove_all(file, ".TXT")) %>%
  # clean corpus data
  dplyr::mutate(content = stringr::str_remove_all(content, "[A-Z][0-9]{2,5} {0,5}[0-9]{1,5} "),
                content = stringr::str_remove_all(content, "[^[:alnum:] ]")) %>%
  dplyr::mutate(content = stringr::str_replace_all(content, " [0-9]{1,}(\\w{1,}) ", " \\1 "),
                content = stringr::str_remove_all(content, "[A-Z]{1,3}[0-9]{1,3}"),
                content = stringr::str_remove_all(content, "[0-9]{1,3}[A-Z]{0,3}[a-z]{0,3}"),
                content = stringr::str_remove_all(content, " \\w "))
# inspect
substr(lob[1, 3], start=1, stop=200)


We can now combine the two data frames into one.



In [None]:
corpus <- rbind(brown, lob)
# inspect
str(corpus)


# Comparing the corpora 

There are many ways in which we can compare corpora. Here, we will  extract keywords that are characteristic for each of the two corpora and then visualize them in two different ways: in the form a bar graphs and as comparative word clouds.

## Extracting Keywords 

When visualizing keywords in bar graphs, we first need to extract the keywords. this has the advantage that we can tabulate the keywords and inspect them before visualizing them.

To extract the keywords, we combine the two corpora 


In [None]:
# Only select speeches by Obama and Trump
comp_corpus <- quanteda::corpus(corpus$content, docvars = data.frame(file = corpus$file,
                                                                  corpus = corpus$corpus))
# inspect
str(comp_corpus)


In a next step, we convert the data into a document-frequency matrix (dfm) which shows how frequent words are in each document.



In [None]:
# Create a dfm grouped by president
corp_dfm <- tokens(comp_corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
  tokens_remove(stopwords("english")) %>%
  tokens_group(groups = corpus) %>%
  dfm()


Next, we calculate the keyness of words in the corpora.



In [None]:
# calculate keyness and determine Trump as target group
result_keyness <- textstat_keyness(corp_dfm, 
                                   target = "LOB")
# inspect
head(result_keyness)


We can then visualize the keywords as a bar graph using the `textplot_keyness` function that is provided in the `quanteda.textplots`  package.
 


In [None]:
# Plot estimated word keyness
quanteda.textplots::textplot_keyness(result_keyness) 


## Comparative Wordclouds 


We can use the `textplot_wordcloud` function to generate not only simple wordclouds but also comparative wordclouds that can be use to compare corpora.

To generate a comparative wordcloud, you need to set the argument`comparison` to `TRUE` and we can specify  arguments to *prettify* the wordcloud. 


In [None]:
# clean corpus
corp_dfm %>%
  # create word cloud
  quanteda.textplots::textplot_wordcloud(comparison = TRUE, max_words = 100, rotation = .25)


We end the session by calling the session info which tells us what packages and what version of the software and packages we have used.



In [None]:
sessionInfo()



***

[Back to HOME](https://github.com/MartinSchweinberger/SLAT7829Tutorials)

***
