![](https://github.com/MartinSchweinberger/SLAT7829/blob/master/images/bannerSLAT7829.jpeg?raw=true)

# Analyzing n-grams, collocations, and keyness

This tutorial show how you can extract and analyze n-grams, collocations, and keyness.

However, please keep in mind that the case studies  merely aim to exemplify ways in which R can be used in language-based research - rather than providing detailed procedures on how to do corpus-based research. 


Collocation refers to the co-occurrence of words. A typical example of a collocation is *Merry Christmas* because the words *merry* and *Christmas* occur together more frequently together than would be expected by chance, if words were just randomly stringed together.

N-grams are related to collocates in that they represent words that occur together (bi-grams are two words that occur together, tri-grams three words and so on). Fortunately, creating N-gram lists is very easy. We will use the example text to create a bi-gram list. We can simply take each word and combine it with the following word.

## Preparation and session set up

Activate required packages.


In [None]:
# load packages
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)


## Loading data

To analyze n-grams and collocations, we first need to load some data. In this tutorial, we will use the Australian Corpus of English (ACE) and the BROWN corpus which represents American English data.

Loading corpus data into R consists of two steps: 

1. create a list of paths of the corpus files

2. loop over these paths and load the data in the files identified by the paths.

To create a list of corpus files, you could use the code chunk below (the code chunk assumes that the BROWN data is in a folder called *BROWN* and the LOB data is in a folder called *LOB*.


In [None]:
# load ace files
acefiles <- list.files(here::here("ACE"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# load the files by scanning the content
ace <- sapply(acefiles, function(x){
  x <- scan(x, what = "char",  sep = "", quote = "",  quiet = T,  skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(ace)


Next, we load the BROWN data.



In [None]:
# load ace files
brownfiles <- list.files(here::here("BROWN"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# load the files by scanning the content
brown <- sapply(brownfiles, function(x){
  x <- scan(x, what = "char",  sep = "", quote = "",  quiet = T,  skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(ace); str(brown)


***

## Using your own data

You can also use your own data. You can see below what you need to do to upload and use your own data.

To be able to load your own data, you need to click on the folder symbol to the left of the screen:

![Binder Folder Symbol](https://slcladal.github.io/images/binderfolder.JPG)

Then, click on the `New Folder` symbol and create a new folder and call it `MyData`.

![Binder New Folder Symbol](https://slcladal.github.io/images/bindernewfolder.JPG)

Then click on the upload symbol and upload your files into the `MyData` folder.

![Binder Upload Symbol](https://slcladal.github.io/images/binderupload.JPG)

Select and upload the files you want to analyze (**IMPORTANT**: here, we assume that you upload some form of text data - not tabular data!). When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.


In [None]:
myfiles <- list.files(here::here("MyData"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# load colt files
mycorpus <- sapply(myfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(mycorpus)


**Keep in mind though that you need to adapt the names of the texts in the code chunks below so that the code below work on your own texts!**

***

## Cleaning and tokenising

We start by cleaning the corpus data by removing tags, artefacts and non-alpha-numeric characters.


In [None]:
ace_clean <- # remove A01 0001 1 sequences
  stringr::str_remove_all(ace, "[A-Z][0-9]{2,2}") %>%
  # remove tags 
  stringr::str_remove_all("<.*?>") %>%
  # remove non-alphanumeric characters 
  stringr::str_remove_all("[^[:alnum:] ]") %>%
  # remove single characters
  stringr::str_remove_all(" \\w ") %>%
  # remove all digits 
  stringr::str_remove_all("\\d") %>%
  # remove superfluous white spaces
  stringr::str_squish()
# inspect
substr(ace_clean[1], start=1, stop=200)


Of course, we do the same for the BROWN data.



In [None]:
brown_clean <- # remove A01 0001 1 sequences
  stringr::str_remove_all(brown, "[A-Z][0-9]{2,2}") %>%
  # remove tags 
  stringr::str_remove_all("<.*?>") %>%
  # remove non-alphanumeric characters 
  stringr::str_remove_all("[^[:alnum:] ]") %>%
  # remove single characters
  stringr::str_remove_all(" \\w ") %>%
  # remove all digits 
  stringr::str_remove_all("\\d") %>%
  # remove superfluous white spaces
  stringr::str_squish()
# inspect
substr(brown_clean[1], start=1, stop=200)


We now split the clean corpora into individual words.



In [None]:
ace_words <- quanteda::tokenize_fastestword(ace_clean)
brown_words <- quanteda::tokenize_fastestword(brown_clean)
# inspect
str(ace_words)


## Extracting bigrams

Now, that we have tokenised the corpus, we can extract n-grams


In [None]:
# create data frame
ace_bigrams <- quanteda::tokens_ngrams(tokens(ace_words)) %>%
  unlist() %>%
  tolower() %>%
  table()
# inspect
head(ace_bigrams, 20)


Create table with n-grams



In [None]:
# create data frame
ace_bigram_tb <- data.frame(ace_bigrams) %>%
  dplyr::rename(bigram = 1,
                freq = 2) %>%
  dplyr::arrange(-freq) %>%
  dplyr::mutate(corpus = "ace")
# inspect
head(ace_bigram_tb, 20)


Now, we will do the same for the brown corpus data



In [None]:
brown_bigram_tb <- quanteda::tokens_ngrams(tokens(brown_words)) %>%
  unlist() %>%
  tolower() %>%
  table() %>%
  data.frame() %>%
  dplyr::rename(bigram = 1,
                freq = 2) %>%
  dplyr::arrange(-freq) %>%
  dplyr::mutate(corpus = "brown")
# inspect
head(brown_bigram_tb, 20)


We can now combine the two tables



In [None]:
bigram_tb <-dplyr::full_join(ace_bigram_tb, brown_bigram_tb) %>%
  dplyr::filter(freq > 10) %>%
  tidyr::spread(corpus, freq) %>%
  dplyr::mutate(ace = tidyr::replace_na(ace, 0),
                brown = tidyr::replace_na(brown, 0),
                ace_all = sum(ace),
                brown_all = sum(brown)) %>%
  dplyr::rowwise() %>%
  dplyr::mutate(p = fisher.test(matrix(c(ace, brown, ace_all-ace, brown_all-brown), byrow = T, nrow = 2))[[1]],
                oddsratio = fisher.test(matrix(c(ace, brown, ace_all-ace, brown_all-brown), byrow = T, nrow = 2))[[3]]) %>%
  dplyr::filter(p < .05/nrow(.)) %>%
  dplyr::arrange(p)
# inspect
head(bigram_tb, 20)


## Creating a Corpus

We start by creating a table of the ACE data.


In [None]:
ace_df <- data.frame(rep("ACE", length(acefiles)), acefiles, ace) %>%
  # change column names
  dplyr::rename(corpus = 1, 
                file = 2,
                content = 3) %>%
  # shorten file names
  dplyr::mutate(file = stringr::str_remove_all(file, ".*/"),
                file = stringr::str_remove_all(file, ".TXT")) %>%
  # clean corpus data
  dplyr::mutate(content = ace_clean)
# inspect
substr(ace_df[1, 3], start=1, stop=200)


We also create a table of the Brown data.



In [None]:
brown_df <- data.frame(rep("BROWN", length(brownfiles)), brownfiles, brown) %>%
  # change column names
  dplyr::rename(corpus = 1, 
                file = 2,
                content = 3) %>%
  # shorten file names
  dplyr::mutate(file = stringr::str_remove_all(file, ".*/"),
                file = stringr::str_remove_all(file, ".TXT")) %>%
  # clean corpus data
  dplyr::mutate(content = brown_clean)
# inspect
substr(brown_df[1, 3], start=1, stop=200)


We can now combine the two data frames into one.



In [None]:
corpus <- rbind(ace_df, brown_df)
# inspect
str(corpus)


# Comparing the corpora 

There are many ways in which we can compare corpora. Here, we will  extract keywords that are characteristic for each of the two corpora and then visualize them in two different ways: in the form a bar graphs and as comparative word clouds.

## Extracting Keywords 

When visualizing keywords in bar graphs, we first need to extract the keywords. this has the advantage that we can tabulate the keywords and inspect them before visualizing them.

To extract the keywords, we combine the two corpora 


In [None]:
# Only select speeches by Obama and Trump
comp_corpus <- quanteda::corpus(corpus$content, docvars = data.frame(file = corpus$file,
                                                                  corpus = corpus$corpus))
# inspect
str(comp_corpus)


In a next step, we convert the data into a document-frequency matrix (dfm) which shows how frequent words are in each document.



In [None]:
# Create a dfm grouped by president
corp_dfm <- tokens(comp_corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
  tokens_remove(stopwords("english")) %>%
  tokens_group(groups = corpus) %>%
  dfm()


Next, we calculate the keyness of words in the corpora.



In [None]:
# calculate keyness and determine Trump as target group
result_keyness <- textstat_keyness(corp_dfm, 
                                   target = "ACE")
# inspect
head(result_keyness)


We can then visualize the keywords as a bar graph using the `textplot_keyness` function that is provided in the `quanteda.textplots`  package.
 


In [None]:
# Plot estimated word keyness
quanteda.textplots::textplot_keyness(result_keyness) 


## Comparative Wordclouds 


We can use the `textplot_wordcloud` function to generate not only simple wordclouds but also comparative wordclouds that can be use to compare corpora.

To generate a comparative wordcloud, you need to set the argument`comparison` to `TRUE` and we can specify  arguments to *prettify* the wordcloud. 


In [None]:
# clean corpus
corp_dfm %>%
  # create word cloud
  quanteda.textplots::textplot_wordcloud(comparison = TRUE, max_words = 100, rotation = .25)


## Finding collocations

There are various techniques for identifying collocations. To identify collocations without having a pre-defined target term, we can use the `textstat_collocations` function from the `quanteda.textstats` package.

However, before we can apply that function and start identifying collocations, we need to process the data to which we want to apply this function. In the present case, we will apply that function to the sentences in the example text which we extract in the code chunk below.


In [None]:
ace_sentences <- ace %>%
  stringr::str_replace_all("[A-Z][0-9]{2,2}", " ") %>%
  tolower() %>%
  # remove tags 
  stringr::str_replace_all("<.*?>", " ") %>%
  # remove single characters
  stringr::str_replace_all(" \\w{1,2} ", " ") %>%
  # remove all digits 
  stringr::str_replace_all("\\d", " ") %>%
  # remove superfluous white spaces
  stringr::str_squish()%>%
  paste0(collapse= " ") %>%
  stringr::str_split(fixed(".")) %>%
  unlist() %>%
  stringr::str_remove_all("[^[:alnum:] ]") %>%
  stringr::str_squish()
# inspect
head(ace_sentences)


From the output shown above, we also see that splitting texts simply by full stops is not optimal as it produces some unwarranted artifacts like the “sentences” that consist of single characters (due to the name of the H.M.S. Beagle - the ship on which Darwin traveled when he explored the southern hemisphere). Fortunately, these errors do not really matter in the case of our example.

Now that we have split the example text into sentences, we can tokenize these sentences and apply the `textstat_collocations` function which identifies collocations.


In [None]:
# create a token object
text_tokens <- tokens(ace_sentences, remove_punct = TRUE) %>%
  tokens_remove(stopwords("english"))
# extract collocations
text_coll <- textstat_collocations(text_tokens, size = 2, min_count = 20)
# inspect
text_coll[1:6, 1:6]


The resulting table shows collocations in the example text descending by collocation strength.


## Visualizing Collocation Networks

Network graphs are a very useful and flexible tool for visualizing relationships between elements such as words, personas, or authors. This section shows how to generate a network graph for collocations of the term alice using the quanteda package.

In a first step, we generate a document-feature matrix based on the sentences in the example text. A document-feature matrix shows how often elements (here these elements are the words that occur in the the example text) occur in a selection of documents (here these documents are the sentences in the example text).


In [None]:
text_dfm <- ace_sentences %>% 
    quanteda::dfm(remove = stopwords('english'), remove_punct = TRUE) %>%
    quanteda::dfm_trim(min_termfreq = 10, verbose = FALSE)
# inspect
text_dfm[1:6, 1:6]


# create document-feature matrix

As we want to generate a network graph of words that collocate with the term organism, we use the `calculateCoocStatistics` function to determine which words most strongly collocate with our target term (organism).


In [None]:
# load function for co-occurrence calculation
source("https://slcladal.github.io/rscripts/calculateCoocStatistics.R")
# define term
coocTerm <- "australia"
# calculate co-occurrence statistics
coocs <- calculateCoocStatistics(coocTerm, text_dfm, measure="LOGLIK")
# inspect results
coocs[1:20]


We now reduce the document-feature matrix to contain only the top 20 collocates of organism (plus our target word organism).



In [None]:
redux_dfm <- dfm_select(text_dfm, 
                        pattern = c(names(coocs)[1:20], "australia"))


Now, we can transform the document-feature matrix into a feature-co-occurrence matrix as shown below. A feature-co-occurrence matrix shows how often each element in that matrix co-occurs with every other element in that matrix.



In [None]:
tag_fcm <- fcm(redux_dfm)
# inspect
tag_fcm[1:6, 1:6]


Using the feature-co-occurrence matrix, we can generate the network graph which shows the terms that collocate with the target term *australia* with the edges representing the co-occurrence frequency. To generate this network graph, we use the `textplot_network` function from the `quanteda.textplots` package.



In [None]:
# generate network graph
textplot_network(tag_fcm, 
                 min_freq = 5, 
                 edge_alpha = 0.1, 
                 edge_size = 5,
                 edge_color = "purple",
                 vertex_labelsize = log(colSums(tag_fcm)))


## Keyness

Another common method that can be used for automated text summarization is keyword extraction. Keyword extraction builds on identifying words that are particularly associated with a certain text. In other words, keyness analysis aims to identify words that are particularly indicative of the content of a certain text.

Below, we identify key words for Charles Darwin’s Origin, Herman Melville’s Moby Dick, and George Orwell’s 1984. We start by creating a weighted document feature matrix from the corpus containing the three texts.

In order to create a corpus, we use the text objects that consist out of many different elements rather than the objects which contained the collapsed texts that we used above. Thus, in a first step, we create a corpus of the texts.


In [None]:
corp_dom <- quanteda::corpus(c(ace_clean, brown_clean)) 
attr(corp_dom, "docvars")$Corpus = c(rep("ACE", length(ace_clean)), 
                                     rep("BROWN", length(brown_clean)))


Next, we generate the document feature matrix and we clean it by removing stopwords and selected other words. In addition, we group the documents feature matrix by author.



In [None]:
dfm_corpus <- corp_dom %>%
  quanteda::tokens(remove_punct = TRUE) %>%
  quanteda::tokens_remove(quanteda::stopwords("english")) %>%
  quanteda::tokens_remove(c("now", "one", "like", "may", "can", "said", "even")) %>%
  quanteda::dfm() %>%
  quanteda::dfm_group(groups = Corpus) %>%
  quanteda::dfm_weight(scheme = "prop")


In a next step, we use the `textstat_frequency` function from the `quanteda` package to extract the most frequent non-stopwords in the three texts.



In [None]:
# Calculate relative frequency by president
freq_weight <- quanteda.textstats::textstat_frequency(dfm_corpus, 
                                                      n = 10,
                                                      groups = dfm_corpus$Corpus)


Now, we can simply plot the most common words and most indicative non-stop words in the three texts.



In [None]:
ggplot(freq_weight, aes(nrow(freq_weight):1, frequency)) +
     geom_point() +
     facet_wrap(~ group, scales = "free") +
     coord_flip() +
     scale_x_continuous(breaks = nrow(freq_weight):1,
                        labels = freq_weight$feature) +
     labs(x = NULL, y = "Relative frequency")


## Outro


We end the session by calling the session info which tells us what packages and what version of the software and packages we have used.


In [None]:
sessionInfo()



***

[Back to HOME](https://github.com/MartinSchweinberger/SLAT7829Tutorials)

***
