![](https://github.com/MartinSchweinberger/SLAT7829/blob/master/images/bannerSLAT7829.jpeg?raw=true)

# Topic modelling

This tutorial show how you can perform topic modeling.

However, please keep in mind that the case studies  merely aim to exemplify ways in which R can be used in language-based research - rather than providing detailed procedures on how to do corpus-based research. 

Topic modeling is a technique used in data analysis to identify and extract topics or themes from a large collection of texts. It is a way to automatically identify patterns and insights in large amounts of unstructured data.

Topic modeling is useful in a variety of fields, such as social sciences, humanities, and marketing research. For example, it can be used to analyze customer reviews to identify common themes, to study political speeches to identify key issues and topics, or to analyze social media data to understand public opinion on a particular topic.

The technique works by analyzing the frequency of words that appear in a text corpus and grouping them into topics that frequently co-occur. These topics can then be interpreted and labeled based on the words that are most strongly associated with them. The result is a set of topics that represent the most important themes in the text corpus.

## Preparation and session set up

Activate required packages.


In [None]:
# load packages
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(seededlda)


## Loading data

To analyze n-grams and collocations, we first need to load some data. In this tutorial, we will use the essays written by German and Spanish learners of English provided in the International Corpus of Learner English (ICLE).

Loading corpus data into R consists of two steps: 

1. create a list of paths of the corpus files

2. loop over these paths and load the data in the files identified by the paths.

To create a list of corpus files, you could use the code chunk below (the code chunk assumes that the ICLE data is in a folder called *ICLE*).


In [None]:
# load ace files
corpusfiles <- list.files(here::here("ICLE"), # path to the corpus data
                       pattern = "GE|SP",
                       # full paths - not just the names of the files
                       full.names = T) 
# load the files by scanning the content
corpus <- sapply(corpusfiles, function(x){
  x <- scan(x, what = "char",  sep = "", quote = "",  quiet = T,  skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(corpus)


***

## Using your own data

You can also use your own data. You can see below what you need to do to upload and use your own data.

To be able to load your own data, you need to click on the folder symbol to the left of the screen:

![Binder Folder Symbol](https://slcladal.github.io/images/binderfolder.JPG)

Then, click on the `New Folder` symbol and create a new folder and call it `MyData`.

![Binder New Folder Symbol](https://slcladal.github.io/images/bindernewfolder.JPG)

Then click on the upload symbol and upload your files into the `MyData` folder.

![Binder Upload Symbol](https://slcladal.github.io/images/binderupload.JPG)

Select and upload the files you want to analyze (**IMPORTANT**: here, we assume that you upload some form of text data - not tabular data!). When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.


In [None]:
myfiles <- list.files(here::here("MyData"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# load files
mycorpus <- sapply(myfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(mycorpus)


**Keep in mind though that you need to adapt the names of the texts in the code chunks below so that the code below work on your own texts!**

***

## Cleaning and tokenising

We start by cleaning the corpus data by removing tags, artefacts and non-alpha-numeric characters.


In [None]:
corpus_clean <- # remove A01 0001 1 sequences
  stringr::str_remove_all(corpus, "<.*?>") %>%
  # remove superfluous white spaces
  stringr::str_squish()
# inspect
substr(corpus_clean[1], start=1, stop=200)


We now split the clean corpora into individual words.



In [None]:
toks_corpus <- tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)
toks_corpus <- tokens_remove(toks_corpus, pattern = c(stopwords("en"), "*-time", "updated-*", "gmt", "bst"))
dfmat_corpus <- dfm(toks_corpus) %>% 
              dfm_trim(min_termfreq = 0.8, termfreq_type = "quantile",
                       max_docfreq = 0.1, docfreq_type = "prop")


## Unsupervised LDA

change k to set different N of topics


In [None]:
# set seed
set.seed(1234)
# generate model
tmod_lda <- seededlda::textmodel_lda(dfmat_corpus, k = 15)
# inspect
terms(tmod_lda, 10)


## Supervised LDA



In [None]:
# semisupervised LDA
dict <- dictionary(list(Computer = c("computers", "information", "machine", "computer"),
                        Education = c("students", "courses", "education", "university"),
                        Movies = c("movie", "film", "commercial", "watch"),
                        Family = c("parents", "home", "mother", "father"),
                        War = c("war", "peace", "somalia", "consequences"),
                        Foreigners = c("foreigners", "germans", "turkish", "turks"),
                        Phone = c("phone", "telephone", "call"),
                        Food = c("mcdonald's", "restaurant", "chips", "fastfood", "taste"),
                        Pets = c("dog*", "walk", "cat", "pet", "happy"),
                        Eco = c("green", "car*", "drive", "speed", "accident", "exhaust"),
                        Dating = c("girl", "boy", "date", "merries")))
tmod_slda <- textmodel_seededlda(dfmat_corpus, dict, residual = TRUE, min_termfreq = 10)
terms(tmod_slda)


inspect topic by file



In [None]:
topics(tmod_slda)[1:20]



extract files and create a data frame of topics and documents 



In [None]:
files <- stringr::str_replace_all(names(topics(tmod_slda)), ".*/(.*?).txt", "\\1")
topics <- topics(tmod_slda)
language <- ifelse(stringr::str_detect(files, "GE"), "German", "Spanish")
df <- data.frame(language, topics) %>%
  dplyr::mutate_if(is.character, factor)
# inspect
head(df)


calculate percentages



In [None]:
dfp <- df %>%
  dplyr::group_by(language, topics) %>%
  dplyr::summarise(freq = n()) %>%
  dplyr::group_by(language) %>%
  dplyr::mutate(all = sum(freq),
                percent = round(freq/all*100, 2))
# inspect
head(dfp)


In [None]:
dfp %>%
  ggplot(aes(x = topics, y = percent, label = percent, fill = language)) +
  geom_bar(stat = "identity", position = position_dodge()) + 
  geom_text(vjust=-0.3, position = position_dodge(0.9)) + 
  theme_bw() +
  coord_cartesian(ylim = c(0, 30)) +
  labs(x = "Topic", y = "Percent") +
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 90))


## Outro


We end the session by calling the session info which tells us what packages and what version of the software and packages we have used.


In [None]:
sessionInfo()



***

[Back to HOME](https://github.com/MartinSchweinberger/SLAT7829Tutorials)

***
