![An interactive LADAL notebook](https://slcladal.github.io/images/uq1.jpg)


This notebook-based tool accompanies the [Language Technology and Data Analysis Laboratory (LADAL) tutorial *Topic Modelling R*](https://ladal.edu.au/topicmodels.html). 

## Using your own data

<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
To <b>use your own data</b>, click on the folder called <b>`MyTexts`</b> (it is in the menu to the left of the screen) and then simply drag and drop your txt-files into the folder. <br>When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.<br>You can upload <b>only txt-files</b> (simple unformatted files created in or saved by a text editor)! <br>The notebook assumes that you upload some form of text data - not tabular data! 
<br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>


In [None]:
# load function that helps loading texts
source("https://slcladal.github.io/rscripts/loadtxts.R")
# load texts
corpus <- loadtxts("notebooks/MyTexts")
# inspect the structure of the text object
str(corpus)


## Cleaning and tokenising

We start by cleaning the corpus data (by removing tags, artefacts and non-alpha-numeric characters) and then splitting the clean corpora into individual words.


In [None]:
corpus_clean <- 
  stringr::str_remove_all(corpus, "<.*?>") %>%
  # remove superfluous white spaces
  stringr::str_squish()
toks_corpus <- tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)
toks_corpus <- tokens_remove(toks_corpus, pattern = c(stopwords("en"), "*-time", "updated-*", "gmt", "bst"))
dfmat_corpus <- dfm(toks_corpus) %>% 
              dfm_trim(min_termfreq = 0.8, termfreq_type = "quantile",
                       max_docfreq = 0.1, docfreq_type = "prop")


## Unsupervised LDA

Now that we have cleaned the data, we can perform the topic modelling. This consists of two steps:

1. First, we perform an unsupervised LDA. We do this to check what topics are in our corpus. 

2. Then, we perform a supervised LDA (based on the results of the unsupervised LDA) to identify meaningful topics in our data. For the supervised LDA, we define so-called *seed terms* that help in generating coherent topics.

Here we look for 15 topics but we would vary the number of topics (k) to check what topics are in our data.


In [None]:
# set seed
set.seed(1234)
# generate model: change k to different numbers, e.g. 10 or 20 and look for consistencies in the keywords for the topics below.
tmod_lda <- seededlda::textmodel_lda(dfmat_corpus, k = 15)
# inspect
terms(tmod_lda, 10)


## Supervised LDA

Now, we perform a supervised LDA. Here we use the keywords extracted based on the unsupervised LDA as *seed terms* for topics to create coherent topics.

<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>IMPORTANT</b>: you need to change and adapt the topics and keywords defined below (simply replace the topics and seed terms with your own topics and seed terms (based on the results of the unsupervised LDA!). 
<br>
</p>
</span>
</div>

<br>


In [None]:
# semisupervised LDA
dict <- dictionary(list(Computer = c("computers", "information", "machine", "computer"),
                        Education = c("students", "courses", "education", "university"),
                        Movies = c("movie", "film", "commercial", "watch")))
tmod_slda <- textmodel_seededlda(dfmat_corpus, dict, residual = TRUE, min_termfreq = 10)
terms(tmod_slda)


Now, we can extract files and create a data frame of topics and documents. This shows what topic is dominant in which file in tabular form.  



In [None]:
files <- stringr::str_replace_all(names(topics(tmod_slda)), ".*/(.*?).txt", "\\1")
cleancontent <- corpus_clean
topics <- topics(tmod_slda)
# generate data frame
df <- data.frame(files, cleancontent, topics) %>%
  dplyr::mutate_if(is.character, factor)
# inspect
head(df)


## Exporting data

To export a data frame as an MS Excel spreadsheet, we use `write_xlsx`. Be aware that we use the `here` function to  save the file in the current working directory.


In [None]:
# save data for MyOutput folder
write_xlsx(dfp, here::here("notebooks/MyOutput/df.xlsx"))


<div class="warning" style='padding:0.1em; background-color: rgba(215,209,204,.3); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>You will find the generated MS Excel spreadsheet named "df.xlsx" in the `MyOutput` folder (located on the left side of the screen).</b> <br><br>Simply double-click the `MyOutput` folder icon, then right-click on the "df.xlsx" file, and choose Download from the dropdown menu to download the file. <br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>


***

[Back to LADAL](https://ladal.edu.au/topicmodels.html)

***
