# Introduction

This notebook contains the analysis of Reddit posts on the Eurovision song contests 2015 and 2025 as well as a control dataset consisting of general Reddit posts.

## Set Up

The next code chunk loads the required packages into the R environment. In this interactive Jupyter notebook, the packages are already installed. If you wanted to perform the analysis on your own computer, you may need to install the packages listed below using the `install.packages()` function. 

In any case, you need to activate them using the `library()` function so that their functions can be used in your R session. Each package serves a different purpose in text analysis, and loading them ensures that all the necessary tools are available.


In [None]:
# Activating packages: Loading the necessary libraries for text analysis.

# Load 'tidyverse': 
# This package suite includes tools for data manipulation and visualization (e.g., dplyr for data wrangling and ggplot2 for plots).
library(tidyverse)

# Load 'quanteda': 
# This library provides tools for quantitative text analysis. It's essential for creating document-feature matrices, tokenization, 
# and various text mining tasks.
library(quanteda)

# Load 'udpipe': 
# Udpipe is used for part-of-speech tagging, lemmatization, and dependency parsing. 
# It's useful when conducting deeper linguistic analyses.
library(udpipe)

# Load 'openxlsx': 
# This package allows reading, writing, and editing Excel files, which is useful for exporting analysis results.
library(openxlsx)

# Load 'stopwords': 
# This library provides access to stopword lists in multiple languages, useful for filtering out common but uninformative words in text analysis.
library(stopwords)

# Load 'ggwordcloud': 
# A tool for creating aesthetically pleasing word clouds using the ggplot2 framework, allowing you to visualize word frequencies.
library(ggwordcloud)

# Load 'quanteda.textplots': 
# This package enhances quanteda's visualization capabilities, providing tools for plotting text-related data, 
# such as keyword-in-context plots and feature co-occurrence networks.
library(quanteda.textplots)

# Load 'topicmodels': 
# This package allows to perform topic modelling using Latent Dirichlet Allocation (LDA).
library(topicmodels)

# Load 'tidytext': 
# This package allows to efficiently process and reformat textual data.
library(tidytext)

# Load 'writexl': 
# This package allows to save tabular data as MS Excel spreadsheets.
library(writexl)

# Load 'seededlda': 
# This package allows to perform and implement a supervised topic model.
library(seededlda)


The next code chunk loads datasets that will be used for analysis in the workshop. The datasets are CSV and Excel files containing text comments and are located in the "data" folder. By using `read.csv()` and `read_excel()`, you are importing these files into R as data frames, which are structured data tables. The `here::here()` function ensures that the file paths are correctly specified, no matter where the script is executed, improving the code's portability.



In [None]:
# Loading data into R

# The "euro2015" dataset is read from a CSV file containing comments from Euro 2015.
# The file is located in the "data" folder, and we use the 'here' package to construct the path dynamically.
euro2015 <- read.csv(here::here("data", "euro_2015_comments.csv"))

# The "euro2025" dataset is read from another CSV file with comments from Euro 2025, 
# also located in the "data" folder.
euro2025 <- read.csv(here::here("data", "euro_2025_comments.csv"))

# The "cntrl" dataset is loaded from an Excel file named 'compared.xlsx'. 
# We use 'read_excel()' from the readxl package to read Excel files. 
# The file path is also dynamically generated using the 'here' package.
cntrl <- readxl::read_excel(here::here("data", "compared.xlsx"))


The next code chunk uses the `head()` function to display the first few rows of the `euro2025` dataset. It helps you inspect the structure and content of the data, ensuring it has been loaded correctly and providing a quick look at the initial values in each column. This is a useful first step in exploring your data.



In [None]:
head(euro2025)



The next code chunk uses the `str()` function to display the structure of the `cntrl` dataset. The `str()` function provides a compact, readable summary of the data frame, including the data types of each column and the first few values in each column. This helps in understanding the format and structure of the dataset, which is essential before conducting further analysis.



In [None]:
# Inspecting the structure of the 'cntrl' dataset

# The str() function gives a concise overview of the 'cntrl' dataset.
# It shows the number of observations (rows), variables (columns),
# the data types of each variable (e.g., integer, character, etc.), 
# and a preview of the first few values in each variable.
str(cntrl)


## Data Processing

In the next code chunk, the `dplyr::mutate()` function is used to clean and process the `euro2015`, `euro2025`, and `cntrl` datasets. For each dataset:

* A new column called `data` is added, which labels the dataset (`"euro2015"`, `"euro2025"`, or `"control"`). This is useful for tracking the source of the data when merging datasets later.

* The `created_utc` column (which likely contains timestamps) is converted to a character data type to ensure consistency across the datasets.

This preprocessing step is important for preparing the datasets for further analysis and ensuring that all relevant fields are in the correct format.


In [None]:
# Clean and process data

# For the euro2015 dataset:
# - Add a new column 'data' with the label "euro2015"
# - Convert the 'created_utc' column to character format (useful for consistency)
euro2015 <- euro2015 %>%
  dplyr::mutate(data = "euro2015") %>% 
  dplyr::mutate(created_utc = as.character(created_utc))

# For the euro2025 dataset:
# - Add a new column 'data' with the label "euro2025"
# - Convert the 'created_utc' column to character format for consistency
euro2025 <- euro2025 %>%
  dplyr::mutate(data = "euro2025") %>% 
  dplyr::mutate(created_utc = as.character(created_utc))

# For the control dataset (cntrl):
# - Add a new column 'data' with the label "control"
# - Convert the 'created_utc' column to character format
cntrl <- cntrl %>%
  dplyr::mutate(data = "control") %>% 
  dplyr::mutate(created_utc = as.character(created_utc))


The next code chunk uses the `dplyr::full_join()` function to combine the `euro2015`, `euro2025`, and `cntrl` datasets into a single data frame called `esc`. A full join merges two datasets by including all records from both datasets, filling in missing values where there is no match.

1. **First Join**: The `euro2015` and `euro2025` datasets are combined using `full_join()`, resulting in a new data frame `esc1` that contains all records from both datasets.

2. **Second Join**: `esc1` is then joined with the `cntrl` dataset, creating a final combined data frame called `esc`.

3. **Inspecting the Structure**: The `str()` function is used to inspect the structure of the `esc` data frame, providing an overview of its columns, data types, and some sample data.


In [None]:
esc1 <- dplyr::full_join(euro2015, euro2025)
esc <- dplyr::full_join(esc1, cntrl)
# inspect
str(esc)


This code chunk performs several cleaning steps on the text data, preparing it for analysis. It focuses on adding new columns, converting text to lower case, and removing unwanted characters (such as non-word characters, URLs, and stopwords). It then selects the relevant columns for the cleaned dataset.

1. **Adding a Continuous ID**: A new column `nid` is added, assigning each row a unique identifier.

2. **Converting Text to Lower Case**: The `body` column's text is converted to lowercase and saved in a new column ltext.

3. **Text Cleaning**: The `ctext` column is created by cleaning the `ltext` data:

  * Replacing newline characters (`\n`) with spaces.
  * Removing non-word characters such as punctuation and special symbols.
  * Removing URLs.
  * Removing stopwords using the `stopwords()` function.
  * Removing small numbers (1 to 3 digits).
  * Squishing superfluous white spaces.
  * Trimming any leading or trailing whitespace.

4. **Selecting Columns**: After cleaning, the code selects relevant columns (`nid`, `data`, `author`, `body`, `ltext`, `ctext`) to create the final cleaned dataset, `esc_clean`.

5. **Inspecting the Cleaned Data**: The `head()` function displays the first few rows of the cleaned data for inspection.


In [None]:
# Clean the data

esc %>%
  # Add continuous id for each row
  dplyr::mutate(nid = 1:nrow(.),
                # Convert body text to lower case and store in ltext
                ltext = tolower(body),
                # Clean the text
                # Replace newline characters (\n) with spaces
                ctext = stringr::str_replace_all(ltext, fixed("\n"), " "),
                # Remove non-word characters (anything not a letter or number)
                ctext = stringr::str_remove_all(ctext, "[^[:alnum:] ]"),
                # Remove URLs that start with htt and followed by alphanumeric characters
                ctext = stringr::str_remove_all(ctext, "htt[:alnum:]{1,}"),
                # Remove stopwords by replacing them with a space
                ctext = stringr::str_replace_all(ctext, paste0(stopwords::stopwords(), collapse="\\b|\\b"), " "),
                # Remove small numbers (1-3 digits)
                ctext = stringr::str_replace_all(ctext, "\\b\\d{1,3}\\b", " "),
                # Remove superfluous white space
                ctext = stringr::str_squish(ctext),
                # Trim leading/trailing spaces
                ctext = stringr::str_trim(ctext)) %>%
  # Select only the relevant columns: nid, data, author, body, ltext, ctext
  dplyr::select(nid, data, author, body, ltext, ctext) -> esc_clean

# Inspect the first few rows of the cleaned dataset
head(esc_clean)


The next code chunk simply retrieves and displays the column names of the cleaned dataset, esc_clean. It allows you to quickly inspect the structure of your data by seeing the names of the variables (columns) included in the dataset. This is useful for ensuring the columns are named as expected after the cleaning and processing steps.



In [None]:
# Check the column names of the cleaned dataset
# This returns a character vector of column names for esc_clean
colnames(esc_clean)


This next code chunk uses the `head()` function to display the first few rows of the cleaned dataset, `esc_clean`. Inspecting the first few rows of the dataset helps you verify that the cleaning and processing steps were applied correctly. It allows you to take a quick look at the structure of your data and the content in each column, ensuring everything is in order before proceeding with further analysis.



In [None]:
# Inspect the first few rows of the cleaned dataset
# head() displays the first 6 rows of the dataset
head(esc_clean)


After loading, preparing, and cleaning the data, we can now turn to the actual text analytics in the next section.

# Showcase of Basic Text Analytics

This section showcases basic methods used in computational analyses of textual data, i.e. text analytics. We start by a basic frequency analysis.


## Frequency Analysis

1. **Data Preparation**: The code starts by filtering the dataset `esc_clean` to include only entries related to the specified Euro data years (2015 and 2024). This ensures that any irrelevant control data is removed, allowing for a focused analysis on the specific datasets of interest.

2. **Tokenization**: The `quanteda::tokens()` function is used to break down the text in the `ctext` column into individual words, or tokens. The text is converted to character format to ensure compatibility with the tokenization process.

3. **Data Frame Transformation**: The list of tokens is then flattened into a single vector and converted into a data frame. The first column of this data frame is renamed to 'token' for clarity.

4. **Frequency Calculation**: The data is grouped by the unique tokens, and the frequency of each token is calculated using `dplyr::summarise(n())`. This results in a new data frame that contains each unique token alongside its corresponding frequency count.

5. **Sorting Results**: Finally, the resulting data frame is sorted in descending order based on token frequency, allowing for easy identification of the most frequently used words.

6. **Inspection of Results**: The `head(text1, 20)` function call displays the top 20 most frequent tokens, providing an overview of the key terms present in the Euro data.


In [None]:
## Frequency Analysis

# Prepare the data for analysis by filtering out control data.
# We are interested only in data related to the "euro2015" and "euro2024" datasets.
euro <- esc_clean %>% 
  dplyr::filter(data == "euro2015" | data == "euro2024")

# Reformat the euro data for tokenization.
# Here, we will split the text into individual words (tokens).

# Tokenize the 'ctext' column of the euro data (split into words)
quanteda::tokens(as.character(euro$ctext)) %>%
  unlist() %>%                       # Flatten the list of tokens into a single vector
  as.data.frame() %>%               # Convert the vector into a data frame
  dplyr::rename(token = 1) %>%       # Rename the first column to 'token'
  dplyr::group_by(token) %>%         # Group the data by each unique token (word)
  dplyr::summarise(text1 = n()) %>%  # Count the frequency of each token
  dplyr::arrange(-text1) -> text1    # Arrange the tokens in descending order by frequency

# Inspect the top 20 most frequent tokens
head(text1, 20)


### Creating a barchart

The following code is designed to visualize the frequency of the most common words used in Reddit posts related to the Eurovision Song Contests for the years 2015 and 2024. By plotting these frequencies, participants can quickly identify which terms are most prevalent in the discussions.

1. **Data Selection**: The `head(30)` function is used to filter the `text1` data frame to include only the 30 most frequent words. This limits the visualization to the most relevant terms, making the plot more interpretable.

2. **Plot Generation**: The `ggplot()` function initializes the plotting process, where `aes()` defines the aesthetic mappings for the plot. The x-axis represents the words (tokens), and the y-axis represents their corresponding frequency counts.

3. **Bar Chart Creation**: The `geom_bar(stat = "identity")` function specifies that the plot should be a bar chart, where the height of each bar corresponds to the frequency of the word.

4. **Text Labels**: The `geom_text()` function adds text labels to the bars, indicating the exact frequency count of each word. The text is positioned slightly below the top of each bar for clarity.

5. **Coordinate Flipping**: The `coord_flip()` function flips the axes, transforming the bar chart from vertical to horizontal. This orientation often makes it easier to read the word labels, especially for longer words.

6. **Theme Application**: The `theme_bw()` function applies a clean black-and-white theme to the plot, enhancing its readability and visual appeal.

7. **Title and Axis Labels**: The `labs()` function adds a title to the plot and labels the axes. The title includes specific years of interest, providing context for the visualized data.


In [None]:
# This code visualizes the frequency information of the most common words 
# in Reddit posts about the Eurovision Song Contests in 2015 and 2024.
# It generates a bar plot displaying the top 30 most frequent words, 
# allowing for a visual comparison of their frequencies.

text1 %>%
  # Select the 30 most frequent words from the text1 data frame
  head(30) %>%
  # Generate a plot using ggplot2
  ggplot(aes(x = reorder(token, text1), y = text1, label = text1)) +
  # Define plot type as a bar chart
  geom_bar(stat = "identity") +
  # Add text labels showing the frequency counts
  geom_text(aes(y = text1 - 50), color = "white", size = 3) + 
  # Flip the coordinates to make the bars horizontal
  coord_flip() +
  # Apply a black-and-white theme to the plot
  theme_bw() +
  # Add a title and label the axes
  labs(title = "30 Most Frequent Words in Reddit Posts about \nthe Eurovision Song Contests 2015 and 2024", x = "", y = "")


This code saves the visual output of the last ggplot created in the session as an image file. Saving plots allows participants to easily share or include the visualizations in reports and presentations.

1. **File Path**: The `here::here()` function is used to construct the file path for the saved image. This function ensures that the path is relative to the project's root directory, making the code more portable and less prone to errors when moving the project across different systems.

2. **Image Format**: The file is saved in PNG format, which is widely used and supports high-quality graphics, making it suitable for both digital and print applications.

3. **File Name and Location**: The image will be saved in the "images" subdirectory, with the specified name "freq_bar.png". If the "images" folder does not exist, you may need to create it beforehand, or the code will return an error.


In [None]:
# This code saves the currently active ggplot visualization as an image file. 
# The saved image will be in PNG format and will be stored in the "images" 
# directory of the project. The file will be named "freq_bar.png".

ggsave(here::here("images", "freq_bar.png"))


### Simple word clouds

1. **Data Preparation for Word Cloud**

* The first code chunk focuses on preparing the data required to create a word cloud. It begins by filtering the `esc_clean` dataset to include only the text data for the specified Eurovision years (2015 and 2024).

* The `dplyr::pull(ctext)` function extracts the clean text from the `ctext` column, and `paste0(collapse = " ")` concatenates all the extracted text into a single long string, making it suitable for analysis.

2. **Corpus and Document-Feature Matrix**

* A corpus object is created using `quanteda::corpus()`, which wraps the concatenated text into a structure that can be analyzed further.

* The `quanteda::dfm()` function generates a document-feature matrix (DFM) from the tokenized corpus. A DFM represents the frequency of words (features) across documents.

* The `quanteda::dfm_trim()` function is applied to filter out less frequent terms, retaining only those that appear at least 500 times. This step reduces noise in the data and focuses the analysis on more significant terms.


In [None]:
# This code prepares the data for generating a word cloud using text from the 
# Eurovision datasets for 2015 and 2024. It extracts clean text, creates a 
# document-feature matrix, and trims it to focus on frequently used terms.

# Extract clean text from the eurovision data for the specified years
dplyr::filter(esc_clean, data == "euro2015" | data == "euro2024") %>% 
  dplyr::pull(ctext) %>%                 # Pull the 'ctext' column containing clean text
  paste0(collapse = " ") -> euro_wc      # Concatenate all text into a single string

# Create a corpus object from the concatenated text
quanteda::corpus(c(euro_wc)) -> esc_corpus 

# Generate a document-feature matrix (DFM) from the corpus
quanteda::dfm(tokens(esc_corpus)) %>%
  # Trim the DFM to retain only terms that appear frequently (min_termfreq = 500)
  quanteda::dfm_trim(min_termfreq = 500, verbose = FALSE) -> esc_dfm


The next code chunk generates the word cloud using the trimmed DFM (`esc_dfm`).

The `quanteda.textplots::textplot_wordcloud()` function creates the visual representation. The `rotation = 0.25` argument allows some words to be displayed at an angle, enhancing the visual appeal. The `color` argument uses a reversed color palette from the RColorBrewer package to set the colors of the words in the cloud, providing a vibrant visual contrast.


In [None]:
# This code generates a word cloud visualization based on the trimmed document-feature matrix (DFM). 
# A word cloud provides a visual representation of the most frequent terms, 
# where the size of each word indicates its frequency.

esc_dfm %>%
  quanteda.textplots::textplot_wordcloud(rotation = 0.25, 
                   color = rev(RColorBrewer::brewer.pal(10, "RdBu"))) 


**Saving wordcloud images**

This code chunk saves the word cloud visualization that has been generated from the trimmed document-feature matrix (`esc_dfm`) as a PNG image. Saving the visualization allows participants to utilize it in various contexts, such as presentations or reports.

1. **File Format and Dimensions**

* The `png()` function specifies the file format (PNG) and sets the file path and name where the image will be saved. The path is constructed using here::here(), which ensures the file is correctly located in the "images" directory of the project.

* The `width` and `height` parameters define the dimensions of the image in pixels. In this case, the image will be 600 pixels wide and 500 pixels tall.

2. **Word Cloud Generation**

* The word cloud visualization is created within the `png()` function's context. The same code used previously is reused here to generate the word cloud, ensuring that this specific visualization is saved to the file.

3. **Closing the Graphical Device**

* The `dev.off()` function is called to close the graphical device. This step is crucial, as it finalizes the image file and writes it to disk. If this function is not called, the image may not save correctly.


In [None]:
# This code saves the generated word cloud visualization as a PNG image file. 
# The image will be stored in the "images" directory of the project and 
# will be named "normal_wc.png". This allows for easy sharing and inclusion 
# in reports or presentations.

# Specify the output file format and dimensions for the PNG image
png(file = here::here("images", "normal_wc.png"), width = 600, height = 500)

# Generate the word cloud visualization using the trimmed document-feature matrix (DFM)
esc_dfm %>%
  quanteda.textplots::textplot_wordcloud(rotation = 0.25, 
                   color = rev(RColorBrewer::brewer.pal(10, "RdBu"))) 

# Close the graphical device to save the image
dev.off()


### Comparative wordcloud

This code chunk prepares data for creating a comparative word cloud, which will display differences in word usage between the Eurovision Song Contest (ESC) dataset and a control dataset. This comparison allows for insights into how language usage differs across these contexts.

1. **Data Extraction*

* The `dplyr::filter()` function extracts clean text from the control group within the esc_clean dataset. Only the rows corresponding to the control data are included.

* The `dplyr::pull(ctext)` function retrieves the relevant text from the ctext column, and paste0(collapse = " ") concatenates this text into a single string named control_wc.

2. *Creating a Combined Corpus**

* A combined corpus object is created using quanteda::corpus(c(euro_wc, control_wc)), which includes text from both the Eurovision dataset (euro_wc) and the control dataset (control_wc).

* Document variables are assigned to the corpus to differentiate between the two data types. This is done by modifying the attributes of the corpus with attr(corp_dom, "docvars")$type.

3. **Generating the Document-Feature Matrix (DFM)**

* The `quanteda::dfm()` function generates a document-feature matrix from the tokens of the combined corpus. This matrix represents the frequency of terms in each document.

* The `quanteda::dfm_group(groups = corp_dom$type)` function groups the DFM by the document types, allowing for separate frequency counts for the ESC and control data.
* Finally, `quanteda::dfm_trim(min_termfreq = 1000, verbose = FALSE)` trims the DFM to retain only those terms that appear at least 1000 times across the documents, focusing the analysis on the most relevant and frequently used terms.


In [None]:
# This code prepares the data for generating a comparative word cloud. 
# It extracts clean text from both the control group and the Eurovision data, 
# creating a combined corpus for analysis. A document-feature matrix (DFM) 
# is then generated, allowing for a comparison of frequently used terms 
# between the two datasets.

# Extract clean text from the control group data
dplyr::filter(esc_clean, data == "control") %>% 
  dplyr::pull(ctext) %>%                 # Pull the 'ctext' column containing clean text
  paste0(collapse = " ") -> control_wc   # Concatenate all text into a single string

# Create a combined corpus from both Eurovision and control texts
corp_dom <- quanteda::corpus(c(euro_wc, control_wc))

# Assign document variables to indicate the type of text (ESC or Control)
attr(corp_dom, "docvars")$type = c("ESC", "Control")

# Generate a document-feature matrix (DFM) from the combined corpus
corp_dom <- quanteda::dfm(tokens(corp_dom)) %>%
    # Group the DFM by the document type
    quanteda::dfm_group(groups = corp_dom$type) %>%
    # Trim the DFM to retain only terms that appear frequently (min_termfreq = 1000)
  quanteda::dfm_trim(min_termfreq = 1000, verbose = FALSE)


The next code chunk creates a comparative word cloud visualization based on the previously prepared document-feature matrix (corp_dom). The word cloud allows for a visual comparison of frequently used terms between the Eurovision dataset and the control dataset.

**Comparative Word Cloud Generation**

* The `quanteda.textplots::textplot_wordcloud()` function is used to generate the word cloud. The `comparison = TRUE` argument indicates that the function should display the word frequencies from the two groups side by side, allowing viewers to easily see which terms are more prevalent in each dataset.

* The `color` argument specifies the colors to be used for each dataset. In this case, "purple" is assigned to one group (e.g., Eurovision), and "darkgray" is assigned to the other (e.g., control), providing a clear visual distinction between the two datasets.


In [None]:
# This code generates a comparative word cloud visualization using the 
# document-feature matrix (DFM) created from both the Eurovision and 
# control datasets. The word cloud will display terms from each dataset 
# in different colors, allowing for easy comparison of their usage.

corp_dom %>%
    # Create a word cloud that compares word frequencies between groups
    quanteda.textplots::textplot_wordcloud(comparison = TRUE, 
                                            color = c("purple", "darkgray"))


This code chunk saves the comparative word cloud visualization generated from the document-feature matrix (`corp_dom`) as a PNG image. Saving the image allows participants to use it in various contexts, such as presentations or reports.

1. **File Format and Dimensions**

* The `png()` function specifies the file format (PNG) and sets the file path and name where the image will be saved. The path is constructed using `here::here()`, which ensures the file is correctly located in the "images" directory of the project.

* The `width` and `height` parameters define the dimensions of the image in pixels. In this case, the image will be 600 pixels wide and 500 pixels tall.

2. **Comparative Word Cloud Generation**

* The comparative word cloud visualization is created within the `png()` function's context. The same code used previously is reused here to generate the word cloud, ensuring that this specific visualization is saved to the file.

3. **Closing the Graphical Device**

* The `dev.off()` function is called to close the graphical device. This step is crucial, as it finalizes the image file and writes it to disk. If this function is not called, the image may not save correctly.


In [None]:
# This code saves the generated comparative word cloud visualization 
# as a PNG image file. The image will be stored in the "images" 
# directory of the project and named "comparative_wc.png". This 
# allows for easy sharing and inclusion in reports or presentations.

# Specify the output file format and dimensions for the PNG image
png(file = here::here("images", "comparative_wc.png"), width = 600, height = 500)

# Generate the comparative word cloud visualization
corp_dom %>%
    quanteda.textplots::textplot_wordcloud(comparison = TRUE, 
                                            color = c("purple", "darkgray"))

# Close the graphical device to save the image
dev.off()


We now turn to concordancing which is the most widely used method for inspecting textual data. 

## Concordancing

Concordancing is a text analysis method that involves examining the context of specific words or phrases within a larger body of text. By creating a "concordance" or a list of occurrences along with their surrounding text, researchers can analyze patterns of usage, meaning, and collocation, providing insights into language use and discourse.

The next code snippet uses the quanteda package to perform a Keyword In Context (KWIC) analysis on the text data contained in the esc$body variable, focusing on occurrences of the word "eurovision." The output is converted into a clean data frame for easier inspection and analysis, showing the context of the keyword within a specified window size.


In [None]:
# Perform KWIC analysis on the text data
quanteda::kwic(x = tokens(esc$body),    # Define text(s) to analyze
                pattern = "eurovision",  # Define the pattern to search for
                window = 5) %>%         # Define the size of the context window

  # Convert the KWIC output into a data frame
  as.data.frame() %>%
  
  # Remove superfluous columns that are not needed for analysis
  dplyr::select(-to, -from, -pattern) -> kwic_esc

# Inspect the first few results of the KWIC analysis
head(kwic_esc)


The next code snippet saves the results of the Keyword In Context (KWIC) analysis stored in the `kwic_esc` data frame to an Excel file. The file is named "kwic1.xlsx" and is saved in the "tables" directory of the project, allowing for easy access and sharing of the analysis results.



In [None]:
# Save the KWIC analysis results to an Excel file
write.xlsx(kwic_esc, here::here("tables", "kwic1.xlsx"))


This code snippet performs a Keyword In Context (KWIC) analysis using a regular expression pattern to search for occurrences of words that begin with "walk" within the text data contained in the `esc$body` variable. The output is converted into a clean data frame for further inspection, showing the contexts in which these words appear.



In [None]:
# Create KWIC analysis using a regular expression pattern
quanteda::kwic(x = tokens(esc$body), 
               pattern = "walk.*",      # Define the regex pattern to match words starting with "walk"
               window = 5,              # Define the size of the context window
               valuetype = "regex") %>%  # Specify that the pattern is a regular expression

  # Convert the KWIC output into a data frame
  as.data.frame() %>%
  
  # Remove superfluous columns that are not needed for analysis
  dplyr::select(-to, -from, -pattern) -> kwic_walk

# Inspect the first few results of the KWIC analysis
head(kwic_walk)


WE now turn to identifying key words which is a very common aspect of text analytics.

## Identifying keywords

This code snippet prepares the data for identifying keywords by filtering the clean text data for the "control" group and then tokenizing it to analyze the frequency of each token (word). The results are organized into a data frame, which displays the most common tokens in the control data set for further analysis.


In [None]:
# Prepare data for identifying keywords from the control group
cntltb <- esc_clean %>% 
  dplyr::filter(data == "control")  # Filter the dataset for the control group

# Reformat control data for analysis
quanteda::tokens(as.character(cntltb$ctext)) %>%  # Tokenize the text data
  unlist() %>% 
  as.data.frame() %>% 
  # Rename the first column for clarity
  dplyr::rename(token = 1) %>% 
  dplyr::group_by(token) %>%  # Group by token (word)
  
  # Determine the frequency of each token/word
  dplyr::summarise(text2 = n()) %>%
  
  # Arrange the tokens by frequency in descending order
  dplyr::arrange(-text2) -> text2

# Inspect the first few results of the keyword analysis
head(text2)


This code snippet combines the keyword frequency tables (`text1` and `text2`) generated from previous analyses into a single data frame (`keytb`). It uses a full join to ensure all tokens from both tables are included and replaces any missing frequency values with zeros for a clearer comparison of keyword usage across the datasets.



In [None]:
# Combine the frequency tables from the two datasets
keytb <- dplyr::full_join(text1, text2, by = "token") %>%
  # Replace NA values with 0 for clarity
  tidyr::replace_na(list(text1 = 0, text2 = 0))

# Inspect the first few results of the combined data
head(keytb)


This code snippet loads a custom function for extracting keyness measures from a specified URL and then applies this function to the combined keyword frequency data (keytb) to calculate various statistical metrics. The results are stored in the keys data frame, which is subsequently inspected to view the first ten rows.



In [None]:
# Load a function that extracts keyness measures (default ordering by G2)
source("https://slcladal.github.io/rscripts/keystats.R")

# Load the texts and calculate keyness statistics using the keytb data
keys <- keystats(keytb)

# Inspect the first 10 rows of the keyness statistics data
head(keys, 10)


This code snippet saves the keyness statistics obtained from the previous analysis, stored in the keys data frame, to an Excel file. The file is named "keys.xlsx" and is saved in the "tables" directory of the project for easy access and sharing of the results.



In [None]:
# Save the keyness statistics to an Excel file
write.xlsx(keys, here::here("tables", "keys.xlsx"))


We now turn to visialising networks of keywords

## Visualising networks

Network visualization is a powerful tool in text analysis as it helps to uncover relationships and connections between different words or phrases, making it easier to identify patterns and structures within the data. By representing these connections graphically, researchers can gain insights into the prominence and contextual usage of terms in their texts.

This code snippet extracts the top 30 keywords classified as "type" from the `keys` data frame, allowing for a focused analysis of the most significant terms to be visualized in a network.


In [None]:
# Extract the top 30 keywords classified as "type" for network visualization
keywords <- keys %>%
  dplyr::filter(type == "type") %>%  # Filter for keywords of the specified type
  dplyr::pull(token) %>%              # Pull the keyword tokens into a vector
  head(30)                            # Select the top 30 keywords

# Inspect the extracted keywords
keywords


This code snippet sets up a network visualization of keywords based on co-occurrence in text data from the Eurovision events in 2015 and 2025. It first filters the dataset for the relevant years, creates a tokenized version of the text, generates a frequency co-occurrence matrix (fcmat), and then selects the co-occurrences of the specified keywords for visualization. Finally, it uses the `textplot_network` function to create a network plot that represents the relationships between these keywords, customizing various visual attributes for clarity.



In [None]:
set.seed(100)  # Set a seed for reproducibility of random processes

# Filter the clean text data for the years 2015 and 2025, extract the text, and tokenize it
toks <- esc_clean %>%
  dplyr::filter(data == "euro2015" | data == "euro2025") %>%  # Filter for relevant datasets
  dplyr::pull(ctext) %>%                                       # Extract the clean text
  tokens()                                                     # Tokenize the text data

# Generate a frequency co-occurrence matrix based on a window context
fcmat <- fcm(toks, context = "window", tri = FALSE)

# Select the frequency co-occurrence matrix for the specified keywords
fcm_50 <- fcm_select(fcmat, pattern = keywords)

# Create a network plot from the selected co-occurrence matrix
fcm_select(fcm_50, pattern = keywords) %>%
    textplot_network(edge_color = "lightgray",                # Set edge color for clarity
                     edge_alpha = 0.5,                        # Adjust edge transparency
                     vertex_color = "gray30",                 # Set vertex color
                     vertex_labelsize = log(rowSums(fcm_50)/min(rowSums(fcm_50))) * 3) + # Scale vertex label sizes
    
  # Customize the plot background
  theme(
    plot.background = element_rect(fill = "white"),           # Set plot background to white
    panel.background = element_rect(fill = "white", colour = "white")  # Set panel background to white
    ) -> netplot  # Save the resulting plot to netplot

# Display the network plot
netplot


This code snippet saves the previously created network plot (`netplot`) as a PNG image file. The image is stored in the "images" directory with the filename "netplot.png," ensuring that the visualization can be easily accessed and shared.



In [None]:
# Save the network plot as a PNG image file
ggsave(here::here("images", "netplot.png"))


In the final part of the workshop, we focus on topic modeling 

## Topic modelling

Topic modeling is a statistical technique used in natural language processing to discover the underlying themes or topics within a collection of texts. By analyzing word co-occurrence patterns, it identifies groups of words that frequently appear together, indicating common subjects across the documents.

The most popular methods include Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF). Topic modeling helps researchers and analysts summarize large datasets, making it easier to understand the main themes, track trends over time, and compare different texts or sources. This technique is particularly useful in fields such as social sciences, marketing, and information retrieval, where understanding content at scale is crucial.

### Data-driven Topic Modellling

Here we begin by implementing a data-driven Latent Dirichlet Allocation (LDA) topic model. This initial step allows us to uncover the underlying themes present in the corpus based solely on the distribution of words without any prior labeling. In the second step, we will utilize a supervised topic model that builds upon the insights gained from the data-driven LDA. This approach enables the researcher to generate a more meaningful and interpretable model by incorporating predefined labels or categories. By leveraging both unsupervised and supervised techniques, we aim to enhance the quality and relevance of the topics identified, ultimately providing a deeper understanding of the text data. 

This code snippet creates a text corpus by filtering the esc_clean dataset to exclude any entries labeled as "control." It extracts the clean text content from the remaining entries and stores it in the variable corpus, which can then be used for further text analysis or modeling.


In [None]:
# Create a text corpus by filtering out control data and extracting clean text
corpus <- esc_clean %>%
  dplyr::filter(data != "control") %>%   # Exclude entries labeled as "control"
  dplyr::pull(ctext)                     # Extract the clean text content


This code snippet loads a custom function from an external script to preprocess a text corpus for topic modeling. The `preptop` function cleans the corpus and converts it into a document-feature matrix (dfm), which represents the frequency of words across documents.

After converting the corpus, the code removes any columns (features) and rows (documents) that do not contain any words (i.e., have a total count of zero). Finally, it displays the first five rows and columns of the cleaned document-feature matrix for inspection.


In [None]:
# Load function that helps with loading texts for preprocessing
source("https://slcladal.github.io/rscripts/preptop.R")

# Clean the corpus and convert it to a document-feature matrix (dfm)
clean_dfm <- preptop(corpus)                           # Apply the preptop function to the corpus
clean_dfm <- clean_dfm[, colSums(abs(clean_dfm)) > 0]  # Remove columns (features) with zero counts
clean_dfm <- clean_dfm[rowSums(abs(clean_dfm)) > 0,]   # Remove rows (documents) with zero counts

# Inspect the first five rows and columns of the cleaned dfm
clean_dfm[1:5, 1:5]


This code snippet generates a Latent Dirichlet Allocation (LDA) topic model using the `topicmodels` package, applied to the cleaned document-feature matrix (`clean_dfm`). The parameter `k` is set to 5, indicating the number of topics to be identified within the corpus. Users can modify this value to explore different topic configurations, such as 10 or 15 topics, and examine the consistency of the keywords associated with each topic. The `control` argument sets a random seed for reproducibility.



In [None]:
# Generate an LDA topic model with 5 topics; adjust k to explore different topic counts
tmlda <- topicmodels::LDA(clean_dfm, k = 5, control = list(seed = 1234))


This code snippet loads a custom function from an external script that is designed to tabulate the top terms for each topic identified in the LDA model. The tabtop function is applied to the tmlda object, which contains the LDA model, and it specifies that the top 10 terms for each topic should be displayed. This allows users to examine the most significant words associated with each identified topic, providing insights into the thematic structure of the corpus.



In [None]:
# Load function that tabulates the top terms for each topic in the LDA model
source("https://slcladal.github.io/rscripts/tabtop.R")

# Inspect the top 10 terms for each topic in the LDA model
tabtop(tmlda, 10)


This code snippet uses the `tidytext` package to transform the LDA model object (`tmlda`) into a tidy format, specifically focusing on the topic-term probabilities represented in the "beta" matrix. The resulting data frame, `termprobs_tmlda`, contains the probabilities of each term belonging to each topic, facilitating further analysis or visualization of the topic modeling results. The `head` function is then employed to display the first few rows of this tidy data frame for inspection.



In [None]:
# Transform the LDA model into a tidy format, extracting topic-term probabilities (beta matrix)
termprobs_tmlda <- tidytext::tidy(tmlda, matrix = "beta")

# Inspect the first few rows of the tidy data frame containing term probabilities
head(termprobs_tmlda)


This code snippet saves the tidy data frame termprobs_tmlda, which contains the probabilities of terms across the topics identified in the LDA model, to an Excel file. The `write_xlsx` function from the `writexl` package is used to export the data, with the file being saved in the specified "tables" folder as "termprobs_tmlda.xlsx." This allows for easy access and sharing of the topic-term probabilities for further analysis or reporting.



In [None]:
# Save the tidy data frame of term probabilities to an Excel file in the tables folder
write_xlsx(termprobs_tmlda, here::here("tables", "termprobs_tmlda.xlsx"))


This code snippet utilizes the `tidytext` package to convert the LDA model object (`tmlda`) into a tidy format that focuses on document-topic probabilities, represented in the "gamma" matrix. The resulting data frame, `docprobs_tmlda`, contains the probabilities of each document belonging to each topic, allowing for an examination of how topics are distributed across the documents in the corpus. The `head` function is then used to display the first few rows of this data frame for inspection.



In [None]:
# Transform the LDA model into a tidy format, extracting document-topic probabilities (gamma matrix)
docprobs_tmlda <- tidy(tmlda, matrix = "gamma")

# Inspect the first few rows of the tidy data frame containing document probabilities
head(docprobs_tmlda)


This code snippet saves the tidy data frame `docprobs_tmlda`, which contains the probabilities of documents belonging to each topic identified in the LDA model, to an Excel file. The `write_xlsx` function from the `writexl` package is used to export the data, with the file being saved in the specified "tables" folder as "docprobs_tmlda.xlsx." This allows for easy access to the document-topic probabilities for further analysis or reporting.



In [None]:
# Save the tidy data frame of document probabilities to an Excel file in the tables folder
write_xlsx(docprobs_tmlda, here::here("tables", "docprobs_tmlda.xlsx"))


### Supervised Topic Modelling


In this code snippet, we implement a semi-supervised Latent Dirichlet Allocation (LDA) topic model using predefined topic labels. A dictionary is created, where each topic is associated with specific keywords relevant to that topic. The `seededlda::textmodel_seededlda` function is then used to fit the model to the cleaned document-feature matrix (`clean_dfm`) with the specified dictionary. By setting `residual = TRUE`, we allow the model to capture not only the predefined topics but also any additional patterns in the data that may not be covered by the keywords. Finally, the `terms` function is called to inspect the terms associated with each identified topic, providing insights into the keywords that define them.


In [None]:
# Create a dictionary for semi-supervised LDA with predefined topics
dict <- dictionary(list(Topic01 = c("norway", "italy", "france"),
                        Topic02 = c("points", "win", "song"),
                        Topic03 = c("like", "bad", "performance"),
                        Topic04 = c("fabulous", "glamorous", "fantastic")))

# Fit a semi-supervised LDA model using the cleaned document-feature matrix and the defined dictionary
tmod_slda <- seededlda::textmodel_seededlda(clean_dfm, 
                                            dict, 
                                            residual = TRUE, 
                                            min_termfreq = 2)

# Inspect the terms associated with each identified topic
terms(tmod_slda)


This code snippet generates a data frame that links the identified topics from the semi-supervised LDA model to the corresponding text files and their cleaned content. It first extracts the filenames from the topics using a regular expression, replacing the full path with just the filename. The variable `cleancontent` is assigned the cleaned corpus of text. The `topics(tmod_slda)` function retrieves the assigned topics for each document. A data frame (`df`) is then created with columns for the filenames, cleaned content, and their associated topics. The `mutate_if` function converts any character columns to factors, making it easier to analyze categorical data. Finally, the `head(df)` function displays the first few rows of the data frame for inspection.



In [None]:
# Extract filenames from the topics generated by the semi-supervised LDA model
files <- names(topics(tmod_slda))

# Retrieve the topics assigned to each document
topics <- topics(tmod_slda)

# Generate a data frame containing filenames, cleaned content, and topics
df <- data.frame(files, topics) %>%
  # Convert character columns to factors
  dplyr::mutate_if(is.character, factor)

# Inspect the first few rows of the data frame
head(df)


This code snippet saves the data frame `dfp` to an Excel file named "df.xlsx" in the specified directory under "tables". The `write_xlsx` function from the `writexl` package is used to export the data frame, allowing for easy sharing and further analysis in Excel or other spreadsheet software.



In [None]:
# Save data for MyOutput folder
write_xlsx(dfp, here::here("tables", "df.xlsx"))


Thank you for participating in this workshop on basic text analysis with R. We hope you found the sessions informative and engaging, and that you now feel more confident in applying these techniques to your own research.

# Outro

Thank you for attending this workshop on basic text analysis with R! We hope you found the content valuable and that you feel equipped to apply these techniques in your research. If you're eager to learn more, we encourage you to explore the Language Technology and Data Analysis Laboratory (LADAL) tutorials at www.ladal.edu.au, where you can access additional resources and tutorials.

To wrap up, here’s some information about your current R session, which can help you keep track of your working environment:


In [None]:
sessionInfo()




Thank you again for your participation, and we wish you the best in your future analyses!
