In [None]:
library(quanteda)
library(quanteda.textstats)
library(writexl)
library(seededlda)

In [None]:
# Load data and list column names
convos <- read.csv('data/all-conversations.csv')

# Focus our analysis on only the turns purely in the MAIN section - note that there are a small number of turns that overlap 
# between main and pre/post as different conversation groups were overlapping at the time of data collection.
convos <- convos[convos$section == 'MAIN',]

# generate a doc_id for quanteda - this would ideally be a primary key instead.
convos$doc_id <- seq_along(convos$text)

nrow(convos)
names(convos)

In [None]:
# Create a corpus from the dataframe we loaded, use the text field as the text
convo_corpus <- corpus(convos, text_field="text", docid_field = "doc_id")

# Tokenise the corpus - we use the Quanteda default tokenisation and remove the standard list of English stopwords
# Note that the standard English list assumes written not spoken material so we will have to take a closer look at this.
convo_tokens <- tokens(
    convo_corpus, remove_punct=TRUE
) |> tokens_remove(stopwords("english"))

# Create a document-feature matrix, (also known as document term, or term-document matrix).
# Note that dfm is the quanteda standard nomenclature so I'll use it throughout.
# More specifically this is a Turn-Token matrix, as the 'documents' are single turns by a speaker.
convo_turn_dfm <- dfm(convo_tokens)

# Granularity

We can count things at different levels of granularity to start to address topic. A word that is used once in every conversation is different from a word used 30 times in a single conversation.

Some examples:

- we can count tokens, regardless of where they occur
- we can count turns including a token
- we can count conversations including a token
- we can count by speaker in a conversation

In [None]:
# Let's start with tokens and turns:
frequencies <- textstat_frequency(convo_turn_dfm)
names(frequencies)

# the feature is the token, the frequency is the token count, and docfreq is the turn count
write_xlsx(frequencies, "results/token_turn_counts.xlsx")

In [None]:
# We'll group the dfm together by the file (conversation) to count tokens and conversations
convo_dfm <- dfm_group(convo_turn_dfm, groups=docvars(convo_corpus, 'X_file_id'))
conversation_frequencies <- textstat_frequency(convo_dfm)

write_xlsx(conversation_frequencies, "results/token_conversation_counts.xlsx")

In [None]:
# Topics by turn
lda_turns <- textmodel_lda(convo_turn_dfm, k=20)
turn_results <- terms(lda_turns, n=20) |> as.data.frame()

write_xlsx(turn_results, "results/turn_topic_top_words.xlsx")

turn_results

In [None]:
# Topics on conversations as documents
lda_convo <- textmodel_lda(convo_dfm, k=20)
convo_results <- terms(lda_convo, n=20) |> as.data.frame()

write_xlsx(convo_results, "results/convo_topic_top_words.xlsx")

convo_results