In [1]:
library(quanteda)
library(quanteda.textstats)
library(writexl)
library(seededlda)

Package version: 3.3.1
Unicode version: 15.1
ICU version: 75.1

Parallel computing: 16 of 16 threads used.

See https://quanteda.io for tutorials and examples.

Loading required package: proxyC


Attaching package: ‘proxyC’


The following object is masked from ‘package:stats’:

    dist



Attaching package: ‘seededlda’


The following object is masked from ‘package:stats’:

    terms




In [2]:
# Load data and list column names
convos <- read.csv('data/all-conversations-sample.csv')

# generate a doc_id for quanteda - this would ideally be a primary key instead.
convos$doc_id <- seq_along(convos$text)

names(convos)

In [3]:
# Create a corpus from the dataframe we loaded, use the text field as the text
convo_corpus <- corpus(convos, text_field="text", docid_field = "doc_id")

# Tokenise the corpus - we use the Quanteda default tokenisation and remove the standard list of English stopwords
# Note that the standard English list assumes written not spoken material so we will have to take a closer look at this.
convo_tokens <- tokens(
    convo_corpus, remove_punct=TRUE
) |> tokens_remove(stopwords("english"))

# Create a document-feature matrix, (also known as document term, or term-document matrix).
# Note that dfm is the quanteda standard nomenclature so I'll use it throughout.
# More specifically this is a Turn-Token matrix, as the 'documents' are single turns by a speaker.
convo_turn_dfm <- dfm(convo_tokens)

# Granularity

We can count things at different levels of granularity to start to address topic. A word that is used once in every conversation is different from a word used 30 times in a single conversation.

Some examples:

- we can count tokens, regardless of where they occur
- we can count turns including a token
- we can count conversations including a token
- we can count by speaker in a conversation

In [4]:
# Let's start with tokens and turns:
frequencies <- textstat_frequency(convo_turn_dfm)
names(frequencies)

# the feature is the token, the frequency is the token count, and docfreq is the turn count
write_xlsx(frequencies, "results/token_turn_counts.xlsx")

In [5]:
# We'll group the dfm together by the file (conversation) to count tokens and conversations
convo_dfm <- dfm_group(convo_turn_dfm, groups=docvars(convo_corpus, '.file_id'))
conversation_frequencies <- textstat_frequency(convo_dfm)

write_xlsx(conversation_frequencies, "results/token_conversation_counts.xlsx")

In [6]:
# Topics by turn
lda_turns <- textmodel_lda(convo_turn_dfm, k=10)
turn_results <- terms(lda_turns, n=20) |> as.data.frame()

write_xlsx(turn_results, "results/turn_topic_top_words.xlsx")

turn_results

topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
like,oh,um,know,like,uh,yeah,yeah,like,really
just,ok,one,well,hh,know,mm,haha,just,good
know,okay,like,go,um,got,hmm,hahah,hh,little
kind,yeah,school,ah,different,come,nice,think,really,think
people,haha,year,right,kinda,australia,haha,hahaah,hahaha,bit
say,yep,actually,now,cause,always,meet,definitely,go,um
things,hahaha,cause,see,lot,work,hoping,right,wanna,can
get,yes,get,mean,sort,never,hm,hi,much,still
stuff,really,time,just,whole,something,hahaha,true,gonna,like
think,good,two,going,well,get,times,live,thought,moved


In [7]:
# Topics on conversations as documents
lda_convo <- textmodel_lda(convo_dfm, k=10)
convo_results <- terms(lda_convo, n=20) |> as.data.frame()

write_xlsx(convo_results, "results/convo_topic_top_words.xlsx")

convo_results

topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
born,haha,worked,sort,ok,yeah,hh,uh,hh,right
town,hh,electrician,brisbane,bit,like,um,ok,haha,ok
lots,hahaha,ok,found,city,um,okay,think,okay,hhh
also,okay,chuck,non-native,floor,just,different,well,hahaha,americans
hour,good,japanese,gosh,guess,oh,mm,states,tsk,mean
high,hahah,enough,probably,bk,know,uh,probably,idioms,south
alright,nice,town,business,mmm,really,hmm,got,hahaah,coast
side,hahaah,south,quite,wing,mm,ha,able,hahahaha,sure
group,gonna,grow,early,small,go,sort,small,speakers,nothing
sorry,yes,fruit,try,guys,cause,thing,come,hahah,two
