/
f_tidytext_example.Rmd
85 lines (65 loc) · 3.13 KB
/
f_tidytext_example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
title: "6. Using tidytext with textmineR"
author: "Thomas W. Jones"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{6. Using tidytext with textmineR}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>", warning = FALSE
)
```
# Using tidytext with textmineR
The [`tidytext`](https://CRAN.R-project.org/package=tidytext) package is one of the more popular natural language processing packages in R's ecosystem. It follows conventions and syntax of the "tidyverse."
You may prefer to use `tidytext` for a couple of reasons. First, `tidytext` has its own philosophy and syntax for handling text, particularly at early stages. You may be more familiar or comfortable with this approach. Second, `tidytext` does, theoretically, offer some more flexibility in options creating DTMs or TCMs. This early stage is critical to successful topic modeling.
See _[Text Mining with R: A Tidy Approach](https://www.tidytextmining.com/)_ for more details about tidytext.
What follows is a short script combining `tidytext` with `textmineR`. Initial data curation and DTM creation is done with `tidytext`. Topic modeling is done with `textmineR` and the outputs are re-formatted in the flavor of `tidytext`'s "tidiers" for other topic models.
```{r}
################################################################################
# Example: Using tidytext with textmineR
################################################################################
library(tidytext)
library(textmineR)
library(dplyr)
library(tidyr)
# load documents in a data frame
docs <- textmineR::nih_sample
# tokenize using tidytext's unnest_tokens
tidy_docs <- docs %>%
select(APPLICATION_ID, ABSTRACT_TEXT) %>%
unnest_tokens(output = word,
input = ABSTRACT_TEXT,
stopwords = c(stopwords::stopwords("en"),
stopwords::stopwords(source = "smart")),
token = "ngrams",
n_min = 1, n = 2) %>%
count(APPLICATION_ID, word) %>%
filter(n>1) #Filtering for words/bigrams per document, rather than per corpus
tidy_docs <- tidy_docs %>% # filter words that are just numbers
filter(! stringr::str_detect(tidy_docs$word, "^[0-9]+$"))
# turn a tidy tbl into a sparse dgCMatrix for use in textmineR
d <- tidy_docs %>%
cast_sparse(APPLICATION_ID, word, n)
# create a topic model
m <- FitLdaModel(dtm = d,
k = 20,
iterations = 200,
burnin = 175)
# below is equivalent to tidy_beta <- tidy(x = m, matrix = "beta")
tidy_beta <- data.frame(topic = as.integer(stringr::str_replace_all(rownames(m$phi), "t_", "")),
m$phi,
stringsAsFactors = FALSE) %>%
gather(term, beta, -topic) %>%
tibble::as_tibble()
# below is equivalent to tidy_gamma <- tidy(x = m, matrix = "gamma")
tidy_gamma <- data.frame(document = rownames(m$theta),
m$theta,
stringsAsFactors = FALSE) %>%
gather(topic, gamma, -document) %>%
tibble::as_tibble()
```