vignettes/f_tidytext_example.Rmd

---
title: "6. Using tidytext with textmineR"
author: "Thomas W. Jones"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{6. Using tidytext with textmineR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", warning = FALSE
)
```

# Using tidytext with textmineR

The [`tidytext`](https://CRAN.R-project.org/package=tidytext) package is one of the more popular natural language processing packages in R's ecosystem. It follows conventions and syntax of the "tidyverse." 

You may prefer to use `tidytext` for a couple of reasons. First, `tidytext` has its own philosophy and syntax for handling text, particularly at early stages. You may be more familiar or comfortable with this approach. Second, `tidytext` does, theoretically, offer some more flexibility in options creating DTMs or TCMs. This early stage is critical to successful topic modeling. 

See _[Text Mining with R: A Tidy Approach](https://www.tidytextmining.com/)_ for more details about tidytext.

What follows is a short script combining `tidytext` with `textmineR`. Initial data curation and DTM creation is done with `tidytext`. Topic modeling is done with `textmineR` and the outputs are re-formatted in the flavor of `tidytext`'s "tidiers" for other topic models. 

```{r}
################################################################################
# Example: Using tidytext with textmineR
################################################################################

library(tidytext)
library(textmineR)
library(dplyr)
library(tidyr)

# load documents in a data frame
docs <- textmineR::nih_sample 

# tokenize using tidytext's unnest_tokens
tidy_docs <- docs %>% 
  select(APPLICATION_ID, ABSTRACT_TEXT) %>% 
  unnest_tokens(output = word, 
                input = ABSTRACT_TEXT,
                stopwords = c(stopwords::stopwords("en"), 
                              stopwords::stopwords(source = "smart")),
                token = "ngrams",
                n_min = 1, n = 2) %>% 
  count(APPLICATION_ID, word) %>% 
  filter(n>1) #Filtering for words/bigrams per document, rather than per corpus

tidy_docs <- tidy_docs %>% # filter words that are just numbers
  filter(! stringr::str_detect(tidy_docs$word, "^[0-9]+$"))

# turn a tidy tbl into a sparse dgCMatrix for use in textmineR
d <- tidy_docs %>% 
  cast_sparse(APPLICATION_ID, word, n)


# create a topic model
m <- FitLdaModel(dtm = d, 
                 k = 20,
                 iterations = 200,
                 burnin = 175)


# below is equivalent to tidy_beta <- tidy(x = m, matrix = "beta")
tidy_beta <- data.frame(topic = as.integer(stringr::str_replace_all(rownames(m$phi), "t_", "")), 
                        m$phi, 
                        stringsAsFactors = FALSE) %>%
  gather(term, beta, -topic) %>% 
  tibble::as_tibble()

# below is equivalent to tidy_gamma <- tidy(x = m, matrix = "gamma")
tidy_gamma <- data.frame(document = rownames(m$theta),
                         m$theta,
                         stringsAsFactors = FALSE) %>%
  gather(topic, gamma, -document) %>%
  tibble::as_tibble()

```