vignettes/examples/corpus.Rmd

---
title: "Corpus manipulation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.width = 7, fig.height = 4, fig.align = "center")
```

This tutorial provides insights in how to create, enrich, transform, and analyze a `sento_corpus` object. A `sento_corpus` object is special because it always has a date column, and numeric metadata features.

**Preparation**
&nbsp;

```{r}
library("sentometrics")
library("quanteda")

data("usnews")
data("list_lexicons")
data("list_valence_shifters")
```

### Summarize a corpus through some statistics and plots

The `corpus_summarize()` function allows quickly investigating how your corpus looks like in terms of number of documents, number of tokens, and its metadata features. It can be done at a daily, weekly, monthly, or yearly frequency, and for all the corpus features or only a selection of them.

```{r}
corpus <- sento_corpus(usnews)

summ <- corpus_summarize(corpus, by = "month", features = c("wsj", "wapo"))
stats <- summ[["stats"]]
plots <- summ[["plots"]]
```

The summary consists of a statistics component...

```{r}
stats
```

... and a component with pregenerated graphs of the statistics.

```{r}
plots$doc_plot # monthly evolution of the number of documents
plots$feature_plot # monthly evolution of the presence of the two journal features
plots$token_plot # monthly evolution of the token statistics
```

### Apply **`quanteda`** corpus functions on a `sento_corpus` object

It is also possible to apply the many corpus manipulation functions of the **`quanteda`** package on a `sento_corpus` object. In fact, the `sento_corpus` object is built on **`quanteda`**'s `corpus` object.

```{r}
corpus <- sento_corpus(usnews)

res <- corpus_reshape(corpus, to = "sentences")
sam <- corpus_sample(corpus, 100)
seg <- corpus_segment(corpus, pattern = "stock", use_docvars = TRUE)
sub <- corpus_subset(corpus, wsj == 1)
tri <- corpus_trim(corpus, "documents", min_ntoken = 300)
trs <- corpus_trim(corpus, "sentences", min_ntoken = 40)
```

### Enrich a `sento_corpus` object with features 

Using the `add_features()` function, additional features can be added to your corpus, or generated through keywords or regex pattern matching.

```{r}
corpus <- sento_corpus(usnews[, 1:3])

kw <- list(
  E = c("economy", "economic"),
  P = c("polic.|Polic.|politi.|Politi."), # a regex pattern
  U = c("uncertainty", "uncertain")
)

corpus <- add_features(corpus, keywords = kw, do.binary = TRUE, do.regex = c(FALSE, TRUE, FALSE))
docvars(corpus, "dummyFeature") <- NULL

head(docvars(corpus), 20)
```