-
Notifications
You must be signed in to change notification settings - Fork 21
/
corpus.Rmd
86 lines (62 loc) · 2.63 KB
/
corpus.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
title: "Corpus manipulation"
output: rmarkdown::html_vignette
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.width = 7, fig.height = 4, fig.align = "center")
```
This tutorial provides insights in how to create, enrich, transform, and analyze a `sento_corpus` object. A `sento_corpus` object is special because it always has a date column, and numeric metadata features.
**Preparation**
```{r}
library("sentometrics")
library("quanteda")
data("usnews")
data("list_lexicons")
data("list_valence_shifters")
```
### Summarize a corpus through some statistics and plots
The `corpus_summarize()` function allows quickly investigating how your corpus looks like in terms of number of documents, number of tokens, and its metadata features. It can be done at a daily, weekly, monthly, or yearly frequency, and for all the corpus features or only a selection of them.
```{r}
corpus <- sento_corpus(usnews)
summ <- corpus_summarize(corpus, by = "month", features = c("wsj", "wapo"))
stats <- summ[["stats"]]
plots <- summ[["plots"]]
```
The summary consists of a statistics component...
```{r}
stats
```
... and a component with pregenerated graphs of the statistics.
```{r}
plots$doc_plot # monthly evolution of the number of documents
plots$feature_plot # monthly evolution of the presence of the two journal features
plots$token_plot # monthly evolution of the token statistics
```
### Apply **`quanteda`** corpus functions on a `sento_corpus` object
It is also possible to apply the many corpus manipulation functions of the **`quanteda`** package on a `sento_corpus` object. In fact, the `sento_corpus` object is built on **`quanteda`**'s `corpus` object.
```{r}
corpus <- sento_corpus(usnews)
res <- corpus_reshape(corpus, to = "sentences")
sam <- corpus_sample(corpus, 100)
seg <- corpus_segment(corpus, pattern = "stock", use_docvars = TRUE)
sub <- corpus_subset(corpus, wsj == 1)
tri <- corpus_trim(corpus, "documents", min_ntoken = 300)
trs <- corpus_trim(corpus, "sentences", min_ntoken = 40)
```
### Enrich a `sento_corpus` object with features
Using the `add_features()` function, additional features can be added to your corpus, or generated through keywords or regex pattern matching.
```{r}
corpus <- sento_corpus(usnews[, 1:3])
kw <- list(
E = c("economy", "economic"),
P = c("polic.|Polic.|politi.|Politi."), # a regex pattern
U = c("uncertainty", "uncertain")
)
corpus <- add_features(corpus, keywords = kw, do.binary = TRUE, do.regex = c(FALSE, TRUE, FALSE))
docvars(corpus, "dummyFeature") <- NULL
head(docvars(corpus), 20)
```