-
Notifications
You must be signed in to change notification settings - Fork 0
/
GoTSentimentAnalysis.Rmd
339 lines (251 loc) · 11.5 KB
/
GoTSentimentAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
---
title: 'Text mining, sentiment analysis, and visualization of GoT'
date: 'created on 4 April 2024 and updated `r format(Sys.time(), "%d %B, %Y")`'
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
message = FALSE)
library(tidyverse)
library(here)
# For text mining:
library(pdftools)
library(tidytext)
library(textdata)
library(ggwordcloud)
# Note - Before lab:
# Attach tidytext and textdata packages
# Run: get_sentiments(lexicon = "nrc")
# Should be prompted to install lexicon - choose yes!
# Run: get_sentiments(lexicon = "afinn")
# Should be prompted to install lexicon - choose yes!
```
**Note:** for more text analysis, you can fork & work through Casey O’Hara and Jessica Couture’s eco-data-sci workshop (available here https://github.com/oharac/text_workshop)
### Get the Game of Thrones pdf:
```{r get-document}
got_path <- here("data","got.pdf")
got_text <- pdf_text(got_path)
```
Some things to notice:
- How cool to extract text out of a PDF! Do you think it will work with any PDF?
- Each row is a page of the PDF (i.e., this is a vector of strings, one for each page)
- The pdf_text() function only sees text that is "selectable"
Example: Just want to get text from a single page (e.g. Page 9)?
```{r single-page}
got_p9 <- got_text[9]
got_p9
```
See how that compares to the text in the PDF on Page 9. What has pdftools added and where?
From Jessica and Casey's text mining workshop: “pdf_text() returns a vector of strings, one for each page of the pdf. So we can mess with it in tidyverse style, let’s turn it into a dataframe, and keep track of the pages. Then we can use stringr::str_split() to break the pages up into individual lines. Each line of the pdf is concluded with a backslash-n, so split on this. We will also add a line number in addition to the page number."
### Some wrangling:
- Split up pages into separate lines (separated by `\n`) using `stringr::str_split()`
- Unnest into regular columns using `tidyr::unnest()`
- Remove leading/trailing white space with `stringr::str_trim()`
```{r split-lines}
got_df <- data.frame(got_text) %>%
mutate(text_full = str_split(got_text, pattern = '\n')) %>%
unnest(text_full) %>%
mutate(text_full = str_trim(text_full))
# Why '\\n' instead of '\n'? Because some symbols (e.g. \, *) need to be called literally with a starting \ to escape the regular expression. For example, \\a for a string actually contains \a. So the string that represents the regular expression '\n' is actually '\\n'.
# Although, this time round, it is working for me with \n alone. Wonders never cease.
# More information: https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html
```
Now each line, on each page, is its own row, with extra starting & trailing spaces removed.
### Get the tokens (individual words) in tidy format
Use `tidytext::unnest_tokens()` (which pulls from the `tokenizer`) package, to split columns into tokens. We are interested in *words*, so that's the token we'll use:
```{r tokenize}
got_tokens <- got_df %>%
unnest_tokens(word, text_full)
got_tokens
# See how this differs from `got_df`
# Each word has its own row!
```
Let's count the words!
```{r count-words}
got_wc <- got_tokens %>%
count(word) %>%
arrange(-n)
got_wc
```
OK...so we notice that a whole bunch of things show up frequently that we might not be interested in ("a", "the", "and", etc.). These are called *stop words*. Let's remove them.
### Remove stop words:
See `?stop_words` and `View(stop_words)`to look at documentation for stop words lexicons.
We will *remove* stop words using `tidyr::anti_join()`:
```{r stopwords}
got_stop <- got_tokens %>%
anti_join(stop_words) %>%
select(-got_text)
```
Now check the counts again:
```{r count-words2}
got_swc <- got_stop %>%
count(word) %>%
arrange(-n)
```
What if we want to get rid of all the numbers (non-text) in `got_stop`?
```{r skip-numbers}
# This code will filter out numbers by asking:
# If you convert to as.numeric, is it NA (meaning those words)?
# If it IS NA (is.na), then keep it (so all words are kept)
# Anything that is converted to a number is removed
got_no_numeric <- got_stop %>%
filter(is.na(as.numeric(word)))
```
### A word cloud of got report words (non-numeric)
See more: https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html
```{r wordcloud-prep}
# There are almost 2000 unique words
length(unique(got_no_numeric$word))
# We probably don't want to include them all in a word cloud. Let's filter to only include the top 100 most frequent?
got_top100 <- got_no_numeric %>%
count(word) %>%
arrange(-n) %>%
head(100)
```
```{r wordcloud}
got_cloud <- ggplot(data = got_top100, aes(label = word)) +
geom_text_wordcloud() +
theme_minimal()
got_cloud
```
That's underwhelming. Let's customize it a bit:
```{r wordcloud-pro}
ggplot(data = got_top100, aes(label = word, size = n)) +
geom_text_wordcloud_area(aes(color = n), shape = "star") +
scale_size_area(max_size = 12) +
scale_color_gradientn(colors = c("darkgreen","blue","red")) +
theme_minimal()
```
Cool! And you can facet wrap (for different reports, for example) and update other aesthetics. See more here: https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html
### Sentiment analysis
First, check out the ‘sentiments’ lexicon. From Julia Silge and David Robinson (https://www.tidytextmining.com/sentiment.html):
“The three general-purpose lexicons are
- AFINN from Finn Årup Nielsen,
- bing from Bing Liu and collaborators, and
- nrc from Saif Mohammad and Peter Turney
All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. All of this information is tabulated in the sentiments dataset, and tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon."
Let's explore the sentiment lexicons. "bing" is included, other lexicons ("afinn", "nrc", "loughran") you'll be prompted to download.
**WARNING:** These collections include very offensive words. I urge you to not look at them in class.
"afinn": Words ranked from -5 (very negative) to +5 (very positive)
```{r afinn}
get_sentiments(lexicon = "afinn")
# Note: may be prompted to download (yes)
# Let's look at the pretty positive words:
afinn_pos <- get_sentiments("afinn") %>%
filter(value %in% c(3,4,5))
# Do not look at negative words in class.
afinn_pos
```
bing: binary, "positive" or "negative"
```{r bing}
get_sentiments(lexicon = "bing")
```
nrc:https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
Includes bins for 8 emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) and positive / negative.
**Citation for NRC lexicon**: Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, Computational Intelligence, 29 (3), 436-465, 2013.
Now nrc:
```{r nrc}
get_sentiments(lexicon = "nrc")
```
Let's do sentiment analysis on the got text data using afinn, and nrc.
### Sentiment analysis with afinn:
First, bind words in `got_stop` to `afinn` lexicon:
```{r bind-afinn}
got_afinn <- got_stop %>%
inner_join(get_sentiments("afinn"))
?get_sentiments
```
Let's find some counts (by sentiment ranking):
```{r count-afinn}
got_afinn_hist <- got_afinn %>%
count(value)
# Plot them:
ggplot(data = got_afinn_hist, aes(x = value, y = n)) +
geom_col()
```
Investigate some of the words in a bit more depth:
```{r afinn-2}
# What are these '-2' words?
got_afinn2 <- got_afinn %>%
filter(value == -2)
```
```{r afinn-2-more}
# Check the unique -2-score words:
unique(got_afinn2$word)
# Count & plot them
got_afinn2_n <- got_afinn2 %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = fct_reorder(factor(word), n))
ggplot(data = got_afinn2_n, aes(x = word, y = n)) +
geom_col() +
coord_flip()
# The list was so long it was impossible to read the table, so we added some code that only show the top 20 words
```
It seems quite reasonable that these words are on the negative side. Maybe fire could also be cozy in contexts like "we sat by the warm and nice fire..." but the most common interpretation would be negative i guess!
Or we can summarize sentiment for the report:
```{r summarize-afinn}
got_summary <- got_afinn %>%
summarize(
mean_score = mean(value),
median_score = median(value)
)
got_summary
```
The mean and median indicate *slightly* negative overall sentiments based on the AFINN lexicon. Which is similar to our expectation based on our knowledge of GoT.
### NRC lexicon for sentiment analysis
We can use the NRC lexicon to start "binning" text by the feelings they're typically associated with. As above, we'll use inner_join() to combine the got non-stopword text with the nrc lexicon:
```{r bind-bing}
got_nrc <- got_stop %>%
inner_join(get_sentiments("nrc"))
```
Wait, won't that exclude some of the words in our text? YES! We should check which are excluded using `anti_join()`:
```{r check-exclusions}
got_exclude <- got_stop %>%
anti_join(get_sentiments("nrc"))
# View(got_exclude)
# Count to find the most excluded:
got_exclude_n <- got_exclude %>%
count(word, sort = TRUE)
head(got_exclude_n)
```
**Lesson: always check which words are EXCLUDED in sentiment analysis using a pre-built lexicon! **
Now find some counts:
```{r count-bing}
got_nrc_n <- got_nrc %>%
count(sentiment, sort = TRUE)
# And plot them:
ggplot(data = got_nrc_n, aes(x = sentiment, y = n)) +
geom_col()
```
Or count by sentiment *and* word, then facet:
```{r count-nrc}
got_nrc_n5 <- got_nrc %>%
count(word,sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(5) %>%
ungroup()
got_nrc_gg <- ggplot(data = got_nrc_n5, aes(x = reorder(word,n), y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, ncol = 4, scales = "free") +
coord_flip() +
theme_minimal() +
labs(x = "Word", y = "count")
# Show it
got_nrc_gg
# Save it
ggsave(plot = got_nrc_gg,
here("figures","got_nrc_sentiment.png"),
height = 8,
width = 5)
```
This plot indicates some contradiction in the NRC lexicon. "lord" is showing up in NRC lexicon as "trust", "joy", "disgust" and "negative"? Let's check:
```{r nrc-lord}
conf <- get_sentiments(lexicon = "nrc") %>%
filter(word == "lord")
# Yep, check it out:
conf
```
## Big picture takeaway
There are serious limitations of sentiment analysis using existing lexicons, and you should **think really hard** about your findings and if a lexicon makes sense for your study. Like we saw with "lord" these lexicons can be contradictory and ambiguous. Also we saw that the word "lord" was the most frequent, but it didn't show up in the plots using the afinn-lexicon, that might be bacause it is assigned the value "0" because of the general ambiguity of the word. You should be very awere of the limitations and omitations of different lexicas and methods. Otherwise, word counts and exploration alone can be useful!