GoTSentimentAnalysis.Rmd

---
title: 'Text mining, sentiment analysis, and visualization of GoT'
date: 'created on 4 April 2024 and updated `r format(Sys.time(), "%d %B, %Y")`'
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      warning = FALSE,
                      message = FALSE)

library(tidyverse)
library(here)

# For text mining:
library(pdftools)
library(tidytext)
library(textdata) 
library(ggwordcloud)

# Note - Before lab:
# Attach tidytext and textdata packages
# Run: get_sentiments(lexicon = "nrc")
# Should be prompted to install lexicon - choose yes!
# Run: get_sentiments(lexicon = "afinn")
# Should be prompted to install lexicon - choose yes!

```

**Note:** for more text analysis, you can fork & work through Casey O’Hara and Jessica Couture’s eco-data-sci workshop (available here https://github.com/oharac/text_workshop)

### Get the Game of Thrones pdf:
```{r get-document}
got_path <- here("data","got.pdf")
got_text <- pdf_text(got_path)
```

Some things to notice:

- How cool to extract text out of a PDF! Do you think it will work with any PDF?
- Each row is a page of the PDF (i.e., this is a vector of strings, one for each page)
- The pdf_text() function only sees text that is "selectable"

Example: Just want to get text from a single page (e.g. Page 9)? 
```{r single-page}
got_p9 <- got_text[9]
got_p9
```

See how that compares to the text in the PDF on Page 9. What has pdftools added and where?

From Jessica and Casey's text mining workshop: “pdf_text() returns a vector of strings, one for each page of the pdf. So we can mess with it in tidyverse style, let’s turn it into a dataframe, and keep track of the pages. Then we can use stringr::str_split() to break the pages up into individual lines. Each line of the pdf is concluded with a backslash-n, so split on this. We will also add a line number in addition to the page number."

### Some wrangling:

- Split up pages into separate lines (separated by `\n`) using `stringr::str_split()`
- Unnest into regular columns using `tidyr::unnest()`
- Remove leading/trailing white space with `stringr::str_trim()`

```{r split-lines}
got_df <- data.frame(got_text) %>% 
  mutate(text_full = str_split(got_text, pattern = '\n')) %>% 
  unnest(text_full) %>% 
  mutate(text_full = str_trim(text_full)) 

# Why '\\n' instead of '\n'? Because some symbols (e.g. \, *) need to be called literally with a starting \ to escape the regular expression. For example, \\a for a string actually contains \a. So the string that represents the regular expression '\n' is actually '\\n'.
# Although, this time round, it is working for me with \n alone. Wonders never cease.

# More information: https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html

```

Now each line, on each page, is its own row, with extra starting & trailing spaces removed. 

### Get the tokens (individual words) in tidy format

Use `tidytext::unnest_tokens()` (which pulls from the `tokenizer`) package, to split columns into tokens. We are interested in *words*, so that's the token we'll use:

```{r tokenize}
got_tokens <- got_df %>% 
  unnest_tokens(word, text_full)
got_tokens

# See how this differs from `got_df`
# Each word has its own row!
```

Let's count the words!
```{r count-words}
got_wc <- got_tokens %>% 
  count(word) %>% 
  arrange(-n)
got_wc
```

OK...so we notice that a whole bunch of things show up frequently that we might not be interested in ("a", "the", "and", etc.). These are called *stop words*. Let's remove them. 

### Remove stop words:

See `?stop_words` and `View(stop_words)`to look at documentation for stop words lexicons.

We will *remove* stop words using `tidyr::anti_join()`:
```{r stopwords}
got_stop <- got_tokens %>% 
  anti_join(stop_words) %>% 
  select(-got_text)
```

Now check the counts again: 
```{r count-words2}
got_swc <- got_stop %>% 
  count(word) %>% 
  arrange(-n)
```

What if we want to get rid of all the numbers (non-text) in `got_stop`?
```{r skip-numbers}
# This code will filter out numbers by asking:
# If you convert to as.numeric, is it NA (meaning those words)?
# If it IS NA (is.na), then keep it (so all words are kept)
# Anything that is converted to a number is removed

got_no_numeric <- got_stop %>% 
  filter(is.na(as.numeric(word)))
```

### A word cloud of got report words (non-numeric)

See more: https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html

```{r wordcloud-prep}
# There are almost 2000 unique words 
length(unique(got_no_numeric$word))

# We probably don't want to include them all in a word cloud. Let's filter to only include the top 100 most frequent?
got_top100 <- got_no_numeric %>% 
  count(word) %>% 
  arrange(-n) %>% 
  head(100)
```

```{r wordcloud}
got_cloud <- ggplot(data = got_top100, aes(label = word)) +
  geom_text_wordcloud() +
  theme_minimal()

got_cloud
```

That's underwhelming. Let's customize it a bit:
```{r wordcloud-pro}
ggplot(data = got_top100, aes(label = word, size = n)) +
  geom_text_wordcloud_area(aes(color = n), shape = "star") +
  scale_size_area(max_size = 12) +
  scale_color_gradientn(colors = c("darkgreen","blue","red")) +
  theme_minimal()
```

Cool! And you can facet wrap (for different reports, for example) and update other aesthetics. See more here: https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html

### Sentiment analysis

First, check out the ‘sentiments’ lexicon. From Julia Silge and David Robinson (https://www.tidytextmining.com/sentiment.html):

“The three general-purpose lexicons are

  -  AFINN from Finn Årup Nielsen,
  -  bing from Bing Liu and collaborators, and
  -  nrc from Saif Mohammad and Peter Turney

All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.  The bing lexicon categorizes words in a binary fashion into positive and negative categories. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.  All of this information is tabulated in the sentiments dataset, and tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon."

Let's explore the sentiment lexicons. "bing" is included, other lexicons ("afinn", "nrc", "loughran") you'll be prompted to download.

**WARNING:** These collections include very offensive words. I urge you to not look at them in class.

"afinn": Words ranked from -5 (very negative) to +5 (very positive)
```{r afinn}
get_sentiments(lexicon = "afinn")
# Note: may be prompted to download (yes)

# Let's look at the pretty positive words:
afinn_pos <- get_sentiments("afinn") %>% 
  filter(value %in% c(3,4,5))

# Do not look at negative words in class. 
afinn_pos
```

bing: binary, "positive" or "negative"
```{r bing}
get_sentiments(lexicon = "bing")
```

nrc:https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
Includes bins for 8 emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) and positive / negative. 

**Citation for NRC lexicon**: Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, Computational Intelligence, 29 (3), 436-465, 2013.

Now nrc:
```{r nrc}
get_sentiments(lexicon = "nrc")
```

Let's do sentiment analysis on the got text data using afinn, and nrc. 


### Sentiment analysis with afinn: 

First, bind words in `got_stop` to `afinn` lexicon:
```{r bind-afinn}
got_afinn <- got_stop %>% 
  inner_join(get_sentiments("afinn"))
  ?get_sentiments
```

Let's find some counts (by sentiment ranking):
```{r count-afinn}
got_afinn_hist <- got_afinn %>% 
  count(value)

# Plot them: 
ggplot(data = got_afinn_hist, aes(x = value, y = n)) +
  geom_col()
```

Investigate some of the words in a bit more depth:
```{r afinn-2}
# What are these '-2' words?
got_afinn2 <- got_afinn %>% 
  filter(value == -2)
```

```{r afinn-2-more}
# Check the unique -2-score words:
unique(got_afinn2$word)

# Count & plot them
got_afinn2_n <- got_afinn2 %>% 
  count(word, sort = TRUE) %>% 
  top_n(20) %>%
  mutate(word = fct_reorder(factor(word), n))


ggplot(data = got_afinn2_n, aes(x = word, y = n)) +
  geom_col() +
  coord_flip()

# The list was so long it was impossible to read the table, so we added some code that only show the top 20 words 
```
It seems quite reasonable that these words are on the negative side. Maybe fire could also be cozy in contexts like "we sat by the warm and nice fire..." but the most common interpretation would be negative i guess!

Or we can summarize sentiment for the report: 
```{r summarize-afinn}
got_summary <- got_afinn %>% 
  summarize(
    mean_score = mean(value),
    median_score = median(value)
  )
got_summary
```

The mean and median indicate *slightly* negative overall sentiments based on the AFINN lexicon. Which is similar to our expectation based on our knowledge of GoT. 

### NRC lexicon for sentiment analysis

We can use the NRC lexicon to start "binning" text by the feelings they're typically associated with. As above, we'll use inner_join() to combine the got non-stopword text with the nrc lexicon: 

```{r bind-bing}
got_nrc <- got_stop %>% 
  inner_join(get_sentiments("nrc"))
```

Wait, won't that exclude some of the words in our text? YES! We should check which are excluded using `anti_join()`:

```{r check-exclusions}
got_exclude <- got_stop %>% 
  anti_join(get_sentiments("nrc"))

# View(got_exclude)

# Count to find the most excluded:
got_exclude_n <- got_exclude %>% 
  count(word, sort = TRUE)

head(got_exclude_n)
```

**Lesson: always check which words are EXCLUDED in sentiment analysis using a pre-built lexicon! **

Now find some counts: 
```{r count-bing}
got_nrc_n <- got_nrc %>% 
  count(sentiment, sort = TRUE)

# And plot them:

ggplot(data = got_nrc_n, aes(x = sentiment, y = n)) +
  geom_col()
```
Or count by sentiment *and* word, then facet:
```{r count-nrc}
got_nrc_n5 <- got_nrc %>% 
  count(word,sentiment, sort = TRUE) %>% 
  group_by(sentiment) %>% 
  top_n(5) %>% 
  ungroup()

got_nrc_gg <- ggplot(data = got_nrc_n5, aes(x = reorder(word,n), y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, ncol = 4, scales = "free") +
  coord_flip() +
  theme_minimal() +
  labs(x = "Word", y = "count")

# Show it
got_nrc_gg

# Save it
ggsave(plot = got_nrc_gg, 
       here("figures","got_nrc_sentiment.png"), 
       height = 8, 
       width = 5)

```

This plot indicates some contradiction in the NRC lexicon. "lord" is showing up in NRC lexicon as "trust", "joy", "disgust" and "negative"? Let's check:
```{r nrc-lord}
conf <- get_sentiments(lexicon = "nrc") %>% 
  filter(word == "lord")

# Yep, check it out:
conf
```

## Big picture takeaway

There are serious limitations of sentiment analysis using existing lexicons, and you should **think really hard** about your findings and if a lexicon makes sense for your study. Like we saw with "lord" these lexicons can be contradictory and ambiguous. Also we saw that the word "lord" was the most frequent, but it didn't show up in the plots using the afinn-lexicon, that might be bacause it is assigned the value "0" because of the general ambiguity of the word. You should be very awere of the limitations and omitations of different lexicas and methods. Otherwise, word counts and exploration alone can be useful!