# Text Analysis Workshop Part 2



Based on [SICSS 2020](https://compsocialscience.github.io/summer-institute/curriculum) (Summer Institute in Computational Social Science)

- [Basic Text Analysis in R](https://compsocialscience.github.io/summer-institute/2020/materials/day3-text-analysis/basic-text-analysis/rmarkdown/Basic_Text_Analysis_in_R.html)
- [Dictionary-based Text Analysis](https://compsocialscience.github.io/summer-institute/2020/materials/day3-text-analysis/dictionary-methods/rmarkdown/Dictionary-Based_Text_Analysis.html#when-should-i-use-a-dictionary-based-approach)
- [Topic Modeling](https://cbail.github.io/SICSS_Topic_Modeling.html)
- [Text Mining with R](https://www.tidytextmining.com/)

Mindy Chang


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-0"><span class="toc-item-num">0&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Install-packages" data-toc-modified-id="Install-packages-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Install packages</a></span></li><li><span><a href="#Load-packages" data-toc-modified-id="Load-packages-0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>Load packages</a></span></li></ul></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load the data</a></span><ul class="toc-item"><li><span><a href="#Getting-Twitter-data" data-toc-modified-id="Getting-Twitter-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Getting Twitter data</a></span></li><li><span><a href="#Look-at-the-data-format" data-toc-modified-id="Look-at-the-data-format-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Look at the data format</a></span></li></ul></li><li><span><a href="#Format-and-clean-the-text" data-toc-modified-id="Format-and-clean-the-text-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Format and clean the text</a></span><ul class="toc-item"><li><span><a href="#Tweet-specific-cleaning" data-toc-modified-id="Tweet-specific-cleaning-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Tweet-specific cleaning</a></span></li><li><span><a href="#Tokenize-the-data" data-toc-modified-id="Tokenize-the-data-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Tokenize the data</a></span></li><li><span><a href="#Convert-to-lowercase" data-toc-modified-id="Convert-to-lowercase-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Convert to lowercase</a></span></li><li><span><a href="#Remove-punctuation" data-toc-modified-id="Remove-punctuation-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Remove punctuation</a></span></li><li><span><a href="#Remove-stopwords" data-toc-modified-id="Remove-stopwords-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Remove stopwords</a></span></li><li><span><a href="#Remove-numbers" data-toc-modified-id="Remove-numbers-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Remove numbers</a></span></li><li><span><a href="#Remove-extra-white-spaces" data-toc-modified-id="Remove-extra-white-spaces-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Remove extra white spaces</a></span></li><li><span><a href="#Stemming" data-toc-modified-id="Stemming-2.8"><span class="toc-item-num">2.8&nbsp;&nbsp;</span>Stemming</a></span></li></ul></li><li><span><a href="#Word-counting" data-toc-modified-id="Word-counting-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Word counting</a></span><ul class="toc-item"><li><span><a href="#Create-a-word-count-dataframe" data-toc-modified-id="Create-a-word-count-dataframe-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Create a word count dataframe</a></span></li><li><span><a href="#Visualize-word-frequencies" data-toc-modified-id="Visualize-word-frequencies-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Visualize word frequencies</a></span></li><li><span><a href="#WordClouds" data-toc-modified-id="WordClouds-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>WordClouds</a></span></li><li><span><a href="#Bigrams-and-n-grams" data-toc-modified-id="Bigrams-and-n-grams-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Bigrams and n-grams</a></span></li><li><span><a href="#tf-idf:-Term-Frequency-Inverse-Document-Frequency" data-toc-modified-id="tf-idf:-Term-Frequency-Inverse-Document-Frequency-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>tf-idf: Term Frequency Inverse Document Frequency</a></span></li></ul></li><li><span><a href="#Dictionary-based-text-analysis" data-toc-modified-id="Dictionary-based-text-analysis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Dictionary-based text analysis</a></span><ul class="toc-item"><li><span><a href="#Select-for-keywords" data-toc-modified-id="Select-for-keywords-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Select for keywords</a></span></li><li><span><a href="#Sentiment-analysis" data-toc-modified-id="Sentiment-analysis-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Sentiment analysis</a></span><ul class="toc-item"><li><span><a href="#Load-a-dictionary" data-toc-modified-id="Load-a-dictionary-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>Load a dictionary</a></span></li><li><span><a href="#Count-sentiment-words-across-tweets" data-toc-modified-id="Count-sentiment-words-across-tweets-4.2.2"><span class="toc-item-num">4.2.2&nbsp;&nbsp;</span>Count sentiment words across tweets</a></span></li><li><span><a href="#Count-sentiments-by-tweet" data-toc-modified-id="Count-sentiments-by-tweet-4.2.3"><span class="toc-item-num">4.2.3&nbsp;&nbsp;</span>Count sentiments by tweet</a></span></li><li><span><a href="#Examine-sentiment-labels" data-toc-modified-id="Examine-sentiment-labels-4.2.4"><span class="toc-item-num">4.2.4&nbsp;&nbsp;</span>Examine sentiment labels</a></span></li><li><span><a href="#[extra]-Plot-sentiments-over-time" data-toc-modified-id="[extra]-Plot-sentiments-over-time-4.2.5"><span class="toc-item-num">4.2.5&nbsp;&nbsp;</span>[extra] Plot sentiments over time</a></span></li><li><span><a href="#[extra]-Linear-model-for-favorites-based-on-sentiment-count" data-toc-modified-id="[extra]-Linear-model-for-favorites-based-on-sentiment-count-4.2.6"><span class="toc-item-num">4.2.6&nbsp;&nbsp;</span>[extra] Linear model for favorites based on sentiment count</a></span></li></ul></li><li><span><a href="#Linguistic-Inquiry-Word-Count-(LIWC)" data-toc-modified-id="Linguistic-Inquiry-Word-Count-(LIWC)-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Linguistic Inquiry Word Count (LIWC)</a></span></li></ul></li><li><span><a href="#Topic-Models" data-toc-modified-id="Topic-Models-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Topic Models</a></span><ul class="toc-item"><li><span><a href="#LDA-(Latent-dirichlet-allocation)" data-toc-modified-id="LDA-(Latent-dirichlet-allocation)-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>LDA (Latent dirichlet allocation)</a></span><ul class="toc-item"><li><span><a href="#Document-term-matrix" data-toc-modified-id="Document-term-matrix-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Document-term matrix</a></span></li><li><span><a href="#Create-LDA-topic-model" data-toc-modified-id="Create-LDA-topic-model-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span>Create LDA topic model</a></span></li><li><span><a href="#Top-terms-in-each-topic" data-toc-modified-id="Top-terms-in-each-topic-5.1.3"><span class="toc-item-num">5.1.3&nbsp;&nbsp;</span>Top terms in each topic</a></span></li><li><span><a href="#Top-documents-in-each-topic" data-toc-modified-id="Top-documents-in-each-topic-5.1.4"><span class="toc-item-num">5.1.4&nbsp;&nbsp;</span>Top documents in each topic</a></span></li></ul></li><li><span><a href="#STM-(Structural-Topic-Modeling)" data-toc-modified-id="STM-(Structural-Topic-Modeling)-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>STM (Structural Topic Modeling)</a></span><ul class="toc-item"><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-5.2.1"><span class="toc-item-num">5.2.1&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Process-the-data" data-toc-modified-id="Process-the-data-5.2.2"><span class="toc-item-num">5.2.2&nbsp;&nbsp;</span>Process the data</a></span></li><li><span><a href="#Create-STM-model" data-toc-modified-id="Create-STM-model-5.2.3"><span class="toc-item-num">5.2.3&nbsp;&nbsp;</span>Create STM model</a></span></li><li><span><a href="#Top-documents-in-each-topic" data-toc-modified-id="Top-documents-in-each-topic-5.2.4"><span class="toc-item-num">5.2.4&nbsp;&nbsp;</span>Top documents in each topic</a></span></li><li><span><a href="#Effect-of-dataset-(date)-on-topic-distribution" data-toc-modified-id="Effect-of-dataset-(date)-on-topic-distribution-5.2.5"><span class="toc-item-num">5.2.5&nbsp;&nbsp;</span>Effect of dataset (date) on topic distribution</a></span></li></ul></li><li><span><a href="#Topic-model-notes" data-toc-modified-id="Topic-model-notes-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Topic model notes</a></span></li></ul></li><li><span><a href="#Appendix" data-toc-modified-id="Appendix-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Appendix</a></span></li></ul></div>

**Goals**
- Review the basics of cleaning the text and visualizing word counts
- Introduction to dictionary-based methods and topic models
- Text analysis is doable! Much of the code is reusable
- Learn limitations and general things to look out for

We will step through loading and cleaning a collection of tweets for text analysis. In this session, we will look at dictionary-based text analysis methods like sentiment analysis and topic modeling.

<img src="imgs/part2/textanalysis_diagrams2.001.png" alt="Sentiment analysis flow diagram" width="600"/>


___Some operational details:___

This session is a [Binder](https://mybinder.org/) instance of a [Jupyter notebook](https://jupyter.org/). This is your own copy of the notebook, where you can edit the code and run it interactively.  

The binder session will time out after 10 minutes of inactivity. If this happens, you will need to restart the binder. Reloading the current page will not work. 

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/MindyChang/text-analysis-workshop/master?filepath=index.ipynb)


## Setup 
### Install packages

We only need to install packages once - they have already been installed in this binder.
- *tidyverse*
    - *dyplr* for dataframe manipulation
    - *tidyr* for formatting into tidy data
    - *ggplot2* for plotting
    - *lubridate* for working with dates and times
- *tidytext* for getting text data into a tidy format
- *SnowballC* for getting word stems
- *stringr* for manipulating strings
- *wordcloud* for generating word clouds
- *topicmodels* for LDA topic modeling
- *stm* for structural topic modeling

In the R console, 
```
install.packages("tidyverse")
install.packages("tidytext")
install.packages("SnowballC")
install.packages("stringr")
install.packages("wordcloud")
install.packages("topicmodels")
install.packages("stm")
```


### Load packages
**Running code interactively**

To run a code block, click the gray section and type *shift+enter* (or click the play button on mobile). 

- The `In [ ]` will indicate the code state. 
    - `In [*]` means it's currently running. 
    - `In [5]` means it finished running, where the `[number]` is a running count of the number of blocks you have run.    
- After running a block, the cursor will automatically advance to the next block.


In [None]:
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(tidytext)
library(SnowballC)
library(stringr)
library(wordcloud)
library(topicmodels)
library(tm)
library(stm)
library(quanteda)

## Load the data


<img src="imgs/part2/textanalysis_diagrams2.002.png" alt="Sentiment analysis flow diagram" width="600"/>

### Getting Twitter data
- Retrieve data with the Twitter API

    - Limited to the last 7-9 days
    - Subject to rate limitations
    - Twitter may sample or not provide a complete set of tweets in searches
    - By default, Twitter will return the most recent tweets that fit the criteria you request

- Use an existing Twitter dataset

    - Hydrate using the tweet id to retrieve the original tweet info

For this workshop, we have an option of 4 different datasets (each with ~4000 tweets), which have already been extracted via the twitter API using the `rtweet` package and saved as a Rdata file.
- Trump's tweets collected between 2017-02-05 and 2018-05-18, named `trumptweets`
- A collection of tweets that contain the word "covid" on 2020-07-29 (spanning 10 minutes), named `covidtweets`
- A collection of tweets that contain the phrase "mental health" on 2020-07-29 (spanning 3 hours) named `mentalhealthtweets`
- A collection of tweets that contain the phrase "mental health" on 2020-08-19 (spanning 3 hours) named `mentalhealthtweets2`

We will rename the data by assigning the variable name `rawtweets` using the assignment operator `<-` so that all of the following code can work with any Twitter dataset.

In [None]:
title <- 'Mental Health' # other options include: 'Covid', 'Trump', 'Mental Health 2'

if (title == 'Trump')
{
    load(url("https://cbail.github.io/Trump_Tweets.Rdata"))
    rawtweets <- trumptweets 
    keyword <- ""
} else if (title =='Covid')
{
    load("data/covidtweets.RData")
    rawtweets <- covid_tweets
    keyword <- "covid"
} else if (title == 'Mental Health')
{
    load("data/mentalhealthtweets.RData")
    rawtweets <- mentalhealthtweets
    keyword <- "mental health"
} else if (title == 'Mental Health 2')
{
    load("data/mentalhealthtweets2.RData")
    rawtweets <- mentalhealthtweets2
    keyword <- "mental health"
} else
{
    warning('Pick a valid dataset from {"Trump","Covid","Mental Health"}')
}

paste("Loaded", title , "tweet dataset")

### Look at the data format
The data is formatted as a data frame, which is a standard table, where the top row contains column names and each row contains data about a single tweet.

- The `head()`function returns a the first few rows of `rawtweets`. 

    - You can change the number in 
    `head(rawtweets, #) `
    for the number of rows you want to see.

Each tweet comes with 90 variables, which are defined in the [tweet data dictionary](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)

Some columns of note:

- `status_id` is the unique tweet id
- `created_at` contains the timestamp of the tweet
- `screen_name` is the twitter handle
- `text` contains the tweet

In [None]:
# preview rawtweets
head(rawtweets,5)

## Format and clean the text
We want to clean the text to keep the most meaningful parts and shape it into the tidy text format for processing.

**Token** - a unit of analysis (e.g. words, sequence of words, sentence)

**Document** - a unit of context for each token (in this case - a single tweet)

**Tidy text format** - One row per token (word in this case) with column variables that have extra context (e.g. which document the word came from)


<img src="imgs/part2/textanalysis_diagrams2.003.png" alt="Sentiment analysis flow diagram" width="600"/>


### Tweet-specific cleaning

- **Filter out retweets**

- **Replace urls and usernames**

  - Depending on your goal, sometimes unique words are useful, and sometimes they add to the noise. Here we will abstract away specific urls and twitter handles (usernames).

- **Treat the search keyword as one word for easy removal**
  - e.g. Mental health --> mentalhealth

- **Remove duplicate tweets**
  - Some twitter bots repost the same tweet from multiple accounts.

In [None]:
# filter out retweets, find urls and @usernames - replace them with the string "url"
# regex for parsing tweets
url_regex <- "https?://[^\\s]+|&amp;|&lt;|&gt;|\bRT\\b"
user_regex <-"@[^\\s]+\\b"
rawtweets <- rawtweets %>%
   filter(is_retweet == FALSE) %>%
    mutate(text = str_replace_all(text, url_regex, "url")) %>%
    mutate(text = str_replace_all(text, user_regex, "username")) %>%
    mutate(text = str_replace_all(text, regex(keyword, ignore_case=TRUE), gsub(" ","",keyword))) %>%
      distinct(text,.keep_all= TRUE)



### Tokenize the data

- The `unnest_tokens()` function separates the tweets into a tidy text format using the token specified. Here we are defining a token as a single word.



In [None]:
tidy_tweets<- rawtweets %>%
    select(status_id, created_at, text) %>%
    unnest_tokens("word", text)

head(tidy_tweets, 4)


### Convert to lowercase
Done automatically by the `unnest_tokens` function from the `tidytext` package

### Remove punctuation
Most punctuation is automatically removed by `unnest_tokens` from `tidytext`.

Here let's also exclude apostrophes.

In [None]:
tidy_tweets$word <- gsub("\\’","",tidy_tweets$word)

### Remove stopwords
Common words such as “the”, “and”, “for”, “is”, etc. are often described as “stop words,” meaning that they should not be included in a text analysis. 
- The `tidytext` package has a list of common stop words called `stop_words` that we can use. 
- There are also some words specific to tweets that we would like to filter out, for example urls and usernames, "rt" for retweets, "t.co" for twitter link shortening, and "amp" for accelerated mobile pages.
- There are some words that aren't caught by `stop_words` that we may want to add like "im", "ive", "its"
- There are some words in `stop_words` that we may want to keep like "work", "working", "worked"

In [None]:
# load stop_words from tidytext package and remove them from tidy_tweets
tweet_words_to_remove <- c("url","username","rt","t.co","amp")
other_words_to_remove <- append(c("its","im","ive","lot"), gsub(" ","",keyword))
words_to_keep <- c("work","working","worked")

#load stop_words
data("stop_words")
# add a few more stop words
custom_stop_words <- stop_words %>%
    bind_rows(tibble(word = append(tweet_words_to_remove, other_words_to_remove),
                     lexicon = "custom")) %>%
    filter(!word %in% words_to_keep)

# remove stopwords and other insignificant words from tidy_tweets
tidy_tweets <-
   tidy_tweets %>%
      anti_join(custom_stop_words) 

print(head(rawtweets$text,2))
head(tidy_tweets,10)

Note: We might need to go back and edit our stopwords later if it removes words that are useful in our context.

For example, in Trump tweets:
- `tidytext` automatically converted all words to lowercase, and removed "UN" 
- `tidytext` automatically removed punctuation, and "Secretary-General" was reduced to only "secretary".
- removing stopwords removed the word "working"

### Remove numbers

In [None]:
# remove numbers from tidy_tweets
tidy_tweets<-tidy_tweets[-grep("\\b\\d+\\b", tidy_tweets$word),]

### Remove extra white spaces

In [None]:
# remove extra white spaces from tidy_tweets
tidy_tweets$word <- gsub("\\s+","",tidy_tweets$word)

### Stemming
We may want to reduce all words to their word stems (or roots).

For example: "unite", "united", "uniting", "unites" all reduce to "unit". 

But the words "unit" and "units" also reduce to "unit", and these groups have different meanings.

The code below shows how to do it using the `snowballC` package, but we won't use it for now.

In [None]:
# get word stems and save as tidy_tweets_stemmed
tidy_tweets_stemmed<-tidy_tweets %>%
      mutate_at("word", list(~wordStem((.), language="en")))
head(tidy_tweets_stemmed)

## Word counting 
Count the most commonly used words across tweets 

<img src="imgs/part2/textanalysis_diagrams2.004.png" alt="Sentiment analysis flow diagram" width="600"/>


### Create a word count dataframe
Here we create a dataframe with a word column `word` and a count column `n`. 

Then we sort it from largest to smallest `n` to get the most frequently used words

In [None]:
# count word frequencies and sort in descending order
top_words<-
   tidy_tweets %>%
    count(word) %>%
        arrange(desc(n))
head(top_words,5)

### Visualize word frequencies
<img src="imgs/part2/textanalysis_diagrams2.005.png" alt="Sentiment analysis flow diagram" width="600"/>

Plot bar charts of word frequencies using `ggplot` from `ggplot2` package

In [None]:
# plot the 20 most frequently used words
# function for bar plots given a dataframe that contains a "word" column and a count column (e.g. "n")
# we pass in: 
# - unit for the plot title ("words", "bigrams", "trigrams") 
# - and the name of the count column (e.g. "n")
plot_frequent_words <- function(word_counts, unit, param) {
    options(repr.plot.width=9, repr.plot.height=7)
    word_counts %>%
        ggplot(aes(x=get(param), y=reorder(word, get(param)), fill=-get(param)))+
          geom_bar(stat="identity")+
            theme_minimal()+
            theme(axis.text.x = element_text(angle = 60, hjust = 1, size=15),
                  axis.text.y = element_text(hjust = 1, size=15),
                  axis.title = element_text(size=15),
                  plot.title = element_text(hjust = 0.5, size=18))+
                ylab("Frequency")+
                xlab("# Occurences")+
                ggtitle(paste("Most Frequent", unit, "in", title, "Tweets", sep=" "))+
                guides(fill=FALSE)
}

top_words %>%
  slice(1:20) %>%
    plot_frequent_words("words", "n")

### WordClouds
Create wordclouds for qualitative insights using the `wordcloud` function from the `wordcloud` package
- `min.freq`: words with frequency below min.freq will not be plotted
- `max.words`: Maximum number of words to be plotted. Least frequent terms are dropped
- `random.order`: plot words in random order. If false, they will be plotted in decreasing frequency
- `rot.per`: proportion words with 90 degree rotation
- `colors`: color words from least to most frequent
    - choose other color themes from [RColorBrewer](https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html)

In [None]:
# generate a wordcloud 
set.seed(1234) # for reproducibility 
wordcloud(words = top_words$word, freq = top_words$n, min.freq = 1,  
          max.words=200, random.order=FALSE, rot.per=0.3,colors=brewer.pal(8, "Dark2"))

### Bigrams and n-grams

Bigrams and n-grams refer to how text is tokenized, or the size of the unit of analysis.

- Unigrams are single words, bigrams are two-word phrases, and n-grams are n-word phrases.

**Bigrams**: 

We tokenize the text into bigrams and then filter out bigrams where either word is a stopword or contains numbers.

In [None]:
# Bigrams
# Preprocess rawtweets by tokenizing into bigrams, removing stopwords from individual words,
# and then combining back together
tidy_bigrams <-rawtweets %>%
    select(status_id,text) %>%
        unnest_tokens(output=word, input=text, token = "ngrams", n = 2) %>% 
          separate(word, c("word1", "word2"), sep = " ") %>% 
              filter(!word1 %in% custom_stop_words$word) %>%
              filter(!word2 %in% custom_stop_words$word) %>% 
              filter(!grepl("\\d+|\\’", word1))%>%
              filter(!grepl("\\d+|\\’", word2))%>%
                  unite(word, word1, word2, sep = " ")

# count bigrams and arrange by frequency
top_bigrams <- tidy_bigrams %>%
    count(word) %>%
        arrange(desc(n))

# plot top bigrams
top_bigrams %>%
  slice(1:20) %>%
    plot_frequent_words("Bigrams" , "n")

set.seed(1234) # for reproducibility 
wordcloud(words = top_bigrams$word, freq = top_bigrams$n, min.freq = 1,  
          max.words=200, random.order=FALSE, rot.per=0.3,colors=brewer.pal(8, "Dark2"),scale=c(4,.3))

**Trigram**

We tokenize the text into trigrams (3-word phrases) and then filter out trigrams where the first or third word is stopwords. We could also filter out trigrams where any of the 3 words are stopwords.

In [None]:
# Trigrams
# Preprocess rawtweets by tokenizing into bigrams, removing stopwords from individual words,
# and then combining back together
tidy_trigrams <-rawtweets %>%
    select(status_id,text) %>%
        unnest_tokens(output=word, input=text, token = "ngrams", n = 3) %>% 
          separate(word, c("word1", "word2", "word3"), sep = " ") %>% 
              filter(!(word1 %in% custom_stop_words$word)) %>% 
              #filter(!(word2 %in% custom_stop_words$word)) %>% 
              filter(!(word3 %in% custom_stop_words$word)) %>%
              filter(!grepl("\\d+|\\’", word1))%>%
              filter(!grepl("\\d+|\\’", word2))%>%
              filter(!grepl("\\d+|\\’", word3))%>%
                  unite(word, word1, word2, word3, sep = " ")%>%
                  filter(!grepl("NA NA NA", word)) # tweets shorter than 3 words

# count bigrams and arrange by frequency
top_trigrams <- tidy_trigrams %>%
    count(word) %>%
        arrange(desc(n))

# plot top bigrams
top_trigrams %>%
  slice(1:20) %>%
    plot_frequent_words("Trigrams","n")

# plot wordcloud
set.seed(1234) # for reproducibility 
wordcloud(words = top_trigrams$word, freq = top_trigrams$n, min.freq = 1,  
          max.words=100, random.order=FALSE, rot.per=0.3,colors=brewer.pal(8, "Dark2"),scale=c(2,.3))

### tf-idf: Term Frequency Inverse Document Frequency
A statistic for how important a word is to a document in a collection

Words that occur more frequently in one document (tweet) and less frequently in other documents should be given more importance as they are more useful for classification. This method is better suited for longer documents.

***Term frequency***

$tf(term)=\displaystyle(\frac{n_{occurences\ of\ term\ in\ document}}{n_{words\ in\ document}})$

***Inverse document Frequency:***

$idf(term)=\displaystyle log(\frac{n_{documents}}{n_{documents\ containing\ term}})$

The `bind_tf_idf` function from the `tidytext` package calculates the tf-idf value for each token (word)

In [None]:
tidy_tweets_tfidf <- tidy_tweets %>%
    count(word, status_id) %>%
        bind_tf_idf(word, status_id, n) %>%
            distinct(word,.keep_all = TRUE) %>%
                arrange(desc(tf_idf))

head(tidy_tweets_tfidf,8)

#set.seed(1234) # for reproducibility     
#wordcloud(words = tidy_tweets_tfidf$word, freq = tidy_tweets_tfidf$tf_idf, min.freq = max(tidy_tweets_tfidf$tf_idf)/10,  
#          max.words=100, random.order=FALSE, rot.per=0.3,colors=brewer.pal(8, "Dark2"),scale=c(2,.5) )

## Dictionary-based text analysis
Dictionary-based techniques involve taking a words that have been assigned a particular meaning or value and counting the number of occurences in each document. This approach assumes each word has an intrinsic meaning and does not take into account the context of each word.

<img src="imgs/part2/textanalysis_diagrams2.006.png" alt="Sentiment analysis flow diagram" width="600"/>


### Select for keywords
A simple example is extracting tweets that contain a certain word or list of words. 

We can create a custom list of words and find tweets that contain any of those words

The `str_detect` function from `stringr` package finds all text that contains a specified string query.

In [None]:
# create a list of words as our custom dictionary
custom_dictionary<-c("college","student") #c("school") 

# extract tweets that contain any words from our custom dictionary
custom_dictionary_tweets<-rawtweets[str_detect(rawtweets$text, 
                                                regex(paste(custom_dictionary, collapse="|"),
                                                      ignore_case=TRUE)),]
paste (nrow(custom_dictionary_tweets), 'tweets found')
custom_dictionary_tweets

In [None]:
# get top words from tweets that match our custom dictionary
custom_top_words<-custom_dictionary_tweets %>%
    select(status_id,text) %>%
      unnest_tokens("word", text) %>%        # tokenize
        anti_join(custom_stop_words) %>%     # remove stopwords
        filter(!grepl("\\d+|\\’", word))%>%  # remove numbers and apostrophes
        mutate_at("word", list(~wordStem((.), language="en"))) %>% # get word stems
            count(word) %>%
                arrange(desc(n))

#plot unigram wordcloud
set.seed(1234) # for reproducibility 
wordcloud(words = custom_top_words$word, freq = custom_top_words$n, min.freq = 1,  
          max.words=100, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"),scale=c(4,1))

# get top bigrams from tweets that match our custom dictionary
custom_top_bigrams <-custom_dictionary_tweets %>%
    select(status_id,text) %>%
        unnest_tokens(output=word, input=text, token = "ngrams", n = 2) %>% 
          separate(word, c("word1", "word2"), sep = " ") %>% 
              filter(!word1 %in% custom_stop_words$word) %>%
              filter(!word2 %in% custom_stop_words$word) %>% 
              filter(!grepl("\\d+|\\’", word1))%>%
              filter(!grepl("\\d+|\\’", word2))%>%
                  unite(word, word1, word2, sep = " ") %>%
                    count(word) %>%
                    arrange(desc(n))

# plot bigram wordcloud
set.seed(1234) # for reproducibility 
wordcloud(words = custom_top_bigrams$word, freq = custom_top_bigrams$n, min.freq = 1,  
          max.words=100, random.order=FALSE, rot.per=0.3,colors=brewer.pal(8, "Dark2"),scale=c(2,.3))

### Sentiment analysis
One popular type of dictionary is a sentiment dictionary which can be used to assess the valence of a given text by searching for words that describe affect or opinion. 
Sentiment dictionaries can vary highly. See [this paper]( https://homepages.dcc.ufmg.br/~fabricio/download/cosn127-goncalves.pdf) for a comparison.


#### Load a dictionary

`tidytext` has a few built-in sentiment dictionaries
- `afinn` - sentiment words in twitter discussions of climate change (value between -5 and 5)
- `bing` - sentiment words identified in online forums (negative vs positive)
- `nrc` - emotional valence words labeled by mturk workers
    - Words in this dictionary are labeled with the sentiments:
    "negative","positive","trust","fear","sadness","anger", "surprise","disgust","joy","anticipation"
    - Each word can be associated with multiple sentiments

`bing` is the only dictionary available in Binder

In [None]:
# look at the bing dictionary
sentiment_dictionary <- "bing"
unique_sentiments <- unique(get_sentiments(sentiment_dictionary)$sentiment)
unique_sentiments
head(get_sentiments(sentiment_dictionary))

#### Count sentiment words across tweets
- We use the `inner_join` function to count the number of sentiment words used across all tweets

- `count(word, sentiment)` counts the number of times each word + sentiment pair appears

In [None]:
sentiments_by_word <- tidy_tweets %>%
  inner_join(get_sentiments(sentiment_dictionary)) %>%
    count(word, sentiment) 

head(sentiments_by_word)


Let's plot the top 10 positive and negative words used across all tweets

In [None]:
if (nrow(distinct(get_sentiments(sentiment_dictionary),sentiment))>2){
    options(repr.plot.width=20, repr.plot.height=15)
}else{
    options(repr.plot.width=15, repr.plot.height=7)
}
sentiments_by_word %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
    theme_minimal()+
    theme(axis.text.x = element_text(angle = 60, hjust = 1, size=15),
          axis.text.y = element_text(hjust = 1, size=15),
          axis.title = element_text(size=15),
          plot.title = element_text(hjust = 0.5, size=18),
          strip.text = element_text(size=18))+
          
    theme(aspect.ratio = 1.5/1)+
    geom_col(show.legend = FALSE, width = 0.5) +  
    facet_wrap(~sentiment, scales = "free_y") +
    labs(y = "Contribution to sentiment",
         x = NULL) +
    coord_flip()

Note: In Trump and covid tweets, "trump" happens to be a positive sentiment word, but it's being used in a different way. Context-appropriate dictionaries will give better results.

#### Count sentiments by tweet
Now let's use the `inner_join` function and count the number of sentiment words per tweet.

In [None]:
# count the number of sentiment words for each tweet
tweet_sentiments <- tidy_tweets %>%
  inner_join(get_sentiments(sentiment_dictionary)) %>%
    count(status_id, sentiment) 

head(tweet_sentiments)


#### Examine sentiment labels

Each tweet can have any number of positive or negative words. Let's see what the tweets with the highest counts for each sentiment look like.

1) Arrange a dataframe where each row has the full tweet and count of sentiment words

  - The `spread` function spreads a key-value pair across columns into multiple columns with the key as the column name.

2) List the tweets with the highest counts for each sentiment.

In [None]:
# group sentiment counts by tweet
tweet_sentiments_spread <-tweet_sentiments %>%
  spread(sentiment, n, fill=0)

# add in the original text and favorite count
tweet_sentiments_full <- merge(tweet_sentiments_spread, 
                              rawtweets[c("status_id", "created_at", "text","favorite_count")], 
                              by="status_id")
#head(tweet_sentiments_full,4)

# for each sentiment, print a tweet with the highest count for that sentiment
sentiments <- as.list(get_sentiments(sentiment_dictionary) 
                      %>% distinct(sentiment))

high_sentiments <- tweet_sentiments_full[FALSE,] 
high_sentiments$highest <- NULL 

for (s in sentiments[[1]])
{
    tmp <- tweet_sentiments_full %>% 
            arrange(desc(get(s)))
    tmp$highest <- s
    high_sentiments <- high_sentiments %>%
        bind_rows(tmp[1:2,])
}
high_sentiments <- high_sentiments %>%
    select(-one_of('status_id', 'favorite_count'))
high_sentiments

Let's generate wordclouds for the most commonly used words in 

- the top 50 tweets containing the most negative sentiment words and 
- the top 50 tweets containing the most positive sentiment words. 

This wordcloud will include non-sentiment words.

In [None]:
# generate word clouds for the top 50 most negative and top 50 most positive tweets
for (s in c("negative","positive"))
{
    print(s)
    top_words <- tweet_sentiments_full %>% 
        arrange(desc(get(s))) %>%              # order by sentiment count
        slice(1:50) %>%                        # take the top 50
        select(status_id,text) %>%
          unnest_tokens("word", text) %>%      # tokenize into words
            anti_join(custom_stop_words) %>%   # remove stopwords
              filter(!grepl("\\d+", word))%>%  # filter out numbers
                filter(!grepl("\\’", word))%>% # filter out apostrophes
                   count(word) %>%
                     arrange(desc(n))

    #plot wordcloud)
    set.seed(1234) # for reproducibility 
    wordcloud(words = top_words$word, freq = top_words$n, min.freq = 1,  
              max.words=100, random.order=FALSE, rot.per=0.3, colors=brewer.pal(8, "Dark2"))#,scale=c(4,1))

}

#### [extra] Plot sentiments over time
This applies to the Trump dataset, which has a timespan of over one day.
A note about timestamps: The `lubridate` package provides functions like `as.Date` and `ymd_hms` to deal with timestamp formatting.

In [None]:
## convert timestamps to timestamp format
#rawtweets$created_at <- ymd_hms(rawtweets$created_at)

## examples for retriving tweets at specific dates/times: 
# rawtweets[as.Date(rawtweets$created_at) == as.Date("2018-05-18"),]
# rawtweets[rawtweets$created_at == ymd_hms("2017-05-05 19:43:37"),]

In [None]:
# get date for each tweet
tweet_sentiments_full$date <- as.Date(tweet_sentiments_full$created_at, format="%Y-%m-%d %x")

# plot count of negative and positive sentiment words by date
options(repr.plot.width=12, repr.plot.height=10)
tweet_sentiments_full %>%
    group_by(date) %>%
    summarise_at(vars(positive,negative), list(n = sum)) %>%
    ggplot(aes(x = date)) + 
      geom_line(aes(y = positive_n, color = "Positive")) +
      geom_line(aes(y = negative_n, color = "Negative")) + 
        scale_color_manual(values = c(
            'Positive' = 'green',
            'Negative' = 'red')) +
        theme_minimal()+
        theme(axis.text.x = 
            element_text(angle = 60, hjust = 1, size=13))+
        theme(plot.title = 
            element_text(hjust = 0.5, size=18))+
          ylab("Number of Sentiment Words")+
          xlab("")+
          labs(color = "Sentiment") +
          ggtitle(paste("Positive and Negative Sentiment in", title, "Tweets", sep=" "))+
          theme(aspect.ratio=1/4)


#### [extra] Linear model for favorites based on sentiment count
This applies to the Trump dataset, which has a substantial favorite count due to being popular and spanning enough time to gather favorites.

In [None]:
model1 <-tweet_sentiments_full %>%
    lm(data=., favorite_count ~  negative + positive)
    ## for nrc only
    #lm(data=., favorite_count ~  negative + positive + disgust +sadness + trust + fear + joy + anticipation)
summary(model1)

### Linguistic Inquiry Word Count (LIWC)

A large, commonly used, purchasable dictionary developed by a social psychologist and team to classify different types of psychometric properties and substantive properties of a text. This dictionary was built in a systematic way rather than through empirically labeling texts and has been psychometrically validated. 

More details [here](http://liwc.wpengine.com/how-it-works/)


## Topic Models

A method for analyzing “bags” or groups of words together instead of counting them individually.
- Every document is a mixture of topics: Assume there are a certain number of latent or underlying themes / "topics"
- Every topic is a mixture of words: Topics consist of groups of words that tend to co-occur

In practice:
- Choosing the appropriate number of topics (k) is difficult and kind of an art - labeling the topics requires human interpretation
- Topic modeling works best for not-too-short text (>50 words) with consistent structure - short text topic modeling is an area of active research

<img src="imgs/part2/textanalysis_diagrams2.007.png" alt="Sentiment analysis flow diagram" width="600"/>


### LDA (Latent dirichlet allocation)
LDA is the most common form of topic modeling, using an iterative algorithm to update the prevalence of each word across the k topics and the prevalence of the topics in each document. This uses the Term Frequency-Inverse Document Frequency metric.

#### Document-term matrix 
The `topicmodels` package requires the data to be formatted as a document-term matrix, where each row is a document, each column is a term, and each cell contains the count of the given term in the given document.

We can pass in any of our tidy text (`tidy_tweets`, `tidy_bigrams`, `tidy_trigrams`) to be converted into the document-term matrix format. 

In [None]:
# convert to document term matrix
tidy_tweets_DTM<-
  tidy_bigrams %>%
  count(status_id, word) %>%
  cast_dtm(status_id, word, n)

inspect(tidy_tweets_DTM[1:4,1:3])

#### Create LDA topic model

In [None]:
# create topic models
k_topics = 10
tweets_topic_model<-LDA(tidy_tweets_DTM, k=k_topics, control = list(seed = 321))
paste("Created a topic model with", k_topics , "topics")

#### Top terms in each topic
Each term is assigned a value beta for each topic, representing the probability of the term occuring in a given topic.

Let's look at the top terms in each topic

In [None]:
# generate dataframe with beta value for each term and topic combination
topic_terms <- tidy(tweets_topic_model, matrix = "beta")

# choose number of terms to show per topic
n_top = 10

# extract top terms in each topic
top_topic_terms <- 
  topic_terms %>%
  group_by(topic) %>%
    top_n(n_top, beta) %>%
    slice(1:n_top)%>%
  ungroup() %>%
  arrange(topic, beta)

# plot the top terms for each topic
if (k_topics>12){
    options(repr.plot.width=20, repr.plot.height=15)
}else{
    options(repr.plot.width=20, repr.plot.height=10)
}

top_topic_terms %>%
    mutate(topic = paste0("Topic ", topic),
           term = reorder_within(term, beta, topic)) %>%
    ggplot(aes(term, beta, fill = as.factor(topic))) +
        geom_col(alpha = 0.8, show.legend = FALSE) +
        facet_wrap(~ topic, scales = "free_y") +
        coord_flip() +
        scale_x_reordered() +

        theme_minimal()+
        theme(axis.text.x = element_text(angle=60, hjust = 1, size=14),
              axis.text.y = element_text(hjust = 1, size=14),
              strip.text = element_text(size=14),)+
        
        labs(x = NULL, y = expression(beta),
             title = "Highest word probabilities for each topic")

#### Top documents in each topic
Each document is assigned a value gamma for each topic, representing the probability of the document being about a given topic.

Let's read some of the tweets that score high in each topic.

In [None]:
# generate a dataframe with gamma value for each document and topic combination
document_topics <- tidy(tweets_topic_model, matrix = "gamma")

# choose number of documents to show per topic
n_top = 5

# read the top documents for each topic
top_topic_documents <- document_topics %>%
    group_by(topic) %>%
    top_n(n_top, gamma) %>%
    slice(1:n_top) %>%
    ungroup() %>%
    rename("status_id" = "document") %>%
    merge(rawtweets[c("status_id", "text")], by="status_id", how="left") %>%
    arrange(topic,desc(gamma))

top_topic_documents

## view all documents from a particular topic
#document_topics %>%
#    filter(topic==3) %>%
#    rename("status_id" = "document") %>%
#    merge(rawtweets[c("status_id", "text")], by="status_id", how="left") %>%
#    arrange(desc(gamma))

    

### STM (Structural Topic Modeling)

Structural topic modeling is a form of topic modeling that uses meta data about documents (e.g. author name, date, etc) to improve the assignment of words to latent topics

#### Load the data
Combine 2 datasets - tweets that contain the phrase "mental health"
- dataset 1:  4000 tweets collected on 2020-07-30 
- dataset 2:  4000 tweets collected on 2020-08-19 

In [None]:
# combine mental health tweets and add dataset column to indicate source
load("data/mentalhealthtweets.RData")
load("data/mentalhealthtweets2.RData")
mentalhealthtweets$dataset = 1
mentalhealthtweets2$dataset = 2
rawtweets2 <- rbind(mentalhealthtweets, mentalhealthtweets2)

# pre-process the raw text    
url_regex <- "https?://[^\\s]+|&amp;|&lt;|&gt;|\bRT\\b"
user_regex <-"@[^\\s]+\\b"
rawtweets2 <- rawtweets2 %>%
    mutate(text = str_replace_all(text, url_regex, "url")) %>%
    mutate(text = str_replace_all(text, user_regex, "username")) %>%
    mutate(text = str_replace_all(text, regex(keyword, ignore_case=TRUE), gsub(" ","",keyword))) %>%
    distinct(text,.keep_all= TRUE)
paste("loaded 2 mental health datasets into rawtweets2")

#### Process the data
There are a few ways to get to the data format used for the `stm` package

- Create tidy text -> summarize text -> create sparse data format
- Create tidy text -> summarize text -> use quanteda package to convert to dfm (document-term matrix)
- Use `stm` package's built-in text processing from raw data (but cannot create n-grams) 

In [None]:
# get tidy bigrams
tidy_bigrams2 <-rawtweets2 %>%
    select(dataset,status_id,text) %>%
        unnest_tokens(output=word, input=text, token = "ngrams", n = 2) %>% 
          separate(word, c("word1", "word2"), sep = " ") %>% 
              filter(!word1 %in% custom_stop_words$word) %>%
              filter(!word2 %in% custom_stop_words$word) %>% 
              filter(!grepl("\\d+|\\’", word1))%>%
              filter(!grepl("\\d+|\\’", word2))%>%
                  unite(word, word1, word2, sep = " ") 

# summarize bigrams by counts
bigrams_count <- tidy_bigrams2 %>%
                    count(dataset,status_id,word) %>%
                    arrange(desc(n))

# create sparse data format for stm
bigrams_sparse <- bigrams_count %>%
    cast_sparse(status_id, word, n)
bigrams_covariates <- tidy_bigrams2 %>%
    sample_frac() %>%
    distinct(status_id,dataset) %>%
    merge(rawtweets2[c("status_id", "text")], by="status_id", how="left")
    

##alternative1: use quanteda package to convert summarized text to document-feature(term) matrix using quanteda
#bigrams_dfm <- bigrams_count %>%
#    cast_dfm(status_id, word, n)
#
#bigrams_stm_data <- convert(bigrams_dfm, to = "stm", docvars = bigrams_covariates)
#docs <- bigrams_stm_data$documents 
#vocab <- bigrams_stm_data$vocab    
#meta <- bigrams_stm_data$meta  


###alternative2: use STM's built in packages
## directly process raw tweets using the text processing built in to stm, but it can't create ngrams
## preprocess the text 
#custom_stopwords = c("rt","t.co","amp","ive","im","its","mh","url","username")
#custom_punctuation = c("\\’")
#processed <- textProcessor(rawtweets2$text, metadata = rawtweets2, 
#                           customstopwords = custom_stopwords,
#                           custompunctuation = custom_punctuation)
## create output files
#out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
#docs <- out$documents
#vocab <- out$vocab
#meta <- out$meta

#### Create STM model
Creating the model takes several minutes to run. We have a few pre-saved models run with k = 10, 15, or 20 topics

In [None]:
k_topics_stm = 10 #saved models: 10, 15, 20
load_file = TRUE

# load the pre-run model or run one yourself
if (load_file){
   load(paste0("models/bigram_stm_",k_topics_stm,".RData"))  
}else
{
    bigram_stm <- stm(bigrams_sparse, 
                      K = k_topics_stm, 
                      prevalence = ~ dataset,
                      data = bigrams_covariates,
                      verbose = FALSE,
                      max.em.its = 75,
                      init.type = "Spectral")
}

# plot the top terms in each topic
if (k_topics_stm>=15){
    options(repr.plot.width=20, repr.plot.height=k_topics_stm)
}else{
    options(repr.plot.width=20, repr.plot.height=10)
}
plot(bigram_stm, n=6)
#labelTopics(bigram_topic_model, n=6)

#### Top documents in each topic
Let's read some of the tweets that score high in each topic.

In [None]:
thoughts <- findThoughts(bigram_stm,texts=bigrams_covariates$text, topics=1:k_topics_stm, n=4)
thoughts

#### Effect of dataset (date) on topic distribution

In [None]:
predict_topics<-estimateEffect(formula = 1:k_topics_stm ~ dataset, stmobj = bigram_stm, metadata = bigrams_covariates, uncertainty = "Global")

# plot the top terms for each topic
if (k_topics_stm>=15){
    options(repr.plot.width=20, repr.plot.height=k_topics_stm)
}else{
    options(repr.plot.width=20, repr.plot.height=12)
}
plot(predict_topics, covariate = "dataset", topics = 1:k_topics_stm,
 model = bigram_stm, method = "difference",
 cov.value1 = "2", cov.value2 = "1",
 xlab = "More 2020-07-30 ... More 2020-08-19",
 main = "2020-07-30 dataset vs. 2020-08-19 dataset",
 xlim = c(-.4, .4), n=4, labeltype="prob"
     #labeltype = "custom", custom.labels = c('Topic 3', 'Topic 5','Topic 9'))
)

### Topic model notes


- [keywordATM](https://arxiv.org/pdf/2004.05964.pdf): specify topics and associated keywords before modeling [[keywordATM package](https://keyatm.github.io/keyATM/)]

- [Biterm topic models](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4032&rep=rep1&type=pdf): learn unordered word groupings across the entire document for short texts



## Appendix

 Slang and emojis add a layer of complexity to text analysis
- [[Paper] Artificial Intelligence and
Inclusion: Formerly Gang Involved Youth as Domain
Experts for Analyzing
Unstructured Twitter Data](https://safelab.socialwork.columbia.edu/sites/default/files/content/AI%26Inclusion.pdf)