# Text Analysis from SICSS

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-0"><span class="toc-item-num">0&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Install-packages" data-toc-modified-id="Install-packages-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Install packages</a></span></li><li><span><a href="#Load-packages" data-toc-modified-id="Load-packages-0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>Load packages</a></span></li></ul></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load the data</a></span><ul class="toc-item"><li><span><a href="#Look-at-the-data-format" data-toc-modified-id="Look-at-the-data-format-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Look at the data format</a></span></li><li><span><a href="#Look-at-individual-column-values" data-toc-modified-id="Look-at-individual-column-values-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Look at individual column values</a></span></li><li><span><a href="#Convert-timestamps" data-toc-modified-id="Convert-timestamps-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Convert timestamps</a></span></li></ul></li><li><span><a href="#Format-and-clean-the-text" data-toc-modified-id="Format-and-clean-the-text-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Format and clean the text</a></span><ul class="toc-item"><li><span><a href="#Filter-out-retweets-and-replace-urls" data-toc-modified-id="Filter-out-retweets-and-replace-urls-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Filter out retweets and replace urls</a></span></li><li><span><a href="#Tokenize-the-data" data-toc-modified-id="Tokenize-the-data-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Tokenize the data</a></span></li><li><span><a href="#Convert-to-lowercase" data-toc-modified-id="Convert-to-lowercase-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Convert to lowercase</a></span></li><li><span><a href="#Remove-punctuation" data-toc-modified-id="Remove-punctuation-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Remove punctuation</a></span></li><li><span><a href="#Remove-stopwords" data-toc-modified-id="Remove-stopwords-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Remove stopwords</a></span></li><li><span><a href="#Remove-numbers" data-toc-modified-id="Remove-numbers-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Remove numbers</a></span></li><li><span><a href="#Remove-extra-white-spaces" data-toc-modified-id="Remove-extra-white-spaces-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Remove extra white spaces</a></span></li><li><span><a href="#Stemming" data-toc-modified-id="Stemming-2.8"><span class="toc-item-num">2.8&nbsp;&nbsp;</span>Stemming</a></span></li></ul></li><li><span><a href="#Word-counting" data-toc-modified-id="Word-counting-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Word counting</a></span><ul class="toc-item"><li><span><a href="#Visualize-word-frequencies" data-toc-modified-id="Visualize-word-frequencies-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Visualize word frequencies</a></span></li><li><span><a href="#WordClouds" data-toc-modified-id="WordClouds-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>WordClouds</a></span></li><li><span><a href="#Bigrams-and-n-grams" data-toc-modified-id="Bigrams-and-n-grams-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Bigrams and n-grams</a></span></li><li><span><a href="#tf-idf:-Term-Frequency-Inverse-Document-Frequency" data-toc-modified-id="tf-idf:-Term-Frequency-Inverse-Document-Frequency-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>tf-idf: Term Frequency Inverse Document Frequency</a></span></li></ul></li><li><span><a href="#Dictionary-based-text-analysis" data-toc-modified-id="Dictionary-based-text-analysis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Dictionary-based text analysis</a></span><ul class="toc-item"><li><span><a href="#Selecting-for-a-collection-of-words" data-toc-modified-id="Selecting-for-a-collection-of-words-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Selecting for a collection of words</a></span></li><li><span><a href="#Sentiment-analysis" data-toc-modified-id="Sentiment-analysis-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Sentiment analysis</a></span></li></ul></li></ul></div>

We will step through loading and cleaning a collection of trump tweets from 2017-2018 for text analysis. In this session, we will look at word counting and dictionary-based text analysis methods.

<img src="imgs/textanalysis_diagrams.001.png" alt="Sentiment analysis flow diagram" width="600"/>



- Setup
- Load the data
- Format and clean the data
- Word counting
    - Wordclouds 
    - bigrams and n-grams
    - tf-idf: Term frequency inverse document frequency
- Dictionary based methods
    - Custom dictionary
    - Sentiment analysis 

___Some technical details:___

This session is a [Binder](https://mybinder.org/) instance of a [Jupyter notebook](https://jupyter.org/) 

To run a code block, click the gray box and press shift+enter. After running a block, the cursor will automatically advance to the next block.

## Setup 
### Install packages
We only need to install packages once - they have already been installed here.
- *tidyverse*
    - *dyplr* for dataframe manipulation
    - *tidyr* for formatting into tidy data
    - *ggplot2* for plotting
    - *lubridate* for working with dates and times
- *tidytext* for getting text data into a tidy format
- *SnowballC* for getting word stems
- *stringr* for manipulating strings
- *wordcloud* for generating word clouds

In the R console, 
```
install.packages("tidyverse")
install.packages("tidytext")
install.packages("SnowballC")
install.packages("stringr")
install.packages("wordcloud")
```


### Load packages

In [None]:
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(tidytext)
library(SnowballC)
library(stringr)
library(wordcloud)

## Load the data


<img src="imgs/textanalysis_diagrams.002.png" alt="Sentiment analysis flow diagram" width="600"/>


We will look at Trump's tweets collected between 2017-02-05 and 2018-05-18, which has already been extracted via the twitter API using the `rtweet` package. 
This data is in the format that would be returned using an API call using the *rtweet* package.

In [None]:
load(url("https://cbail.github.io/Trump_Tweets.Rdata"))

### Look at the data format
Preview the data we loaded, which is named ```trumptweets```.

You can change the number in 
```head(trumptweets, #) ```
for the number of rows you want to see.

- `created_at` contains the timestamp of the tweet

- `text` contains the tweet

In [None]:
# preview trumptweets
head(trumptweets,5)

### Look at individual column values

In [None]:
# print column names
names(trumptweets)
trumptweets %>%
  select('created_at', 'text', 'favorite_count','source') %>%
    head

We can use ```[tablename]$[columnname]```  to select a column and apply different operations.
Some example operations include:

| operation   | Description |                      
|-------------|-------------|
| `min`       | minimum     |
| `max`       | maximum     | 
| `nrow`      | # rows      | 
| `ncol`      | # columns   | 
| `unique`    | list of unique values  | 
| `n_distinct`| # unique values     | 
| `mean`| mean    | 
| `median`| median     | 
| `sd`| standard dev    | 

In [None]:
# print summary information on individual columns
print(paste('# of rows: ', nrow(trumptweets)))
min(trumptweets$created_at)
max(trumptweets$created_at)
unique(trumptweets$country)

### Convert timestamps 
This will make it easier to select tweets by a specific date or timestamp

In [None]:
# convert timestamps to timestamp format
trumptweets$created_at <- ymd_hms(trumptweets$created_at)

## examples: 
# trumptweets[as.Date(trumptweets$created_at) == as.Date("2018-05-18"),]
# trumptweets[trumptweets$created_at == ymd_hms("2017-05-05 19:43:37"),]

## Format and clean the text

<img src="imgs/textanalysis_diagrams.003.png" alt="Sentiment analysis flow diagram" width="600"/>


### Filter out retweets and replace urls

In [None]:
# regex for parsing tweets
replace_reg <- "https?://[^\\s]+|&amp;|&lt;|&gt;|\bRT\\b"
trumptweets <- trumptweets %>%
  filter(is_retweet == FALSE) %>%
  mutate(text = str_replace_all(text, replace_reg, "url"))


### Tokenize the data
**Tokenization** - the way you define a unit of analysis (e.g. words, sequence of words, sentence)

**Document** - a unit of context (in this case - a single tweet)

**Tidy text format** - One row per token (word in this case) with column variables that have extra context (e.g. which tweet the word came from)

In [None]:
tidy_trump_tweets<- trumptweets %>%
    select(created_at,text) %>%
    unnest_tokens("word", text)

head(tidy_trump_tweets)


### Convert to lowercase
Done automatically by `tidytext`

### Remove punctuation
Done automatically by `tidytext`

### Remove stopwords
Common words such as “the”, “and”, “for”, “is”, etc. are often described as “stop words,” meaning that they should not be included in a quantitative text analysis. The tidytext package has a list of common stop words called `stop_words` that we can use. There are also some words specific to tweets that we would like to filter out, for example urls, "rt" for retweets, "t.co" for twitter link shortening, and "amp" for accelerated mobile pages.

Note: `tidytext` automatically converted all words to lowercase, and "UN" is one of the stopwords that was removed. This would likely be an important word to keep, but we'll let it slide for now. 

In [None]:
# load stop_words from tidytext package and remove from tidy_trump_tweets

#load stop_words
data("stop_words")
# add a few more stop words
custom_stop_words <- stop_words %>%
    bind_rows(tibble(word = c("url","rt","t.co","amp"),
                     lexicon = "custom"))

# remove stopwords and other insignificant words from tidy_trump_tweets
tidy_trump_tweets <-
   tidy_trump_tweets %>%
      anti_join(custom_stop_words) 
head(tidy_trump_tweets)

### Remove numbers

In [None]:
# remove numbers
tidy_trump_tweets<-tidy_trump_tweets[-grep("\\b\\d+\\b", tidy_trump_tweets$word),]

### Remove extra white spaces

In [None]:
# remove extra white spaces
tidy_trump_tweets$word <- gsub("\\s+","",tidy_trump_tweets$word)

### Stemming

In [None]:
# get word stems
tidy_trump_tweets_stemmed<-tidy_trump_tweets %>%
      mutate_at("word", list(~wordStem((.), language="en")))
head(tidy_trump_tweets_stemmed)

## Word counting 
Count the most commonly used words across tweets and plot them 

<img src="imgs/textanalysis_diagrams.004.png" alt="Sentiment analysis flow diagram" width="600"/>


In [None]:
# count word frequencies and sort in descending order
top_words<-
   tidy_trump_tweets %>%
    count(word) %>%
        arrange(desc(n))
head(top_words)

### Visualize word frequencies
<img src="imgs/textanalysis_diagrams.005.png" alt="Sentiment analysis flow diagram" width="600"/>

Plot bar charts of word frequencies using `ggplot` from `ggplot2` package

In [None]:
# plot the 20 most frequently used words
plot_frequent_words <- function(word_counts) {
    word_counts %>%
        ggplot(aes(x=n, y=reorder(word, n), fill=-n))+
          geom_bar(stat="identity")+
            theme_minimal()+
            theme(axis.text.x = element_text(angle = 60, hjust = 1, size=15),
                  axis.text.y = element_text(hjust = 1, size=15),
                  axis.title = element_text(size=15),
                  plot.title = element_text(hjust = 0.5, size=18))+
                ylab("Frequency")+
                xlab("# Occurences")+
                ggtitle("Most Frequent Words in Trump Tweets")+
                guides(fill=FALSE)
}

top_words %>%
  slice(1:20) %>%
    plot_frequent_words()

### WordClouds
Create wordclouds for qualitative insights using the `wordcloud` function from the `wordcloud` package
- `min.freq`: words with frequency below min.freq will not be plotted
- `max.words`: Maximum number of words to be plotted. Least frequent terms are dropped
- `random.order`: plot words in random order. If false, they will be plotted in decreasing frequency
- `rot.per`: proportion words with 90 degree rotation
- `colors`: color words from least to most frequent
    - choose other color themes from [RColorBrewer](https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html)

In [None]:
# generate a wordcloud 
set.seed(1234) # for reproducibility 
wordcloud(words = top_words$word, freq = top_words$n, min.freq = 1,  
          max.words=200, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))

### Bigrams and n-grams
Unigrams are single words, bigrams are two-word phrases, and n-grams are n-word phrases.

We can change how the tweets are tokenized to analyze n-grams.

In [None]:
tidy_bigrams <-trumptweets %>%
    select(created_at,text) %>%
        unnest_tokens(output=word, input=text, token = "ngrams", n = 2) %>% 
          separate(word, c("word1", "word2"), sep = " ") %>% 
              filter(!word1 %in% custom_stop_words$word) %>%
              filter(!word2 %in% custom_stop_words$word) %>% 
                  unite(word,word1, word2, sep = " ")

top_bigrams <- tidy_bigrams %>%
    count(word) %>%
        arrange(desc(n))

top_bigrams %>%
  slice(1:20) %>%
    plot_frequent_words()

### tf-idf: Term Frequency Inverse Document Frequency
A statistic for how important a word is to a document in a collection

Words that occur more frequently in one document (tweet) and less frequently in other documents should be given more importance as they are more useful for classification.

***Term frequency***

$tf(term)=\displaystyle(\frac{n_{occurences\ of\ term\ in\ document}}{n_{words\ in\ document}})$

***Inverse document Frequency:***

$idf(term)=\displaystyle log(\frac{n_{documents}}{n_{documents\ containing\ term}})$

In [None]:
tidy_trump_tfidf <- tidy_trump_tweets %>%
    count(word, created_at) %>%
        bind_tf_idf(word, created_at, n) %>%
            arrange(desc(tf_idf))

In [None]:
set.seed(1234) # for reproducibility 
tidy_trump_tfidf_unique <- tidy_trump_tfidf %>%
    distinct(word,.keep_all = TRUE)
wordcloud(words = tidy_trump_tfidf_unique$word, freq = tidy_trump_tfidf_unique$tf_idf, min.freq = .5,  
          max.words=100, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"),scale=c(2,.5) )

## Dictionary-based text analysis

<img src="imgs/textanalysis_diagrams.006.png" alt="Sentiment analysis flow diagram" width="600"/>


### Selecting for a collection of words
We can create a custom list of words and find tweets that contain any of those words

In [None]:
custom_dictionary<-c("economy","unemployment","trade","tariffs","jobs")

In [None]:
custom_dictionary_tweets<-trumptweets[str_detect(trumptweets$text, 
                                                regex(paste(custom_dictionary, collapse="|"),
                                                      ignore_case=TRUE)),]
custom_dictionary_tweets

In [None]:
# plot a wordcloud for our tweets that match our custom dictionary
custom_top_words<-custom_dictionary_tweets %>%
    select(created_at,text) %>%
      unnest_tokens("word", text) %>%
        anti_join(stop_words) %>%
            filter(!(word=="https"|
                 word=="rt"|
                 word=="t.co"|
                 word=="amp" |
                 word=="url")) %>%
            count(word) %>%
                arrange(desc(n))
set.seed(1234) # for reproducibility 
wordcloud(words = custom_top_words$word, freq = custom_top_words$n, min.freq = 1,  
          max.words=100, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"),scale=c(4,1))

### Sentiment analysis

One popular type of dictionary is a sentiment dictionary which can be used to assess the valence of a given text by searching for words that describe affect or opinion. 

`tidytext` has a few built-in sentiment dictionaries
- `afinn` - sentiment words in twitter discussions of climate change (value between -5 and 5)
- `bing` - sentiment words identified in online forums (negative vs positive)
- `nrc` - created emotional valence words by mturk workers

Let's use `nrc`. 
- Words in this dictionary are labeled with the sentiments: "negative","positive","trust","fear","sadness","anger", "surprise","disgust","joy","anticipation"
- Each word can be associated with multiple sentiments

In [None]:
# for nrc dictionary, look at unique sentiments and an example word for each
head(get_sentiments("nrc"))


In [None]:
# count the number of sentiment words for each tweet
trump_tweet_sentiment <- tidy_trump_tweets %>%
  inner_join(get_sentiments("nrc")) %>%
    count(created_at, sentiment) 
head(trump_tweet_sentiment)


In [None]:
# group sentiment counts by tweet
trump_sentiment_spread <-trump_tweet_sentiment %>%
  spread(sentiment, n, fill=0)

trump_sentiment_full <- merge(trump_sentiment_spread, 
                              trumptweets[c("created_at","text","favorite_count")], 
                              by="created_at")
head(trump_sentiment_full)

In [None]:
sentiments <- as.list(get_sentiments("nrc") %>% distinct(sentiment))
for (s in sentiments)
{
    print(s)    
}

In [None]:
model1 <-trump_sentiment_full %>%
  lm(data=., favorite_count ~ disgust + negative + joy + anticipation + positive + sadness + trust + fear)
summary(model1)