# Text Analysis from SICSS

- Prepare the data
    - Setup
    - Load the data
    - Format and clean the data
- Word counting
    - wordclouds 
    - tf-idf
- Dictionary based methods
    - Topic-based dictionary
    - Sentiment analysis 

## Prepare the data

### Setup 
#### Install packages
We only need to install packages once - they have already been installed here.
- *tidyverse*
    - *ggplot2* for plotting
    - *dyplr* for dataframe manipulation
    - *tidyr* for getting to tidy data
    - *lubridate* for working with dates and times
- *tidytext* for getting text data into a tidy format
- *SnowballC* for getting word stems
- *wordcloud* for generating word clouds

In the R console, 
```
install.packages("tidyverse")
install.packages("tidytext")
install.packages("SnowballC")
install.packages("wordcloud")
```


#### Load packages

In [6]:
library(tidytext)
library(dplyr)
library(ggplot2)
library(SnowballC)
library(stringr)
library(lubridate)
library(wordcloud)

### Load the data
We will look at Trump's tweets collected between 2017-02-05 and 2018-05-18. 

This data is in the format that would be returned using an API call using the *rtweet* package.

In [7]:
load(url("https://cbail.github.io/Trump_Tweets.Rdata"))

## if the file is not available, download a local copy
# load(file = "trumptweets.Rdata")


#### Look at the data format
Preview the data we loaded, which is named ```trumptweets```.

You can change the number in 
```head(trumptweets, #) ```
for the number of rows you want to see.

- `created_at` contains the timestamp of the tweet

- `text` contains the tweet

In [8]:
# preview trumptweets
head(trumptweets,5)

status_id,created_at,user_id,screen_name,text,source,reply_to_status_id,reply_to_user_id,reply_to_screen_name,is_quote,⋯,retweet_text,place_url,place_name,place_full_name,place_type,country,country_code,geo_coords,coords_coords,bbox_coords
<chr>,<dttm>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<list>,<list>,<list>
997577906007298048,2018-05-18 20:41:21,25073877,realDonaldTrump,"Just met with UN Secretary-General António Guterres who is working hard to “Make the United Nations Great Again.” When the UN does more to solve conflicts around the world, it means the U.S. has less to do and we save money. @NikkiHaley is doing a fantastic job! https://t.co/pqUv6cyH2z",Twitter for iPhone,,,,False,⋯,,,,,,,,"NA, NA","NA, NA","NA, NA, NA, NA, NA, NA, NA, NA"
997573139663028224,2018-05-18 20:22:25,25073877,realDonaldTrump,America is a Nation that believes in the power of redemption. America is a Nation that believes in second chances - and America is a Nation that believes that the best is always yet to come! #PrisonReform https://t.co/Yk5UJUYgHN,Twitter for iPhone,,,,False,⋯,,,,,,,,"NA, NA","NA, NA","NA, NA, NA, NA, NA, NA, NA, NA"
997568208369577985,2018-05-18 20:02:49,25073877,realDonaldTrump,RT @SteveForbesCEO: .@realDonaldTrump speech on drug costs pays immediate dividends. New @Amgen drug lists at 30% less than expected. Middl…,Twitter for iPhone,,,,False,⋯,".@realDonaldTrump speech on drug costs pays immediate dividends. New @Amgen drug lists at 30% less than expected. Middlemen like Pharmacy Benefit Managers, insurers &amp; hospitals would do well by passing discounts on to patients. @SecAzar @SGottliebFDA https://t.co/mfRQ5COtev",,,,,,,"NA, NA","NA, NA","NA, NA, NA, NA, NA, NA, NA, NA"
997515759281680385,2018-05-18 16:34:24,25073877,realDonaldTrump,"We grieve for the terrible loss of life, and send our support and love to everyone affected by this horrible attack in Texas. To the students, families, teachers and personnel at Santa Fe High School – we are with you in this tragic hour, and we will be with you forever... https://t.co/LtJ0D29Hsv",Twitter for iPhone,,,,False,⋯,,,,,,,,"NA, NA","NA, NA","NA, NA, NA, NA, NA, NA, NA, NA"
997493407097524224,2018-05-18 15:05:35,25073877,realDonaldTrump,School shooting in Texas. Early reports not looking good. God bless all!,Twitter for iPhone,,,,False,⋯,,,,,,,,"NA, NA","NA, NA","NA, NA, NA, NA, NA, NA, NA, NA"


#### Look at individual column values

In [9]:
# print column names
names(trumptweets)
trumptweets %>%
  select('created_at', 'text', 'favorite_count','source') %>%
    head

created_at,text,favorite_count,source
<dttm>,<chr>,<int>,<chr>
2018-05-18 20:41:21,"Just met with UN Secretary-General António Guterres who is working hard to “Make the United Nations Great Again.” When the UN does more to solve conflicts around the world, it means the U.S. has less to do and we save money. @NikkiHaley is doing a fantastic job! https://t.co/pqUv6cyH2z",4550,Twitter for iPhone
2018-05-18 20:22:25,America is a Nation that believes in the power of redemption. America is a Nation that believes in second chances - and America is a Nation that believes that the best is always yet to come! #PrisonReform https://t.co/Yk5UJUYgHN,10450,Twitter for iPhone
2018-05-18 20:02:49,RT @SteveForbesCEO: .@realDonaldTrump speech on drug costs pays immediate dividends. New @Amgen drug lists at 30% less than expected. Middl…,0,Twitter for iPhone
2018-05-18 16:34:24,"We grieve for the terrible loss of life, and send our support and love to everyone affected by this horrible attack in Texas. To the students, families, teachers and personnel at Santa Fe High School – we are with you in this tragic hour, and we will be with you forever... https://t.co/LtJ0D29Hsv",40709,Twitter for iPhone
2018-05-18 15:05:35,School shooting in Texas. Early reports not looking good. God bless all!,66378,Twitter for iPhone
2018-05-18 13:50:11,"Reports are there was indeed at least one FBI representative implanted, for political purposes, into my campaign for president. It took place very early on, and long before the phony Russia Hoax became a “hot” Fake News story. If true - all time biggest political scandal!",55306,Twitter for iPhone


We can use ```[tablename]$[columnname]```  to select a column and perform different operations on them.
Some example operations include:

```min```, ```max```, 

```nrow```, ```ncol```,

```unique```, ```n_distinct```,

```mean```,```median```, ```sd```

In [None]:
# print summary information on individual columns
print(paste('# of rows: ', nrow(trumptweets)))
min(trumptweets$created_at)
max(trumptweets$created_at)
unique(trumptweets$country)

#### Convert timestamps 
This will make it easier to select tweets by a specific date or timestamp

In [None]:
# convert timestamps to timestamp format
trumptweets$created_at <- ymd_hms(trumptweets$created_at)

## examples: 
# trumptweets[as.Date(trumptweets$created_at) == as.Date("2018-05-18"),]
# trumptweets[trumptweets$created_at == ymd_hms("2018-05-18 15:05:35"),]

### Format and clean the text


#### Filter out retweets and replace urls

In [None]:
# regex for parsing tweets
replace_reg <- "https?://[^\\s]+|&amp;|&lt;|&gt;|\bRT\\b"
trumptweets <- trumptweets %>%
  filter(is_retweet == FALSE) %>%
  mutate(text = str_replace_all(text, replace_reg, "url"))


#### Tokenize the data
**Tokenization** - the way you define a unit of analysis (e.g. words, sequence of words, sentence)

**Document** - a unit of context (in this case - a single tweet)

**Tidy text format** - One row per token (word in this case) with column variables that have extra context (e.g. which tweet the word came from)

In [None]:
tidy_trump_tweets<- trumptweets %>%
    select(created_at,text) %>%
    unnest_tokens("word", text)
head(tidy_trump_tweets)


#### Remove stopwords
Common words such as “the”, “and”, “bot”, “for”, “is”, etc. are often described as “stop words,” meaning that they should not be included in a quantitative text analysis. The tidytext package has a list of common stop words that we can use.

In [None]:
# load stop_words from tidytext package and remove from tidy_trump_tweets

#load stop_words
data("stop_words")

# remove stopwords and other insignificant words from tidy_trump_tweets
tidy_trump_tweets <-
   tidy_trump_tweets %>%
      anti_join(stop_words) %>%
        filter(!(word=="https"|
                 word=="rt"|
                 word=="t.co"|
                 word=="amp" |
                 word=="url")) 
head(tidy_trump_tweets)

#### Remove punctuation
Done automatically by `tidytext`

#### Convert to lowercase
Done automatically by `tidytext`

#### Remove extra white spaces

In [None]:
# remove extra white spaces
tidy_trump_tweets$word <- gsub("\\s+","",tidy_trump_tweets$word)

#### Stemming

In [None]:
# get word stems
tidy_trump_tweets_stemmed<-tidy_trump_tweets %>%
      mutate_at("word", list(~wordStem((.), language="en")))
head(tidy_trump_tweets_stemmed)

## Word counting 
Count the most commonly used words across tweets and plot them 

In [None]:
# count word frequencies and sort in descending order
top_words<-
   tidy_trump_tweets %>%
    count(word) %>%
        arrange(desc(n))

In [None]:
# plot the top 20 most frequently used words
top_words %>%
  slice(1:20) %>%
    ggplot(aes(x=reorder(word, -n), y=n, fill=word))+
      geom_bar(stat="identity")+
        theme_minimal()+
        theme(axis.text.x = 
            element_text(angle = 60, hjust = 1, size=13))+
        theme(plot.title = 
            element_text(hjust = 0.5, size=18))+
          ylab("Frequency")+
          xlab("")+
          ggtitle("Most Frequent Words in Trump Tweets")+
          guides(fill=FALSE)

### WordCloud

In [None]:
# generate a wordcloud 
set.seed(1234) # for reproducibility 
wordcloud(words = top_words$word, freq = top_words$n, min.freq = 1,  
          max.words=200, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))

### tf-idf: Term Frequency Inverse Document Frequency
Giving more weight to a term occuring in less documents

In [None]:
tidy_trump_tfidf <- tidy_trump_tweets %>%
    count(word, created_at) %>%
        bind_tf_idf(word, created_at, n) %>%
            arrange(desc(tf_idf))

In [None]:
head(tidy_trump_tfidf)

In [None]:
trumptweets[trumptweets$created_at == ymd_hms("2017-06-15 23:49:24"),]$hashtags

## Dictionary-based text analysis

### Selecting for a collection of words

In [None]:
topic_dictionary<-c("economy","unemployment","trade","tariffs","jobs")

In [None]:
topic_dictionary_tweets<-trumptweets[str_detect(trumptweets$text, paste(my_dictionary, collapse="|")),]

In [None]:
topic_dictionary_tweets

In [None]:
topic_top_words<-topic_dictionary_tweets %>%
    select(created_at,text) %>%
      unnest_tokens("word", text) %>%
        anti_join(stop_words) %>%
            filter(!(word=="https"|
                 word=="rt"|
                 word=="t.co"|
                 word=="amp" |
                 word=="url")) %>%
            count(word) %>%
                arrange(desc(n))

In [None]:
set.seed(1234) # for reproducibility 
wordcloud(words = topic_top_words$word, freq = topic_top_words$n, min.freq = 1,  
          max.words=200, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))