# Lab 3 - Word Clouds

In this lab notebook, we will learn how to visualize word clouds. Word clouds are useful to give an idea of a body of text at a glance; we instantly see which words are used the most in the text. By using **size as the visual channel** to represent the frequency of the word, we can utilize human visual perception to perceive the relative frequencies of the words at an instant. Now, we will see how to create a simple word cloud. Later in the practice, we'll see how to create interactive visualizations with word clouds. 

We will use ```tm``` text mining library and ```SnowBallC``` word stemming library to process text. 

First we need to read our text data from the file, and then convert it into a **corpus** to process it. **Run the following code cells in the notebook. **

In [None]:
library("tm") # text mining library
library("SnowballC") # word stemmer
library("wordcloud") # wordcloud vis
library("RColorBrewer")

In [None]:
# Read the text file 
filePath <- "HuckFinn_chapters1-3.txt"
text <- readLines(filePath)

In [None]:
# Turn text into a corpus
docs <- Corpus(VectorSource(text))
# Look at the corpus
inspect(docs)

Above, you can see the contents of the corpus. 
We need to process this text before we create our word cloud. 
We need to remove things that are not useful such as capitalization, white space, punctuations, and stopwords. 

**Stopwords** are those that are most common in a language, and usually are filtered out before text mining (words such as an, are, as, for, the, etc.).

**Text stemming** is the process of reducing inflected or derived words to their word stem or root form; such as thinking -> think, cats -> cat, connected -> connect. Text stemming may not always produce good results, whether you should use it should be decided on a case by case basis. 

Let's see how we do these in R: 

In [None]:
# Some preprocessing first

# To lower case 
docs <- tm_map(docs, content_transformer(tolower))
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
docs <- tm_map(docs, stemDocument)
# Remove punctuations
docs <- tm_map(docs, removePunctuation)

# Inspect again
inspect(docs)


As you can see above, the text is filtered and ready for mining. As a result of stemming, some words are converted to non-words; some words were left unstemmed. We can visualize both with and without stemming to see if it makes any meaningful difference. 

Before we can create the word cloud, we first need to create a **frequency table** for the words. This will be the word counts in the corpus. We use ```TermDocumentMatrix``` for that. 

In [None]:
# Create frequencies of the words 
dtm <- TermDocumentMatrix(docs)

# Convert it to a matrix and sort w.r.t. freq.s
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
# Convert to a data frame 
d <- data.frame(word = names(v), freq=v)
# Print the data frame 
head(d, 30)

**Now, we can use the wordcloud to visualize these words. ** Look at the comments in the code to see the options to the ```wordcloud```. 

In [None]:
# Word cloud visualization 
# Set the random number generator 
set.seed(1234)
# inputs are the words, frequencies, minimum freq of a word to show, max. number of words to show. 
# rot.per is the percentage of the rotated words in the vis; we don't want any, so it's zero. 
wordcloud(words = d$word, freq = d$freq, min.freq = 5,
          max.words=100, random.order=FALSE, rot.per=0, 
          colors=brewer.pal(8, "Dark2"))

As you can see, there are still some words that are not very useful for analysis. We can create our own list of stopwords and remove them from the corpus. Here is how: 

In [None]:
# specify your own stopwords as a character vector, and rerun on the docs. 
docs <- tm_map(docs, removeWords, c("said", "got","went","come","dont","didnt","wouldnt","couldnt","well","get","want","say","told","know","see","must","make","like","one","thing")) 

In [None]:

# Rerun the code 
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 30)

In [None]:
wordcloud(words = d$word, freq = d$freq, min.freq = 10,
          max.words=100, random.order=FALSE, rot.per=0, 
          colors=brewer.pal(8, "Dark2"))

In the practice, we will see how to create a word cloud and remove words interactively to analyze the text. 