# Practice - Word Clouds

In this practice, we will create a Shiny app to visualize word cloud and interactively remove words and change its options to analyze text. 

We will use a preprocessed screenplay from the season one episode one of the Game of Thrones. The screenplay is edited to remove all scene descriptions; only actors' and actresses' lines are kept. We will visualize the script and also the number of lines per character. First, load the data and see it. 

In [1]:
library(tm)
library(wordcloud)
library(RColorBrewer)

filePath <- "gotS1E1processed.txt"
text <- readLines(filePath)
# Load the data as a corpus
docs <- Corpus(VectorSource(text))
inspect(docs)

Loading required package: NLP
Loading required package: RColorBrewer


<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 330

  [1] WAYMAR ROYCE: What d'you expect? They're savages. One lot steals a goat from another lot and before you know it, they're ripping each other to pieces.                                                                                                                                                                                                                                                          
  [2] WILL: I've never seen wildlings do a thing like this. I've never seen a thing like this, not ever in my life.                                                                                                                                                                                                                                                                                                   
  [3] WAYMAR ROYCE: How close did you get?                                          

Before doing anything else, we will create two documents, one containing lines only, other one containing character names only. For that, we'll define our own text mining function, and use it to separate lines from names. 

In [2]:
# This is our function definition; sub replaces a string with another one. 
# Here we remove a "pattern" by replacing it with nothing ("")

SepName <- content_transformer(function (x , pattern ) sub(pattern, "", x))
    
# Let's call it two times, to replace names and lines:
# everything before ":", and everything after ":" 
gotlines <- tm_map(docs, SepName, ".*:")
gotnames <- tm_map(docs, SepName, ":.*")

# let's look at lines only
inspect(gotlines)
    

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 330

  [1]  What d'you expect? They're savages. One lot steals a goat from another lot and before you know it, they're ripping each other to pieces.                                                                                                                                                                                                                                                                
  [2]  I've never seen wildlings do a thing like this. I've never seen a thing like this, not ever in my life.                                                                                                                                                                                                                                                                                                 
  [3]  How close did you get?                                                                     

In [3]:
# and this corpus is names only 
inspect(gotnames)

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 330

  [1] WAYMAR ROYCE  WILL          WAYMAR ROYCE  WILL          GARED        
  [6] ROYCE         GARED         ROYCE         WILL          ROYCE        
 [11] ROYCE         WILL          GARED         ROYCE         GARED        
 [16] JON           JON           SEPTA MORDANE SANSA         SEPTA MORDANE
 [21] NED           JON           ROBB          JON/ROBB      CASSEL       
 [26] NED           CATELYN       NED           CASSEL        NED          
 [31] CATELYN       NED           ROBB          WILL          WILL         
 [36] WILL          NED           JON           NED           JON          
 [41] NED           JON           NED           BRAN          NED          
 [46] BRAN          NED           BRAN          NED           BRAN         
 [51] NED           JON           THEON         NED           THEON        
 [56] NED           NED           ROBB          JON           

Now, we need to do the text processing on ```gotlines``` to remove punctuation and white space, and convert to lower case, and remove stopwords just like we have seen in the lab notebook. 
**It's your turn:**

In [None]:
gotlines <- # YOUR CODE 
gotlines <- # YOUR CODE 
gotlines <- # YOUR CODE 
gotlines <- # YOUR CODE 

inspect(gotlines)

Now we will compute the word frequencies and plot the word cloud. 
**It's your turn:**

In [None]:
dtm1 <- # YOUR CODE 
m1 <- # YOUR CODE 
v1 <-# YOUR CODE 
d1 <- # YOUR CODE 
head(d1, 30)

set.seed(1234)
wordcloud(words = d1$word, freq = d1$freq, min.freq = 5,
          max.words=100, random.order=FALSE, rot.per=0, 
          colors=brewer.pal(8, "Dark2"))

Now, visualize the **```gotnames```** as a word cloud. Since it only contains character names, you do not need to do any text preprocessing. Just compute the frequencies and plot them.

#### <span style='background:yellow'>It's your turn</span>

In [None]:
# YOUR CODE HERE
# ----------------





Which two characters have the most lines ? 

### Interactive Word Cloud in Shiny 

Now, we will create an app to show the word cloud. First we will create a static app with no interaction. We need a simple layout with one plot. 

In [None]:
#DEPLOY TO SHINY SERVER
dir <- getwd() #This gets the current Working Directory
course <- "DATA-SCI-8654" #This is to specify the course path for the shiny server
folder <- "module3-wordcloud1" #This specifies the folder name to copy

system(sprintf("/usr/local/bin/shiny_deploy %s %s %s", course, dir,folder), 
       intern = TRUE,
       ignore.stdout = FALSE, 
       ignore.stderr = FALSE,
       wait = TRUE, 
       input = NULL)

** Now we will modify this code to get user input for the minimum frequency and maximum number of words. **
Let's have two sliders for these:

Now, we will have check boxes to remove words from the word cloud. For this, we will display the top 20 words, and let user to choose to include or exclude them from the visualization.

You should see an interface similar to this: 

<img src="../images/got.png">

