# Representation of Text Data $\,\, \tiny\text{Lecture}$
<img src="banner lecture.jpg" align=left />

<br>
### Apparatus
___
Load function libraries, define additional useful functions, and set defaults here.

In [1]:
options(warn=-1)

# Load some required functions
library(rJava,      verbose=FALSE, warn.conflicts=FALSE, quietly=TRUE)
library(xlsxjars,   verbose=FALSE, warn.conflicts=FALSE, quietly=TRUE)
library(xlsx,       verbose=FALSE, warn.conflicts=FALSE, quietly=TRUE) # Also, ensure Java version (32-bit or 64-bit) matches R kernel
library(tm,   verbose=FALSE, warn.conflicts=FALSE, quietly=TRUE)
library(RWeka,   verbose=FALSE, warn.conflicts=FALSE, quietly=TRUE)
library(e1071,   verbose=FALSE, warn.conflicts=FALSE, quietly=TRUE)

# Define some useful functions
unigram = function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram = function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
unibigram = function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
regulate_columns = function(new.t, data.t) { i = colnames(new.t)[colnames(new.t) %in% colnames(data.t)]; new.tx = as.data.frame(as.list(t(rep(0, ncol(data.t)))), optional=TRUE, col.names=colnames(data.t)); new.tx[, i] = new.t[, i]; new.tx }

# Set some visualization formatting defaults
options(digits=10, scipen=100) # expose many digits, use scientific notation sparingly
options(repr.matrix.max.cols=500)
options(repr.matrix.max.rows=1000)

<br>
### Definitions
___
Some definitions:
* **Corpus:** A text dataset, often represented as a one-column table.
* **Document:** An observation in a text dataset, which can be a word, a phrase, a sentence, a paragraph, an article, a chapter, a book, or any other unit of text.

<br>
### Retrieve Data
___
Here we retrieve a corpus, represented as a one-column table, comprising 7 seven documents.  We then manually classify each document as type A or B.

_Note: On some systems, use encoding="UTF-8" to ensure that any unusual characters in the text are interpreted correctly._  

In [2]:
data = read.xlsx("../DATASETS/DATASET Jabberwocky.xlsx", sheetIndex=1, header=FALSE, encoding="UTF-8")
data$class = factor(c("A","A","B","A","B","B","A"), c("A","B"))
dim(data)
data

X1,class
"’Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe.",A
"“Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!”",A
He took his vorpal sword in hand; Long time the manxome foe he sought— So rested he by the Tumtum tree And stood awhile in thought.,B
"And, as in uffish thought he stood, The Jabberwock, with eyes of flame, Came whiffling through the tulgey wood, And burbled as it came!",A
"One, two! One, two! And through and through The vorpal blade went snicker-snack! He left it dead, and with its head He went galumphing back.",B
"“And hast thou slain the Jabberwock? Come to my arms, my beamish boy! O frabjous day! Callooh! Callay!” He chortled in his joy.",B
"’Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe.",A


<br>
Here we retrieve a new, unclassified document.

In [3]:
new = data.frame(X1="The bird rested in the dead tree for a long time, and thought for a long time about business analytics.")
new

X1
"The bird rested in the dead tree for a long time, and thought for a long time about business analytics."


<br>
### Represent Corpus as a Document-Term Table 
___

#### Step 1: Prepare the corpus for transformation.

To prepare the corpus for transformation, convert it from table form to list form.

_Note: To access an element of a list, use the [[...]] notation._

In [4]:
corpus = VCorpus(VectorSource(data$X1))

inspect(corpus[[1]])
inspect(corpus[[2]])
inspect(corpus[[3]])
inspect(corpus[[4]])
inspect(corpus[[5]])
inspect(corpus[[6]])
inspect(corpus[[7]])

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 142

’Twas brillig, and the slithy toves 
      Did gyre and gimble in the wabe: 
All mimsy were the borogoves, 
      And the mome raths outgrabe.
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 150

“Beware the Jabberwock, my son! 
      The jaws that bite, the claws that catch! 
Beware the Jubjub bird, and shun 
      The frumious Bandersnatch!” 
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 147

He took his vorpal sword in hand; 
      Long time the manxome foe he sought— 
So rested he by the Tumtum tree 
      And stood awhile in thought. 
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 151

And, as in uffish thought he stood, 
      The Jabberwock, with eyes of flame, 
Came whiffling through the tulgey wood, 
      And burbled as it came! 
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 155

One, two! One, two! And through and through 
      The vorpal blade went snicker-snack! 
He left it dead, and with its he

<br>
#### Step 2: Transform the corpus in various ways.

Here we transform the corpus in 6 ways:
* Transform all unusal characters to similar standard latin letters (necessary only on some systems).
* Transform all upper case letters to lower case letters.
* Remove all numbers.
* Remove all punctuation.
* Remove all inconsequential words, like "a", "the", "we", "in", etc.
* Remove all whitespace, like spaces, new lines, etc.

In [5]:
#corpus = tm_map(corpus, content_transformer(iconv), from="UTF-8", to="latin1") # this line is necessary only on some systems
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation, ucp=TRUE)
corpus = tm_map(corpus, removeWords, stopwords("en")) # uses dictionary of English stopwords 
corpus = tm_map(corpus, stripWhitespace)

inspect(corpus[[1]])
inspect(corpus[[2]])
inspect(corpus[[3]])
inspect(corpus[[4]])
inspect(corpus[[5]])
inspect(corpus[[6]])
inspect(corpus[[7]])

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 78

twas brillig slithy toves gyre gimble wabe mimsy borogoves mome raths outgrabe
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 90

beware jabberwock son jaws bite claws catch beware jubjub bird shun frumious bandersnatch 
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 93

 took vorpal sword hand long time manxome foe sought rested tumtum tree stood awhile thought 
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 84

 uffish thought stood jabberwock eyes flame came whiffling tulgey wood burbled came 
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 82

one two one two vorpal blade went snickersnack left dead head went galumphing back
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 93

 hast thou slain jabberwock come arms beamish boy o frabjous day callooh callay chortled joy 
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 78

twas brillig slithy toves gyre gimble wabe mimsy borogoves mome raths outgrabe


<br>
#### Step 3: Construct the document-term table (words).

The document-term table's columns correspond to the words in the transformed corpus.  Rows correspond to documents.  Each value is the number of times a certain word appears in a certain document.

In [6]:
dtm = DocumentTermMatrix(corpus)
data.t = as.data.frame(as.matrix(dtm))
dim(data.t)
data.t

arms,awhile,back,bandersnatch,beamish,beware,bird,bite,blade,borogoves,boy,brillig,burbled,callay,callooh,came,catch,chortled,claws,come,day,dead,eyes,flame,foe,frabjous,frumious,galumphing,gimble,gyre,hand,hast,head,jabberwock,jaws,joy,jubjub,left,long,manxome,mimsy,mome,one,outgrabe,raths,rested,shun,slain,slithy,snickersnack,son,sought,stood,sword,thou,thought,time,took,toves,tree,tulgey,tumtum,twas,two,uffish,vorpal,wabe,went,whiffling,wood
0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0
0,0,0,1,0,2,1,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,1,1,1,0,1,1,1,0,1,0,1,0,0,0,1,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,1
0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1,0,2,0,0
1,0,0,0,1,0,0,0,0,0,1,0,0,1,1,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0


<br>
#### Step 3 (alternative): Construct the document-term table (n-grams).

The document-term table's columns correspond to the n-grams in the transformed corpus, where an n-gram is a set of n words that appear adjacent to each other.  For example, "arms beamish" and "awhile thought" are 2-grams, or bigrams. For example, "arms" and "awhile" are 1-grams, or unigrams, or just words.  Rows correspond to documents.  Each value is the number of times a certain n-gram appears in a certain document.

Here we determine the document-term table for unigrams and bigrams.

In [7]:
dtm = DocumentTermMatrix(corpus, control=list(tokenize=unibigram))
data.t = as.data.frame(as.matrix(dtm))
dim(data.t)
data.t

arms,arms beamish,awhile,awhile thought,back,bandersnatch,beamish,beamish boy,beware,beware jabberwock,beware jubjub,bird,bird shun,bite,bite claws,blade,blade went,borogoves,borogoves mome,boy,boy o,brillig,brillig slithy,burbled,burbled came,callay,callay chortled,callooh,callooh callay,came,came whiffling,catch,catch beware,chortled,chortled joy,claws,claws catch,come,come arms,day,day callooh,dead,dead head,eyes,eyes flame,flame,flame came,foe,foe sought,frabjous,frabjous day,frumious,frumious bandersnatch,galumphing,galumphing back,gimble,gimble wabe,gyre,gyre gimble,hand,hand long,hast,hast thou,head,head went,jabberwock,jabberwock come,jabberwock eyes,jabberwock son,jaws,jaws bite,joy,jubjub,jubjub bird,left,left dead,long,long time,manxome,manxome foe,mimsy,mimsy borogoves,mome,mome raths,o frabjous,one,one two,outgrabe,raths,raths outgrabe,rested,rested tumtum,shun,shun frumious,slain,slain jabberwock,slithy,slithy toves,snickersnack,snickersnack left,son,son jaws,sought,sought rested,stood,stood awhile,stood jabberwock,sword,sword hand,thou,thou slain,thought,thought stood,time,time manxome,took,took vorpal,toves,toves gyre,tree,tree stood,tulgey,tulgey wood,tumtum,tumtum tree,twas,twas brillig,two,two one,two vorpal,uffish,uffish thought,vorpal,vorpal blade,vorpal sword,wabe,wabe mimsy,went,went galumphing,went snickersnack,whiffling,whiffling tulgey,wood,wood burbled
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0
0,0,0,0,0,1,0,0,2,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,0,1,1,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,1,1
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,0,1,1,0,0,0,2,1,1,0,0,0,0
1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0


<br>
#### Step 3 (alternative): Construct the document-term table (n-grams and sparse variable removal).

The document-term table's columns correspond to the n-grams in the transformed corpus  Rows correspond to documents.  Each value is the number of times a certain n-gram appears in a certain document.

Additionally, columns corresponding to n-grams that appear in only a few documents are removed.   

Here we determine the document-term table for unigrams and bigrams.

_Use the `removeSparseTerms` function with `sparse` parameter to remove sparse terms.  `sparse`=$x$ means columns corresponding to n-grams missing from no more than $x$ documents are kept, other columns are removed. _

In [8]:
dtm = DocumentTermMatrix(corpus, control=list(tokenize=unibigram))
dtm = removeSparseTerms(dtm, sparse=0.90)
data.t =  as.data.frame(as.matrix(dtm))
dim(data.t)
data.t

arms,arms beamish,awhile,awhile thought,back,bandersnatch,beamish,beamish boy,beware,beware jabberwock,beware jubjub,bird,bird shun,bite,bite claws,blade,blade went,borogoves,borogoves mome,boy,boy o,brillig,brillig slithy,burbled,burbled came,callay,callay chortled,callooh,callooh callay,came,came whiffling,catch,catch beware,chortled,chortled joy,claws,claws catch,come,come arms,day,day callooh,dead,dead head,eyes,eyes flame,flame,flame came,foe,foe sought,frabjous,frabjous day,frumious,frumious bandersnatch,galumphing,galumphing back,gimble,gimble wabe,gyre,gyre gimble,hand,hand long,hast,hast thou,head,head went,jabberwock,jabberwock come,jabberwock eyes,jabberwock son,jaws,jaws bite,joy,jubjub,jubjub bird,left,left dead,long,long time,manxome,manxome foe,mimsy,mimsy borogoves,mome,mome raths,o frabjous,one,one two,outgrabe,raths,raths outgrabe,rested,rested tumtum,shun,shun frumious,slain,slain jabberwock,slithy,slithy toves,snickersnack,snickersnack left,son,son jaws,sought,sought rested,stood,stood awhile,stood jabberwock,sword,sword hand,thou,thou slain,thought,thought stood,time,time manxome,took,took vorpal,toves,toves gyre,tree,tree stood,tulgey,tulgey wood,tumtum,tumtum tree,twas,twas brillig,two,two one,two vorpal,uffish,uffish thought,vorpal,vorpal blade,vorpal sword,wabe,wabe mimsy,went,went galumphing,went snickersnack,whiffling,whiffling tulgey,wood,wood burbled
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0
0,0,0,0,0,1,0,0,2,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,0,1,1,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,1,1
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,0,1,1,0,0,0,2,1,1,0,0,0,0
1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0


<br>
#### Step 4: Restore classification.

Add the class column from the data in table form to the data in document-term form.

In [9]:
data.tc = data.t
data.tc$class = data$class
data.tc

arms,arms beamish,awhile,awhile thought,back,bandersnatch,beamish,beamish boy,beware,beware jabberwock,beware jubjub,bird,bird shun,bite,bite claws,blade,blade went,borogoves,borogoves mome,boy,boy o,brillig,brillig slithy,burbled,burbled came,callay,callay chortled,callooh,callooh callay,came,came whiffling,catch,catch beware,chortled,chortled joy,claws,claws catch,come,come arms,day,day callooh,dead,dead head,eyes,eyes flame,flame,flame came,foe,foe sought,frabjous,frabjous day,frumious,frumious bandersnatch,galumphing,galumphing back,gimble,gimble wabe,gyre,gyre gimble,hand,hand long,hast,hast thou,head,head went,jabberwock,jabberwock come,jabberwock eyes,jabberwock son,jaws,jaws bite,joy,jubjub,jubjub bird,left,left dead,long,long time,manxome,manxome foe,mimsy,mimsy borogoves,mome,mome raths,o frabjous,one,one two,outgrabe,raths,raths outgrabe,rested,rested tumtum,shun,shun frumious,slain,slain jabberwock,slithy,slithy toves,snickersnack,snickersnack left,son,son jaws,sought,sought rested,stood,stood awhile,stood jabberwock,sword,sword hand,thou,thou slain,thought,thought stood,time,time manxome,took,took vorpal,toves,toves gyre,tree,tree stood,tulgey,tulgey wood,tumtum,tumtum tree,twas,twas brillig,two,two one,two vorpal,uffish,uffish thought,vorpal,vorpal blade,vorpal sword,wabe,wabe mimsy,went,went galumphing,went snickersnack,whiffling,whiffling tulgey,wood,wood burbled,class
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,A
0,0,0,0,0,1,0,0,2,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,A
0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,0,1,1,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,B
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,1,1,A
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,0,1,1,0,0,0,2,1,1,0,0,0,0,B
1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,B
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,A


<br>
### Represent New Observation as Document-Term Table
___

#### Prepare the corpus for transformation, transform the corpus in various ways, construct the document-term table.

In [10]:
corpus = VCorpus(VectorSource(new$X1))
#corpus = tm_map(corpus, content_transformer(iconv), from="UTF-8", to="latin1") # delete this line on some systems
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation, ucp=TRUE)
corpus = tm_map(corpus, removeWords, stopwords("en"))
corpus = tm_map(corpus, stripWhitespace)

dtm = DocumentTermMatrix(corpus, control=list(tokenize=unibigram))
new.t = as.data.frame(as.matrix(dtm))
new.t

analytics,bird,bird rested,business,business analytics,dead,dead tree,long,long time,rested,rested dead,thought,thought long,time,time business,time thought,tree,tree long
1,1,1,1,1,1,1,2,2,1,1,1,1,2,1,1,1,1


<br>
#### Regulate the columns of the new observation's document-term table so that they align with the corpus' document-term table.

  Remove columns as necessary.  Add columns with 0 values as necessary.

In [11]:
new.tr = regulate_columns(new.t, data.t)
new.tr

arms,arms beamish,awhile,awhile thought,back,bandersnatch,beamish,beamish boy,beware,beware jabberwock,beware jubjub,bird,bird shun,bite,bite claws,blade,blade went,borogoves,borogoves mome,boy,boy o,brillig,brillig slithy,burbled,burbled came,callay,callay chortled,callooh,callooh callay,came,came whiffling,catch,catch beware,chortled,chortled joy,claws,claws catch,come,come arms,day,day callooh,dead,dead head,eyes,eyes flame,flame,flame came,foe,foe sought,frabjous,frabjous day,frumious,frumious bandersnatch,galumphing,galumphing back,gimble,gimble wabe,gyre,gyre gimble,hand,hand long,hast,hast thou,head,head went,jabberwock,jabberwock come,jabberwock eyes,jabberwock son,jaws,jaws bite,joy,jubjub,jubjub bird,left,left dead,long,long time,manxome,manxome foe,mimsy,mimsy borogoves,mome,mome raths,o frabjous,one,one two,outgrabe,raths,raths outgrabe,rested,rested tumtum,shun,shun frumious,slain,slain jabberwock,slithy,slithy toves,snickersnack,snickersnack left,son,son jaws,sought,sought rested,stood,stood awhile,stood jabberwock,sword,sword hand,thou,thou slain,thought,thought stood,time,time manxome,took,took vorpal,toves,toves gyre,tree,tree stood,tulgey,tulgey wood,tumtum,tumtum tree,twas,twas brillig,two,two one,two vorpal,uffish,uffish thought,vorpal,vorpal blade,vorpal sword,wabe,wabe mimsy,went,went galumphing,went snickersnack,whiffling,whiffling tulgey,wood,wood burbled
0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


<br>
### Build Model & Make Predictions
___

Here we build a naive bayes model to predict class given all other variables, using the corpus' document-term table.  Then we predict the class of the new observation, using the new observation's document-term table.

In [12]:
model = naiveBayes(class ~ ., data.tc)
result = predict(model, new.tr, type="class")

data.frame(result)

result
A


<br>
### Expectations
___
* Convert a corpus in on-column table form to form, conceptually and using R.
* Build a model and make predictions based on text data, conceptually and using R.

<br>
### Some Useful R Functions
___
* `as.data.frame` https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/as.data.frame
* `as.matrix` https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/matrix
* `content_transformer` https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/getTransformations
* `DocumentTermMatrix` https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/TermDocumentMatrix
* `inspect` https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/inspect
* `removeNumbers` https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/getTransformations
* `removePunctuation` https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/getTransformations
* `removeSparseTerms` https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/removeSparseTerms
* `removeWords` https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/getTransformations
* `stripWhitespace` https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/getTransformations
* `tm_map` https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/tm_map
* `VCorpus` https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/VCorpus
* `VectorSource` https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/VectorSource


<br>
### Further Reading
___
* http://www-stat.wharton.upenn.edu/~stine/mich/index.html#textanalytics
* http://www-stat.wharton.upenn.edu/~stine/mich/DM_10.pdf
* http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know
* https://www.r-bloggers.com/building-wordclouds-in-r/

$\tiny \text{Copyright (c) Berkeley Data Analytics Group, LLC}$