# Text Data

![](banner_text.jpg)

In [1]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)

## Introduction

## Terms

* **Natural Language Processing (NLP):** Predictive analytics applied to text data is a specialization of natural language processing.
* **Text Mining:** Data analytics applied to text data.
* **Corpus:** A text dataset, often represented as a one-column table.
* **Document:** An observation in a text dataset, which can be a word, a phrase, a sentence, a paragraph, an article, a chapter, a book, or any other unit of text.

## Classification of Text

### Data

Here we retrieve a corpus, represented as a one-column table, comprising 11 documents, each classified as A or B based on author.

In [2]:
data = rbind(read.csv("Jabberwocky.csv", header=FALSE, stringsAsFactors=FALSE)[1:5,,drop=FALSE],
             read.csv("Raven.csv", header=FALSE, stringsAsFactors=FALSE)[1:6,,drop=FALSE])

data$class = factor(c("A","A","A","A","A","B","B","B","B","B","B"))


size(data)
data

observations,variables
11,2


V1,class
"Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe.",A
"""Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!""",A
He took his vorpal sword in hand; Long time the manxome foe he sought - So rested he by the Tumtum tree And stood awhile in thought.,A
"And, as in uffish thought he stood, The Jabberwock, with eyes of flame, Came whiffling through the tulgey wood, And burbled as it came!",A
"One, two! One, two! And through and through The vorpal blade went snicker-snack! He left it dead, and with its head He went galumphing back.",A
"Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore-",B
"While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door.",B
"'Tis some visitor,"" I muttered, ""tapping at my chamber door- Only this and nothing more.""",B
"Ah, distinctly I remember it was in the bleak December; And each separate dying ember wrought its ghost upon the floor.",B
Eagerly I wished the morrow; -vainly I had sought to borrow From my books surcease of sorrow - sorrow for the lost Lenore-,B


<br>
Here we retrieve a new, unclassified document.

In [3]:
new = data.frame(V1="One bird sought a dead tree at midnight and thought about nothing more than business analytics.")
new

V1
One bird sought a dead tree at midnight and thought about nothing more than business analytics.


### Data as Document-Term Matrix 

#### Simplify the Text

Here we transform the corpus in 8 ways:
* Transform all unusal characters to similar standard latin letters (necessary only on some systems).
* Transform all upper case letters to lower case letters.
* Remove all numbers.
* Remove all punctuation.
* Remove special characters.
* Remove all inconsequential words, like "a", "the", "we", "in", etc.
* Reduce all words to their roots (this is known as stemming).
* Remove all whitespace, like spaces, new lines, etc.

Transform all upper case letters to lower case letters:

In [4]:
corpus = VCorpus(VectorSource(data$V1))
corpus = tm_map(corpus, content_transformer(tolower))

as.data.frame.content(corpus)

V1
"twas brillig, and the slithy toves did gyre and gimble in the wabe: all mimsy were the borogoves, and the mome raths outgrabe."
"""beware the jabberwock, my son! the jaws that bite, the claws that catch! beware the jubjub bird, and shun the frumious bandersnatch!"""
he took his vorpal sword in hand; long time the manxome foe he sought - so rested he by the tumtum tree and stood awhile in thought.
"and, as in uffish thought he stood, the jabberwock, with eyes of flame, came whiffling through the tulgey wood, and burbled as it came!"
"one, two! one, two! and through and through the vorpal blade went snicker-snack! he left it dead, and with its head he went galumphing back."
"once upon a midnight dreary, while i pondered, weak and weary, over many a quaint and curious volume of forgotten lore-"
"while i nodded, nearly napping, suddenly there came a tapping, as of some one gently rapping, rapping at my chamber door."
"'tis some visitor,"" i muttered, ""tapping at my chamber door- only this and nothing more."""
"ah, distinctly i remember it was in the bleak december; and each separate dying ember wrought its ghost upon the floor."
eagerly i wished the morrow; -vainly i had sought to borrow from my books surcease of sorrow - sorrow for the lost lenore-


Further, remove all numbers:

In [5]:
corpus = VCorpus(VectorSource(data$V1))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)

as.data.frame.content(corpus)

V1
"twas brillig, and the slithy toves did gyre and gimble in the wabe: all mimsy were the borogoves, and the mome raths outgrabe."
"""beware the jabberwock, my son! the jaws that bite, the claws that catch! beware the jubjub bird, and shun the frumious bandersnatch!"""
he took his vorpal sword in hand; long time the manxome foe he sought - so rested he by the tumtum tree and stood awhile in thought.
"and, as in uffish thought he stood, the jabberwock, with eyes of flame, came whiffling through the tulgey wood, and burbled as it came!"
"one, two! one, two! and through and through the vorpal blade went snicker-snack! he left it dead, and with its head he went galumphing back."
"once upon a midnight dreary, while i pondered, weak and weary, over many a quaint and curious volume of forgotten lore-"
"while i nodded, nearly napping, suddenly there came a tapping, as of some one gently rapping, rapping at my chamber door."
"'tis some visitor,"" i muttered, ""tapping at my chamber door- only this and nothing more."""
"ah, distinctly i remember it was in the bleak december; and each separate dying ember wrought its ghost upon the floor."
eagerly i wished the morrow; -vainly i had sought to borrow from my books surcease of sorrow - sorrow for the lost lenore-


Further, remove all punctuation:

In [6]:
corpus = VCorpus(VectorSource(data$V1))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation, ucp=TRUE)

as.data.frame.content(corpus)

V1
twas brillig and the slithy toves did gyre and gimble in the wabe all mimsy were the borogoves and the mome raths outgrabe
beware the jabberwock my son the jaws that bite the claws that catch beware the jubjub bird and shun the frumious bandersnatch
he took his vorpal sword in hand long time the manxome foe he sought so rested he by the tumtum tree and stood awhile in thought
and as in uffish thought he stood the jabberwock with eyes of flame came whiffling through the tulgey wood and burbled as it came
one two one two and through and through the vorpal blade went snickersnack he left it dead and with its head he went galumphing back
once upon a midnight dreary while i pondered weak and weary over many a quaint and curious volume of forgotten lore
while i nodded nearly napping suddenly there came a tapping as of some one gently rapping rapping at my chamber door
tis some visitor i muttered tapping at my chamber door only this and nothing more
ah distinctly i remember it was in the bleak december and each separate dying ember wrought its ghost upon the floor
eagerly i wished the morrow vainly i had sought to borrow from my books surcease of sorrow sorrow for the lost lenore


Further, remove special characters:

In [7]:
corpus = VCorpus(VectorSource(data$V1))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation, ucp=TRUE)
corpus = tm_map(corpus, removeSpecialChars, chars="’“”—")

as.data.frame.content(corpus)

V1
twas brillig and the slithy toves did gyre and gimble in the wabe all mimsy were the borogoves and the mome raths outgrabe
beware the jabberwock my son the jaws that bite the claws that catch beware the jubjub bird and shun the frumious bandersnatch
he took his vorpal sword in hand long time the manxome foe he sought so rested he by the tumtum tree and stood awhile in thought
and as in uffish thought he stood the jabberwock with eyes of flame came whiffling through the tulgey wood and burbled as it came
one two one two and through and through the vorpal blade went snickersnack he left it dead and with its head he went galumphing back
once upon a midnight dreary while i pondered weak and weary over many a quaint and curious volume of forgotten lore
while i nodded nearly napping suddenly there came a tapping as of some one gently rapping rapping at my chamber door
tis some visitor i muttered tapping at my chamber door only this and nothing more
ah distinctly i remember it was in the bleak december and each separate dying ember wrought its ghost upon the floor
eagerly i wished the morrow vainly i had sought to borrow from my books surcease of sorrow sorrow for the lost lenore


Further, remove all inconsequential words, like "a", "the", "we", "in", etc.:

In [8]:
corpus = VCorpus(VectorSource(data$V1))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation, ucp=TRUE)
corpus = tm_map(corpus, removeSpecialChars, chars="’“”—")
corpus = tm_map(corpus, removeWords, stopwords("english")) # uses dictionary of English stopwords

as.data.frame.content(corpus)

V1
twas brillig slithy toves gyre gimble wabe mimsy borogoves mome raths outgrabe
beware jabberwock son jaws bite claws catch beware jubjub bird shun frumious bandersnatch
took vorpal sword hand long time manxome foe sought rested tumtum tree stood awhile thought
uffish thought stood jabberwock eyes flame came whiffling tulgey wood burbled came
one two one two vorpal blade went snickersnack left dead head went galumphing back
upon midnight dreary pondered weak weary many quaint curious volume forgotten lore
nodded nearly napping suddenly came tapping one gently rapping rapping chamber door
tis visitor muttered tapping chamber door nothing
ah distinctly remember bleak december separate dying ember wrought ghost upon floor
eagerly wished morrow vainly sought borrow books surcease sorrow sorrow lost lenore


Further, reduce all words to their roots (this is known as stemming):

In [9]:
corpus = VCorpus(VectorSource(data$V1))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation, ucp=TRUE)
corpus = tm_map(corpus, removeSpecialChars, chars="’“”—")
corpus = tm_map(corpus, removeWords, stopwords("english")) # uses dictionary of English stopwords
corpus = tm_map(corpus, stemDocument, "english")

as.data.frame.content(corpus)

V1
twas brillig slithi tove gyre gimbl wabe mimsi borogov mome rath outgrab
bewar jabberwock son jaw bite claw catch bewar jubjub bird shun frumious bandersnatch
took vorpal sword hand long time manxom foe sought rest tumtum tree stood awhil thought
uffish thought stood jabberwock eye flame came whiffl tulgey wood burbl came
one two one two vorpal blade went snickersnack left dead head went galumph back
upon midnight dreari ponder weak weari mani quaint curious volum forgotten lore
nod near nap sudden came tap one gentl rap rap chamber door
tis visitor mutter tap chamber door noth
ah distinct rememb bleak decemb separ die ember wrought ghost upon floor
eager wish morrow vain sought borrow book surceas sorrow sorrow lost lenor


Further, remove all whitespace, like spaces, new lines, etc.:

In [10]:
corpus = VCorpus(VectorSource(data$V1))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation, ucp=TRUE)
corpus = tm_map(corpus, removeSpecialChars, chars="’“”—")
corpus = tm_map(corpus, removeWords, stopwords("english")) # uses dictionary of English stopwords
corpus = tm_map(corpus, stemDocument, "english")
corpus = tm_map(corpus, stripWhitespace)

as.data.frame.content(corpus)

V1
twas brillig slithi tove gyre gimbl wabe mimsi borogov mome rath outgrab
bewar jabberwock son jaw bite claw catch bewar jubjub bird shun frumious bandersnatch
took vorpal sword hand long time manxom foe sought rest tumtum tree stood awhil thought
uffish thought stood jabberwock eye flame came whiffl tulgey wood burbl came
one two one two vorpal blade went snickersnack left dead head went galumph back
upon midnight dreari ponder weak weari mani quaint curious volum forgotten lore
nod near nap sudden came tap one gentl rap rap chamber door
tis visitor mutter tap chamber door noth
ah distinct rememb bleak decemb separ die ember wrought ghost upon floor
eager wish morrow vain sought borrow book surceas sorrow sorrow lost lenor


#### Document-Term Matrix: Bag of Words

Represent the simplified dataset as a document-term matrix ("bag of words").

The document-term matrix's variables (columns) correspond to the words in the transformed corpus.  Observations (rows) correspond to documents.  Each value is the number of times a certain word appears in a certain document.

In [11]:
dtm = DocumentTermMatrix(corpus)
data.t = as.data.frame(as.matrix(dtm))
size(data.t)
data.t

observations,variables
11,109


angel,awhil,back,bandersnatch,bewar,bird,bite,blade,bleak,book,borogov,borrow,brillig,burbl,came,catch,chamber,claw,curious,dead,decemb,die,distinct,door,dreari,eager,ember,evermor,eye,flame,floor,foe,forgotten,frumious,galumph,gentl,ghost,...,rest,separ,shun,slithi,snickersnack,son,sorrow,sought,stood,sudden,surceas,sword,tap,thought,time,tis,took,tove,tree,tulgey,tumtum,twas,two,uffish,upon,vain,visitor,volum,vorpal,wabe,weak,weari,went,whiffl,wish,wood,wrought
0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,1,2,1,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1,1,0,0,1,0,1,1,0,1,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0
0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,2,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,1,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,2,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0


#### Document-Term Matrix: n-Gram Representation

Alternatively, represent as a document-term matrix of n-grams.

The document-term matrix's variables (columns) correspond to the n-grams in the transformed corpus, where an n-gram is a set of n words that appear adjacent to each other.  For example, "awhil thought" and "bewar jabberwock" are 2-grams, or bigrams. For example, "awhil" and "bandersnatch" are 1-grams, or unigrams, or just words.  Observations (rows) correspond to documents.  Each value is the number of times a certain n-gram appears in a certain document.

Here we determine the document-term matrix for unigrams and bigrams.

In [12]:
dtm = DocumentTermMatrix(corpus, control=list(tokenize=unibigrams))
data.t = as.data.frame(as.matrix(dtm))

size(data.t)
data.t

observations,variables
11,225


ah distinct,angel,angel name,awhil,awhil thought,back,bandersnatch,bewar,bewar jabberwock,bewar jubjub,bird,bird shun,bite,bite claw,blade,blade went,bleak,bleak decemb,book,book surceas,borogov,borogov mome,borrow,borrow book,brillig,brillig slithi,burbl,burbl came,came,came tap,came whiffl,catch,catch bewar,chamber,chamber door,claw,claw catch,...,tumtum tree,twas,twas brillig,two,two one,two vorpal,uffish,uffish thought,upon,upon floor,upon midnight,vain,vain sought,visitor,visitor mutter,volum,volum forgotten,vorpal,vorpal blade,vorpal sword,wabe,wabe mimsi,weak,weak weari,weari,weari mani,went,went galumph,went snickersnack,whiffl,whiffl tulgey,wish,wish morrow,wood,wood burbl,wrought,wrought ghost
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,1,2,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0
0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,2,1,1,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,1,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0


#### Document-Term Matrix: Pruned n-Gram Representation

Alternatively, represent as a document-term matrix of n-grams with sparse variables removed.

The document-term matrix's variables (columns) correspond to the n-grams in the transformed corpus.  Observations (rows) correspond to documents.  Each value is the number of times a certain n-gram appears in a certain document.

Additionally, variables corresponding to n-grams that appear in only a few documents are removed.   

Here we determine the document-term matrix for unigrams and bigrams.

_Use the `removeSparseTerms` function with `sparse` parameter to remove sparse variables.  `sparse`=$x$ means columns corresponding to n-grams missing from no more than $x$ of documents are kept, other columns are removed._

In [13]:
dtm = DocumentTermMatrix(corpus, control=list(tokenize=unibigrams))
dtm = removeSparseTerms(dtm, sparse=0.90)
data.t =  as.data.frame(as.matrix(dtm))

size(data.t)
data.t

observations,variables
11,13


came,chamber,chamber door,door,jabberwock,lenor,one,sought,stood,tap,thought,upon,vorpal
0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,1,1,0,1,0,1
2,0,0,0,1,0,0,0,1,0,1,0,0
0,0,0,0,0,0,2,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,1,0
1,1,1,1,0,0,1,0,0,1,0,0,0
0,1,1,1,0,0,0,0,0,1,0,0,0
0,0,0,0,0,0,0,0,0,0,0,1,0
0,0,0,0,0,1,0,1,0,0,0,0,0


#### Restore Classification

Add the class variable from the dataset in the original table form to the data in document-term matrix form.

In [14]:
data.tc = data.t
data.tc$class = data$class

size(data.tc)
data.tc

observations,variables
11,14


came,chamber,chamber door,door,jabberwock,lenor,one,sought,stood,tap,thought,upon,vorpal,class
0,0,0,0,0,0,0,0,0,0,0,0,0,A
0,0,0,0,1,0,0,0,0,0,0,0,0,A
0,0,0,0,0,0,0,1,1,0,1,0,1,A
2,0,0,0,1,0,0,0,1,0,1,0,0,A
0,0,0,0,0,0,2,0,0,0,0,0,1,A
0,0,0,0,0,0,0,0,0,0,0,1,0,B
1,1,1,1,0,0,1,0,0,1,0,0,0,B
0,1,1,1,0,0,0,0,0,1,0,0,0,B
0,0,0,0,0,0,0,0,0,0,0,1,0,B
0,0,0,0,0,1,0,1,0,0,0,0,0,B


### Model

Here we build a naive Bayes model to predict class given all other variables, using the corpus' document-term matrix.

In [15]:
model = naiveBayes(class ~ ., plain_var_names(data.tc)) # the plain_var_names function adjusts column names that include space characters
model


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
        A         B 
0.4545455 0.5454545 

Conditional probabilities:
   came
Y        [,1]      [,2]
  A 0.4000000 0.8944272
  B 0.1666667 0.4082483

   chamber
Y        [,1]      [,2]
  A 0.0000000 0.0000000
  B 0.3333333 0.5163978

   chamber_door
Y        [,1]      [,2]
  A 0.0000000 0.0000000
  B 0.3333333 0.5163978

   door
Y        [,1]      [,2]
  A 0.0000000 0.0000000
  B 0.3333333 0.5163978

   jabberwock
Y   [,1]      [,2]
  A  0.4 0.5477226
  B  0.0 0.0000000

   lenor
Y        [,1]      [,2]
  A 0.0000000 0.0000000
  B 0.3333333 0.5163978

   one
Y        [,1]      [,2]
  A 0.4000000 0.8944272
  B 0.1666667 0.4082483

   sought
Y        [,1]      [,2]
  A 0.2000000 0.4472136
  B 0.1666667 0.4082483

   stood
Y   [,1]      [,2]
  A  0.4 0.5477226
  B  0.0 0.0000000

   tap
Y        [,1]      [,2]
  A 0.0000000 0.0000000
  B 0.3333333 0.516397

### Prediction

#### Simplify the Text

Simplify the new observation's text.

In [16]:
corpus = VCorpus(VectorSource(new$V1))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation, ucp=TRUE)
corpus = tm_map(corpus, removeSpecialChars, chars="’“”—")
corpus = tm_map(corpus, removeWords, stopwords("english")) # uses dictionary of English stopwords
corpus = tm_map(corpus, stemDocument, "english")
corpus = tm_map(corpus, stripWhitespace)

as.data.frame.content(corpus)

V1
one bird sought dead tree midnight thought noth busi analyt


#### Document-Term Matrix 

Represent the new observation as a document-term matrix.

In [17]:
dtm = DocumentTermMatrix(corpus, control=list(tokenize=unibigrams))
new.t = as.data.frame(as.matrix(dtm))

size(new.t)
new.t

observations,variables
1,19


analyt,bird,bird sought,busi,busi analyt,dead,dead tree,midnight,midnight thought,noth,noth busi,one,one bird,sought,sought dead,thought,thought noth,tree,tree midnight
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


#### Make Columns Agree 

Add and/or remove columns of the new observation's document-term matrix so that they align with the corpus' document-term matrix.

Remove columns as necessary.  Add columns with 0 values as necessary.

In [18]:
new.tr = make_columns_agree(new.t, data.t)

size(new.tr)
new.tr

observations,variables
1,13


came,chamber,chamber door,door,jabberwock,lenor,one,sought,stood,tap,thought,upon,vorpal
0,0,0,0,0,0,1,1,0,0,1,0,0


#### Apply the Model

Here we predict the class of the new observation, using the model and the new observation's document-term matrix.

In [19]:
prob = predict(model, plain_var_names(new.tr), type="raw")
class.predicted = as.class(prob, "A", cutoff=0.5)

data.frame(new, class.predicted)

V1,class.predicted
One bird sought a dead tree at midnight and thought about nothing more than business analytics.,A


## Code

### Useful Functions

In [20]:
# removeSpecialChars       # from setup.R
# as.data.frame.content    # from setup.R
# unibigrams               # from setup.R
# plain_var_names          # from setup.R
# make_columns_agree       # from setup.R

# help(tm_map)             # from tm library
# help(VCorpus)            # from tm library
# help(VectorSource)       # from tm library
# help(DocumentTermMatrix) # from tm library

## Expectations

Know about this:
* Convert a corpus in one-column table form to document-term matrix form, conceptually and using R.
* Build a model and make predictions based on text data, conceptually and using R.

## Further Reading

* http://www-stat.wharton.upenn.edu/~stine/mich/index.html#textanalytics
* http://www-stat.wharton.upenn.edu/~stine/mich/DM_10.pdf
* http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know
* https://www.r-bloggers.com/building-wordclouds-in-r/

<p style="text-align:left; font-size:10px;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float:right;">
Document revised July 17, 2020
</span>
</p>