# Intelligent Systems. Hands-on 2
# Introduction

We have used the base document [here](https://rpubs.com/rgcmme/IS-HO2) and we have added our analysis at the end of the document. Also we have marked in <font color='red'> **red** </font> important changes on the original document

The goal of this document is to show how to perform different annotations (word, sentence, part-of-speech, and Penn Treebank parse) over text documents using the **openNLP** (natural language processing) and the **tm** (text mining) packages in R.

# Preparation
## Check working directory
Check the working directory with ```wd```. If it is not the one where your data are located, change it with ```setwd```.

In [41]:
# getwd()

In [5]:
# setwd("./HO2")

## Load libraries
Now we load the required libraries. Only a couple of things to mention:

- Using the annotate function of the openNLP package may require to explicitly include the package name (i.e., ```NLP::annotate```) due to a name clash with ggplot2
- Need to change the memory allocated to Java to avoid out-of-memory problems

In [7]:
# Needed for OutOfMemoryError: Java heap space 
library(rJava)
#.jinit(parameters="-Xmx8g")
# If there are more memory problems, invoke gc() after the POS tagging

# The openNLPmodels.en library is not in CRAN; it has to be installed from another repository
#install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at")

library(NLP)
library(openNLP) 
library(openNLPmodels.en)
library(tm)
library(koRpus)
library(koRpus.lang.en)
library(gdata)
library(plyr)

# Auxiliary functions
## <font color='red'> getAnnotationsFromDocument </font> & <font color='red'> getAnnotationsFromDocument_Korp </font>
```getAnnotationsFromDocument``` returns annotations for the text document: word, sentence, part-of-speech, and Penn Treebank parse annotations.

As an alternative, the koRpus package uses TreeTagger for POS tagging.

We are going to use ```getAnnotationsFromDocument_Perc``` with koRpus TreeTagger with BNC tagset detailed [here](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) and [here](http://www.natcorp.ox.ac.uk/docs/c5spec.html) 

In [9]:
getAnnotationsFromDocument_Korp = function(doc){
    tagged.text <- treetag("./txt_sentoken/pos/cv000_29590.txt",  treetagger="manual",
      lang="en",
      TT.options=list(
        path="C:/TreeTagger",
        preset="en",
        params = "english-bcn.par"
      ),
      doc_id="sample"
    )
#  parse_annotator <- Parse_Annotator()
#  y3 <- annotate(x, parse_annotator, y2)
  return(tagged.text)  
} 

In [10]:
getAnnotationsFromDocument = function(doc){
  x=as.String(doc)
  sent_token_annotator <- Maxent_Sent_Token_Annotator()
  word_token_annotator <- Maxent_Word_Token_Annotator()
  pos_tag_annotator <- Maxent_POS_Tag_Annotator()
  y1 <- annotate(x, list(sent_token_annotator, word_token_annotator))
  y2 <- annotate(x, pos_tag_annotator, y1)
  parse_annotator <- Parse_Annotator()
  y3 <- annotate(x, parse_annotator, y2)
  return(y3)  
} 

# getAnnotatedMergedDocument
```getAnnotatedMergedDocument``` returns the text document merged with the annotations.

In [11]:
getAnnotatedMergedDocument = function(doc,annotations){
  x=as.String(doc)
  y2w <- subset(annotations, type == "word")
  tags <- sapply(y2w$features, '[[', "POS")
  r1 <- sprintf("%s/%s", x[y2w], tags)
  r2 <- paste(r1, collapse = " ")
  return(r2)  
} 

# getAnnotatedPlainTextDocument
```getAnnotatedPlainTextDocument``` returns the text document along with its annotations in an ```AnnotatedPlainTextDocument```.

In [12]:
getAnnotatedPlainTextDocument = function(doc,annotations){
  x=as.String(doc)
  a = AnnotatedPlainTextDocument(x,annotations)
  return(a)  
} 

# Load corpus
We are going to use the **[Movie review data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)** version 2.0, created by Bo Pang and Lillian Lee.

Once unzipped, the data splits the different documents into positive and negative opinions. In this script we are going to use the positive opinions located in ```./txt_sentoken/pos```.

In [13]:
source.pos = DirSource("./txt_sentoken/pos", encoding = "UTF-8")
corpus = Corpus(source.pos)

# Inspect corpus
Let’s take a look at the document in the first entry.

In [14]:
inspect(corpus[[1]])

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 4226

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
g

# Annotate corpus - openNLP - Penn treebank
We just apply the ```getAnnotationsFromDocument``` function to every document in the corpus using ```lapply```.

This step may take long depending on the size of the corpus and on the annotations that we want to identify.

We are going to anotate the first two sentences of the first document with both methods.

In [15]:
annotations = lapply(corpus[1], getAnnotationsFromDocument)

The first annotations are sentence annotations. They indicate where the sentence starts and where it ends. In ```constituents``` we can access the tokens in the sentence (and check the number of tokens it has). In parse we can access the parse tree.

In [16]:
head(annotations[[1]])

 id type     start end  features
  1 sentence     1  265 constituents=<<integer,54>>, parse=<<character,1>>
  2 sentence   268  439 constituents=<<integer,36>>, parse=<<character,1>>
  3 sentence   442  591 constituents=<<integer,27>>, parse=<<character,1>>
  4 sentence   594  797 constituents=<<integer,44>>, parse=<<character,1>>
  5 sentence   800  939 constituents=<<integer,28>>, parse=<<character,1>>
  6 sentence   942 1299 constituents=<<integer,70>>, parse=<<character,1>>

Word annotations also are defined. They indicate where the word starts, where it ends, and the part-of-speech tag.

In [17]:
tail(annotations[[1]])

 id  type start end  features
 844 word  4189 4197 POS=NN
 845 word  4199 4199 POS=,
 846 word  4201 4208 POS=NN
 847 word  4210 4212 POS=CC
 848 word  4214 4217 POS=NN
 849 word  4219 4225 POS=NN

We can create ```AnnotatedPlainTextDocuments``` that attach the annotations to the document and store the annotated corpus in another variable (since we destroy the corpus metadata).

In [18]:
corpus.tagged = Map(getAnnotatedPlainTextDocument, corpus, annotations)
corpus.tagged[[1]] 

<<AnnotatedPlainTextDocument>>
Metadata:  0
Annotations:  length: 849
Content:  chars: 4226

We can also store all the annotations inline with the text and store the annotated corpus in another variable (since we destroy the corpus metadata).

In [19]:
corpus.taggedText = Map(getAnnotatedMergedDocument, corpus[1], annotations)
corpus.taggedText[[1]] 

# Annotate corpus - koRpus - BNC Basic (C5) 
We just apply the ```getAnnotationsFromDocument``` function to every document in the corpus using ```lapply```.

This step may take long depending on the size of the corpus and on the annotations that we want to identify.

In [21]:
annotations2 = lapply(corpus[1], getAnnotationsFromDocument_Korp)

"Invalid tag(s) found: NN2, PRP, AJ0, VHB, PNI, PRF, NN1, PUN, CJS, PNP, VBB, PUL, PUR, CJC, AT0, EX0, AV0, VVB, NP0, PNQ, CRD, NN0, TO0, VVI, VM0, VBI, PUQ, DT0, CJT, VDB, XX0, DPS, AVP, AJC, AVQ, ORD, VDD, ZZ0
  This is probably due to a missing tag in kRp.POS.tags() and
  needs to be fixed. It would be nice if you could forward the
"

koRpus annotate the end of a sentece with the $SENT$ tag.

We extract the first sentece

In [22]:
annotations2[[1]][annotations2[[1]][,"idx"]<annotations2[[1]][annotations2[[1]][,"tag"]=="SENT",][1,"idx"]]

doc_id,token,tag,lemma,lttr,wclass,desc,stop,stem,idx,sntc
sample,films,NN2,film,5,unknown,,,,1,1
sample,adapted,VVN,adapt,7,verb,,,,2,1
sample,from,PRP,from,4,unknown,,,,3,1
sample,comic,AJ0,comic,5,unknown,,,,4,1
sample,books,NN2,book,5,unknown,,,,5,1
sample,have,VHB,have,4,unknown,,,,6,1
sample,had,VHN,have,3,verb,,,,7,1
sample,plenty,PNI,plenty,6,unknown,,,,8,1
sample,of,PRF,of,2,unknown,,,,9,1
sample,success,NN1,success,7,unknown,,,,10,1


# Access annotated documents
There are functions for accessing parts of an ```AnnotatedPlainTextDocument```.

In [23]:
doc = corpus.tagged[[1]] 
doc

<<AnnotatedPlainTextDocument>>
Metadata:  0
Annotations:  length: 849
Content:  chars: 4226

For accessing the text representation of the document.

In [24]:
as.character(doc)

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
getting the hughes brothers to direct this seems almost as 

For accessing its words.

In [25]:
head(words(doc))

For accessing its sentences.

In [26]:
head(sents(doc),3)

For accessing its tagged words.

In [27]:
head(tagged_words(doc))

films/NNS
adapted/VBD
from/IN
comic/JJ
books/NNS
have/VBP

For accessing its tagged sentences.

In [28]:
head(tagged_sents(doc),3)

[[1]]
films/NNS
adapted/VBD
from/IN
comic/JJ
books/NNS
have/VBP
had/VBN
plenty/NN
of/IN
success/NN
,/,
whether/IN
they/PRP
're/VBP
about/IN
superheroes/NNS
(/-LRB-
batman/NN
,/,
superman/NN
,/,
spawn/NN
)/-RRB-
,/,
or/CC
geared/VBN
toward/IN
kids/NNS
(/-LRB-
casper/NN
)/-RRB-
or/CC
the/DT
arthouse/NN
crowd/NN
(/-LRB-
ghost/NN
world/NN
)/-RRB-
,/,
but/CC
there/EX
's/VBZ
never/RB
really/RB
been/VBN
a/DT
comic/JJ
book/NN
like/IN
from/IN
hell/NN
before/IN
./.

[[2]]
for/IN
starters/NNS
,/,
it/PRP
was/VBD
created/VBN
by/IN
alan/NN
moore/NN
(/-LRB-
and/CC
eddie/JJ
campbell/NN
)/-RRB-
,/,
who/WP
brought/VBD
the/DT
medium/NN
to/TO
a/DT
whole/JJ
new/JJ
level/NN
in/IN
the/DT
mid/JJ
'80s/NNS
with/IN
a/DT
12-part/JJ
series/NN
called/VBN
the/DT
watchmen/NNS
./.

[[3]]
to/TO
say/VB
moore/NN
and/CC
campbell/NN
thoroughly/RB
researched/VBD
the/DT
subject/NN
of/IN
jack/NN
the/DT
ripper/NN
would/MD
be/VB
like/IN
saying/VBG
michael/NN
jackson/NN
is/VBZ
starting/VBG
to/TO
look/VB
a/DT
little/JJ
odd/JJ
./.

For accessing the parse trees of its sentences.

In [29]:
head(parsed_sents(doc),3)

[[1]]
(TOP
  (S
    (S
      (NP
        (NP (NNS films))
        (VP (VBN adapted) (PP (IN from) (NP (JJ comic) (NNS books)))))
      (VP
        (VP
          (VBP have)
          (VP
            (VBN had)
            (NP (NP (NN plenty)) (PP (IN of) (NP (NN success))))
            (, ,)
            (SBAR
              (IN whether)
              (S
                (NP (PRP they))
                (VP
                  (VBP 're)
                  (PP
                    (IN about)
                    (NP
                      (NP (NNS superheroes))
                      (PRN
                        (-LRB- -LRB-)
                        (NP
                          (NP (NN batman))
                          (, ,)
                          (NP (NN superman))
                          (, ,)
                          (NP (NN spawn)))
                        (-RRB- -RRB-)))))))))
        (, ,)
        (CC or)
        (VP
          (VBN geared)
          (PP
            (IN toward)
        

# Analysis and comparation Korpus-OpenNLP

We are going to analyze the first two sentences with [The BNC Basic (C5) Tagset](http://www.natcorp.ox.ac.uk/docs/c5spec.html) and Penn Treebank.

First of all, we are going to extract the sentences, after that we are going to use excel to join both analysis and we are going to reload the file in order to analyze it.

# Sentence 1
## openNLP

In [30]:
head(tagged_sents(doc),1)

[[1]]
films/NNS
adapted/VBD
from/IN
comic/JJ
books/NNS
have/VBP
had/VBN
plenty/NN
of/IN
success/NN
,/,
whether/IN
they/PRP
're/VBP
about/IN
superheroes/NNS
(/-LRB-
batman/NN
,/,
superman/NN
,/,
spawn/NN
)/-RRB-
,/,
or/CC
geared/VBN
toward/IN
kids/NNS
(/-LRB-
casper/NN
)/-RRB-
or/CC
the/DT
arthouse/NN
crowd/NN
(/-LRB-
ghost/NN
world/NN
)/-RRB-
,/,
but/CC
there/EX
's/VBZ
never/RB
really/RB
been/VBN
a/DT
comic/JJ
book/NN
like/IN
from/IN
hell/NN
before/IN
./.


## koRpus
 
 Again, we write the sentence in a csv file to analyze the sentence in excel

In [31]:
write.csv(annotations2[[1]][annotations2[[1]][,"idx"]<annotations2[[1]][annotations2[[1]][,"tag"]=="SENT",][1,"idx"]],file = "MyData.csv")

In [32]:
annotations2[[1]][annotations2[[1]][,"idx"]<annotations2[[1]][annotations2[[1]][,"tag"]=="SENT",][1,"idx"]]

doc_id,token,tag,lemma,lttr,wclass,desc,stop,stem,idx,sntc
sample,films,NN2,film,5,unknown,,,,1,1
sample,adapted,VVN,adapt,7,verb,,,,2,1
sample,from,PRP,from,4,unknown,,,,3,1
sample,comic,AJ0,comic,5,unknown,,,,4,1
sample,books,NN2,book,5,unknown,,,,5,1
sample,have,VHB,have,4,unknown,,,,6,1
sample,had,VHN,have,3,verb,,,,7,1
sample,plenty,PNI,plenty,6,unknown,,,,8,1
sample,of,PRF,of,2,unknown,,,,9,1
sample,success,NN1,success,7,unknown,,,,10,1


We can appreciate that both POS taggers have identified the same sentence

# Sentence 2
## openNLP

In [33]:
tagged_sents(doc)[2]

[[1]]
for/IN
starters/NNS
,/,
it/PRP
was/VBD
created/VBN
by/IN
alan/NN
moore/NN
(/-LRB-
and/CC
eddie/JJ
campbell/NN
)/-RRB-
,/,
who/WP
brought/VBD
the/DT
medium/NN
to/TO
a/DT
whole/JJ
new/JJ
level/NN
in/IN
the/DT
mid/JJ
'80s/NNS
with/IN
a/DT
12-part/JJ
series/NN
called/VBN
the/DT
watchmen/NNS
./.


Again, we write the sentence on a file

In [34]:
write.csv(annotations2[[1]][(annotations2[[1]][,"idx"]>annotations2[[1]][annotations2[[1]][,"tag"]=="SENT",][1,"idx"]) 
                  & (annotations2[[1]][,"idx"]<annotations2[[1]][annotations2[[1]][,"tag"]=="SENT",][2,"idx"])], file = "MyData2.csv")

In [35]:
annotations2[[1]][(annotations2[[1]][,"idx"]>annotations2[[1]][annotations2[[1]][,"tag"]=="SENT",][1,"idx"]) 
                  & (annotations2[[1]][,"idx"]<annotations2[[1]][annotations2[[1]][,"tag"]=="SENT",][2,"idx"])]

Unnamed: 0,doc_id,token,tag,lemma,lttr,wclass,desc,stop,stem,idx,sntc
55,sample,for,PRP,for,3,unknown,,,,55,2
56,sample,starters,NN2,starter,8,unknown,,,,56,2
57,sample,",",PUN,",",1,unknown,,,,57,2
58,sample,it,PNP,it,2,unknown,,,,58,2
59,sample,was,VBD,be,3,verb,,,,59,2
60,sample,created,VVN,create|created,7,verb,,,,60,2
61,sample,by,PRP,by,2,unknown,,,,61,2
62,sample,alan,AJ0,alan,4,unknown,,,,62,2
63,sample,moore,NP0,moore,5,unknown,,,,63,2
64,sample,(,PUL,(,1,unknown,,,,64,2


It seems that again both POS taggers have identified the same sentence again. Now we are going to read the analyzed file

In [36]:
analyze_data <- read.xls ("Sentences.xlsx", sheet = 3, header = TRUE)
analyze_data

Sentence,Penn.code,Penn.meaning,BCN.code,BCN.Meaning,Penn.correct,BCN.Correct
films,NNS,"Noun, plural",NN2,"Plural common noun (e.g. pencils, geese, times, revelations)",Yes,Yes
adapted,VBD,"Verb, past tense",VVN,"The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned)",Yes,Yes
from,IN,Preposition or subordinating conjunction,PRP,"Preposition (except for of) (e.g. about, at, in, on, on behalf of, with)",Yes,Yes
comic,JJ,Adjective,AJ0,"Adjective (general or positive) (e.g. good, old, beautiful)",Yes,Yes
books,NNS,"Noun, plural",NN2,"Plural common noun (e.g. pencils, geese, times, revelations)",Yes,Yes
have,VBP,"Verb, non-3rd person singular present",VHB,"The finite base form of the verb HAVE: have, 've",Yes,Yes
had,VBN,"Verb, past participle",VHN,The past participle form of the verb HAVE: had,Yes,Yes
plenty,NN,"Noun, singular or mass",PNI,"Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody) [N.B. This tag applies to words which always function as [heads of] noun phrases. Words like some and these, which can also occur before a noun head in an article-like function, are tagged as determiners (see DT0 and AT0 above).]",Yes,Yes
of,IN,Preposition or subordinating conjunction,PRF,"The preposition of. Because of its frequency and its almost exclusively postnominal function, of is assigned a special tag of its own.",Yes,Yes
success,NN,"Noun, singular or mass",NN1,"Singular common noun (e.g. pencil, goose, time, revelation)",Yes,Yes


## Accuracy Calculation

We were required to calculate the accuracy

\begin{align}
\ accuracy & = \frac{(TP+TN)}{Total Population}
\end{align}

But since we don´t have False positives, we are going to use the following metric:

\begin{align}
\ prevalence & = \frac{TP}{Total Population}
\end{align}

## Penn Treebank

In [37]:
sum(analyze_data[,"Penn.correct"]=="Yes")/length(analyze_data[,"Penn.correct"])

If we analyze the errors, we obtain the following

In [38]:
analyze_data[which(analyze_data[,"Penn.correct"]=="No"),]

Unnamed: 0,Sentence,Penn.code,Penn.meaning,BCN.code,BCN.Meaning,Penn.correct,BCN.Correct
53,before,IN,Preposition or subordinating conjunction,AV0,"General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest. [Note that adverbs, unlike adjectives, are not tagged as positive, comparative, or superlative.This is because of the relative rarity of comparative and superlative adverbs.]",No,Yes
65,eddie,JJ,Adjective,NN1,"Singular common noun (e.g. pencil, goose, time, revelation)",No,Yes


For the **first one** the sub-sentence is 

*<center> but there 's never really been a comic book like from hell before </center>*

Here the algorithm classifies $before$ as a preposition when it is an adverb.

The **second** sub-sentence is

*<center> ( and eddie campbell ) </center>*

Here the algorithm confuses a composite noun as an adjective of a noun.

## BCN

In [39]:
sum(analyze_data[,"BCN.Correct"]=="Yes")/length(analyze_data[,"BCN.Correct"])

If we analyze the errors, we obtain the following

In [40]:
analyze_data[which(analyze_data[,"BCN.Correct"]=="No"),]

Unnamed: 0,Sentence,Penn.code,Penn.meaning,BCN.code,BCN.Meaning,Penn.correct,BCN.Correct
50,like,IN,Preposition or subordinating conjunction,VVB,"The finite base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and present subjunctive]",Yes,No
61,alan,NN,"Noun, singular or mass",AJ0,"Adjective (general or positive) (e.g. good, old, beautiful)",Yes,No
81,-,-,-,POS,"The possessive or genitive marker 's or ' (e.g. for 'Peter's or somebody else's', the sequence of tags is: NP0 POS CJC PNI AV0 POS)",Yes,No
82,'80s,NNS,"Noun, plural",CRD,"Cardinal number (e.g. one, 3, fifty-five, 3609)",Yes,No


For the **first one**, the sub-sentence is 

*<center> but there 's never really been a comic book like from hell before </center>*

Here the algorithm classifies $like$ as a verb when it working as a preposition in the sentence.

The **second** sub-sentence is

*<center> it was created by alan moore  </center>*

Here the algorithm confuses a composite noun as an adjective of a noun, as the previous case with Penn Tree Bank

The **third sentence** is

*<center> who brought the medium to a whole new level in the mid - '80s with a 12-part series called the watchmen </center>*

Here the algorithm identifies '80s as a number, when actually we understand it is working as a noun in the sentence (as the other algorithm identifies

# Conclussion

BCN is far more detailed in its annotation that Penn Treebank, both algorithms have similar performance in the sentences we have selected. Also both seems to have troubles identifying composite names.

Finally, BCN failed in identifying a number as a noun, it seems that openNLP does not divide a word when it starts with ' and koRpus do.