# Intelligent Systems. Hands-on 3
# Introduction

We have used the base document [here](http://rpubs.com/rgcmme/IS-HO4) and we have added our analysis at the end of the document.

The goal of this document is to show a sample script for pattern-based entity recognition over text documents using a gazetteer. It mainly uses the **openNLP** (natural language processing), the tm (text mining) and the **SPARQL** packages in R.

# Preparation
## Check working directory
Check the working directory with ```wd```. If it is not the one where your data are located, change it with ```setwd```.

In [1]:
# getwd()

In [2]:
# setwd("./HO3")

## Load libraries
Now we load the required libraries. Only a couple of things to mention:

- Using the annotate function of the openNLP package may require to explicitly include the package name (i.e., ```NLP::annotate```) due to a name clash with ggplot2
- Need to change the memory allocated to Java to avoid out-of-memory problems

In [14]:
# Needed for OutOfMemoryError: Java heap space 
library(rJava)
# .jinit(parameters="-Xmx4g")
# If there are more memory problems, invoke gc() after the POS tagging

library(NLP)
library(openNLP) 
library(openNLPmodels.en)
library(tm)
library(stringr)
library(SPARQL)
library(parallel)

# Auxiliary functions
## getAnnotationsFromDocument
```getAnnotationsFromDocument``` returns annotations for the text document: word, sentence, part-of-speech, and Penn Treebank parse annotations.

As an alternative, the koRpus package uses TreeTagger for POS tagging.

In [15]:
getAnnotationsFromDocument = function(doc){
  x=as.String(doc)
  sent_token_annotator <- Maxent_Sent_Token_Annotator()
  word_token_annotator <- Maxent_Word_Token_Annotator()
  pos_tag_annotator <- Maxent_POS_Tag_Annotator()
  y1 <- annotate(x, list(sent_token_annotator, word_token_annotator))
  y2 <- annotate(x, pos_tag_annotator, y1)
#  parse_annotator <- Parse_Annotator()
#  y3 <- annotate(x, parse_annotator, y2)
  return(y2)  
} 

# getAnnotatedMergedDocument
```getAnnotatedMergedDocument``` returns the text document merged with the annotations.

In [16]:
getAnnotatedMergedDocument = function(doc,annotations){
  x=as.String(doc)
  y2w <- subset(annotations, type == "word")
  tags <- sapply(y2w$features, '[[', "POS")
  r1 <- sprintf("%s/%s", x[y2w], tags)
  r2 <- paste(r1, collapse = " ")
  return(r2)  
} 

# getAnnotatedPlainTextDocument
```getAnnotatedPlainTextDocument``` returns the text document along with its annotations in an ```AnnotatedPlainTextDocument```.

In [17]:
getAnnotatedPlainTextDocument = function(doc,annotations){
  x=as.String(doc)
  a = AnnotatedPlainTextDocument(x,annotations)
  return(a)  
} 

# detectPatternOnDocument --> <font color='red'> Modified to include all matches</font> 
```detectPatternOnDocument``` returns the pattern detected on an ```AnnotatedPlainTextDocument```.

We have modified this function to include all matches in a document for the names annotation, not only the first one.

In [18]:
detectPatternOnDocument_old <- function(doc, pattern) {
  x=as.String(doc)
  res=str_match(x,pattern)
  
  if (length(res)==1){
    return (res)
  } else {
    if (all(is.na(res[,2:length(res)])))
      return (NA)
    else {
      ret=list()
      for (i in 2:length(res)){
        ret = paste(ret,res[i])
      }
      return(ret)
    }
  }
}

In [19]:
detectPatternOnDocument <- function(doc, pattern) {
  x=as.String(doc)
  res=str_match_all(x,pattern)
  if (length(res[[1]])==0){
    return (NA)
  } else {
      if (all(is.na(res[2:length(res)]))){
          return (NA)
      }
      else {
            ret=list()
            for (k in 1:length(res)){
              for (i in 1:length(res[[k]][,1])){
                            if (i>1){
                      ret = paste(ret,",")
                  }
                  for (j in 2:(length(res[[1]][1,]))){
                      ret = paste(ret,res[[k]][i,j])
                  }
              }
            }
    return(ret)
    }
  }
}

# detectPatternsInCorpus --> <font color='red'> Modified to call the new function </font> 
```detectPatternsInCorpus``` returns a data frame with all the patterns detected in a corpus.

In [20]:
detectPatternsInCorpus = function(corpus, patterns, type){
  vallEntities <- data.frame(matrix(NA, ncol = length(patterns)+1, 
                                    nrow = length(corpus)))
  names(vallEntities) <- c("File",patterns)
  for (i in 1:length(patterns)) {
      if (type == "old"){
                vallEntities[,i+1]=unlist(lapply(corpus, detectPatternOnDocument_old, 
                                         pattern=patterns[i]))
          } 
      else {
                vallEntities[,i+1]=unlist(lapply(corpus, detectPatternOnDocument, 
                                         pattern=patterns[i]))          
      }
    }
  for (i in 1:length(corpus)) {
    vallEntities$File[i]=meta(corpus[[i]])$id
    }
  return (vallEntities)  
  }

# countMatchesPerColumn
```countMatchesPerColumn``` returns the number of matches per pattern/column.

Counts the number of columns with non-NA values for each pattern.

In [21]:
countMatchesPerColumn = function (df) {
  entityCountPerPattern <- data.frame(matrix(NA, ncol = 2, 
                                             nrow = length(names(df))-1))
  names(entityCountPerPattern) <- c("Entity","Count")
  
  for (i in 2:length(names(df))) {
    entityCountPerPattern$Entity[i-1] = names(df)[i]
    entityCountPerPattern$Count[i-1] = nrow(subset(df, !is.na(df[i])))
    }
  return (entityCountPerPattern)
  }

# countMatchesPerRow
```countMatchesPerRow``` returns the number of entities per file/row.

Counts the number of rows with non-NA values for each file.

In [22]:
countMatchesPerRow = function (df) {
  entityCountPerFile <- data.frame(matrix(NA, ncol = 2, nrow = nrow(df)))
  names(entityCountPerFile) <- c("File","Count")
  
  for (i in 1:nrow(df)) {
    entityCountPerFile$File[i] = df$File[i]
    entityCountPerFile$Count[i] = length(Filter(Negate(is.na),df[i,2:length(df[i,])]))
    }
  return (entityCountPerFile[entityCountPerFile[2]!=0,])
  }

# mergeAllMatchesInLists
```mergeAllMatchesInLists``` returns a data frame with all the files and their matches in a single list per file.

In [23]:
mergeAllMatchesInLists = function (df) {
  matchesPerFile = rep(list(list()), nrow(df))

  for (i in 1:nrow(df)) {    
    matches=as.list(unname(unlist(Filter(Negate(is.na),df[i,2:length(df[i,])]))))
    matchesPerFile[[i]]=append(matchesPerFile[[i]],matches)
  }
  
  files = df[,1]
  matches = matchesPerFile
  
  allMatches<- data.frame(matrix(NA, ncol = 2, nrow = nrow(df)))
  names(allMatches) <- c("Files","Matches")
  
  allMatches$Files=files
  allMatches$Matches=matches
  
  return (allMatches)
}

# mergeGoldStandardInLists
```mergeGoldStandardInLists``` returns a data frame with all the files and the gold standard matches in a single list per file.

In [24]:
mergeGoldStandardInLists = function (df) {
  matchesPerFile = rep(list(list()), nrow(df))
  
  for (i in 1:nrow(df)) {    
    matches=as.list(unlist(Filter(Negate(is.na),df[i,2:length(df)])))
    matchesPerFile[[i]]=append(matchesPerFile[[i]],matches)
  }
  
  files = df[,1]
  matches = matchesPerFile
  
  allMatches<- data.frame(matrix(NA, ncol = 2, nrow = nrow(df)))
  names(allMatches) <- c("Files","Matches")
  
  allMatches$Files=files
  allMatches$Matches=matches
  
  return (allMatches)
}

# calculateMetrics
```calculateMetrics``` calculates precision, recall and f-measure according to a gold standard.

In [25]:
calculateMetrics = function (matches, matches.gs) {
  
  metrics<- data.frame(matrix(NA, ncol = 3, nrow = 1))
  names(metrics) <- c("Precision","Recall","Fmeasure")
  
  numCorrect = 0
  allAnswers = 0
  possibleAnswers = 0
  
  for (i in 1:nrow(matches)) {    
    if (length(matches.gs$Matches[[i]])!=0) {
      l = str_trim(unlist(matches[i,2]))
      l.gs = unname(unlist(matches.gs[i,2]))
      intersection = intersect(l, l.gs)
      numCorrect = numCorrect + length(intersect(l, l.gs))
      allAnswers = allAnswers + length (l)
      possibleAnswers = possibleAnswers + length(l.gs)    
    }
  }
  
  metrics$Precision = numCorrect / allAnswers
  metrics$Recall = numCorrect / possibleAnswers
  
  beta = 1
  metrics$Fmeasure= ((sqrt(beta)+1) * metrics$Precision * metrics$Recall) / 
    ((sqrt(beta)*metrics$Precision) + metrics$Recall)
  
  return(metrics)
}

# Load corpus
We are going to use the **[Movie review data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)** version 2.0, created by Bo Pang and Lillian Lee.

Once unzipped, the data splits the different documents into positive and negative opinions. In this script we are going to use the positive opinions located in ```./txt_sentoken/pos```.

We are only going to load the first 500 reviews.

In [26]:
source.pos = DirSource("./HO2/txt_sentoken/pos", encoding = "UTF-8")
corpus = Corpus(source.pos)

# Inspect corpus
Let’s take a look at the document in the first entry.

In [27]:
inspect(corpus[[1]])

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 4226

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
g

# Annotate corpus
We just apply the ```getAnnotationsFromDocument``` function to every document in the corpus using ```lapply```.

This step may take long depending on the size of the corpus and on the annotations that we want to identify.

In [28]:
annotations = lapply(corpus[1:500], getAnnotationsFromDocument)

We can create ```AnnotatedPlainTextDocuments``` that attach the annotations to the document and store the annotated corpus in another variable (since we destroy the corpus metadata).

In [29]:
corpus.tagged = Map(getAnnotatedPlainTextDocument, corpus, annotations)

In [30]:
corpus.taggedText = Map(getAnnotatedMergedDocument, corpus, annotations)

# Get actor names from DBpedia
We define a query to obtain (some) actor names in DBpedia.

In [31]:
prefixT <- c("skos","http://www.w3.org/2004/02/skos/core#")

sparql_prefixT <- "
PREFIX owl: <http://www.w3.org/2002/07/owl#>
"

qT <- paste(sparql_prefixT,"
SELECT DISTINCT ?label where {
  ?actor a <http://dbpedia.org/class/yago/Actor109765278> .
  ?actor rdfs:label ?label .
} 
LIMIT 10000
OFFSET 0
")

Let’s evaluate the query against the SPARQL endpoint.

In [32]:
endpointT <- "http://dbpedia.org/sparql"
optionsT=""

actors <- SPARQL(endpointT,qT,ns=prefixT,extra=optionsT)$results

And take a look at the output of the query.

In [33]:
length(actors)

In [34]:
actors[1:30]

label,label.1,label.2,label.3,label.4,label.5,label.6,label.7,label.8,label.9,...,label.20,label.21,label.22,label.23,label.24,label.25,label.26,label.27,label.28,label.29
"""Megan Lawrence""@it","""Megan Lawrence""@en","""Barry James""@it","""Barry James""@en","""Al Pacino""@en","""Ø¢Ù„ Ø¨Ø§ØªØ´ÙŠÙ†Ùˆ""@ar","""Al Pacino""@de","""Al Pacino""@es","""Al Pacino""@fr","""Al Pacino""@it",...,"""Alan Rickman""@fr","""Alan Rickman""@it","""ã‚¢ãƒ©ãƒ³ãƒ»ãƒªãƒƒã‚¯ãƒžãƒ³""@ja","""Alan Rickman""@nl","""Alan Rickman""@pl","""Alan Rickman""@pt","""Ð Ð¸ÐºÐ¼Ð°Ð½, ÐÐ»Ð°Ð½""@ru","""è‰¾å€«Â·ç‘žå…‹æ›¼""@zh","""Albert Finney""@en","""Ø£Ù„Ø¨Ø±Øª ÙÙŠÙ†ÙŠ""@ar"


# Clean the query result
We need to clean the output of the query. We need to:

- Remove everything out of the quotes
- Remove parentheses
- Remove duplicates
- Remove “.” for the regular expression
- Put all letters in non-capital

In [35]:
actors.2 <- mclapply(actors, function(x) strsplit(x,'"')[[1]][2])
actors.3 <- mclapply(actors.2, function(x) strsplit(x,' \\(')[[1]][1])
actor.names <- unique(actors.3)
actor.names <- mclapply(actor.names, gsub, pattern="\\.", replacement=" ")
actor.names <- mclapply(actor.names, tolower)
length(actor.names)

In [36]:
head(actor.names,10)

# Write gazetteer to a file
Now we write the gazetteer to a file.

In [37]:
write.table(unlist(actor.names), file = "gazetteer.txt", row.names = F, col.names = F, na="", sep=";")

# Detect patterns
We include spaces at both sides of the names, to only match full words.

And we detect the patterns in the corpus.

In [39]:
pattern.an <- mclapply(actor.names, function(x) return(paste(" ",x," ",sep = "")))
pattern.an=unlist(pattern.an)

# There is some actor named "you" that is spoiling our results; we remove it
pattern.an = pattern.an[grep("^ you $", pattern.an, invert = TRUE)]

matches.an = detectPatternsInCorpus(corpus, pattern.an, "old")

Let’s see how many patterns we have found per file.

In [40]:
countMatchesPerRow(matches.an) 

Unnamed: 0,File,Count
1,cv000_29590.txt,1
3,cv002_15918.txt,1
4,cv003_11664.txt,1
5,cv004_11636.txt,1
6,cv005_29443.txt,1
7,cv006_15448.txt,1
8,cv007_4968.txt,1
10,cv009_29592.txt,1
11,cv010_29198.txt,1
15,cv014_13924.txt,1


Let’s see which patterns we have found.

In [41]:
countColum = countMatchesPerColumn(matches.an) 
countColum[countColum$Count != 0,]

Unnamed: 0,Entity,Count
3,al pacino,6
8,alan rickman,4
13,albert finney,1
18,alex cox,1
25,andie macdowell,3
34,antonio banderas,5
43,ashley judd,6
53,ava gardner,1
69,blake edwards,1
75,brad pitt,11


Now we write the results to a file.

In [42]:
write.table(matches.an, file = "allEntitiesGazetteer.csv", row.names = F, na="", sep=";")

# Evaluate using gold standard
Let’s put all matches in a list for comparison with a gold standard.

In [43]:
allMatches = mergeAllMatchesInLists(matches.an)
head(allMatches,10)

Files,Matches
cv000_29590.txt,tim burton
cv001_18431.txt,
cv002_15918.txt,meg ryan
cv003_11664.txt,paul newman
cv004_11636.txt,bruce lee
cv005_29443.txt,eriq ebouaney
cv006_15448.txt,jennifer lien
cv007_4968.txt,woody allen
cv008_29435.txt,
cv009_29592.txt,gem


Now we load the gold standard and put all gold standard matches in a list for comparison.

In [44]:
goldStandard = read.table(file = "goldStandard.csv", quote = "", na.strings=c(""), colClasses="character", sep=";")

allMatchesGold = mergeGoldStandardInLists(goldStandard)
head(allMatchesGold,10)

Files,Matches
cv000_29590.txt,"alan moore , eddie campbell , moore , campbell , jack , michael jackson , albert , allen hughes , peter godley , robbie coltrane , frederick abberline, johnny depp , abberline , mary kelly , heather graham , terry hayes , rafael yglesias , steve guttenberg , tim burton , marilyn manson , peter deming , martin childs , depp , ians holm , joe gould , richardson , graham"
cv001_18431.txt,"matthew broderick , reese witherspoon , george washington carver, tracy flick , paul , max fischer , bill murray , broderick , witherspoon , jessica campbell , tammy , rooney , campbell , alexander payne , tracy , m"
cv002_15918.txt,"ryan , hanks , tom hanks , joe fox , meg ryan , kathleen kelley, fox , kelley"
cv003_11664.txt,"john williams , steven spielberg, spielberg , williams , martin brody , roy scheider , larry vaughn , murray hamilton , brody , matt hooper , richard dreyfuss, hooper , vaughn , quint , robert shaw , hitchcock , scheider , dreyfuss , shaw , robert redford , paul newman , duddy kravitz , ahab"
cv004_11636.txt,"herb , jackie chan , barry sanders , sanders , jackie , chan , bruce lee , tim allen , lawrence kazdan, john williams , spielberg , george lucas"
cv005_29443.txt,"raoul peck , lumumba , patrice lumumba , eriq ebouaney , helmer peck , peck , pascal bonitzer , patrice , joseph kasa vubu, maka kotto , moise tschombe , pascal nzonzi"
cv006_15448.txt,"tony kaye , edward norton , norton , derek vinyard , danny , edward furlong , beverly dangelo, davin , jennifer lien , derek , kaye , avery brooks , furlong , dangelo , lien"
cv007_4968.txt,"betsy , molly ringwald , alan alda , ringwald , alda , dylan walsh , walsh , madeline kahn , ally sheedy , sheedy , anthony lapaglia, lapaglia , stevie dee , robert de niro , alec baldwin , de niro , joe pesci , catherine ohara , woody allen"
cv008_29435.txt,"lumumba , janssens , rudi delhem , moise tshombe , pascal nzonzi , mobutu , joseph kasa vubu, maka kotto , peck , bonitzer , ebouaney"
cv009_29592.txt,"schwartznager, stallone , van damme , rongguang yu , wong fei-hong, jackie chan , fei-hong , sze-man tsang, wong kei-ying, yen chi dan , yuen wo ping , fox"


Finally, we calculate the metrics.

In [45]:
metrics = calculateMetrics(allMatches, allMatchesGold)
metrics

Precision,Recall,Fmeasure
0.7233704,0.03407474,0.06508368
