# More NLP
Looking at some low level natural language processing packages

First, get the packages installed. Before doing anything, see if Java works on your system.


In [2]:
#install.packages("RCurl")
#install.packages("rJava")
library(rJava)

If not, then that needs to be fixed.

Next, install some packages. (Only need to do this if not already done)

install.packages can take a vector of packages as shown below.

In [2]:
#install.packages(c("NLP", "openNLP", "RWeka", "qdap"))

In [3]:
library(NLP)
library(openNLP)
library(RWeka)

This gives us some functions to play with. A regular R function is one that reads lines from a file. We will use that to read in a short bio. (You can find it on canvas under NLP)

In [4]:
bio <- readLines("data/anb-jarena-lee.txt")

In [5]:
bio

Put everything together into one string. The paste command has an option to take a vector/list, and collapse it into one string. Join the sentences with a space character.
Just to be safe, convert it to a string for the rest of the routines...

In [5]:
bio <- as.String(paste(bio,collapse=" "))

In [6]:
bio

In 1804, after several months of profound spiritual anxiety, Jarena Lee moved from New Jersey to Philadelphia. There she labored as a domestic and worshiped among white congregations of Roman Catholics and mixed congregations of Methodists. On hearing an inspired sermon by the Reverend Richard Allen, founder of the Bethel African Methodist Episcopal Church, Lee joined the Methodists. She was baptized in 1807. Prior to her baptism, she experienced the various physical and emotional stages of conversion: terrifying visions of demons and eternal perdition; extreme feelings of ecstasy and depression; protracted periods of meditation, fasting, and prayer; ennui and fever; energy and vigor. In 1811 she married Joseph Lee, who pastored an African-American church in Snow Hill, New Jersey. They had six children, four of whom died in infancy.

Next comes the new parts...

Some library functions called 'annotators' will be used to find sentences and words in the text. These will be used to identify things in the text later. The results will be numbers indicating where in the string words and sentences begin and end.

In [12]:
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))

The resulting annotations show where words and sentences are in the bio

In [9]:
bio_annotations

 id  type     start end features
   1 sentence     1 110 constituents=<<integer,20>>
   2 sentence   112 240 constituents=<<integer,20>>
   3 sentence   242 386 constituents=<<integer,25>>
   4 sentence   388 412 constituents=<<integer,6>>
   5 sentence   414 693 constituents=<<integer,49>>
   6 sentence   695 791 constituents=<<integer,19>>
   7 sentence   793 844 constituents=<<integer,12>>
   8 word         1   2 
   9 word         4   7 
  10 word         8   8 
  11 word        10  14 
  12 word        16  22 
  13 word        24  29 
  14 word        31  32 
  15 word        34  41 
  16 word        43  51 
  17 word        53  59 
  18 word        60  60 
  19 word        62  67 
  20 word        69  71 
  21 word        73  77 
  22 word        79  82 
  23 word        84  86 
  24 word        88  93 
  25 word        95  96 
  26 word        98 109 
  27 word       110 110 
  28 word       112 116 
  29 word       118 120 
  30 word       122 128 
  31 word       130 131 
  32

In [14]:
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
class(bio_doc)

There is a handy way to get sentences and words

In [11]:
sents(bio_doc)

In [12]:
words(bio_doc)

Lets use the fun magrittr library and pipe these into a few other functions...
head gives us the first n things on a list/vector. This is handy

In [13]:
library(magrittr)

In [14]:
head(c(1:10),2)

In [15]:
words(bio_doc) %>% head(2)

Try something fun, make a vocab list of the document...

First, get the words, and make sure they are a vector (Required for the unique function)

Then, sort the results to it looks like a dictionary

In [16]:
wv <- unlist(words(bio_doc))

In [17]:
#sort(unique(wv))
unique(wv) %>% sort() -> rachel
head(rachel)

All this works, but isn't something we couldn't do with regular R.

Create a different kind of annotator, one that recognizes names. (There is one in the library)

First, the tools need a language definition to work with. This can be installed for english (It needs more arguments since the databases are not kept on cran)

In [18]:
#install.packages("openNLPmodels.en",
#                 repos = "http://datacube.wu.ac.at/",
#                 type = "source")

In [19]:
person_ann <- Maxent_Entity_Annotator(kind = "person")

create a sequence of things to recognize. This is sometimes called a pipeline in the literature. Of course, I'll call it bob. 

In [20]:
bob <- list(sent_ann,word_ann,person_ann)

In [21]:
bob[3]

[[1]]
An annotator inheriting from classes
  Simple_Entity_Annotator Annotator
with description
  Computes entity annotations using the Apache OpenNLP Maxent name
  finder employing the default model for language 'en' and kind
  'person'.


Now to perform the annotations on the data

In [22]:
bob_annotate <- annotate(bio,bob)
name_doc <- AnnotatedPlainTextDocument(bio, bob_annotate)

In [23]:
str(name_doc)

List of 3
 $ content    : 'String' chr "In 1804, after several months of profound spiritual anxiety, Jarena Lee moved from New Jersey to Philadelphia. "| __truncated__
 $ meta       : list()
 $ annotations:List of 1
  ..$ :List of 162
  .. ..$ :Classes 'Annotation', 'Span'  hidden list of 5
  .. .. ..$ id      : int 1
  .. .. ..$ type    : chr "sentence"
  .. .. ..$ start   : int 1
  .. .. ..$ end     : int 110
  .. .. ..$ features:List of 1
  .. .. .. ..$ :List of 1
  .. .. .. .. ..$ constituents: int [1:20] 8 9 10 11 12 13 14 15 16 17 ...
  .. .. ..- attr(*, "meta")= list()
  .. ..$ :Classes 'Annotation', 'Span'  hidden list of 5
  .. .. ..$ id      : int 2
  .. .. ..$ type    : chr "sentence"
  .. .. ..$ start   : int 112
  .. .. ..$ end     : int 240
  .. .. ..$ features:List of 1
  .. .. .. ..$ :List of 1
  .. .. .. .. ..$ constituents: int [1:20] 28 29 30 31 32 33 34 35 36 37 ...
  .. .. ..- attr(*, "meta")= list()
  .. ..$ :Classes 'Annotation', 'Span'  hidden list of 5
  .. .. 

Next, create a function to get the names from the annotated result (I don't know why this isn't in the library)

In [24]:
entities <- function(doc, kind) {
  s <- doc$content
  a <- annotations(doc)[[1]]
  if(hasArg(kind)) {
    k <- sapply(a$features, `[[`, "kind")
    s[a[k == kind]]
  } else {
    s[a[a$type == "entity"]]
  }
}

This function can be used to extract a kind of annotation. The second argument indicates the type. It is a string, and should be placed in quotes. Using this function, find the names in the document...

In [25]:
entities(name_doc,"person")

With This new understanding, Lets try the names on the sjsu web site...

In [26]:
library(rvest)

Loading required package: xml2


In [27]:
web_page <- read_html("http://www.sjsu.edu")

In [28]:
web_text <- html_text(web_page)

the web_text has a lot of newline characters, and many extra spaces... I'll try and get rid of the extra new lines quotes, etc.

In [29]:
web_text <- gsub("\n",'',web_text)
web_text <- gsub("\\t",' ',web_text)
web_text <- gsub("  ",' ',web_text)

In [30]:
sjsu_annotate <- annotate(web_text,bob)

In [31]:
sjsu_doc <- AnnotatedPlainTextDocument(web_text, sjsu_annotate)

In [32]:
entities(sjsu_doc,"person")

In [33]:
web_text

Just for fun, let's try and look for any locations in the sjsu website

In [34]:
location_ann <- Maxent_Entity_Annotator(kind = "location")

In [35]:
jack <- list(sent_ann, word_ann,location_ann)

In [36]:
loc_annotate <- annotate(web_text,jack)

In [37]:
loc_doc <- AnnotatedPlainTextDocument(web_text, loc_annotate)

In [38]:
unique(entities(loc_doc,"location"))

In [39]:
pp_raw <- readLines("data/Pride_and_Prejudice_short.txt")

In [40]:
pp_str <- as.String(paste(pp_raw,collapse=" "))

In [41]:
pp_ann <- annotate(pp_str,jack)

In [42]:
pp_doc <- AnnotatedPlainTextDocument(pp_str,pp_ann)

In [43]:
#sort(unique(entities(pp_doc,"location")))
entities(pp_doc,"location") %>% unique() %>% sort()

In [44]:
org_ann <- Maxent_Entity_Annotator(kind = "organization")

In [45]:
joel <- list(sent_ann, word_ann,org_ann)

In [46]:
joel_ann <- annotate(web_text,joel)

In [47]:
joel_doc <- AnnotatedPlainTextDocument(web_text,joel_ann)

In [48]:
entities(joel_doc,"organization")