# Comparison of Natural Language Understanding Services and Frameworks

This notebook implements code from four different commerical NLP services in a typical workflow.  Each script should be run as a stand-alone implementation.

__Open source references__

* [Python: spaCy](https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/)
* [R: TextMining(tm)](https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/)
* [R: OpenNLP](https://rpubs.com/lmullen/nlp-chapter)

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 

### Summary of all open frameworks and commercial services


In [2]:
Image(url= "images/Cloud_and_Open.png", width=700)

### Detailed explanation of differences



### [R](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html)

This example code is taken from the [blog](https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/), with code using additional libraries, from [here](https://rpubs.com/lmullen/nlp-chapter)

### [Python: spaCy](https://spacy.io/usage/spacy-101)

[SpaCy](https://spacy.io/usage/processing-pipelines) is clear in its documentation that it is built for general and customized pipelines.  This example code is taken from the [blog](https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/).

`$ conda install spacy`

In [1]:
import requests

r = requests.get('https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/04/04080929/Tripadvisor_hotelreviews_Shivambansal.txt')

In [2]:
r.text[0:100]

'Nice place Better than some reviews give it credit for. Overall, the rooms were a bit small but nice'

In [None]:
# prepare space
import spacy 
nlp = spacy.load('en')

document = r.text
document = nlp(document)

In [7]:
# identifiers in module
dir(document)[-10:]

['text_with_ws',
 'to_array',
 'to_bytes',
 'user_data',
 'user_hooks',
 'user_span_hooks',
 'user_token_hooks',
 'vector',
 'vector_norm',
 'vocab']

In [9]:
# tokenization
document[0]
document[len(document)-5]
list(document.sents)[:5]

[Nice place Better than some reviews give it credit for.,
 Overall, the rooms were a bit small but nice.,
 Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).,
 Overall, it was a good experience and the staff was quite friendly. ,
 what a surprise What a surprise the Sheraton was after reading some of the reviews.]

In [10]:
# part-of-speech
all_tags = {w.pos: w.pos_ for w in document}

In [14]:
all_tags

{82: 'ADJ',
 83: 'ADP',
 84: 'ADV',
 87: 'CCONJ',
 88: 'DET',
 89: 'INTJ',
 90: 'NOUN',
 91: 'NUM',
 92: 'PART',
 93: 'PRON',
 94: 'PROPN',
 95: 'PUNCT',
 97: 'SYM',
 98: 'VERB',
 99: 'X',
 101: 'SPACE'}

In [15]:
# all tags of first sentence of our document 
for word in list(document.sents)[0]:  
    print( word, word.tag_)

Nice JJ
place NN
Better NNP
than IN
some DT
reviews NNS
give VBP
it PRP
credit NN
for IN
. .


In [16]:
#define some parameters  
noisy_pos_tags = ['PROP']
min_token_length = 2

#Function to check if the token is a noise or not  
def isNoise(token):     
    is_noise = False
    if token.pos_ in noisy_pos_tags:
        is_noise = True 
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) <= min_token_length:
        is_noise = True
    return is_noise 

def cleanup(token, lower = True):
    if lower:
       token = token.lower()
    return token.strip()


# top unigrams used in the reviews 
from collections import Counter
cleaned_list = [cleanup(word.string) for word in document if not isNoise(word)]
Counter(cleaned_list) .most_common(5)

[('hotel', 685),
 ('room', 653),
 ('great', 300),
 ('sheraton', 286),
 ('location', 272)]

In [18]:
# entities
labels = set([w.label_ for w in document.ents]) 
for label in labels: 
    entities = [cleanup(e.string, lower=False) for e in document.ents if label==e.label_] 
    entities = list(set(entities)) 
    print( label[:5],entities[:5] )   

EVENT ['the Hynes Convention centre', 'DIRTY Room / RUDE Staff My', 'the Body Shopy', 'New Year', 'the Olympic Trials']
LAW ['#1', 'Room 2916', 'the Duck Tour - it', 'the USS Constitution', 'the Sheraton Boston']
ORG ['', 'SHERATON', 'the Wrentham', 'Good Hotel', 'Whats Good']
GPE ['the United States', 'Pizza', 'Starbucks', 'Wrentham Village -', 'Hotel']
PRODU ['3.30pm', 'Radisson', 'Centre', '225.00', 'Suite']
CARDI ['', '10,000', 'about 1000', '170', '9AM']
LOC ['Fenway Park', 'the Back Bay', '', 'Charles River', 'the South End']
MONEY ['about $40', '$109', '10 dollars', '99', '20/hr).I']
QUANT ['10 feet', 'a ton', '27 inch', 'the airline miles', 'two feet']
WORK_ ['The Room', 'the Back Bay', 'Wonderful Location The', 'Beautiful and the', 'a Charles River']
TIME ['about 5 nights', 'the night', 'Later in the afternoon', 'early evening', '45 seconds']
NORP ['American', 'Americans', 'stayThese', 'Brit', 'Priceline']
PERCE ['20% tip', '100%', '9pm)', '50% off', 'about 20mins,']
DATE ['th

In [21]:
# dependency parsing
# extract all review sentences that contains the term - hotel
hotel = [sent for sent in document.sents if 'hotel' in sent.string.lower()]

# create dependency tree
sentence = hotel[2] 
for word in sentence:
    print( word, ': ', str(list(word.children)) )

A :  []
cab :  [A, from]
from :  [airport, to]
the :  []
airport :  [the]
to :  [hotel]
the :  []
hotel :  [the]
can :  []
be :  [cab, can, cheaper, .]
cheaper :  [than]
than :  [shuttles]
the :  []
shuttles :  [the, depending]
depending :  [time]
what :  []
time :  [what, of]
of :  [day]
the :  []
day :  [the, go]
you :  []
go :  [you]
. :  []


In [34]:
# check all adjectives used with a word 
def pos_words (sentence, token, ptag):
    sentences = [sent for sent in sentence.sents if token in sent.string]     
    pwrds = []
    for sent in sentences:
        for word in sent:
            for character in word.string: 
                   pwrds.extend([child.string.strip() for child in word.children if child.pos_ == ptag] )
    return Counter(pwrds).most_common(10)

pos_words(document, 'hotel', 'ADJ')

[('great', 368),
 ('other', 266),
 ('my', 247),
 ('our', 243),
 ('nice', 228),
 ('good', 223),
 ('that', 181),
 ('many', 155),
 ('its', 145),
 ('which', 142)]

### [Scala: Epic](http://www.scalanlp.org/documentation/)

END OF DOCUMENT