# Comparison of Natural Language Understanding Services and Frameworks

This document compares the leading cloud service providers, and programming frameworks, for Natural Language Understanding (NLU).  A summary table of specific characteristics displays an overview of the tool differences.  Additional comparison tables are show commercial cloud service providers, in-depth.  The document concludes with code samples implementing the different tools.


__Executive Summary__

Open source programming frameworks compare favorably, and oftentimes dominate, commerical cloud providers in both features and performance.  Python's [spaCy](https://spacy.io/usage/facts-figures#section-benchmarks) appears to be optimally designed for production and offers a large number of features.  Only Stanford's CoreNLP has similar qualities for open software, and commercial services do not appear to offer as much.  Building a foundational layer with this library allows for strong, general, start to more specific solutions, later.  In particular, [R has over a hundred libraries](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html) that provide highly specialized functionality.

Commercial services do allow for simplified pricing because costs are per NLU item, not processing time, which must be calculated for manual methods.  However, using a commercial service does not obviate the need for additional programming in order to customize general and specific solutions.  In which case, the commercial service becomes an extra layer of complexity between processes in a pipeline.  

__Open source references__

* [Python: spaCy](https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/)
* [R: TextMining(tm)](https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/)
* [R: OpenNLP](https://rpubs.com/lmullen/nlp-chapter)

__Commercial references__

* [kontikilabs: very thorough with accompanying code](https://medium.com/kontikilabs/comparing-machine-learning-ml-services-from-various-cloud-ml-service-providers-63c8a2626cb6)
* [Google vs Watson](http://fredrikstenbeck.com/google-natural-language-vs-watson-natural-language-understanding/)
* [Watson internals](https://www.quora.com/What-do-AI-ML-and-NLP-researchers-think-of-IBM%E2%80%99s-Watson-Does-it-have-the-potential-to-make-a-huge-impact)
* [Google: categories](https://cloud.google.com/natural-language/docs/categories)
* [Watson: categories](https://console.bluemix.net/docs/services/natural-language-understanding/categories.html#categories-hierarchy)

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 

### Summary of all open frameworks and commercial services


In [2]:
Image(url= "./images/Cloud_and_Open.png", width=700)

### Summary of open frameworks



In [16]:
Image(url= "./images/Open.png", width=500)

### Summary of commercial cloud services

Commercial features are fairly consistent across services, except for Syntax and Part-Of-Speech, which IMHO is a must have.  Google has better Syntax, POS.  Watson has nice hierarchical categories.

In [13]:
Image(url= "./images/CloudML_Features.png", width=500)

Performance may not be as important because we using the service as batch daily, and we are not paying for processing time.

In [14]:
Image(url= "./images/CloudML_Performance.png", width=500)

Costs are also consistent

In [15]:
Image(url= "./images/CloudML_Cost.png", width=500)

### Detailed explanation of differences



### [R](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html)

This example code is taken from the [blog](https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/), with code using additional libraries, from [here](https://rpubs.com/lmullen/nlp-chapter)

### [Python: spaCy](https://spacy.io/usage/spacy-101)

[SpaCy](https://spacy.io/usage/processing-pipelines) is clear in its documentation that it is built for general and customized pipelines.  This example code is taken from the [blog](https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/).

`$ conda install spacy`

In [1]:
import requests

r = requests.get('https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/04/04080929/Tripadvisor_hotelreviews_Shivambansal.txt')

In [2]:
r.text[0:100]

'Nice place Better than some reviews give it credit for. Overall, the rooms were a bit small but nice'

In [None]:
# prepare space
import spacy 
nlp = spacy.load('en')

document = r.text
document = nlp(document)

In [7]:
# identifiers in module
dir(document)[-10:]

['text_with_ws',
 'to_array',
 'to_bytes',
 'user_data',
 'user_hooks',
 'user_span_hooks',
 'user_token_hooks',
 'vector',
 'vector_norm',
 'vocab']

In [9]:
# tokenization
document[0]
document[len(document)-5]
list(document.sents)[:5]

[Nice place Better than some reviews give it credit for.,
 Overall, the rooms were a bit small but nice.,
 Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).,
 Overall, it was a good experience and the staff was quite friendly. ,
 what a surprise What a surprise the Sheraton was after reading some of the reviews.]

In [10]:
# part-of-speech
all_tags = {w.pos: w.pos_ for w in document}

In [14]:
all_tags

{82: 'ADJ',
 83: 'ADP',
 84: 'ADV',
 87: 'CCONJ',
 88: 'DET',
 89: 'INTJ',
 90: 'NOUN',
 91: 'NUM',
 92: 'PART',
 93: 'PRON',
 94: 'PROPN',
 95: 'PUNCT',
 97: 'SYM',
 98: 'VERB',
 99: 'X',
 101: 'SPACE'}

In [15]:
# all tags of first sentence of our document 
for word in list(document.sents)[0]:  
    print( word, word.tag_)

Nice JJ
place NN
Better NNP
than IN
some DT
reviews NNS
give VBP
it PRP
credit NN
for IN
. .


In [16]:
#define some parameters  
noisy_pos_tags = ['PROP']
min_token_length = 2

#Function to check if the token is a noise or not  
def isNoise(token):     
    is_noise = False
    if token.pos_ in noisy_pos_tags:
        is_noise = True 
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) <= min_token_length:
        is_noise = True
    return is_noise 

def cleanup(token, lower = True):
    if lower:
       token = token.lower()
    return token.strip()


# top unigrams used in the reviews 
from collections import Counter
cleaned_list = [cleanup(word.string) for word in document if not isNoise(word)]
Counter(cleaned_list) .most_common(5)

[('hotel', 685),
 ('room', 653),
 ('great', 300),
 ('sheraton', 286),
 ('location', 272)]

In [18]:
# entities
labels = set([w.label_ for w in document.ents]) 
for label in labels: 
    entities = [cleanup(e.string, lower=False) for e in document.ents if label==e.label_] 
    entities = list(set(entities)) 
    print( label[:5],entities[:5] )   

EVENT ['the Hynes Convention centre', 'DIRTY Room / RUDE Staff My', 'the Body Shopy', 'New Year', 'the Olympic Trials']
LAW ['#1', 'Room 2916', 'the Duck Tour - it', 'the USS Constitution', 'the Sheraton Boston']
ORG ['', 'SHERATON', 'the Wrentham', 'Good Hotel', 'Whats Good']
GPE ['the United States', 'Pizza', 'Starbucks', 'Wrentham Village -', 'Hotel']
PRODU ['3.30pm', 'Radisson', 'Centre', '225.00', 'Suite']
CARDI ['', '10,000', 'about 1000', '170', '9AM']
LOC ['Fenway Park', 'the Back Bay', '', 'Charles River', 'the South End']
MONEY ['about $40', '$109', '10 dollars', '99', '20/hr).I']
QUANT ['10 feet', 'a ton', '27 inch', 'the airline miles', 'two feet']
WORK_ ['The Room', 'the Back Bay', 'Wonderful Location The', 'Beautiful and the', 'a Charles River']
TIME ['about 5 nights', 'the night', 'Later in the afternoon', 'early evening', '45 seconds']
NORP ['American', 'Americans', 'stayThese', 'Brit', 'Priceline']
PERCE ['20% tip', '100%', '9pm)', '50% off', 'about 20mins,']
DATE ['th

In [21]:
# dependency parsing
# extract all review sentences that contains the term - hotel
hotel = [sent for sent in document.sents if 'hotel' in sent.string.lower()]

# create dependency tree
sentence = hotel[2] 
for word in sentence:
    print( word, ': ', str(list(word.children)) )

A :  []
cab :  [A, from]
from :  [airport, to]
the :  []
airport :  [the]
to :  [hotel]
the :  []
hotel :  [the]
can :  []
be :  [cab, can, cheaper, .]
cheaper :  [than]
than :  [shuttles]
the :  []
shuttles :  [the, depending]
depending :  [time]
what :  []
time :  [what, of]
of :  [day]
the :  []
day :  [the, go]
you :  []
go :  [you]
. :  []


In [34]:
# check all adjectives used with a word 
def pos_words (sentence, token, ptag):
    sentences = [sent for sent in sentence.sents if token in sent.string]     
    pwrds = []
    for sent in sentences:
        for word in sent:
            for character in word.string: 
                   pwrds.extend([child.string.strip() for child in word.children if child.pos_ == ptag] )
    return Counter(pwrds).most_common(10)

pos_words(document, 'hotel', 'ADJ')

[('great', 368),
 ('other', 266),
 ('my', 247),
 ('our', 243),
 ('nice', 228),
 ('good', 223),
 ('that', 181),
 ('many', 155),
 ('its', 145),
 ('which', 142)]

### [Scala: Epic](http://www.scalanlp.org/documentation/)

### [IBM Watson](https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/)

In [None]:
import json
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 \
  import Features, EntitiesOptions, KeywordsOptions
import time
from datetime import timedelta
import sys
import os
import argparse


#We need to get our API credentials in the code for authentication that we have stored as Environment Variables locally
NLP_USER_WATSON = os.environ.get("NLP_USER_WATSON")
NLP_PASS_WATSON = os.environ.get("NLP_PASS_WATSON")
NLP_VER_WATSON = os.environ.get("NLP_VER_WATSON")


#Following line is used to save all the console output into a text file
sys.stdout = open('nlp_api_output.txt', 'a')

start_time = time.monotonic()


def input_file(text_file_path):
    global text
    if os.path.isfile(text_file_path):
        with open(text_file_path, 'r') as text_file:
            text = text_file.read()
    else:
        print("File doesn't exist in the directory!")


def analyze_text():
  #Initialize NaturalLanguageUnderstanding function using the API credentials
  natural_language_understanding = NaturalLanguageUnderstandingV1(
    username = NLP_USER_WATSON,
    password = NLP_PASS_WATSON,
    version = NLP_VER_WATSON)

  response = natural_language_understanding.analyze(
    text = text,
    features = Features(
      entities = EntitiesOptions(
        emotion = True,
        sentiment = True),
      keywords = KeywordsOptions(
        emotion = True,
        sentiment = True)))

  print(json.dumps(response, indent = 2)) #json output after textual analysis


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description = __doc__,
        formatter_class = argparse.RawDescriptionHelpFormatter)
    parser.add_argument(
        'text_file_path',
        help = 'The complete file path of the text file you want to analyze.') 
    args = parser.parse_args()

    input_file(args.text_file_path)
    analyze_text()


end_time = time.monotonic()
print("Execution_Time:", timedelta(seconds = end_time - start_time))
print('\n')

### [Google Cloud Natural Language](https://cloud.google.com/natural-language/)

In [None]:
rom google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import os
import time
from datetime import timedelta
import sys
import argparse


#We need to get our API credentials in the code for authentication that we have stored as Environment Variables locally.
os.environ.get("GOOGLE_APPLICATION_CREDENTIALS")

#Following line is used to save all the console outputs in a text file.
sys.stdout = open('nlp_api_content_output.txt', 'w')

start_time = time.monotonic()


def input_file(text_file_path):
    global text
    if os.path.isfile(text_file_path):
        with open(text_file_path, 'r') as text_file:
            text = text_file.read()
    else:
        print("File doesn't exist in the directory!")


def sentiment_text():
    """Detects sentiment in the text."""
    client = language.LanguageServiceClient()
    # Instantiates a plain text document.
    document = types.Document(
        content = text,
        type = enums.Document.Type.PLAIN_TEXT)

    # Detects sentiment in the document. You can also analyze HTML with:
    #   document.type == enums.Document.Type.HTML
    sentiment = client.analyze_sentiment(document).document_sentiment

    print('Sentiment: {}, {}'.format(sentiment.score, sentiment.magnitude))
    print('\n')


def entities_text():
    """Detects entities in the text."""
    client = language.LanguageServiceClient()

    # Instantiates a plain text document.
    document = types.Document(
        content = text,
        type = enums.Document.Type.PLAIN_TEXT)

    # Detects entities in the document. You can also analyze HTML with:
    #   document.type == enums.Document.Type.HTML
    entities = client.analyze_entities(document).entities

    # entity types from enums.Entity.Type
    entity_type = ('UNKNOWN', 'PERSON', 'LOCATION', 'ORGANIZATION',
                   'EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD', 'OTHER')

    for entity in entities:
        print('=' * 20)
        print(u'{:<16}: {}'.format('name', entity.name))
        print(u'{:<16}: {}'.format('type', entity_type[entity.type]))
        print(u'{:<16}: {}'.format('metadata', entity.metadata))
        print(u'{:<16}: {}'.format('salience', entity.salience))
        print(u'{:<16}: {}'.format('wikipedia_url',
              entity.metadata.get('wikipedia_url', '-')))
    print('\n')


def syntax_text():
    """Detects syntax in the text."""
    client = language.LanguageServiceClient()

    # Instantiates a plain text document.
    document = types.Document(
        content = text,
        type = enums.Document.Type.PLAIN_TEXT)

    # Detects syntax in the document. You can also analyze HTML with:
    #   document.type == enums.Document.Type.HTML
    tokens = client.analyze_syntax(document).tokens

    # part-of-speech tags from enums.PartOfSpeech.Tag
    pos_tag = ('UNKNOWN', 'ADJ', 'ADP', 'ADV', 'CONJ', 'DET', 'NOUN', 'NUM',
               'PRON', 'PRT', 'PUNCT', 'VERB', 'X', 'AFFIX')

    for token in tokens:
        print(u'{}: {}'.format(pos_tag[token.part_of_speech.tag],
                               token.text.content))
    print('\n')


def entity_sentiment_text():
    """Detects entity sentiment in the provided text."""
    client = language.LanguageServiceClient()

    document = types.Document(
        content = text.encode('utf-8'),
        type = enums.Document.Type.PLAIN_TEXT)

    # Detect and send native Python encoding to receive correct word offsets.
    encoding = enums.EncodingType.UTF32
    if sys.maxunicode == 65535:
        encoding = enums.EncodingType.UTF16

    result = client.analyze_entity_sentiment(document, encoding)

    for entity in result.entities:
        print('Mentions: ')
        print(u'Name: "{}"'.format(entity.name))
        for mention in entity.mentions:
            print(u'  Begin Offset : {}'.format(mention.text.begin_offset))
            print(u'  Content : {}'.format(mention.text.content))
            print(u'  Magnitude : {}'.format(mention.sentiment.magnitude))
            print(u'  Sentiment : {}'.format(mention.sentiment.score))
            print(u'  Type : {}'.format(mention.type))
        print(u'Salience: {}'.format(entity.salience))
        print(u'Sentiment: {}\n'.format(entity.sentiment))
    print('\n')


def classify_text():
    """Classifies content categories of the provided text."""
    client = language.LanguageServiceClient()

    document = types.Document(
        content = text.encode('utf-8'),
        type = enums.Document.Type.PLAIN_TEXT)

    categories = client.classify_text(document).categories

    for category in categories:
        print(u'=' * 20)
        print(u'{:<16}: {}'.format('name', category.name))
        print(u'{:<16}: {}'.format('confidence', category.confidence))
    print('\n')


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description = __doc__,
        formatter_class = argparse.RawDescriptionHelpFormatter)
    parser.add_argument(
        'text_file_path',
        help = 'The complete file path of the text file you want to analyze.')
    args = parser.parse_args()

    input_file(args.text_file_path)
    sentiment_text()
    entities_text()
    syntax_text()
    entity_sentiment_text()
    classify_text()


end_time = time.monotonic()
print("Execution_Time:", timedelta(seconds = end_time - start_time))

### [Amazon Comprehend](https://aws.amazon.com/documentation/comprehend/)

In [None]:
import boto3
import time
from datetime import timedelta
import sys
import os
import argparse


#We need to get our API credentials in the code for authentication that we have stored as Environment Variables locally.
os.environ.get("AWS_ACCESS_KEY_ID")
os.environ.get("AWS_SECRET_ACCESS_KEY")
os.environ.get("AWS_REGION")
    

#Following line is used to save all the console outputs in a text file.
sys.stdout = open('output.txt','a')

start_time = time.monotonic()


def input_file(text_file_path):
    global text
    if os.path.isfile(text_file_path):
        with open(text_file_path, 'r') as text_file:
            text = text_file.read()
    else:
        print("File doesn't exist in the directory!")


def dominant_language_text():
    #Initialize amazon_comprehend client function
    client_comprehend = boto3.client(
        'comprehend',
    )
    dominant_language_response = client_comprehend.detect_dominant_language(
        Text = text
    )
    #Print the Dominant Language
    print("Language:", sorted(dominant_language_response['Languages'], key = lambda k: k['LanguageCode'])[0]['LanguageCode'])


def entities_text():
    #Initialize amazon_comprehend client function
    client_comprehend = boto3.client(
        'comprehend',
    )
    response_entities = client_comprehend.detect_entities(
            Text = text,
            LanguageCode = 'en'
    )
    entities = list(set([obj['Type'] for obj in response_entities['Entities']]))
    #Print the Entities
    print("Entities:",entities)


def key_phrases_text():
    #Initialize amazon_comprehend client function
    client_comprehend = boto3.client(
        'comprehend',
    )
    response_key_phrases = client_comprehend.detect_key_phrases(
        Text = text,
        LanguageCode = 'en'
    )
    key_phrases = list(set([obj['Text'] for obj in response_key_phrases['KeyPhrases']]))
    #Print the Key Phrases
    print("Key Phrases:", key_phrases)


def sentiment_text():
    #Initialize amazon_comprehend client function
    client_comprehend = boto3.client(
        'comprehend',
    )
    response_sentiment = client_comprehend.detect_sentiment(
        Text = text,
        LanguageCode = 'en'
    )
    sentiment = response_sentiment['Sentiment']
    #Print the Sentiment
    print("Sentiment Analysis:" , sentiment)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description = __doc__,
        formatter_class = argparse.RawDescriptionHelpFormatter)
    parser.add_argument(
        'text_file_path',
        help = 'The complete file path of the text file you want to analyze.')
    args = parser.parse_args()
    input_file(args.text_file_path)
    dominant_language_text()
    entities_text()
    key_phrases_text()
    sentiment_text()


end_time = time.monotonic()
print("Execution_Time:", timedelta(seconds = end_time - start_time))

### [Microsoft Azure Text Analytics](https://azure.microsoft.com/en-us/resources/videos/learn-how-to-create-text-analytics-solutions-with-azure-machine-learning-templates/)

In [None]:
mport requests
import os
import sys
import json
import time
from datetime import timedelta
import argparse


#We need to get our API credentials in the code for authentication that we have stored as Environment Variables locally
Ocp_Apim_Subscription_Key = os.environ.get("KEY_NLP")


#Following line is used to save all the console output into a text file
sys.stdout = open('nlp_api_output.txt', 'a')

start_time = time.monotonic()


def input_file(text_file_path):
    global text
    if os.path.isfile(text_file_path):
        with open(text_file_path, 'r') as text_file:
            text = text_file.read()
    else:
        print("File doesn't exist in the directory!")


def analyze_text():
    headers = {
        # NOTE: Replace the "Ocp-Apim-Subscription-Key" value with a valid subscription key.
        'Ocp-Apim-Subscription-Key': Ocp_Apim_Subscription_Key,
    }

    urls = ['https://eastus2.api.cognitive.microsoft.com/text/analytics/v2.0/languages', 'https://eastus2.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment', 'https://eastus2.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases']

    documents = { 'documents': [
        { 'id': '1', 'language': 'en', 'text': text }]}

    try:
        # NOTE: You must use the same location in your REST call as you used to obtain your subscription keys.
        #   For example, if you obtained your subscription keys from westus, replace "eastus2" in the
        #   URLs above with "westus".
        for url in urls:
            response = requests.post(url = url,
                                 headers = headers,
                                 data = (json.dumps(documents)).encode('utf-8'))
            data = response.json()
            print(data)
        print('\n')
    except Exception as e:
        print('Error: ', e)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description = __doc__,
        formatter_class = argparse.RawDescriptionHelpFormatter)
    parser.add_argument(
        'text_file_path',
        help = 'The complete file path of the text file you want to analyze.')
    args = parser.parse_args()

    input_file(args.text_file_path)
    analyze_text()


end_time = time.monotonic()
print("Execution_Time:", timedelta(seconds = end_time - start_time))
print('\n')

END OF DOCUMENT