<div><img src="https://www.ibm.com/blogs/bluemix/wp-content/uploads/2017/02/NLU.png", width=270, height=270, align = 'right'> 

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/51/IBM_logo.svg/640px-IBM_logo.svg.png", width = 90, height = 90, align = 'right', style="margin:0px 25px"></div>

# Retrieve DataFrames to visualize data from IBM Watson Natural Language Understanding

In this R notebook, you'll use IBM Watson Natural Language Understanding (NLU) to analyze keywords from the websites of the Fortune 100 companies. You'll convert the results of NLU into a set of R DataFrames that you can easily analyze in a notebook. Then you'll create a visual representation of the keywords. 

Data scientists use NLU to uncover insights about sentiment and emotion from structured and unstructured data and to analyze text to extract metadata, such as concepts, entities, keywords, categories, relations, and semantic roles. NLU returns both the overall sentiment and emotion for the whole document and the specific sentiment and emotion for each of the keywords in the text, for deeper analysis.

To start with that tutorial, see [Visualize a customer base with Watson NLU](#visualize). To read a little more about the technology and some of the other fun things you can do, see [Getting personal about Watson Natural Language Understanding](#about) and [Using R & Watson NLU](#using_r). 

This notebook runs on R with Spark 2.0.

## Table of contents

1.  [Getting personal about Watson Natural Language Understanding](#about)<br>
    1.1  [What data do we get from Watson NLU?](#what_data)

2.  [Using R & Watson NLU](#using_r)<br>
    2.1  [Using the `httr` and `jsonlite` packages](#httr)<br>
    2.2  [Functional access to Watson NLU](#functiondetails) <br>
    2.3  [Understand the `watsonNLUtoDF()` package](#watsonnlutodf)<br>

3.  [Visualize a customer base with Watson NLU](#visualize)<br>
    3.1  [Load the customer data](#visualize1) <br>
    3.2  [Shape the data for the NLU function](#visualize2)<br>
    3.3  [Send customersDF to Watson](#visualize3)<br>
    3.4  [Concatenate the concepts and keywords extracted from Watson](#visualize4)<br>
    3.5  [Format the Text for a Word Cloud](#visualize5)<br>
    3.6  [Visualize with the Brunel library](#visualize6)<br>

[Summary and next steps](#summary)

<a id='about'></a>
## 1. Getting personal about Watson Natural Language Understanding 

The first time I tested Watson NLU a huge grin spread across my face. For me it was one another one of those moments where technology inspired awe - sort of like when I played my first PC game as a kid in 1990 or when I saw the iPhone in 2006.  So what exactly is Watson NLU? 

> *With a sophisticated suite of natural language processing capabilities, NLU can analyze text and extract meta-data from unstructured content such as concepts, entities, keywords, categories, sentiment, emotion, relations, semantic roles. You can also customize the text analysis with NLU for linguistic nuances specific to your domain or industry (such as entities and relations) with custom models developed using Watson Knowledge Studio. With customization, you can further improve the accuracy of meta-data extraction. Whether it is social media monitoring, content recommendation, or advertising optimization, NLU can be easily put to use for extracting the hard to find insights from unstructured content.*  

> Source: __[IBM Bluemix Blog](https://www.ibm.com/blogs/bluemix/2017/02/hello-nlu/)__

In other words, the NLU service allows you to send unstructured data to Watson and have it return a rich set of structured data.  For example, you can send Watson a block of text and it will understand the information contained therein. Alternatively, you can send a URL and extract all the information from it.

Ok, sounds kinda cool, but what's the 'wow' factor?  Read on.

<a id='what_data'></a>
### 1.1 What data do we get from Watson NLU?

The easiest way to get a sense of what NLU does is to __[try the demo](https://natural-language-understanding-demo.mybluemix.net/)__.  So, go there and come back after you've played around with submitting different URLs and text to the service.  

Back?  Are you impressed yet?  You must be!  As you saw, we get detailed information about the concepts, sentiment, and categorization of the data we send to Watson.  

<a id='using_r'></a>
## 2. Using R and Watson NLU

I wanted to work with the data from NLU in R but there didn't seem to be many resources available online.  Rather than work directly with the JSON returned by the service, I decided to write a function - `watsonNLUtoDF()` - that converts NLU results from JSON into a list of R DataFrames.  To do so, I had to use two excellent R packages.

<a id='httr'></a>
### 2.1 The `httr` and `jsonlite` packages

The `httr` package is one of the many packages for R that is written by Hadley Wickham. It provides access to `curl` functionality from inside of R, and its simplicity is classic Hadley. For example, if to run `POST()` or `GET()` using a URL is easy. You must include authentication and JSON-structured data in the `POST()` to Watson NLU:

> `> POST(URL, authenticate(username, password), body = toJSON(list(features)))`

You'll notice the `toJSON()` function in there.  That comes from the `jsonlite` package, which offers excellent support for converting R objects to and from JSON.  As you can see this came in handy when I needed to send JSON to Watson NLU.  It is also essential in parsing the response from Watson into R DataFrames. 

These two packages let me access Watson NLU from R, but I still needed a way to access the API in a more programmatic, scalable fashion. That's why I wrote the function.

<a id='functiondetails'></a>
### 2.2 Functional access to Watson NLU

You need to define the function to access NLU and understand its usage and arguments. You can skip ahead and look at the [function documentation](#watsonnlutodf). But before you can run the function, you need to get NLU credentials and import packages.

Sign up for NLU and add your NLU credentials:
1. Create a service for [Natural Language Understanding (NLU)](https://www.ibm.com/watson/developercloud/natural-language-understanding.html). 
1. Insert the username and password values for your NLU service in the following cell. 
1. Run the cell.

In [1]:
# The code was removed by DSX for sharing.

Install and import the necessary packages:

In [2]:
# This notebook uses version 2.3 of Brunel. If the version changes in the future, the visualization may not work and you will need 
# to update the version number in the code. 
 
install.packages('tm')
install.packages("devtools")
devtools::install_github("Brunel-Visualization/Brunel", subdir="R", ref="v2.3", force=TRUE)
library(brunel)
library(tm)
library(httr)
library(jsonlite)

Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/sc3e-53554f95eddadf-4e28db014a7c/R/libs’
(as ‘lib’ is unspecified)
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/sc3e-53554f95eddadf-4e28db014a7c/R/libs’
(as ‘lib’ is unspecified)
“installation of package ‘devtools’ had non-zero exit status”Downloading GitHub repo Brunel-Visualization/Brunel@v2.3
from URL https://api.github.com/repos/Brunel-Visualization/Brunel/zipball/v2.3
Installing brunel
Installing 1 package: jsonlite
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/sc3e-53554f95eddadf-4e28db014a7c/R/libs’
(as ‘lib’ is unspecified)
“installation of package ‘jsonlite’ had non-zero exit status”'/usr/local/src/bluemix_jupyter_bundle/R/lib64/R/bin/R' --no-site-file  \
  --no-environ --no-save --no-restore --quiet CMD INSTALL  \
  '/gpfs/global_fs01/sym_shared/YPProdSpark/user/sc3e-53554f95eddadf-4e28db014a7c/notebook/tmp/RtmpQmCd2q/devtools55692a2bc741/Brunel-Visualizat

Define the function to retrieve DataFrames from the Watson API:

In [3]:
watsonNLUtoDF <- function(data, username, password, verbose = F, language = 'en') {
  
  ## Url for Watson NLU service on Bluemix used to POST (send) content to the service to have it analyzed.  
  ## For more details: https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#post-analyze 
  base_url <- "https://gateway.watsonplatform.net/natural-language-understanding/api/v1/analyze?version=2017-02-27"
  
    ## Initialize Empty DataFrames
  conceptsDF <- data.frame()
  keywordsDF <- data.frame()
  sentimentDF <- data.frame()
  categoriesDF <- data.frame()
  analyzedTextDF <- data.frame()
  
  ## Loop over each id, identify the type and send the value to Watson
  for (i in 1:nrow(data)){
    try({
      
      id <- data$id[i]
      value <- data$value[i]
      
      ## Define the JSON payload for NLU
      body <- list(api_endpoint = value, 
                   features = list(
                     categories = {},
                     concepts = {},
                     keywords = {},
                     sentiment = {}),
                   language = language,
                   return_analyzed_text = TRUE)
      
      ## Provide the correct type for each id
      names(body)[1] <- data$type[i]
      
      if(verbose == T){
      print(paste("Sending", data$type[i], "for", id, "to Watson NLU..."))
      }
      
      ## Hit the API and return JSON
      watsonResponse <- POST(base_url,
                             content_type_json(),
                             authenticate(username, password, type = "basic"),
                             body = toJSON(body, auto_unbox = T)) 

        ## Parse JSON into DataFrames
      concepts <- data.frame(id = id, 
                             fromJSON(toJSON(content(watsonResponse), pretty = T), flatten = T)$concepts,
                             stringsAsFactors = F)

      keywords <- data.frame(id = id, 
                             fromJSON(toJSON(content(watsonResponse), pretty = T), flatten = T)$keywords,
                             stringsAsFactors = F)
      
      sentiment <- data.frame(id = id, 
                             fromJSON(toJSON(content(watsonResponse), pretty = T), flatten = T)$sentiment,
                             stringsAsFactors = F)
      
      categories <- data.frame(id = id,
                               fromJSON(toJSON(content(watsonResponse), pretty = T), flatten = T)$categories,
                               stringsAsFactors = F)
      
      analyzedText <- data.frame(id = id,
                                 fromJSON(toJSON(content(watsonResponse), pretty = T), flatten = T)$analyzed_text,
                                 stringsAsFactors = F)
      
      
      ## Append results to output DataFrames
      conceptsDF <- rbind(conceptsDF, concepts)
      keywordsDF <- rbind(keywordsDF, keywords)
      sentimentDF <- rbind(sentimentDF, sentiment)
      categoriesDF <- rbind(categoriesDF, categories)
      analyzedTextDF <- rbind(analyzedTextDF, analyzedText)
      
      if(verbose == T) {
      print(paste("Iteration", i, "of", nrow(data), "complete."))
      }
    })
  }
  resultsList <- list(conceptsDF, keywordsDF, sentimentDF, categoriesDF, analyzedTextDF, watsonResponse)
  names(resultsList) <- c("conceptsDF", "keywordsDF", "sentimentDF", "categoriesDF", "analyzedTextDF", "response")
  return(resultsList)
}

<a id='watsonnlutodf'></a>
### 2.3 Understand the `watsonNLUtoDF()` function

#### Usage
 > `watsonNLUtoDF(data, username, password, verbose = F, language = 'en')`

#### Arguments
 > - **data:** *DataFrame* with 3 columns: 
     - `id` *(string)* 
     - `type` *(string)*, must be one of `url` or `text` - think of this as which API endpoint you want to submit to. 
     - `value` *(string)*, contains the data you want to send.
     
 > * **username:** *string*.  User name for the NLU service on Bluemix.
 
 > * **password:** *string*.  Password for the NLU service on Bluemix.
 
 > * **verbose**: *boolean*.  Print messages showing progress. Defaults to `F`.
 
 > * **language**: *string*.  Default is English. [See the API docs](https://www.ibm.com/watson/developercloud/doc/natural-language-understanding/#supported-languages) for available options.
 
#### Value

 > A *list* of five dataframes extracted from the service:
     1. conceptsDF
     2. keywordsDF
     3. sentimentDF
     4. categoriesDF
     5. analyzedTextDF
  
#### Example:
Create a DataFrame with `id`, `type`, and `value` columns that contains a row with a URL and a row with text:

In [4]:
df <- data.frame(id = c("Seattle Seahawks", "Seattle Sounders"), 
                 type = c("url", "text"),
                 value = c("www.seahawks.com", 
                           "From Wikipedia.org: Seattle Sounders FC is an American professional soccer club based 
                           in Seattle, Washington. The Sounders compete as a member of the Western Conference of 
                           Major League Soccer (MLS) and are the league's current defending champions, having won
                           the 2016 MLS Cup. The club was established on November 13, 2007, and began play in 2009
                           as an MLS expansion team. The Sounders are the third Seattle soccer club to share the 
                           Sounders name being part of a legacy which traces back to the original team of the NASL
                           in 1974."),
                 stringsAsFactors = F)

head(df)

id,type,value
Seattle Seahawks,url,www.seahawks.com
Seattle Sounders,text,"From Wikipedia.org: Seattle Sounders FC is an American professional soccer club based in Seattle, Washington. The Sounders compete as a member of the Western Conference of Major League Soccer (MLS) and are the league's current defending champions, having won  the 2016 MLS Cup. The club was established on November 13, 2007, and began play in 2009  as an MLS expansion team. The Sounders are the third Seattle soccer club to share the Sounders name being part of a legacy which traces back to the original team of the NASL  in 1974."


Run the `watsonNLUtoDF` function on the DataFrame and specify to return a DataFrame of concepts and a DataFrame of categories:

In [5]:
## Send properly formatted DataFrame with credentials to Watson
responseList <- watsonNLUtoDF(df, username, password, verbose = T)

head(responseList$conceptsDF)
head(responseList[4]) ## 'categories'

[1] "Sending url for Seattle Seahawks to Watson NLU..."
[1] "Iteration 1 of 2 complete."
[1] "Sending text for Seattle Sounders to Watson NLU..."
[1] "Iteration 2 of 2 complete."


id,text,relevance,dbpedia_resource
Seattle Seahawks,National Football League,0.9576,http://dbpedia.org/resource/National_Football_League
Seattle Seahawks,Kansas City Chiefs,0.545,http://dbpedia.org/resource/Kansas_City_Chiefs
Seattle Seahawks,Seattle Seahawks,0.5177,http://dbpedia.org/resource/Seattle_Seahawks
Seattle Seahawks,Pete Carroll,0.4893,http://dbpedia.org/resource/Pete_Carroll
Seattle Seahawks,National Football League exhibition season,0.3404,http://dbpedia.org/resource/National_Football_League_exhibition_season
Seattle Seahawks,Season ticket,0.3399,http://dbpedia.org/resource/Season_ticket


id,score,label
Seattle Seahawks,0.9999,/sports/football
Seattle Seahawks,0.3846,/business and industrial/business news
Seattle Seahawks,0.2957,/technology and computing/internet technology/social network
Seattle Sounders,0.7794,/sports/soccer
Seattle Sounders,0.3538,/sports/gymnastics
Seattle Sounders,0.2107,/real estate


The first returned DataFrame shows which concepts are most important. The second returned DataFrame shows which categories best describe the text. Watson is very confident that the text is about football! 

<a id='visualize'></a>

## 3. Visualize a customer base with Watson NLU

At this point, you should have a good idea of what the NLU service provides and how you've accessed it in R. Now have a little fun and visualize some output.

The data science goal of this notebook is understanding market segmentation across a customer base.  **Can you use an AI engine to better understand and classify both existing and potential customers?  Watson is well suited for this task!**  Use the `watsonNLUtoDF` function to send Watson a DataFrame full of company names and a mix of their URLs and text descriptions. Then  build a word cloud with the concepts and keywords that are returned.

<a id='visualize1'></a>
### 3.1  Load the customer data
Download the data set from the DSX community and load it into a DataFrame.

To load the data:
1. Go to the [Fortune 100 companies data set](https://apsportal.ibm.com/exchange/public/entry/view/4d26cd0dd964734bc23c6475a8dc454b) on the DSX Community. 
2. Click the download icon and save the data set as .csv file to your computer.  
3. Load the `fortune100.csv` file into your notebook. Click the **Find and Add Data** icon on the notebook action bar. Drop the file into the box or browse to select the file. The file is loaded to your object storage and appears in the Data Assets section of the project. For more information, see <a href="https://datascience.ibm.com/docs/content/analyze-data/load-and-access-data.html" target="_blank" rel="noopener noreferrer">Load and access data</a>.
4. To load the data from the `fortune100.csv` file into a R DataFrame, click in the next code cell and select **Insert to code > Insert R DataFrame** under the file name.
5. Rename the two instances of `df.data.x` to `customersDF` in the last two lines.
6. Run the cell.

In [6]:
# Load data using Insert to code > Insert R DataFrame


Loading required package: RCurl
Loading required package: bitops

Attaching package: ‘RCurl’

The following object is masked from ‘package:SparkR’:

    base64



id,type,value
Walmart,url,http://www.walmart.com
Exxon Mobil,url,http://www.exxonmobil.com
Apple,url,http://www.apple.com
Berkshire Hathaway,url,http://www.berkshirehathaway.com
McKesson,url,http://www.mckesson.com
UnitedHealth Group,url,http://www.unitedhealthgroup.com


<a id='visualize2'></a>
### 3.2 Shape the data for the NLU function

The DataFrame already has correctly named columns of `id`, `type`, and `value`. Make sure that all the column types are strings:

1. Copy the value of the `file` argument from the `read.csv` function in the previous cell and replace the text `YOUR_VALUE`. For example: <br>
    `file = getObjectStorageFileWithCredentials_xxxxxxxxx("<your project>", "fortune100.csv")`
1. Run the cell.

In [7]:
# Copy and past the read.csv command from the cell above, adding 'stringsAsFactors = F' as an additional parameter 
# to ensure that column types are returned as strings
#   ex: customersDF <-  read.csv(file = getObjectStorageFileWithCredentials_xxxxxxxxx("<your project>", "fortune100.csv"), stringsAsFactors = F)

customersDF <-  read.csv(file = YOUR VALUE), stringsAsFactors = F)

str(customersDF)

'data.frame':	100 obs. of  3 variables:
 $ id   : chr  "Walmart" "Exxon Mobil" "Apple" "Berkshire Hathaway" ...
 $ type : chr  "url" "url" "url" "url" ...
 $ value: chr  "http://www.walmart.com" "http://www.exxonmobil.com" "http://www.apple.com" "http://www.berkshirehathaway.com" ...


Looks good!  

<a id='visualize3'></a>
### 3.3 Send `customersDF`  to Watson 
Run the `watsonNLUtoDF` function to send the customersDF DataFrame to Watson NLU:

In [8]:
responseList <- watsonNLUtoDF(customersDF, username, password)

<a id='visualize4'></a>
### 3.4 Concatenate the concepts and keywords extracted from Watson 
Now concatenate the resulting concepts and keywords lists into a customerDocument DataFrame:

In [9]:
customerDocument <- paste(responseList$concepts$text, responseList$keywords$text, sep = " ", collapse = " ")

<a id='visualize5'></a>
### 3.5 Format the Text for a Word Cloud
Use functions from the R tm package to format the text so that you can create a word cloud:

- Remove punctuation, numbers, and spaces
- Remove very short words and very long words
- Calculate the frequency of each word
- Drop words that appear 8 or fewer times

In [10]:
## Create Corpus from concatenated text
customerCorpus <- Corpus(VectorSource(customerDocument))

## Scrub text
customerCorpus <- tm_map(customerCorpus, removePunctuation)
customerCorpus <- tm_map(customerCorpus, removeNumbers)
customerCorpus <- tm_map(customerCorpus, stripWhitespace)

## Remove words less than 4 characters or greater than 20
customerTDM <- TermDocumentMatrix(customerCorpus, control = list(wordLengths = c(4, 20)))

## Calculate word frequencies
wordFreqDF <- data.frame(word = row.names(as.matrix(customerTDM)), freq = as.vector(customerTDM))

## Drop words with a count less than 8
wordFreqDF <- subset(wordFreqDF, freq >= 8)

Look at the resulting wordFreDF DataFrame:

In [11]:
wordFreqDF

Unnamed: 0,word,freq
6,access,9
56,airlines,21
57,airport,8
67,amazon,22
69,america,16
71,american,23
80,annual,8
82,annuity,9
90,apple,19
92,appliance,13


<a id='visualize6'></a>
### 3.6 Visualize with the Brunel library
Finally, use the Brunel library to visualize the word cloud to see themes amoung Fortune 100 companies.

In [12]:
brunel (" data('wordFreqDF') cloud color(word) size(freq) label(word) mean(freq) legends(none)",
        width = 800, height = 600, online_js = TRUE)

<a id='summary'></a>
## Summary and next steps

Congratulations! In this notebook you learned about the Watson Natural Language Understanding API and how to access it in a programmatic way by using the R programming language.  

Try substituting your own client list for the Fortune 100 data and creating a word cloud.

### Author

**Rafi Kurlansik** is an Open Source Solutions Engineer specializing in big data technologies, such as Hadoop and Spark. He's responsible for developing and delivering demonstrations of IBM tech to both enterprise clients and the larger analytics community. Kurlansik has hands-on experience with machine learning, natural language processing, data visualization, and dashboard development. If you're wondering where he comes down on the biggest data science debate of our day, Rafi is, in his own words, "an avid R fan, especially RStudio!" 

Copyright © 2017 IBM. This notebook and its source code are released under the terms of the MIT License.