# Classification and Attribution of data

## 1. Setup
To prepare your environment, you need to install some packages and enter credentials for the Watson services.

### 1.1 Install the necessary packages

You need the latest versions of these packages:<br>
Watson Developer Cloud: a client library for Watson services.<br>
NLTK: leading platform for building Python programs to work with human language data.<br>
python-keystoneclient: is a client for the OpenStack Identity API.<br>
python-swiftclient: is a python client for the Swift API.<br><br>
** Install the Watson Developer Cloud package: **

In [1]:
!pip install --upgrade watson-developer-cloud

Requirement already up-to-date: watson-developer-cloud in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s973-7d640fb4db0d6f-c5c16a29391b/.local/lib/python2.7/site-packages
Requirement already up-to-date: pyOpenSSL>=16.2.0 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s973-7d640fb4db0d6f-c5c16a29391b/.local/lib/python2.7/site-packages (from watson-developer-cloud)
Requirement already up-to-date: requests<3.0,>=2.0 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s973-7d640fb4db0d6f-c5c16a29391b/.local/lib/python2.7/site-packages (from watson-developer-cloud)
Requirement already up-to-date: pysolr<4.0,>=3.3 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s973-7d640fb4db0d6f-c5c16a29391b/.local/lib/python2.7/site-packages (from watson-developer-cloud)
Requirement already up-to-date: cryptography>=1.9 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s973-7d640fb4db0d6f-c5c16a29391b/.local/lib/python2.7/site-packages (from pyOpenSSL>=16.2.0->watson-developer-cloud)
Requirement already u

** Install NLTK: **

In [2]:
!pip install --upgrade nltk

Requirement already up-to-date: nltk in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s973-7d640fb4db0d6f-c5c16a29391b/.local/lib/python2.7/site-packages
Requirement already up-to-date: six in /usr/local/src/bluemix_jupyter_bundle.v53/notebook/lib/python2.7/site-packages (from nltk)


** Install IBM Bluemix Object Storage Client: **

In [3]:
!pip install python-swiftclient



** <font color=blue>Now restart the kernel by choosing Kernel > Restart. </font> **

### 1.2 Import packages and libraries

Import the packages and libraries that you'll use:

In [4]:
import json
import watson_developer_cloud
from watson_developer_cloud import NaturalLanguageUnderstandingV1
import watson_developer_cloud.natural_language_understanding.features.v1 \
  as Features
    
import swiftclient
import re
import nltk
from nltk import word_tokenize,sent_tokenize,ne_chunk

## 2. Configuration

Add configurable items of the notebook below

### 2.1 Add your service credentials from Bluemix for the Watson services

You must create a Watson Natural Language Understanding service on Bluemix.
Create a service for Natural Language Understanding (NLU).
Insert the username and password values for your NLU in the following cell. Do not change the values of the version fields.

Run the cell.

In [5]:
# @hidden_cell
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2017-02-27',
    username="",
    password="")

### 2.2 Add your service credentials for Object Storage

You must create Object Storage service on Bluemix.
To access data in a file in Object Storage, you need the Object Storage authentication credentials.
Insert the Object Storage authentication credentials as <i><b>credentials_1</b></i> in the following cell after 
removing the current contents in the cell. 


In [6]:
# @hidden_cell
credentials_1 = {
  'auth_url':'',
  'project':'',
  'project_id':'',
  'region':'',
  'user_id':'',
  'domain_id':'',
  'domain_name':'',
  'username':'',
  'password':'',
  'container':'',
  'tenantId':'',
  'filename':''
}

### 2.3 Global Variables

Add global variables.


In [7]:
# Specify file names for sample text and configuration files
sampleTextFileName = "sample_text.txt"
sampleConfigFileName = "sample_config.txt"


### 2.4 Configure and download required NLTK packages

Download the 'punkt' and 'averaged_perceptron_tagger' NLTK packages for POS tagging usage.

In [8]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /gpfs/fs01/user/s973
[nltk_data]     -7d640fb4db0d6f-c5c16a29391b/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /gpfs/fs01/user/s973-7d640fb4db0d6f-
[nltk_data]     c5c16a29391b/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## 3. Classification

Write the classification related utility functions in a modularalized form.

### 3.1 Watson NLU Classification

In [9]:
def analyze_using_NLU(analysistext):
    """ Call Watson Natural Language Understanding service to obtain analysis results.
    """
    response = natural_language_understanding.analyze( 
        text=analysistext,features=[ Features.Entities(
                                        emotion=True,
                                        sentiment=True,
                                        limit=2
                                     ),
                                     Features.Keywords(
                                        emotion=True,
                                        sentiment=True,
                                        limit=2
                                     )
                                   ] )
    return response

### 3.2 Augumented Classification

Custom classification utlity fucntions for augumenting the results of Watson NLU API call

In [10]:
def split_sentences(text):
    """ Split text into sentences.
    """
    sentence_delimiters = re.compile(u'[\\[\\]\n.!?]')
    sentences = sentence_delimiters.split(text)
    return sentences

def split_into_tokens(text):
    """ Split text into tokens.
    """
    tokens = nltk.word_tokenize(text)
    return tokens
    
def POS_tagging(text):
    """ Generate Part of speech tagging of the text.
    """
    POSofText = nltk.tag.pos_tag(text)
    return POSofText

def keyword_tagging(tag,tagtext,text):
    """ Tag the text matching keywords.
    """
    if (text.lower().find(tagtext.lower()) != -1):
        return text[text.lower().find(tagtext.lower()):text.lower().find(tagtext.lower())+len(tagtext)]
    else:
        return 'UNKNOWN'
    
def regex_tagging(tag,regex,text):
    """ Tag the text matching REGEX.
    """    
    p = re.compile(regex, re.IGNORECASE)
    matchtext = p.findall(text)
    regex_list=[]    
    if (len(matchtext)>0):
        for regword in matchtext:
            regex_list.append(regword)
    return regex_list

def chunk_tagging(tag,chunk,text):
    """ Tag the text using chunking.
    """
    parsed_cp = nltk.RegexpParser(chunk)
    pos_cp = parsed_cp.parse(text)
    chunk_list=[]
    for root in pos_cp:
        if isinstance(root, nltk.tree.Tree):               
            if root.label() == tag:
                chunk_word = ''
                for child_root in root:
                    chunk_word = chunk_word +' '+ child_root[0]
                chunk_list.append(chunk_word)
    return chunk_list
    
def augument_NLUResponse(responsejson,updateType,text,tag):
    """ Update the NLU response JSON with augumented classifications.
    """
    if(updateType == 'keyword'):
        if not any(d.get('text', None) == text for d in responsejson['keywords']):
            responsejson['keywords'].append({"text":text,"relevance":0.5})
    else:
        if not any(d.get('text', None) == text for d in responsejson['entities']):
            responsejson['entities'].append({"type":tag,"text":text,"relevance":0.5,"count":1})        
    

def classify_text(text, config):
    """ Perform augumented classification of the text.
    """
    
    response = analyze_using_NLU(text)
    responsejson = response
    
    sentenceList = split_sentences(text)
    
    tokens = split_into_tokens(text)
    
    postags = POS_tagging(tokens)
    
    configjson = json.loads(config)
    for stages in configjson['configuration']['classification']['stages']:
        print('Stage - Performing ' + stages['name']+':')
        for steps in stages['steps']:
            print('    Step - ' + steps['type']+':')
            if (steps['type'] == 'keywords'):
                for keyword in steps['keywords']:
                    for word in sentenceList:
                        wordtag = keyword_tagging(keyword['tag'],keyword['text'],word)
                        if(wordtag != 'UNKNOWN'):
                            print('      '+keyword['tag']+':'+wordtag)
                            augument_NLUResponse(responsejson,'entities',wordtag,keyword['tag'])
            elif(steps['type'] == 'd_regex'):
                for regex in steps['d_regex']:
                    for word in sentenceList:
                        regextags = regex_tagging(regex['tag'],regex['pattern'],word)
                        if (len(regextags)>0):
                            for words in regextags:
                                print('      '+regex['tag']+':'+words)
                                augument_NLUResponse(responsejson,'entities',words,regex['tag'])
            elif(steps['type'] == 'chunking'):
                for chunk in steps['chunk']:
                    chunktags = chunk_tagging(chunk['tag'],chunk['pattern'],postags)
                    if (len(chunktags)>0):
                        for words in chunktags:
                            print('      '+chunk['tag']+':'+words)
                            augument_NLUResponse(responsejson,'entities',words,chunk['tag'])
            else:
                print('UNKNOWN STEP')
    
    return responsejson

def replace_unicode_strings(response):
    """ Convert dict with unicode strings to strings.
    """
    if isinstance(response, dict):
        return {replace_unicode_strings(key): replace_unicode_strings(value) for key, value in response.iteritems()}
    elif isinstance(response, list):
        return [replace_unicode_strings(element) for element in response]
    elif isinstance(response, unicode):
        return response.encode('utf-8')
    else:
        return response


## 4. Persistence and Storage

### 4.1 Configure Object Storage Client

In [11]:
auth_url = credentials_1['auth_url']+"/v3"
container = credentials_1["container"]

IBM_Objectstorage_Connection = swiftclient.Connection(
    key=credentials_1['password'], authurl=auth_url, auth_version='3', os_options={
        "project_id": credentials_1['project_id'], "user_id": credentials_1['user_id'], "region_name": credentials_1['region']})

def create_container(container_name):
    """ Create a container on Object Storage.
    """
    x = IBM_Objectstorage_Connection.put_container(container_name)
    return x

def put_object(container_name, fname, contents, content_type):
    """ Write contents to Object Storage.
    """
    x = IBM_Objectstorage_Connection.put_object(
        container_name,
        fname,
        contents,
        content_type)
    return x

def get_object(container_name, fname):
    """ Retrieve contents from Object Storage.
    """
    Object_Store_file_details = IBM_Objectstorage_Connection.get_object(
        container_name, fname)
    return Object_Store_file_details[1]

## 5. Classify text
Read the data file for classification from Object Store<br>
Read the configuration file for augumented classification from Object Store.<br>
Persist the classification results as JSON file in object store.

In [12]:
# Load the text from Object Storage
text = get_object(container, sampleTextFileName)

# Load the json configuration from Object Storage
config = get_object(container, sampleConfigFileName)

# Print the json configuration
print("## Using the configuration ##")
print(config)

## Using the configuration ##
{
  "configuration": {
    "classification": {
      "stages": [
        {
          "name": "Base Tagging",
          "steps": [
            {
              "type": "keywords",
              "keywords": [
                {
                  "tag": "Passion",
                  "text": "Science"
                },
                {
                  "tag": "Subjects",
                  "text": "cosmology"
                }
              ]
            },
            {
              "type": "d_regex",
              "d_regex": [
                {
                  "tag": "Date",
                  "pattern": "(\\d+/\\d+/\\d+)"
                }
              ]
            },
            {
              "type": "d_regex",
              "d_regex": [
                {
                  "tag": "Email",
                  "pattern": "\\b[\\w.-]+?@\\w+?\\.\\w+?\\b"
                }
              ]
            },
            {
  

In [13]:
# Classify the text
response = classify_text(text, config)

Stage - Performing Base Tagging:
    Step - keywords:
      Passion:science
      Passion:science
      Subjects:cosmology
      Subjects:cosmology
    Step - d_regex:
      Date:01/08/1942
    Step - d_regex:
    Step - d_regex:
      PhoneNumber:1112223333
    Step - chunking:
      NP: an early age
      NP: a passion
      NP: science
      NP: the sky
      NP: age
      NP: cosmology
      NP: amyotrophic lateral sclerosis
      NP: illness
      NP: work
      NP: cosmology
      NP: science
      NP: everyone
      NP: phone
      NP: email
      NP: yahoo.com
      NAME: Stephen Hawking
      NAME: Oxford
      NAME: England
      NAME: Hawking
      NAME: University
      NAME: Cambridge
      NAME: Stephen Hawking
      NAME: @
Stage - Performing Domain Tagging:
    Step - d_regex:
      Year:1942
      Year:1112
      Year:2233


In [22]:
# replace unicode strings and convert dict to str for storage
response = str(replace_unicode_strings(response))
print("~~ Text Classification ~~")
print(response)

~~ Text Classification ~~
{'keywords': [{'relevance': 0.96866, 'text': 'Stephen Hawking', 'sentiment': {'score': 0.0}, 'emotion': {'anger': 0.075166, 'joy': 0.045775, 'sadness': 0.175093, 'fear': 0.117016, 'disgust': 0.047847}}, {'relevance': 0.884134, 'text': 'amyotrophic lateral sclerosis', 'sentiment': {'score': -0.253741}, 'emotion': {'anger': 0.102951, 'joy': 0.008111, 'sadness': 0.377757, 'fear': 0.137669, 'disgust': 0.044538}}], 'entities': [{'emotion': {'anger': 0.024758, 'joy': 0.222444, 'sadness': 0.434388, 'fear': 0.153419, 'disgust': 0.050352}, 'count': 5, 'sentiment': {'score': 0.201512}, 'text': 'Stephen Hawking', 'disambiguation': {'subtype': ['Academic', 'Astronomer', 'AwardNominee', 'AwardWinner', 'BoardMember', 'Scientist', 'FilmActor', 'FilmWriter', 'TVActor'], 'name': 'Stephen Hawking', 'dbpedia_resource': 'http://dbpedia.org/resource/Stephen_Hawking'}, 'relevance': 0.846941, 'type': 'Person'}, {'emotion': {'anger': 0.084431, 'joy': 0.271569, 'sadness': 0.195576, 'f

In [15]:
# Store the classification response in Object Storage
put_object(container, "sample_text_classification.txt", response, "text")

# Retrieve classification response from Object Storage
get_object(container, "sample_text_classification.txt")

"{'keywords': [{'relevance': 0.96866, 'text': 'Stephen Hawking', 'sentiment': {'score': 0.0}, 'emotion': {'anger': 0.075166, 'joy': 0.045775, 'sadness': 0.175093, 'fear': 0.117016, 'disgust': 0.047847}}, {'relevance': 0.884134, 'text': 'amyotrophic lateral sclerosis', 'sentiment': {'score': -0.253741}, 'emotion': {'anger': 0.102951, 'joy': 0.008111, 'sadness': 0.377757, 'fear': 0.137669, 'disgust': 0.044538}}], 'entities': [{'emotion': {'anger': 0.024758, 'joy': 0.222444, 'sadness': 0.434388, 'fear': 0.153419, 'disgust': 0.050352}, 'count': 5, 'sentiment': {'score': 0.201512}, 'text': 'Stephen Hawking', 'disambiguation': {'subtype': ['Academic', 'Astronomer', 'AwardNominee', 'AwardWinner', 'BoardMember', 'Scientist', 'FilmActor', 'FilmWriter', 'TVActor'], 'name': 'Stephen Hawking', 'dbpedia_resource': 'http://dbpedia.org/resource/Stephen_Hawking'}, 'relevance': 0.846941, 'type': 'Person'}, {'emotion': {'anger': 0.084431, 'joy': 0.271569, 'sadness': 0.195576, 'fear': 0.217244, 'disgust'