# Classification and Attribution of data

## 1. Setup
To prepare your environment, you need to install some packages and enter credentials for the Watson services.

### 1.1 Install the necessary packages

You need the latest versions of these packages:<br>
Watson Developer Cloud: a client library for Watson services.<br>
NLTK: leading platform for building Python programs to work with human language data.<br>
python-keystoneclient: is a client for the OpenStack Identity API.<br>
python-swiftclient: is a python client for the Swift API.<br><br>
** Install the Watson Developer Cloud package: **

In [1]:
!pip install watson-developer-cloud==1.5

Requirement already up-to-date: watson-developer-cloud in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: pyOpenSSL>=16.2.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Requirement not upgraded as not directly required: autobahn>=0.10.9 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Requirement not upgraded as not directly required: Twisted>=13.2.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Requirement not upgraded as not directly required: python-dateutil>=2.5.3 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Requirement not upgraded as not directly required: requests<3.0,>=2.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Requirement not upgraded as not directly required: service-identity>=17.0.0 in /opt/con

** Install NLTK: **

In [2]:
!pip install --upgrade nltk

Requirement already up-to-date: nltk in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from nltk)


** Install IBM Cloud Object Storage Client: **

In [3]:
!pip install ibm-cos-sdk

Requirement not upgraded as not directly required: ibm-cos-sdk in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: ibm-cos-sdk-core==2.*,>=2.0.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ibm-cos-sdk)
Requirement not upgraded as not directly required: ibm-cos-sdk-s3transfer==2.*,>=2.0.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ibm-cos-sdk)
Requirement not upgraded as not directly required: jmespath<1.0.0,>=0.7.1 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ibm-cos-sdk-core==2.*,>=2.0.0->ibm-cos-sdk)
Requirement not upgraded as not directly required: python-dateutil<3.0.0,>=2.1 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ibm-cos-sdk-core==2.*,>=2.0.0->ibm-cos-sdk)
Requirement not upgraded as not directly required: docutils>=0.10 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ibm-cos-sdk-core==2.*,>=2.0.0->ibm-cos-sdk)
Re

** <font color=blue>Now restart the kernel by choosing Kernel > Restart. </font> **

### 1.2 Import packages and libraries

Import the packages and libraries that you'll use:

In [4]:
import json
import watson_developer_cloud
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 \
  import Features, EntitiesOptions, KeywordsOptions
    
import ibm_boto3
from botocore.client import Config

import re
import nltk
from nltk import word_tokenize,sent_tokenize,ne_chunk

## 2. Configuration

Add configurable items of the notebook below

### 2.1 Add your service credentials from IBM Cloud for the Watson services

You must create a Watson Natural Language Understanding service on IBM Cloud.
Create a service for Natural Language Understanding (NLU).
Insert the username and password values for your NLU in the following cell. Do not change the values of the version fields.

Run the cell.

In [5]:
# @hidden_cell
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2017-02-27',
    username="",
    password="")

### 2.2 Add your service credentials for Object Storage

You must create Object Storage service on IBM Cloud.
To access data in a file in Object Storage, you need the Object Storage authentication credentials.
Insert the Object Storage authentication credentials as <i><b>credentials_1</b></i> in the following cell after 
removing the current contents in the cell. 


In [6]:
# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials_1 = {
    'IBM_API_KEY_ID': '',
    'IAM_SERVICE_ID': '',
    'ENDPOINT': '',
    'IBM_AUTH_ENDPOINT': '',
    'BUCKET': '',
    'FILE': ''
}

### 2.3 Global Variables

Add global variables.


In [7]:
# Specify file names for sample text and configuration files
sampleTextFileName = "sample_text.txt"
sampleConfigFileName = "sample_config.txt"

### 2.4 Configure and download required NLTK packages

Download the 'punkt' and 'averaged_perceptron_tagger' NLTK packages for POS tagging usage.

In [8]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /home/dsxuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/dsxuser/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## 3. Classification

Write the classification related utility functions in a modularized form.

### 3.1 Watson NLU Classification

In [9]:
def analyze_using_NLU(analysistext):
    """ Call Watson Natural Language Understanding service to obtain analysis results.
    """
    response = natural_language_understanding.analyze( 
        text=analysistext,
        features=Features(entities=EntitiesOptions(), 
                          keywords=KeywordsOptions()))
    return response

### 3.2 Augumented Classification

Custom classification utlity fucntions for augumenting the results of Watson NLU API call

In [10]:
def split_sentences(text):
    """ Split text into sentences.
    """
    sentence_delimiters = re.compile(u'[\\[\\]\n.!?]')
    sentences = sentence_delimiters.split(text)
    return sentences

def split_into_tokens(text):
    """ Split text into tokens.
    """
    tokens = nltk.word_tokenize(text)
    return tokens
    
def POS_tagging(text):
    """ Generate Part of speech tagging of the text.
    """
    POSofText = nltk.tag.pos_tag(text)
    return POSofText

def keyword_tagging(tag,tagtext,text):
    """ Tag the text matching keywords.
    """
    if (text.lower().find(tagtext.lower()) != -1):
        return text[text.lower().find(tagtext.lower()):text.lower().find(tagtext.lower())+len(tagtext)]
    else:
        return 'UNKNOWN'
    
def regex_tagging(tag,regex,text):
    """ Tag the text matching REGEX.
    """    
    p = re.compile(regex, re.IGNORECASE)
    matchtext = p.findall(text)
    regex_list=[]    
    if (len(matchtext)>0):
        for regword in matchtext:
            regex_list.append(regword)
    return regex_list

def chunk_tagging(tag,chunk,text):
    """ Tag the text using chunking.
    """
    parsed_cp = nltk.RegexpParser(chunk)
    pos_cp = parsed_cp.parse(text)
    chunk_list=[]
    for root in pos_cp:
        if isinstance(root, nltk.tree.Tree):               
            if root.label() == tag:
                chunk_word = ''
                for child_root in root:
                    chunk_word = chunk_word +' '+ child_root[0]
                chunk_list.append(chunk_word)
    return chunk_list
    
def augument_NLUResponse(responsejson,updateType,text,tag):
    """ Update the NLU response JSON with augumented classifications.
    """
    if(updateType == 'keyword'):
        if not any(d.get('text', None) == text for d in responsejson['keywords']):
            responsejson['keywords'].append({"text":text,"relevance":0.5})
    else:
        if not any(d.get('text', None) == text for d in responsejson['entities']):
            responsejson['entities'].append({"type":tag,"text":text,"relevance":0.5,"count":1})        
    

def classify_text(text, config):
    """ Perform augumented classification of the text.
    """
    
    response = analyze_using_NLU(text)
    responsejson = response
    
    sentenceList = split_sentences(text)
    
    tokens = split_into_tokens(text)
    
    postags = POS_tagging(tokens)
    
    configjson = json.loads(config)
    for stages in configjson['configuration']['classification']['stages']:
        print('Stage - Performing ' + stages['name']+':')
        for steps in stages['steps']:
            print('    Step - ' + steps['type']+':')
            if (steps['type'] == 'keywords'):
                for keyword in steps['keywords']:
                    for word in sentenceList:
                        wordtag = keyword_tagging(keyword['tag'],keyword['text'],word)
                        if(wordtag != 'UNKNOWN'):
                            print('      '+keyword['tag']+':'+wordtag)
                            augument_NLUResponse(responsejson,'entities',wordtag,keyword['tag'])
            elif(steps['type'] == 'd_regex'):
                for regex in steps['d_regex']:
                    for word in sentenceList:
                        regextags = regex_tagging(regex['tag'],regex['pattern'],word)
                        if (len(regextags)>0):
                            for words in regextags:
                                print('      '+regex['tag']+':'+words)
                                augument_NLUResponse(responsejson,'entities',words,regex['tag'])
            elif(steps['type'] == 'chunking'):
                for chunk in steps['chunk']:
                    chunktags = chunk_tagging(chunk['tag'],chunk['pattern'],postags)
                    if (len(chunktags)>0):
                        for words in chunktags:
                            print('      '+chunk['tag']+':'+words)
                            augument_NLUResponse(responsejson,'entities',words,chunk['tag'])
            else:
                print('UNKNOWN STEP')
    
    return responsejson

def replace_unicode_strings(response):
    """ Convert dict with unicode strings to strings.
    """
    if isinstance(response, dict):
        return {replace_unicode_strings(key): replace_unicode_strings(value) for key, value in response.iteritems()}
    elif isinstance(response, list):
        return [replace_unicode_strings(element) for element in response]
    elif isinstance(response, unicode):
        return response.encode('utf-8')
    else:
        return response


## 4. Persistence and Storage

### 4.1 Configure Object Storage Client

In [11]:
cos = ibm_boto3.client('s3',
                    ibm_api_key_id=credentials_1['IBM_API_KEY_ID'],
                    ibm_service_instance_id=credentials_1['IAM_SERVICE_ID'],
                    ibm_auth_endpoint=credentials_1['IBM_AUTH_ENDPOINT'],
                    config=Config(signature_version='oauth'),
                    endpoint_url=credentials_1['ENDPOINT'])

def get_file(filename):
    '''Retrieve file from Cloud Object Storage'''
    fileobject = cos.get_object(Bucket=credentials_1['BUCKET'], Key=filename)['Body']
    return fileobject

def load_string(fileobject):
    '''Load the file contents into a Python string'''
    text = fileobject.read()
    return text.decode('utf-8')

def put_file(filename, filecontents):
    '''Write file to Cloud Object Storage'''
    resp = cos.put_object(Bucket=credentials_1['BUCKET'], Key=filename, Body=filecontents)
    return resp

## 5. Classify text
Read the data file for classification from Object Store<br>
Read the configuration file for augumented classification from Object Store.<br>
Persist the classification results as JSON file in object store.

In [12]:
# Trustworthy ML Rating: 90%
# Load the text from Object Storage
text = load_string(get_file(sampleTextFileName))

# Load the json configuration from Object Storage
config = load_string(get_file(sampleConfigFileName))

# Print the json configuration
print("## Using the configuration ##")
print(config)

## Using the configuration ##
{
  "configuration": {
    "classification": {
      "stages": [
        {
          "name": "Base Tagging",
          "steps": [
            {
              "type": "keywords",
              "keywords": [
                {
                  "tag": "Passion",
                  "text": "Science"
                },
                {
                  "tag": "Subjects",
                  "text": "cosmology"
                }
              ]
            },
            {
              "type": "d_regex",
              "d_regex": [
                {
                  "tag": "Date",
                  "pattern": "(\\d+/\\d+/\\d+)"
                }
              ]
            },
            {
              "type": "d_regex",
              "d_regex": [
                {
                  "tag": "Email",
                  "pattern": "\\b[\\w.-]+?@\\w+?\\.\\w+?\\b"
                }
              ]
            },
            {
              "type": "d_regex",
        

In [13]:
# Trustworthy ML Rating: 90%
# Classify the text
response = classify_text(text, config)

Stage - Performing Base Tagging:
    Step - keywords:
      Passion:science
      Passion:science
      Subjects:cosmology
      Subjects:cosmology
    Step - d_regex:
      Date:01/08/1942
    Step - d_regex:
    Step - d_regex:
      PhoneNumber:1112223333
    Step - chunking:
      NP: an early age
      NP: a passion
      NP: science
      NP: the sky
      NP: age
      NP: cosmology
      NP: amyotrophic lateral sclerosis
      NP: illness
      NP: work
      NP: cosmology
      NP: science
      NP: everyone
      NP: phone
      NP: email
      NP: yahoo.com
      NAME: Stephen Hawking
      NAME: Oxford
      NAME: England
      NAME: Hawking
      NAME: University
      NAME: Cambridge
      NAME: Stephen Hawking
      NAME: @
Stage - Performing Domain Tagging:
    Step - d_regex:
      Year:1942
      Year:1112
      Year:2233


In [14]:
# Trustworthy ML Rating: 75%
print("~~ Text Classification ~~")

# Store the classification response in Object Storage
put_file("sample_text_classification.txt", json.dumps(response))

# Retrieve classification response from Object Storage
json.loads(load_string(get_file("sample_text_classification.txt")))

~~ Text Classification ~~


{'entities': [{'count': 5,
   'disambiguation': {'dbpedia_resource': 'http://dbpedia.org/resource/Stephen_Hawking',
    'name': 'Stephen Hawking',
    'subtype': ['Academic',
     'Astronomer',
     'AwardNominee',
     'AwardWinner',
     'BoardMember',
     'Scientist',
     'FilmActor',
     'FilmWriter',
     'TVActor']},
   'relevance': 0.846941,
   'text': 'Stephen Hawking',
   'type': 'Person'},
  {'count': 1,
   'disambiguation': {'dbpedia_resource': 'http://dbpedia.org/resource/University_of_Cambridge',
    'name': 'University of Cambridge',
    'subtype': ['Location',
     'CollegeUniversity',
     'ProcessorManufacturer',
     'University']},
   'relevance': 0.202166,
   'text': 'University of Cambridge',
   'type': 'Organization'},
  {'count': 1,
   'disambiguation': {'dbpedia_resource': 'http://dbpedia.org/resource/Oxford',
    'name': 'Oxford',
    'subtype': ['AdministrativeDivision', 'PlaceWithNeighborhoods', 'City']},
   'relevance': 0.200592,
   'text': 'Oxford',
   '