# Classification and Attribution of data

## 1. Setup
To prepare your environment, you need to install some packages and enter credentials for the Watson services.

### 1.1 Install the necessary packages

You need the latest versions of these packages:<br>
Watson Developer Cloud: a client library for Watson services.<br>
NLTK: leading platform for building Python programs to work with human language data.<br>
python-keystoneclient: is a client for the OpenStack Identity API.<br>
python-swiftclient: is a python client for the Swift API.<br><br>
** Install the Watson Developer Cloud package: **

In [None]:
!pip install --upgrade watson-developer-cloud

** Install NLTK: **

In [None]:
!pip install --upgrade nltk

** Install IBM Bluemix Object Storage Client: **

In [None]:
!pip install python-keystoneclient

In [None]:
!pip install python-swiftclient

** <font color=blue>Now restart the kernel by choosing Kernel > Restart. </font> **

### 1.2 Import packages and libraries

Import the packages and libraries that you'll use:

In [None]:
import json
import sys
import thread
import time
import watson_developer_cloud
from watson_developer_cloud import NaturalLanguageUnderstandingV1
import watson_developer_cloud.natural_language_understanding.features.v1 \
  as Features
    
import swiftclient
from keystoneclient import client
    
import operator
from functools import reduce
from io import StringIO
import numpy as np
from os.path import join, dirname
import requests
import re
import pandas as pd
import nltk
from nltk import word_tokenize,sent_tokenize,ne_chunk
from nltk.corpus import stopwords

## 2. Configuration

Add configurable items of the notebook below

### 2.1 Add your service credentials from Bluemix for the Watson services

You must create a Watson Natural Language Understanding service on Bluemix.
Create a service for Natural Language Understanding (NLU).
Insert the username and password values for your NLU in the following cell. Do not change the values of the version fields.

Run the cell.

In [None]:
# @hidden_cell
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2017-02-27',
    username='',
    password='')

### 2.2 Add your service credentials for Object Storage

You must create Object Storage service on Bluemix.
To access data in a file in Object Storage, you need the Object Storage authentication credentials.
Insert the Object Storage authentication credentials in the following cell. 


In [None]:
# @hidden_cell
credentials_1 = {
  'auth_url':'',
  'project':'',
  'project_id':'',
  'region':'',
  'user_id':'',
  'domain_id':'',
  'domain_name':'',
  'username':'',
  'password':"""""",
  'container':'',
  'tenantId':'',
  'filename':''
}

### 2.3 Global Variables

Add global variables.


In [None]:
# Specify file names for sample text and configuration files
sampleTextFileName = "sample_text.txt"
sampleConfigFileName = "sample_config.txt"


### 2.4 Configure and download required NLTK packages

Download the 'punkt' and 'averaged_perceptron_tagger' NLTK packages for POS tagging usage.

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()


## 3. Classification

Write the classification related utility functions in a modularalized form.

### 3.1 Watson NLU Classification

In [None]:
def analyze_using_NLU(analysistext):
    response = natural_language_understanding.analyze( 
        text=analysistext,features=[ Features.Entities(
                                        emotion=True,
                                        sentiment=True,
                                        limit=2
                                     ),
                                     Features.Keywords(
                                        emotion=True,
                                        sentiment=True,
                                        limit=2
                                     )
                                   ] )

    return response

### 3.2 Augumented Classification

Custom classification utlity fucntions for augumenting the results of Watson NLU API call

In [None]:
### Splitting the text into sentences 
def split_sentences(text):
    """
    Utility function to return a list of sentences.
    @param text The text that must be split in to sentences.
    """
    sentence_delimiters = re.compile(u'[\\[\\]\n.!?]')
    sentences = sentence_delimiters.split(text)
    return sentences

### Splitting the text into tokens 
def split_into_tokens(text):

    tokens = nltk.word_tokenize(text)
    return tokens
    
### Part of speech tagging of the text 
def POS_tagging(text):

    POSofText = nltk.tag.pos_tag(text)
    return POSofText

### Tagging of the text matching keywords
def keyword_tagging(tag,tagtext,text):

    if (text.lower().find(tagtext.lower()) != -1):
        return text[text.lower().find(tagtext.lower()):text.lower().find(tagtext.lower())+len(tagtext)]
    else:
        return 'UNKNOWN'
    

### Tagging of the text matching REGEX
def regex_tagging(tag,regex,text):
    
    p = re.compile(regex, re.IGNORECASE)
    matchtext = p.findall(text)
    regex_list=[]    
    if (len(matchtext)>0):
        for regword in matchtext:
            regex_list.append(regword)
    return regex_list

### Tagging of the text using chunking
def chunk_tagging(tag,chunk,text):
    parsed_cp = nltk.RegexpParser(chunk)
    pos_cp = parsed_cp.parse(text)
    chunk_list=[]
    for root in pos_cp:
        if isinstance(root, nltk.tree.Tree):               
            if root.label() == tag:
                chunk_word = ''
                for child_root in root:
                    chunk_word = chunk_word +' '+ child_root[0]
                chunk_list.append(chunk_word)
    return chunk_list

### Update the NLU response JSON with augumented classifications
def augument_NLUResponse(responsejson,updateType,text,tag):

    if(updateType == 'keyword'):
        if not any(d.get('text', None) == text for d in responsejson['keywords']):
            responsejson['keywords'].append({"text":text,"relevance":0.5})
    else:
        if not any(d.get('text', None) == text for d in responsejson['entities']):
            responsejson['entities'].append({"type":tag,"text":text,"relevance":0.5,"count":1})        
    

### Perform augumented classification of the text 
def classify_text(text):

    ### Classification of the text using Watson NLU
    response = analyze_using_NLU(text)
    responsejson = response
    
    ### Start performing Augumented Classification steps
    
    ### Split sentences    
    sentenceList = split_sentences(text)
    for sentence in sentenceList:
        print("Sentence:", sentence)
    
    ### Spilt into tokens    
    tokens = split_into_tokens(text)
    
    ### Perform POS tagging of tokens    
    postags = POS_tagging(tokens)
    
    ### Lookup the configuration file to perform tagging steps    
    configjson = json.loads(config)
    for stages in configjson['configuration']['classification']['stages']:
        print('## Performing ' + stages['name']+' ##')
        for steps in stages['steps']:
            print('-- Performing Step ' + steps['type']+' --')
            if (steps['type'] == 'keywords'):
                for keyword in steps['keywords']:
                    for word in sentenceList:
                        wordtag = keyword_tagging(keyword['tag'],keyword['text'],word)
                        if(wordtag != 'UNKNOWN'):
                            print('** '+keyword['tag']+':'+wordtag)
                            augument_NLUResponse(responsejson,'entities',wordtag,keyword['tag'])
            elif(steps['type'] == 'd_regex'):
                for regex in steps['d_regex']:
                    for word in sentenceList:
                        regextags = regex_tagging(regex['tag'],regex['pattern'],word)
                        if (len(regextags)>0):
                            for words in regextags:
                                print('** '+regex['tag']+':'+words)
                                augument_NLUResponse(responsejson,'entities',words,regex['tag'])
            elif(steps['type'] == 'chunking'):
                for chunk in steps['chunk']:
                    chunktags = chunk_tagging(chunk['tag'],chunk['pattern'],postags)
                    if (len(chunktags)>0):
                        for words in chunktags:
                            print('** '+chunk['tag']+':'+words)
                            augument_NLUResponse(responsejson,'entities',words,chunk['tag'])
            else:
                print('UNKNOWN STEP')
    
    return responsejson


## 4. Persistence and Storage

### 4.1 Configure Object Storage Client

In [None]:
auth_url = credentials_1['auth_url']+"/v3"
container = credentials_1["container"]

IBM_Objectstorage_Connection = swiftclient.Connection(
    key=credentials_1['password'], authurl=auth_url, auth_version='3', os_options={
        "project_id": credentials_1['project_id'], "user_id": credentials_1['user_id'], "region_name": credentials_1['region']})

def create_container(container_name):
    x = IBM_Objectstorage_Connection.put_container(container_name)
    return x

def put_object(container_name, fname, contents, content_type):
    x = IBM_Objectstorage_Connection.put_object(
        container_name,
        fname,
        contents,
        content_type)
    return x

def get_object(container_name, fname):
    Object_Store_file_details = IBM_Objectstorage_Connection.get_object(
        container_name, fname)
    return Object_Store_file_details[1]

## 5. Classify text
Read the data file for classification from Object Store<br>
Read the configuration file for augumented classification from Object Store.<br>
Persist the classification results as JSON file in object store.

In [None]:
text = get_object(container, sampleTextFileName)
config = get_object(container, sampleConfigFileName)

response = str(classify_text(text))

put_object(container, "sample_text_classification.txt", response, "text")
get_object(container, "sample_text_classification.txt")