# Gathering Data
The first step is to find a data set that needs to be analyzed. The data needs to be stored so that the rest of the flow can use it. It is recommended to use a database and store the data in JSON format but any way of saving the data is fine as long as it can be accessed throughout the flow. Examples include just storing it in a JSON file, SQL database, NoSQL database, CSV file, etc. For the purposes of this starter kit, the data is being stored in a Cloudant NoSQL database (available on bluemix) and that is the simplest way to get started and requires the fewest changes to be made to the scripts. We offer a few scripts to help push and pull from this database to make it even easier.

The minimum amount of data needed is review/feedback text and a way to link it to the product it is targeted to (most likely with some sort of key/id to mark a review for a product).

Take the dataset and push it to your database/file in a format that is easiest to index and use further in the flow. For reference, here is an example of a review that we used: 

TODO: ADD JSON STRING HERE

Key things to note: By storing it this way, it is easy to retrieve the text of the review and find out which product it is pointed to so that the rest of the flow can be easily executed.

# Import Data CSV into Watson Knowledge Studio
The data must be converted into a form that can be used to train the models for entity extraction through Watson Knowledge Studio. To do this, run the script below. Please edit the database connections so that they point to your database where all your data is stored or recode that portion so that it points to wherever the data is located (if not in a cloudant database).

The important portion is the output and the format of the output csv that is ingested by Watson Knowledge Studio is:

"key","value"

"TITLE OF DOCUMENT","TEXT OF DOCUMENT"

...

The CSV that is created at the end of the script must now be imported into Watson Knowledge Studio (WKS). Each row in the CSV will be treated as a separate document and will be organized in WKS. The data doesn't have to be in CSV format to be uploaded into WKS, one can also do it manually by uploading documents but we provide a tool to convert it into CSV to make it easier.

The documents then need to be annotated with entities and relationships. Coreference is also done here. A strict guideline is very helpful when doing this.

For guidelines and tips on Watson Knowledge Studio, reference /notebooks/WKS.md

After the annotations are done and the model is trained, it needs to be exported into Alchemy Language using an API key. To do this, first get an Alchemy API key from bluemix and then reference the WKS documentation above to see how that is done.

In [None]:
#db2file.py

import cloudant
import csv

SERVER = ''      ''' Replace with your server URL'''
DATABASE = ''    ''' Replace with the name of the database'''
USERNAME = ''    ''' Replace with the username from your
                    credentials for the NLC'''
PASSWORD = ''    ''' Replace with the password from
                    your credentials for the NLC'''
DESIGN = ''      ''' Replace with the name of the design document that contains
                     the view. This should be of the form '_design/XXXX''''
VIEW = ''        ''' Replace with the view from your database to poll,
                    this should take the form of view_file/view and should
                    return the text to classify as the value field and what
                    you would like to call it as the key'''
DESTINATION = ''  ''' Replace with correct name for output
                    file (NOTE must be *.csv)'''

server = cloudant.client.Cloudant(USERNAME, PASSWORD, url=SERVER)
server.connect()
db = server[DATABASE]
query = db.get_view_result(DESIGN, VIEW)
file = open(DESTINATION, 'wb')
writer = csv.writer(file)


for q in query:
    print q[0]
    if 'key' in q[0] and q[0]['key'] is not None:
        title = q[0]['key']
    else:
        title = "No Title"
    if 'value' in q[0] and q[0]['value'] is not None:
        text = q[0]['value']
    else:
        text = "No Text"
    writer.writerow([title, text])

file.close()


# Update Alchemy API key and Model ID

Once a WKS model is trained and hooked up to the alchemy API key. The info needs to be updated in the utils/token_replacement.py script. Just update the 2 variables at the top and that will get your script ready to use for your application.

In [None]:
# token_replacement.py

import ast
import re
import nltk
from watson_developer_cloud import alchemy_language_v1 as alchemy

apikey = ''  # Replace with your Alchemy Language API key
modelId = ''  # Replace with the model-id from Watson Knowledge Studio
alchemyapi = alchemy.AlchemyLanguageV1(api_key=apikey)


def get_relations(review):
    split = {}
    if len(review) > 5024:
        mid = find_middle(review)
        while mid >= 5024:
            mid = find_middle(review[:mid])
        half = review[mid:]
        review = review[:mid]
        split = get_relations(half)
    f = alchemyapi.typed_relations(text=review, model=modelId)
    response = f.content
    response = ast.literal_eval(response)

    while response['status'] == 'ERROR':
        if 'language' in response:
            if response['language'] != 'english':
                break
        print response

    if split != {}:
        if 'typedRelations' in response and 'typedRelations' in split:
            response['typedRelations'] = response['typedRelations'] + \
                split['typedRelations']
            response['text'] = response['text'] + split['text']
        elif 'typedRelations' in split and 'typedRelations' not in response:
            response['typedRelations'] = split['typedRelations']
    return response


def get_entities(review):
    split = {}
    if len(review) > 5024:
        mid = find_middle(review)
        while mid >= 5024:
            mid = find_middle(review[:mid])
        review = review[:mid]
        half = review[mid:]
        split = get_entities(half)
    f = alchemyapi.entities(text=review, model=modelId, sentiment=True)
    response = f.content
    response = ast.literal_eval(response)
    if split != {}:
        if 'entities' in split and 'entities' in response:
            response['entities'] = response['entities'] + split['entities']
            response['text'] = response['text'] + split['text']
        elif 'entites' in split and 'entities' not in response:
            response['entities'] = split['entities']
    return response


def token_replacement_entities(review):
    processed = get_entities(review)
    if 'statusInfo' in processed:
        return review
    if 'entities' in processed:
        entities = processed['entities']
        text = processed['text']
        for i in entities:
            token = i['text']
            classification = "<" + i['type'] + ">"
            token = re.escape(token)
            re.sub(r'\\ ', ' ', token)
            text = re.sub(r"\b%s\b" % token, classification, text, count=1)
    return text


def find_middle(text):
        generator = nltk.tokenize.util.regexp_span_tokenize(text, r'\.')
        sequences = list(generator)
        mid_sentence = len(sequences)/2
        middle_char = sequences[mid_sentence][1]
        middle_char = int(middle_char) + 1
        return middle_char


# Design Natural Language Classifier
After the Watson Knowledge Studio language model is finished, a NLC needs to be created. Go to bluemix and create an instance of a NLC. Then a design for the classes and layers need to be made. These final classes are the endpoints of the system that are "grouped" together for a final processing step. These should be groups of sentences that you would be interested in. For example, when doing Amazon reviews, it is interesting to know what features a product has so sentences that are related to features are grouped into a class. Sentences that are related to customer service or pricing would be another class since that is also a point of interest in the dataset.

If data classes are hierarchical (one class is a subclass of another), it may be useful to use a layered approach. We have used 2 layers of classifiers for our application, where output of one class goes into the next layer. We found that using a layered approach gives us better results that using multiple classes in a single layer. This is because, using a layered approach helps us eliminate miscellaneous sentences at each layer and pass more specific data to the next layer.  We have included our sample training sets in /src/Training/"new training set"/. Description of our layered architecture and classes can be found /notebooks/NLC_description.docx. You can define your own architecture, and you can experiment with a single layer if the classes are distinct. 

Note: We found it helpful to always include an "Other" category in the NLC to reject sentences that aren't useful to your application. Basically this acts as a junk bin.

A script called src/utils/classify.py needs to be changed depending on what your final classification structure looks like. What this script does is take in an input sentence and runs it through the whole classifier tree and gives you the output from the whole tree. It also has a getClasses function should return all the endpoint classes in your system for use in validation.

In [None]:
# classify.py

import json
from watson_developer_cloud import NaturalLanguageClassifierV1

CLF_USERNAME = ''  # Replace with the username from your credentials for NLC
CLF_PASSWORD = ''  # Replace with the password from your credentials for NLC
CLASSIFIER_JSON = '../../data/classifier_ids.json'  # Location of classifiers that is autogenerated when you train the NLC

# Retrieve Classifier ID's
with open(CLASSIFIER_JSON) as classifier_ids:
    classifierTree = json.load(classifier_ids)

nlc = NaturalLanguageClassifierV1(username=CLF_USERNAME, password=CLF_PASSWORD)

# TODO Make general, do not bind to our NLC classes

def classify(review):
    # classifierTree holds all of the layers and they are referenced by tierX where X is the layer number
    resp = nlc.classify(classifierTree['tier1'], review)
    classification = resp["top_class"]
    # If an output of tier1 is to be fed into tier2, this is how it is done
    if(classification == "Product"):
        resp = nlc.classify(classifierTree['tier2'], review)
        classification = resp["top_class"]
        # this is an example of feeding another output into tier3
        if(classification == "Feature"):
            resp = nlc.classify(classifierTree['tier3'], review)
            classification = resp["top_class"]
    return classification

# Returns the final classifications possible from the classifier architecture
def getClasses():
    return ["Comparison", "Sentiment", "Customer Service", "Other", "Price",
            "Issue", "Enhancement", "Feature"]


# Make training and testing sets
In order to train and validate your Natural Language Classifier, a training and testing set must be created. Run the script below to make these sets. It will create your training and testing sets in a CSV format that can be ingested by the NLC later in the flow. This will split up reviews into sentences and then leave those sentences to be classified by hand with your NLC design. 

The format for hand classifying looks like:

Sentence1, Class

Sentence2, Class

...

If a layered architecture is being used for the NLC, then a little complexity is added to the process. A training set must be created for each of the layers. A simple way to approach a layered architecture is as follows:
1. First run the script below to create a training set
2. Classify for 1st layer only and save the csv
3. Now any class that is fed into a second layer needs to reclassified using second layer classes and saved into another csv.
4. Continue until all layers have their own training set.

A testing set for a layered architecture is much simpler. Just classify the sentences using the final output classes of the whole classifier tree.

In [None]:
#training_testing.py

'''
This script creates a training and testing split out of your data for you. It
can either be run interactivly by running the script witout arguments, or run
automatically with command line arguments.
The flags are:
    -l The location of the data, a directory of .txt or .json files or a .csv
        file
    -r Percentage of data to split into training
    -e Precentage of data to split into testing
    -j Field in .json that contains the text data. Only necessary if loading
        from a .json file.
'''

import os
import numpy as np
import ast
import re
import csv
import sys
import getopt
import nltk


def csv_handler(file, training, testing):
    ftest = open("testing_set.csv", "wb")
    ftrain = open("training_set.csv", "wb")
    wtest = csv.writer(ftest)
    wtrain = csv.writer(ftrain)

    rand = np.random.rand

    reader = csv.reader(f)

    for row in reader:
        if rand() < training:
            sentences = nltk.tokenize.sent_tokenize(row)
            for sentence in sentences:
                wtrain.writerow([sentence])
        else:
            sentences = nltk.tokenize.sent_tokenize(row)
            for sentence in sentences:
                wtest.writerow([sentence])

    ftest.close()
    ftrain.close()


def txt_handler(file, writer):

    text = file.read()
    sentences = nltk.tokenize.sent_tokenize(text)
    for sentence in sentences:
        writer.writerow([sentences])


def json_handler(file, writer, json_field):
    raw_text = file.read()

    try:
        processed_text = ast.literal_eval(raw_text)
        text = processed_text[json_field]
        sentences = nltk.tokenize.sent_tokenize(text)
        for sentence in sentences:
            writer.writerow([sentences])
    except:
        print "ERROR: Something wrong with .json file: " + file.name
flags = {}

if len(sys.argv) > 1:
    args = sys.argv[1:]
    opts = getopt.getopt(args, 'l:e:r:j:')
    for pair in opts:
        flags[pair[0]] = pair[1]


if len(sys.argv) > 1:
    print "Input the full path to your data"
    print "The data can be in the format of a .csv file (with one column" + \
        " and one text per line), or a directory of .json or .txt files."
    print "NOTE: If you are using a directory, please make sure your data" + \
        "is the only thing in the directory"
    location = raw_input("Data Location: ")
else:
    if '-l' in flags:
        location = flags['-l']
    else:
        print "ERROR: No file location. Did you use the -l" +\
            " flag to mark a file location?"

if re.match(r".\.csv$", location):
    try:
        f = open(location, 'rb')
        training = 0
        testing = 0
        if len(sys.argv) > 1:
            while training + testing != 100:
                print "What fraction would you like to use for training?" + \
                    " (We recommend 70%)"
                training = raw_input("Training (0-100): ")
                print "What fraction would you like to use for testing?" + \
                    " (We recommend 30%)"
                testing = raw_input("Testing (0-100): ")
                if training + testing != 100:
                    print "ERROR: Training and testing sets must equal 100%"
        else:
            if '-r' in flags and '-e' in flags:
                training = flags['-r']
                testing = flags['-e']
            else:
                print "ERROR: No training or testing split." + \
                    " Did you use the -r and -e flags to mark them?"

        training = float(training)/100
        testing = float(testing)/100

        csv_handler(f, training, testing)
        f.close()

    except (IOError):
        print "ERROR: File not found"

else:
    json_field = ""
    try:
        files = os.listdir(location)
        rand = np.random.rand
        total_docs = len(files)
        training = 0
        testing = 0
        if len(sys.argv) > 1:
            while training + testing != 100:
                print "What fraction would you like to use for training?" + \
                    " (We recommend 70%)"
                training = input("Training (0-100): ")
                print "What fraction would you like to use for testing?" + \
                    " (We recommend 30%)"
                testing = input("Testing (0-100): ")
                if training + testing != 100:
                    print "ERROR: Training and testing sets must equal 100%"
        else:
            if '-r' in flags and '-e' in flags:
                training = flags['-r']
                testing = flags['-e']
            else:
                print "ERROR: No training or testing split." + \
                    " Did you use the -r and -e flags to mark them?"

        training = float(training)/100
        testing = float(testing)/100

        ftest = open("testing_set.csv", "wb")
        ftrain = open("training_set.csv", "wb")
        wtest = csv.writer(ftest)
        wtrain = csv.writer(ftrain)

        for entry in files:
            if re.match(r".\.txt$", entry):
                f = open(location + '/' + entry, 'rb')

                if rand() < training:
                    txt_handler(f, wtrain)
                else:
                    txt_handler(f, wtest)
                f.close()
            if re.match(r".\.json$", entry):
                if len(sys.argv) > 1:
                    if json_field == "":
                        print "What key in the .json contains your text data?"
                        json_field = raw_input("Json Key: ")
                else:
                    if '-j' in flags:
                        json_fields = flags['-j']
                    if json_field == "":
                        print "Please use the -j flag to give the key of" + \
                            "the .json that contains the text data."

                f = open(location + '/' + entry, 'rb')

                if rand() < training:
                    json_handler(f, wtrain, json_field)
                else:
                    json_handler(f, wtest, json_field)
                f.close()
        ftest.close()
        ftrain.close()
    except OSError:
        print "ERROR: Directory not found"


# Perform Entity Extraction and Replacement
The next thing that needs to be done is take the training and testing sets and replace any entities found in the sentences with a tag of the entity name found. This is where a sentence is generalized like so: 

I love my phone because of its great screen --> I love my Product because of its Descriptor Feature

This allows the NLC to be generalized when training so that it works across many products. To perform this, run the script below (src/Training/Auto_Token_Replacement.py) and just replace the read and write files with the files you are trying to get entity replaced. So for a single layer NLC, first run it with the training set and then the testing set and give the output files easily identified names to let you know that the data has been replaced (something simple like "training_replaced.csv").

In [None]:
# Auto_Token_Replacement.py

'''
Takes a csv file with one record per line, uses the alchemy languag API to find
keywords and then replaces the keywords of the record with the name of the
class of the keywords. The end result is a generalized sentence.
Must change location of input .csv and output .csv
'''
import csv
import sys
import os

sys.path.insert(0, os.path.abspath('..'))
from utils import token_replacement as t

read = open('', 'rb')      # Replace with location of .csv file to classify
write = open('', 'wb')     # Replace with output file location

reader = csv.reader(read)
writer = csv.writer(write)

for row in reader:
    token = t.token_replacement_entities(row[0])
    writer.writerow([token, row[1], row[2]])

read.close()
write.close()


# Train Natural Language Classifier
The Natural Language Classifier (NLC) needs to be trained on this generalized/tagged data. This is a 3 level NLC that will provide better classification results than just using a single layer. Add your credentials in the file below and run the script to train the NLC. The script will also store your classifier_id's in a json file.

In [None]:
# trainNLC.py

import json
from watson_developer_cloud import NaturalLanguageClassifierV1

tier1CSV = ''  # Replace with location of layer 1 training data
tier2CSV = ''  # Replace with another location of layer 2 training data
USERNAME = ''  # Replace with username of NLC credentials
PASSWORD = ''  # Replace with password of NLC credentials
JSON_TARGET = '../../data/classifier_ids.json'  # Location to keep classifiers

# The architecture of your classifier tree is used here. This is 2 tiers but add/remove any tiers to make
# it work with your architecture
classifierTree = {
    'tier1': '',
    'tier2': ''
}

# Initialize classifier
nlc = NaturalLanguageClassifierV1(username=USERNAME, password=PASSWORD)

# Train tier 1 classifier
print("############# TIER 1 CLASSIFIER ##############")
with open(tier1CSV, 'rb') as training_data:
    classifier = nlc.create(
        training_data=training_data,
        name='tier1',
        language='en'
      )
print(json.dumps(classifier, indent=2))
classifierTree['tier1'] = classifier['classifier_id']

# Train tier 2 classifier
print("############# TIER 2 CLASSIFIER ##############")
with open(tier2CSV, 'rb') as training_data:
    classifier = nlc.create(
        training_data=training_data,
        name='tier2',
        language='en'
      )
print(json.dumps(classifier, indent=2))
classifierTree['tier2'] = classifier['classifier_id']


# Write the tiers with classifier id's to file for use later
with open(JSON_TARGET, 'w') as outfile:
    json.dump(classifierTree, outfile)

print("############# FULL CLASSIFIER TREE ##############")
print(json.dumps(classifierTree))


# Validate Results
Now since all the models are now trained, they can be validated to make sure the performance being attained is good enough to use. Run the src/Training/ValidateModels.py script to see how the trained models performed. It will tell you your classification accuracy (number of times you got the class right divided by number of tests) and a confusion matrix that tells you HOW you are misclassifying things so that you can improve results even more.

A score of around 60% is usually appropriate from testing but it is very dependent on the application and the amount of data that is available.

In [None]:
# ValidateModels.py

import csv
import sys
import os

sys.path.insert(0, os.path.abspath('..'))
from utils import token_replacement as tr
from utils import classify as clf

TEST_SET_FILE = '' # Insert the name of your test set that has gone through entity replacement
verbose = False

read = open(TEST_SET_FILE, 'rb')
reader = csv.reader(read)

# confusion_matrix[PREDICTED][ACTUAL]
confusion_matrix = {}
classes = clf.getClasses()
for cl1 in classes:
    confusion_matrix[cl1] = {}
    for cl2 in classes:
        confusion_matrix[cl1][cl2] = 0

num_correct = 0
total = 0
for row in reader:
    sentence = row[0]
    correct_class = row[1]
    total += 1
    correct = False
    token = tr.token_replacement(sentence)
    classification = clf.classify(token)
    confusion_matrix[classification][correct_class] += 1
    if(classification in correct_class):
        num_correct += 1
        correct = True
    if(verbose):
        print(sentence + " --> " + token)
        print("EXPECTED: " + correct_class + " / ACTUAL: " + classification)
        if(correct):
            print("CORRECT!")
        else:
            print("WRONG!")
        print("-" * 30)

print("CORRECT: " + str(num_correct) + "/" + str(total) + " (" +
      str(num_correct/float(total)*100) + ")")

row = ''
for i in range(len(classes)+1):
    row += '{data[' + str(i) + ']:>20} |'
headers = [""] + classes
print(row.format(data=headers))

for cl1 in classes:
    row_data = []
    row_data.append(cl1)
    for cl2 in classes:
        row_data.append(confusion_matrix[cl1][cl2])
    print(row.format(data=row_data))

read.close()


# Troubleshooting
If the results from validation are not good, there are a few steps that can help boost those results to an acceptable level.

1. Increase the size of the training set for the NLC. This generally increases the performance but with diminishing returns as the size is increased.
2. Balance the training/testing sets. What this means is that each of the classes needs to have comparable levels of representation in the training set so that the classifiers can learn about each of the classes and not overclassify for the larger classes.
3. Check the confusion matrix and see which classes are being misclassified the most and focus on giving them more data. If some classes are overclassified then look at tip #2.
4. Skip entity replacement. Entity replacement is used as a way to generalize the data going into the NLC. The effect this can have is that it can reduce the accuracy of the system if the dataset is already specific to a tight domain. If that is the case for the application then removing this step can increase accuracy.
5. Check WKS documentation for troubleshooting steps for WKS.