# Gathering Data
The first step is to find a data set that needs to be analyzed. The data needs to be stored so that the rest of the flow can use it. It is recommended to use a database and store the data in JSON format but any way of saving the data is fine as long as it can be accessed throughout the flow. Examples include just storing it in a JSON file, SQL database, NoSQL database, CSV file, etc. For the purposes of this starter kit, the data is being stored in a Cloudant NoSQL database (available on bluemix) and that is the simplest way to get started and requires the fewest changes to be made to the scripts. We offer a few scripts to help push and pull from this database to make it even easier.

The minimum amount of data needed is review/feedback text and a way to link it to the product it is targeted to (most likely with some sort of key/id to mark a review for a product).

Take the dataset and push it to your database/file in a format that is easiest to index and use further in the flow. For reference, here is an example of a review that we used: 

TODO: ADD JSON STRING HERE

Key things to note: By storing it this way, it is easy to retrieve the text of the review and find out which product it is pointed to so that the rest of the flow can be easily executed.

# Converting Data into CSV
The data must be converted into a form that can be used to train the models for entity extraction. To do this, run the script below. Please edit the database connections so that they point to your database where all your data is stored or recode that portion so that it points to wherever the data is located (if not in a cloudant database).

The important portion is the output and the format of the output csv that is ingested by Watson Knowledge Studio is:

"key","value"
"TITLE OF DOCUMENT","TEXT OF DOCUMENT"
...

# Import Data CSV into Watson Knowledge Studio
The data must be converted into a form that can be used to train the models for entity extraction through Watson Knowledge Studio. To do this, run the script below. Please edit the database connections so that they point to your database where all your data is stored or recode that portion so that it points to wherever the data is located (if not in a cloudant database).

The important portion is the output and the format of the output csv that is ingested by Watson Knowledge Studio is:

"key","value"
"TITLE OF DOCUMENT","TEXT OF DOCUMENT"
...

The CSV that is created at the end of the script must now be imported into Watson Knowledge Studio (WKS). Each row in the CSV will be treated as a separate document and will be organized in WKS. The data doesn't have to be in CSV format to be uploaded into WKS, one can also do it manually by uploading documents but we provide a tool to convert it into CSV to make it easier.

The documents then need to be annotated with entities and relationships. Coreference is also done here. A strict guideline is very helpful when doing this.

For guidelines and tips on Watson Knowledge Studio, reference /notebooks/WKS.md

After the annotations are done and the model is trained, it needs to be exported into Alchemy Language using an API key. To do this, first get an Alchemy API key from bluemix and then reference the WKS documentation above to see how that is done.

In [None]:
#db2file.py

import cloudant
import csv

SERVER = ''      ''' Replace with your server URL'''
DATABASE = ''    ''' Replace with the name of the database'''
USERNAME = ''    ''' Replace with the username from your
                    credentials for the NLC'''
PASSWORD = ''    ''' Replace with the password from
                    your credentials for the NLC'''
DESIGN = ''      ''' Replace with the name of the design document that contains
                     the view. This should be of the form '_design/XXXX''''
VIEW = ''        ''' Replace with the view from your database to poll,
                    this should take the form of view_file/view and should
                    return the text to classify as the value field and what
                    you would like to call it as the key'''
DESTINATION = ''  ''' Replace with correct name for output
                    file (NOTE must be *.csv)'''

server = cloudant.client.Cloudant(USERNAME, PASSWORD, url=SERVER)
server.connect()
db = server[DATABASE]
query = db.get_view_result(DESIGN, VIEW)
file = open(DESTINATION, 'wb')
writer = csv.writer(file)


for q in query:
    print q[0]
    if 'key' in q[0] and q[0]['key'] is not None:
        title = q[0]['key']
    else:
        title = "No Title"
    if 'value' in q[0] and q[0]['value'] is not None:
        text = q[0]['value']
    else:
        text = "No Text"
    writer.writerow([title, text])

file.close()


# Make training and testing sets
In order to train and validate your Natural Language Classifier, a training and testing set must be created.

In [None]:
#training_testing.py

'''
This script creates a training and testing split out of your data for you. It
can either be run interactivly by running the script witout arguments, or run
automatically with command line arguments.
The flags are:
    -l The location of the data, a directory of .txt or .json files or a .csv
        file
    -r Percentage of data to split into training
    -e Precentage of data to split into testing
    -j Field in .json that contains the text data. Only necessary if loading
        from a .json file. 
'''

import os
import numpy as np
import ast
import re
import csv
import sys
import getopt


def csv_handler(file, training, testing):
    ftest = open("testing_set.csv", "wb")
    ftrain = open("training_set.csv", "wb")
    wtest = csv.writer(ftest)
    wtrain = csv.writer(ftrain)

    rand = np.random.rand

    reader = csv.reader(f)

    for row in reader:
        if rand() < training:
            wtrain.writerow([row])
        else:
            wtest.writerow([row])

    ftest.close()
    ftrain.close()


def txt_handler(file, writer):

    text = file.read()
    writer.writerow([text])


def json_handler(file, writer, json_field):
    raw_text = file.read()

    try:
        processed_text = ast.literal_eval(raw_text)
        text = processed_text[json_field]
        writer.writerow([text])
    except:
        print "ERROR: Something wrong with .json file: " + file.name
flags = {}

if len(sys.argv) > 1:
    args = sys.argv[1:]
    opts = getopt.getopt(args, 'l:e:r:j:')
    for pair in opts:
        flags[pair[0]] = pair[1]


if len(sys.argv) > 1:
    print "Input the full path to your data"
    print "The data can be in the format of a .csv file (with one column" + \
        " and one text per line), or a directory of .json or .txt files."
    print "NOTE: If you are using a directory, please make sure your data" + \
        "is the only thing in the directory"
    location = raw_input("Data Location: ")
else:
    if '-l' in flags:
        location = flags['-l']
    else:
        print "ERROR: No file location. Did you use the -l" +\
            " flag to mark a file location?"

if re.match(r".\.csv$", location):
    try:
        f = open(location, 'rb')
        training = 0
        testing = 0
        if len(sys.argv) > 1:
            while training + testing != 100:
                print "What fraction would you like to use for training?" + \
                    " (We recommend 70%)"
                training = raw_input("Training (0-100): ")
                print "What fraction would you like to use for testing?" + \
                    " (We recommend 30%)"
                testing = raw_input("Testing (0-100): ")
                if training + testing != 100:
                    print "ERROR: Training and testing sets must equal 100%"
        else:
            if '-r' in flags and '-e' in flags:
                training = flags['-r']
                testing = flags['-e']
            else:
                print "ERROR: No training or testing split." + \
                    " Did you use the -r and -e flags to mark them?"

        training = float(training)/100
        testing = float(testing)/100

        csv_handler(f, training, testing)
        f.close()

    except (IOError):
        print "ERROR: File not found"

else:
    json_field = ""
    try:
        files = os.listdir(location)
        rand = np.random.rand
        total_docs = len(files)
        training = 0
        testing = 0
        if len(sys.argv) > 1:
            while training + testing != 100:
                print "What fraction would you like to use for training?" + \
                    " (We recommend 70%)"
                training = input("Training (0-100): ")
                print "What fraction would you like to use for testing?" + \
                    " (We recommend 30%)"
                testing = input("Testing (0-100): ")
                if training + testing != 100:
                    print "ERROR: Training and testing sets must equal 100%"
        else:
            if '-r' in flags and '-e' in flags:
                training = flags['-r']
                testing = flags['-e']
            else:
                print "ERROR: No training or testing split." + \
                    " Did you use the -r and -e flags to mark them?"

        training = float(training)/100
        testing = float(testing)/100

        ftest = open("testing_set.csv", "wb")
        ftrain = open("training_set.csv", "wb")
        wtest = csv.writer(ftest)
        wtrain = csv.writer(ftrain)

        for entry in files:
            if re.match(r".\.txt$", entry):
                f = open(location + '/' + entry, 'rb')

                if rand() < training:
                    txt_handler(f, wtrain)
                else:
                    txt_handler(f, wtest)
                f.close()
            if re.match(r".\.json$", entry):
                if len(sys.argv) > 1:
                    if json_field == "":
                        print "What key in the .json contains your text data?"
                        json_field = raw_input("Json Key: ")
                else:
                    if '-j' in flags:
                        json_fields = flags['-j']
                    if json_field == "":
                        print "Please use the -j flag to give the key of" + \
                            "the .json that contains the text data."

                f = open(location + '/' + entry, 'rb')

                if rand() < training:
                    json_handler(f, wtrain, json_field)
                else:
                    json_handler(f, wtest, json_field)
                f.close()
        ftest.close()
        ftrain.close()
    except OSError:
        print "ERROR: Directory not found"


# Train Natural Language Classifier
The Natural Language Classifier (NLC) needs to be trained on this generalized/tagged data. This is a 3 level NLC that will provide better classification results than just using a single layer. Add your credentials in the file below and run the script to train the NLC. The script will also store your classifier_id's in a json file.

In [None]:
import couchdbkit
import json
from watson_developer_cloud import NaturalLanguageClassifierV1
import csv

tier1CSV = 'training_set1.csv'
tier2CSV = 'training_set2.csv'
tier3CSV = 'training_set3.csv'
USERNAME = 'e561bc30-d294-41f4-8b47-39fc6bc29917'
PASSWORD = 'XH8pYnsYfClv'
JSON_TARGET = '../../data/classifier_ids.json'

classifierTree = {
    'tier1':'',
    'tier2':'',
    'tier3':''
}

# Initialize classifier
nlc = NaturalLanguageClassifierV1(username = USERNAME, password = PASSWORD)

# Train tier 1 classifier
print("############# TIER 1 CLASSIFIER ##############")
with open(tier1CSV, 'rb') as training_data:
  classifier = nlc.create(
    training_data=training_data,
    name='tier1',
    language='en'
  )
print(json.dumps(classifier, indent=2))
classifierTree['tier1'] = classifier['classifier_id']

# Train tier 2 classifier
print("############# TIER 2 CLASSIFIER ##############")
with open(tier2CSV, 'rb') as training_data:
  classifier = nlc.create(
    training_data=training_data,
    name='tier2',
    language='en'
  )
print(json.dumps(classifier, indent=2))
classifierTree['tier2'] = classifier['classifier_id']

# Train tier 3 classifier
print("############# TIER 3 CLASSIFIER ##############")
with open(tier3CSV, 'rb') as training_data:
  classifier = nlc.create(
    training_data=training_data,
    name='tier3',
    language='en'
  )
print(json.dumps(classifier, indent=2))
classifierTree['tier3'] = classifier['classifier_id']

# Write the tiers with classifier id's to file for use later
with open(JSON_TARGET, 'w') as outfile:
    json.dump(classifierTree, outfile)

print("############# FULL CLASSIFIER TREE ##############")
print(json.dumps(classifierTree))