# 1. Gathering data
The first step is to find a data set that needs to be analyzed. The data needs to be stored so that the rest of the flow can use it. It is recommended to use a database and store the data in JSON format but any way of saving the data is fine as long as it can be accessed throughout the flow. Examples include just storing it in a JSON file, SQL database, NoSQL database, CSV file, etc. For the purposes of this starter kit, the data is being stored in a Cloudant NoSQL database (available on Bluemix http://bluemix.net/) and that is the simplest way to get started and requires the fewest changes to be made to the scripts. We offer a few scripts to help push and pull from this database to make it even easier.

The minimum amount of data needed is some review/feedback text and a way to link it to the product it is targeted to (most likely with some sort of key/id to mark a review for a product).

Take the dataset and push it to your database/file in a format that is easiest to index and use further in the flow. For reference, here is an example of a review that we used: 

{
 "reviewerID": "AO94DHGC771SJ", 
 "asin": "0528881469", 
 "reviewerName": "amazdnu", 
 "helpful": [0, 0], 
 "reviewText": "We got this GPS for my husband who is an (OTR) over the road trucker.  Very Impressed with the shipping time, it arrived a few days earlier than expected...  within a week of use however it started freezing up... could of just been a glitch in that unit.  Worked great when it worked!  Will work great for the normal person as well but does have the \"trucker\" option. (the big truck routes - tells you when a scale is coming up ect...)  Love the bigger screen, the ease of use, the ease of putting addresses into memory.  Nothing really bad to say about the unit with the exception of it freezing which is probably one in a million and that's just my luck.  I contacted the seller and within minutes of my email I received a email back with instructions for an exchange! VERY impressed all the way around!", 
 "overall": 5.0, 
 "summary": "Gotta have GPS!", 
 "unixReviewTime": 1370131200, 
 "reviewTime": "06 2, 2013"
}

The data was downloaded from http://jmcauley.ucsd.edu/data/amazon/ and the Electronics section was the one used for the demo.

Key things to note: By storing it this way, it is easy to retrieve the text of the review and find out which product it is pointed to so that the rest of the flow can be easily executed.


# 2. Importing data into Cloudant

In order to preprocess the data, reviews used in this example are uploaded to a Cloudant database. Assuming you have followed instructions for setting up a Cloudant service under your Bluemix account and have added its credentials to your local .env file, the script below should upload the data you have to your Cloudant database.

In [None]:
import os
import couchdbkit
import configparser
import logging
import json
import cloudanthelper as ch

logger = logging.getLogger()
logger.setLevel(logging.INFO)

#getting current directory
curdir = os.getcwd()
logger.debug(curdir)

#loading credentials from .env file
credFilePath = os.path.join(curdir,'..','.env')
config = configparser.ConfigParser()
config.read(credFilePath)
logger.debug(config.sections())

#please provide the name of the file that contains your data 
#(the file should be placed under the 'resources' folder)
DATA_FILE_NAME='data_small.json'

def parse(path):
    reviews = open(path, 'r')
    data = []
    for review in reviews:
        data.append(json.loads(review))
    return data

# Connecting to cloudant
server = couchdbkit.Server(config['CLOUDANT']['CLOUDANT_URL'])
db = server.get_or_create_db(config['CLOUDANT']['CLOUDANT_DB'])

# Uploading data to cloudant
data_file_path = os.path.join(curdir,'..','resources',DATA_FILE_NAME)
docs = parse(data_file_path)
logger.debug("uploading(" + str(len(docs)) + ")...")

for review in xrange(0, len(docs), 1000):
    db.bulk_save(docs[review : review + 1000])

# Creating document to track status of reviews
client = ch.getConnection()
db = client[config['CLOUDANT']['CLOUDANT_DB']]
ch.create_tracker(db)

logger.info("Cloudant upload finished.")


# 3. Creating .csv file for use with Watson Knowledge Studio
The data must be converted into a form that can be used to train the models for entity extraction through Watson Knowledge Studio. To do this, run the script below. 

NOTE: If you stored your data in a separate database other than Cloudant during step 2, please edit the database connections so that they point to your database where all your data is stored or recode that portion so that it points to wherever the data is located.

The important portion is the output and the format of the output csv that is ingested by Watson Knowledge Studio is:

"key","value"

"TITLE OF DOCUMENT","TEXT OF DOCUMENT"

...

The .csv that is created at the end of the script must now be imported into Watson Knowledge Studio (WKS). Each row in the .csv will be treated as a separate document and will be organized in WKS. The data doesn't have to be in .csv format to be uploaded into WKS, one can also do it manually by uploading documents but we provide a tool to convert it into .csv to make it easier.

The documents then need to be annotated with entities and relationships. Coreference is also done here. A strict guideline is very helpful when doing this.

For guidelines and tips on Watson Knowledge Studio, reference /notebooks/WKS.md

After the annotations are done and the model is trained, it needs to be exported into Alchemy Language using an API key. To do this, first get an Alchemy API key from bluemix and then reference the WKS documentation above to see how that is done.

In [None]:
import csv
import os
import logging
import configparser
import cloudanthelper as ch

logger = logging.getLogger()
logger.setLevel(logging.ERROR)

#getting current directory
curdir = os.getcwd()
logger.debug(curdir)

#loading credentials from .env file
credFilePath = os.path.join(curdir,'..','.env')
config = configparser.ConfigParser()
config.read(credFilePath)
logger.debug(config.sections())

#please provide the name for the .csv that will be uploaded 
#to WKS (the file will be written to the 'data/output/' folder)
WKS_INPUT_FILE='wks_input.csv'

OUTPUT_FILE = os.path.join(curdir,'..','data','output',WKS_INPUT_FILE)

#Initializing Cloudant client
client = ch.getConnection()
db = client[config['CLOUDANT']['CLOUDANT_DB']]

# Process results from the query and write to a file
try:
    file = open(OUTPUT_FILE, 'wb')
    writer = csv.writer(file)
except:
    logging.error('Error when opening file for writing.')

for doc in db:
    if 'title' in doc:
        writer.writerow([doc['title'], doc['reviewText']])

file.close()
client.disconnect()

# 4. Creating groud truth file for NLC

A groud truth file (groud_truth.csv) is already provided for this example under the 'data' folder.

If you are working on a different use case, create your own NLC groud truth data for training a new classifier by following the instructions and best practices available on the service tutorial (link available in the README).

Once your groud_truth.csv file is created, save it to the 'data' folder so that the next steps can be run using it.

# 5. Using the WKS model to replace entities by their semantic types

Once a WKS model is trained (by going through the steps on the WKS notebook) and hooked up to the alchemy API key, this information needs to be updated in your local .env file.

This step allows Alchemy to replace instances mentioned in the sentences by their semantic types (color, product, model). These semantic types were defined when the WKS model was trained and are usually associated with a given domain data.

In [None]:
import re
import os
import logging
import configparser
import csv
import utils
import nltk
from watson_developer_cloud import alchemy_language_v1 as alchemy

logger = logging.getLogger()
logger.setLevel(logging.ERROR)

#getting current directory
curdir = os.getcwd()
logger.debug(curdir)

#loading credentials from .env file
credFilePath = os.path.join(curdir,'..','.env')
config = configparser.ConfigParser()
config.read(credFilePath)
logger.debug(config.sections())

model_id = config['WKS']['WKS_MODEL_ID']
alchemy_api = alchemy.AlchemyLanguageV1(api_key = 
                    config['ALCHEMY']['ALCHEMY_API_KEY'])

#please provide the path to the ground truth file you created on
#step 4. The path provided by default points to the available ground
#truth file
INPUT_FILE = os.path.join(curdir,'..','data','ground_truth.csv')
OUTPUT_FILE = os.path.join(curdir,'..','data',
                           'output','ground_truth_replaced.csv')
MAX_CHAR_REVIEW = 5024

def get_entities(review):
    """
    Get entities from Alchemy service.
    Input: text which contains entities.
    Output: json object with response from the service.
    """
    logger.debug(review)
    response = ''
    try:
        response = alchemy_api.entities(text=str(review), model=model_id, sentiment=True)
    except:
        logger.error("Error when getting entities.")
    logger.debug("Result from entities call: "+str(response))
    return response
    
    
def token_replacement_entities(review_text):
    """
    Replaces the identified tokens by their
        semantic types.
    Input: text to replace identified tokens.
    Output: sentences with tokens replaced by their
            semantic types.
    """
    processed = get_entities(review_text)
    if 'entities' in processed:
        if len(processed['entities']) == 0:
            return review_text
        else:
            entities = processed['entities']
            for i in entities:
                token = i['text']
                classification = "<" + i['type'] + ">"
                token = re.escape(token)
                re.sub(r'\\ ', ' ', token)
                review_text = re.sub(r"\b%s\b" % token, classification, review_text, count=1)
            return review_text
    else:
        return review_text
    
#Opens a file to write results from token replacement
try:
    write = open(OUTPUT_FILE, 'wb')
    writer = csv.writer(write)
except:
    logging.error('Error when opening file for writing.')

#Opens a file to read reviews
try:
    read = open(INPUT_FILE, 'rb') 
    reader = csv.reader(read)
except:
    logging.error('Error when opening file for reading.')

line_number = 0
for row in reader:
    line_number += 1
    logger.debug(row[0])

    try:
        sentences = nltk.tokenize.sent_tokenize(row[0])
        for sentence in sentences:
            replaced_sentence = token_replacement_entities(sentence)
            writer.writerow([replaced_sentence, row[1], row[2]])
    except:
        logger.error("Could not get entities for sentence in line number "\
                     +str(line_number))


read.close()
write.close()


# 6. Training Natural Language Classifier

The Natural Language Classifier (NLC) should be trained on the generalized/tagged data generated on the previous step.

This step assumes that you have provided your NLC credentials in your local .env file.

IMPORTANT: Update the local .env file to contain the following line in the [NLC] section:
    NLC_CLASSIFIER = YOUR_CLASSIFIER_ID


In [None]:
import json
import os
import logging
import configparser
import csv
from watson_developer_cloud import NaturalLanguageClassifierV1

logger = logging.getLogger()
logger.setLevel(logging.INFO)

#getting current directory
curdir = os.getcwd()
logger.debug(curdir)

#loading credentials from .env file
credFilePath = os.path.join(curdir,'..','.env')
config = configparser.ConfigParser()
config.read(credFilePath)
logger.debug(config.sections())

NLC_USERNAME = config['NLC']['NLC_USERNAME']
NLC_PASSWORD = config['NLC']['NLC_PASSWORD']

#please provide the path to the ground truth file generated
#on the previous step. The path provided by default points 
#to the available ground truth file example.
TRAINING_DATA = os.path.join(curdir,'..','data','output',
                           'ground_truth_replaced.csv')
CLASSIFIER_NAME = 'voc_classifier'
#initializing classifier
nlc = NaturalLanguageClassifierV1(username=NLC_USERNAME, 
                                  password=NLC_PASSWORD)

#training classifier
logger.debug('Classifier training initialized...')
with open(TRAINING_DATA, 'rb') as training_data:
    classifier = nlc.create(
        training_data=training_data,
        name=CLASSIFIER_NAME,
        language='en'
      )

classifiers = nlc.list()
if 'classifiers' in classifiers:
    for classifier in classifiers['classifiers']:
        if classifier['name'] == CLASSIFIER_NAME:
            logger.info('Your new classifier id is '+\
                        str(classifier['classifier_id']\
                            +'. Please update your .env file with this information.'))
logger.debug('Classifier training finished.')


# Troubleshooting

Some things to keep in mind while evaluating the results of the previous steps in your application:

1. Increase the size of the training set for the NLC. This generally increases the performance but with diminishing returns as the size is increased.
2. Balance the training/testing sets. What this means is that each of the classes needs to have comparable levels of representation in the training set so that the classifiers can learn about each of the classes and not overclassify for the larger classes.
3. Skip entity replacement. Entity replacement is used as a way to generalize the data going into the NLC. The effect this can have is that it can reduce the accuracy of the system if the dataset is already specific to a tight domain. If that is the case for the application then removing this step can increase accuracy.
4. Check WKS documentation and notebook for troubleshooting steps for WKS.

# Summary

At this point, you should have been able to:

1. Upload reviews to a Cloudant instance (or to any other persistence technique you have choosen to use);
2. Have a trained linguistic model created using Watson Knowledge Studio;
3. Have a trained Natural Language Classifier.