# Gathering Data
The first step is to find a data set that needs to be analyzed. The data needs to be stored so that the rest of the flow can use it. It is recommended to use a database and store the data in JSON format but any way of saving the data is fine as long as it can be accessed throughout the flow. Examples include just storing it in a JSON file, SQL database, NoSQL database, CSV file, etc. For the purposes of this starter kit, the data is being stored in a Cloudant NoSQL database (available on bluemix) and that is the simplest way to get started and requires the fewest changes to be made to the scripts. We offer a few scripts to help push and pull from this database to make it even easier.

The minimum amount of data needed is review/feedback text and a way to link it to the product it is targeted to (most likely with some sort of key/id to mark a review for a product).

Take the dataset and push it to your database/file in a format that is easiest to index and use further in the flow. For reference, here is an example of a review that we used: 

TODO: ADD JSON STRING HERE

Key things to note: By storing it this way, it is easy to retrieve the text of the review and find out which product it is pointed to so that the rest of the flow can be easily executed.

# Converting Data into CSV
The data must be converted into a form that can be used to train the models for entity extraction. To do this, run the script below. Please edit the database connections so that they point to your database where all your data is stored or recode that portion so that it points to wherever the data is located (if not in a cloudant database).

The important portion is the output and the format of the output csv that is ingested by Watson Knowledge Studio is:

"key","value"
"TITLE OF DOCUMENT","TEXT OF DOCUMENT"
...

# Import Data CSV into Watson Knowledge Studio
The data must be converted into a form that can be used to train the models for entity extraction through Watson Knowledge Studio. To do this, run the script below. Please edit the database connections so that they point to your database where all your data is stored or recode that portion so that it points to wherever the data is located (if not in a cloudant database).

The important portion is the output and the format of the output csv that is ingested by Watson Knowledge Studio is:

"key","value"
"TITLE OF DOCUMENT","TEXT OF DOCUMENT"
...

The CSV that is created at the end of the script must now be imported into Watson Knowledge Studio (WKS). Each row in the CSV will be treated as a separate document and will be organized in WKS. The data doesn't have to be in CSV format to be uploaded into WKS, one can also do it manually by uploading documents but we provide a tool to convert it into CSV to make it easier.

The documents then need to be annotated with entities and relationships. Coreference is also done here. A strict guideline is very helpful when doing this.

For guidelines and tips on Watson Knowledge Studio, reference /notebooks/WKS.md

After the annotations are done and the model is trained, it needs to be exported into Alchemy Language using an API key. To do this, first get an Alchemy API key from bluemix and then reference the WKS documentation above to see how that is done.

In [None]:
import cloudant
import csv

SERVER = ''      ''' Replace with your server URL'''
DATABASE = ''    ''' Replace with the name of the database'''
USERNAME = ''    ''' Replace with the username from your
                    credentials for the NLC'''
PASSWORD = ''    ''' Replace with the password from
                    your credentials for the NLC'''
DESIGN = ''      ''' Replace with the name of the design document that contains
                     the view. This should be of the form '_design/XXXX''''
VIEW = ''        ''' Replace with the view from your database to poll,
                    this should take the form of view_file/view and should
                    return the text to classify as the value field and what
                    you would like to call it as the key'''
DESTINATION = ''  ''' Replace with correct name for output
                    file (NOTE must be *.csv)'''

server = cloudant.client.Cloudant(USERNAME, PASSWORD, url=SERVER)
server.connect()
db = server[DATABASE]
query = db.get_view_result(DESIGN, VIEW)
file = open(DESTINATION, 'wb')
writer = csv.writer(file)


for q in query:
    print q[0]
    if 'key' in q[0] and q[0]['key'] is not None:
        title = q[0]['key']
    else:
        title = "No Title"
    if 'value' in q[0] and q[0]['value'] is not None:
        text = q[0]['value']
    else:
        text = "No Text"
    writer.writerow([title, text])

file.close()


# Replace entities with Tags in Training Sets
You will need to run your training set CSV through the script below which replaces any entities found by Entity Extraction in Alchemy language with a representative tag in order generalize the classifier in the coming steps. Just identify the name for your csv file input and output and it should work.

In [None]:
import token_replacement as t
import csv

read = open('ground_truth_layer1.csv','rb') ##replace with correct filenames
write = open('ground_truth_layer1_replace.csv','wb')     ##replace with correct filenames

reader = csv.reader(read)
writer = csv.writer(write)

for row in reader:
    token = t.token_replacement(row[0])
    writer.writerow([token,row[1]])

read.close()
write.close()

# Train Natural Language Classifier
The Natural Language Classifier (NLC) needs to be trained on this generalized/tagged data. This is a 3 level NLC that will provide better classification results than just using a single layer. Add your credentials in the file below and run the script to train the NLC. The script will also store your classifier_id's in a json file.

In [None]:
import couchdbkit
import json
from watson_developer_cloud import NaturalLanguageClassifierV1
import csv

tier1CSV = 'training_set1.csv'
tier2CSV = 'training_set2.csv'
tier3CSV = 'training_set3.csv'
USERNAME = 'e561bc30-d294-41f4-8b47-39fc6bc29917'
PASSWORD = 'XH8pYnsYfClv'
JSON_TARGET = '../../data/classifier_ids.json'

classifierTree = {
    'tier1':'',
    'tier2':'',
    'tier3':''
}

# Initialize classifier
nlc = NaturalLanguageClassifierV1(username = USERNAME, password = PASSWORD)

# Train tier 1 classifier
print("############# TIER 1 CLASSIFIER ##############")
with open(tier1CSV, 'rb') as training_data:
  classifier = nlc.create(
    training_data=training_data,
    name='tier1',
    language='en'
  )
print(json.dumps(classifier, indent=2))
classifierTree['tier1'] = classifier['classifier_id']

# Train tier 2 classifier
print("############# TIER 2 CLASSIFIER ##############")
with open(tier2CSV, 'rb') as training_data:
  classifier = nlc.create(
    training_data=training_data,
    name='tier2',
    language='en'
  )
print(json.dumps(classifier, indent=2))
classifierTree['tier2'] = classifier['classifier_id']

# Train tier 3 classifier
print("############# TIER 3 CLASSIFIER ##############")
with open(tier3CSV, 'rb') as training_data:
  classifier = nlc.create(
    training_data=training_data,
    name='tier3',
    language='en'
  )
print(json.dumps(classifier, indent=2))
classifierTree['tier3'] = classifier['classifier_id']

# Write the tiers with classifier id's to file for use later
with open(JSON_TARGET, 'w') as outfile:
    json.dump(classifierTree, outfile)

print("############# FULL CLASSIFIER TREE ##############")
print(json.dumps(classifierTree))