## Active Learning
#### First step in sampling from WDS

This notebook only shows that we can run one step of active learning on WDS output.
More work would need to be done to complete the active learning cycle.

This currently uses *uncertainty sampling* to select passages to annotate, but this could be revisited.

In [None]:
# necessary imports
import json
import pandas as pd
from ibm_watson import DiscoveryV1
from ibm_cloud_sdk_core.authenticators import BasicAuthenticator

from toal.stores import BasicStore
from toal.stores.loaders import load_from_watson_discover, load_w_annotation_units_from_watson_discover
from toal.learners import MulticlassWatsonDiscoveryLearner
from toal.samplers import MulticlassUncertaintySampler, MulticlassDensitySampler, MulticlassClusterSampler, MulticlassComplexSampler

# just to make results easier to display in notebook
pd.options.display.max_columns = None
pd.options.display.width=None
pd.options.display.max_colwidth = 2000

In [None]:
username = "add here"
password = "add here"
url= "add here"
version = "add here"
collection_id = "add here"
environment_id = "add here"

In [None]:
# connect to WDS
authenticator = BasicAuthenticator(username, password)
wds = DiscoveryV1(
    version=version,
    authenticator=authenticator)

wds.set_service_url(url)


So we have a trained (WKS) model and we can query a (WDS) collection to get (WKS) predicitons on that unlabeled collection data.

You can filter the respose by adding entities or relations to the query, or use `None` to use active learning over everything.

In [None]:
# 1) filter by entity type
# toi = "SELL_ACQ"  # 'type' of interest
# query = "enriched_text.entities.type:" + toi
# 2) filter by relation type
toi = "transacted"  # type of interest  is_acquiring  transacted is_promoted
query = "enriched_text.relations.type:" + toi
# 3) use None to avoid filtering by entity or relation type
# toi = None
# query = None

# query(self, environment_id, collection_id, filter=None, query=None, natural_language_query=None, passages=None, aggregation=None, 
#       count=None, return_=None, offset=None, sort=None, highlight=None, passages_fields=None, passages_count=None, 
#       passages_characters=None, deduplicate=None, deduplicate_field=None, similar=None, similar_document_ids=None, 
#       similar_fields=None, bias=None, spelling_suggestions=None, x_watson_logging_opt_out=None, **kwargs)
q_response = wds.query(environment_id, collection_id, query=query, count=1000).get_result()

print("Query matched: %s" % q_response['matching_results'])

We would like to use active learning to find the most promising data to annotate to improve the machine learning model.  We start with *uncertainty* sampling with *least confidence*.  This is a simple but effective baseline that looks for the instances the model is least confident to annotate to improve the model.

Optionally we can apply clustering with the active learning that will be sure the pick a greater variety of uncertain examples.  However, this may also sample from some high confidence clusters.

In [None]:
# annotation batch size
batch_size = min(50, len(q_response['results']))
enrichment = 'relations'  # entities or relations

# use active learning toolkit
store = BasicStore()
store.append_data( *load_from_watson_discover(q_response, enrichment=enrichment, filter_type=toi))
#store.append_data( *load_w_annotation_units_from_watson_discover(q_response, enrichment=enrichment, filter_type=toi))

learner = MulticlassWatsonDiscoveryLearner(q_response, enrichment=enrichment, filter_type=toi)

# basic (but still effective) sampler
sampler = MulticlassUncertaintySampler(learner, strategy='lc')
# sampler with clustering
# sampler = MulticlassClusterSampler(learner)
sampled_df = sampler.choose_instances(store, batch_size=20)
sampled_df

The samplers above treat each machine learning instance (e.g., entity or relation) separately and try to choose the best for annotation.  However, annotation is done over sentences or passages.  The active learning sampling below tries to choose the best sentences/passages (comprising perhaps multiple entites/relations) for annotation.

In [None]:
# aggregate predictions across a sentence
store = BasicStore()
store.append_data( *load_w_annotation_units_from_watson_discover(q_response, enrichment=enrichment, filter_type=toi))
sampler = MulticlassComplexSampler(learner)
sampled_df = sampler.choose_instances(store, batch_size=batch_size, selection='least_worst') # sum mean best least_worst
sampled_df['text']