# Using the Debater services for analysing and finding insights in the Austin Survey dataset 
In this tutorial we will use a community survey conducted in the city of Austin in the years 2016 and 2017 (https://data.world/cityofaustin/mf9f-kvkk). In this survey, the citizens of Austin where asked "If there was ONE thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?". We will analyse their open-ended answers in few different ways.

This tutorial will demonstrate how to use the *Argument Quality* service, the *Key Point Analysis (KPA)* service, the *Term Wikifier* service and the *Term Relater* service. It will also demonstrate how they can be combined into a powerful text analysis mechanism.

## 1. Run Key Point Analysis (KPA) on 1000 randomly selected sentences from 2016 survey

### 1.1 Read random sample of 1000 sentences from 2016 comments
We will first read the attached csv file into the 'sentences' parameter. The dataset_austin_sentences.csv file has the Austin survey dataset, after sentences spliting. Each row in the csv is one sentence and a sentence have the following attributes: ['id', 'text', 'district','year']

In [1]:
import csv
import random


with open('./dataset_austin_sentences.csv') as csv_file:
    reader = csv.DictReader(csv_file)
    sentences = list(reader)

Lets have a look at the sentences at hand:

In [2]:
print('There are %d sentences in the dataset' % len(sentences))
print('Each sentence is a dictionary with the following keys: %s' % str(sentences[0].keys()))

There are 6274 sentences in the dataset
Each sentence is a dictionary with the following keys: dict_keys(['id', 'text', 'district', 'year'])


Lets select only the sentences from the 2016 survey and randomly sample 1000 out of them. The KPA service is able to run over hundreds of thousands of sentences, however since the computation is heavy in resources (particularly GPUs) the trial version is limited to 1000 sentences. Using a random.seed(0) is important since we already prepared a hot-cache over these sentences for a quicker KPA run.

In [3]:
sentences_2016 = [sentence for sentence in sentences if sentence['year'] == '2016']
print('There are %d sentences in the 2016 survey' % len(sentences_2016))
random.seed(0)
random_sample_sentences_2016 = random.sample(sentences_2016, 1000)

There are 3005 sentences in the 2016 survey


### 1.2 Run KPA on the random sample

Full documentation of the KPA service can be found [here](https://early-access-program.debater.res.ibm.com/docs/services/keypoints/keypoints_pydoc.html).<br/>
Lets initialize few needed parameters. The DebaterApi object supplies the clients for the various debater services. The clients print information using the logger and a suitable verbosity level is needed. The api_key should be set, it can be retrieved from the early-access-program site.  The KPA service stores the data (and a cache) in a domain. A user can create several domains, one for each dataset. We will run all KPA jobs in the same domain named 'austin_demo'.

In [11]:
from debater_python_api.api.debater_api import DebaterApi
from austin_utils import init_logger
import os

init_logger()
api_key = os.environ['API_KEY']
debater_api = DebaterApi(apikey=api_key)
keypoints_client = debater_api.get_keypoints_client()
domain = 'austin_demo'

KeyError: 'API_KEY'

Exercise 1:<br/>
Lets define a method named "run_kpa". The method receives a list of sentences (each sentence is a dictionary with the following keys: 'id','text') and runs KPA on these sentences. In order to run KPA, we need to:<br/>1. Upload the comments into a domain using the "keypoints_client.upload_comments(domain, comment_ids, comment_texts, dont_split=True)" method. This method receives the domain, a list of comment_ids and a list of comment_texts. When uploading comments into a domain, the KPA service splits the comments into sentences and runs a minor cleansing on the sentences. Since we already splitted the comments into sentences ourselves and we want to KPA service to use them as is, we will also set the parameter "dont_split" to True.<br/>2. Wait till all comments in the domain are processed using the "keypoints_client.wait_till_all_comments_are_processed(domain)" method.<br/>3. Start a KPA job using the "keypoints_client.start_kp_analysis_job(domain, comments_ids, run_params)" method. This method receives the domain, a list of comment_ids and a "run_params". The run_params is a dictionary with various parameters for cosumizing the job. One of the parameters we can set is 'n_top_kps' which tells the system how many key points are required. We will set it to 20, therefore we will use run_params={'n_top_kps': 20}. The job runs in an async manner and a future is returned.<br/>4. Use the returned future and wait till results are available using the "future.get_result" method.

In [5]:
def run_kpa(sentences):
    sentences_texts = [sentence['text'] for sentence in sentences]
    sentences_ids = [sentence['id'] for sentence in sentences]

    keypoints_client.upload_comments(domain=domain,
                                     comments_ids=sentences_ids,
                                     comments_texts=sentences_texts,
                                     dont_split=True)

    keypoints_client.wait_till_all_comments_are_processed(domain)

    future = keypoints_client.start_kp_analysis_job(domain=domain, comments_ids=sentences_ids,
                                                    run_params={'n_top_kps': 20})

    kpa_result = future.get_result(high_verbosity=True, polling_timout_secs=5)
    return kpa_result

We will now use the method you implemented and run over the random sample and print the result.

In [6]:
from austin_utils import print_results

kpa_result_random_1000_2016 = run_kpa(random_sample_sentences_2016)
print_results(kpa_result_random_1000_2016, n_sentences_per_kp=2, title='Random sample 2016')

2021-04-28 10:57:21,912 [INFO] keypoints_client.py 120: uploading 1000 comments in batches
2021-04-28 10:57:23,026 [INFO] keypoints_client.py 135: uploaded 1000 comments, out of 1000
2021-04-28 10:57:23,027 [INFO] keypoints_client.py 75: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2021-04-28 10:57:23,751 [INFO] keypoints_client.py 139: comments status: {'processed_comments': 0, 'pending_comments': 1000, 'processed_sentences': 0}
2021-04-28 10:57:33,756 [INFO] keypoints_client.py 75: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2021-04-28 10:57:34,414 [INFO] keypoints_client.py 139: comments status: {'processed_comments': 0, 'pending_comments': 1000, 'processed_sentences': 623}
2021-04-28 10:57:44,420 [INFO] keypoints_client.py 75: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2021-04-28 10:57:45,009 [INFO] keypoints_client.py 139: comments status: {'

2021-04-28 10:59:33,312 [INFO] keypoints_client.py 45: job_id 6089157a267f5852df0496a2 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 20, 'total_batches': 20, 'batch_size': 2000}}
2021-04-28 10:59:38,318 [INFO] keypoints_client.py 75: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2021-04-28 10:59:39,813 [INFO] keypoints_client.py 48: job_id 6089157a267f5852df0496a2 is done, returning result: {'keypoint_matchings': [{'keypoint': 'none', 'matching': [{'domain': 'austin_demo', 'comment_id': '16', 'sentence_id': 0, 'sents_in_comment': 1, 'span_start': 0, 'span_end': 124, 'num_tokens': 22, 'argument_quality': 0.4642207622528076, 'sentence_text': '2) Now that it is required by law for EVERYONE to have health insurance, you mist eliminate the "Travis County Health" tax!!', 'score': 0}, {'domain': 'austin_demo', 'comment_id': '37', 'sentence_id': 0, 'sents_in_comment': 1, 'span_start': 0, 'span_end': 50, 'num_tokens':

Random sample 2016 coverage: 28.36
Random sample 2016 key points:
39 - We need better mass transit!
	- Need more bus routes!
	- NEED BETTER PUBLIC TRANSPORTATION
34 - Affordable housing, traffic, cleaner streets/roads.
	- TRAFFIC AND AFFORDABLE HOUSING ARE THE BIGGEST PROBLEM TO LIVING HERE.
	- Also, the traffic in Austin is ridiculous and the lack of public transportation needs
	  improvements.
24 - Affordable housing is very important.
	- Affordable housing is crucial, & keeping seniors in their homes is part of that challenge!
	- Affordable housing MUST become a reality/ahora!
21 - Homeowners taxes are a problem.
	- Cost of living here is to high & tax for my home is to high.
	- High property taxes seem to be one of the two reasons why people I know are leaving
	  Austin (traffic being the other reason).
19 - Water costs too much.
	- The cost or water is excessive and way too high.
	- Also, the cost of my water bill is insanely high and I am about to protest it!
17 - Provide public 

## 2. Run KPA on 1000 top quality sentences from 2016 survey
The Austin Survey dataset is noisy and the answers and sentences vary in quality. Selecting the sentences randomly may lead to running over many sentences that are not informative. Instead, we will now select the more argumentative and informative sentences using the argument-quality service. We will calculate an argument-quality score for each sentence and select 1000 sentences with the highest score.

In [7]:
arg_quality_client = debater_api.get_argument_quality_client()
sentences_topic = [{'sentence': sentence['text'], 'topic': 'Austin'} for sentence in sentences_2016]
arg_quality_scores = arg_quality_client.run(sentences_topic)
sentences_2016_and_scores = zip(sentences_2016, arg_quality_scores)
sentences_2016_and_scores_sorted = sorted(sentences_2016_and_scores, key=lambda x: x[1], reverse=True)
sentences_2016_sorted = [sentence for sentence, _ in sentences_2016_and_scores_sorted]

ArgumentQualityClient:   0%|          | 0/3005 [00:00<?, ?it/s]

ConnectionError: Can't access server at https://arg-quality.debater.res.ibm.com/score/. Status code: 500 - Internal Server Error. Message: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
        <title>Error response</title>
    </head>
    <body>
        <h1>Error response</h1>
        <p>Error code: 500</p>
        <p>Message: 'sentence_topic_pairs'.</p>
        <p>Error code explanation: 500 - Server got itself in trouble.</p>
    </body>
</html>


Lets examine the top and bottom 10 sentences and check whether the service is able to detect the higher quality sentences.

In [8]:
from austin_utils import split_sentences_to_lines
k = 10
top_sentences = sentences_2016_sorted[:k]
top_sentences = [sentence['text'] for sentence in top_sentences]
print('Top %d sentences: ' % k)
print('\n'.join(split_sentences_to_lines(top_sentences, 1)))

bottom_sentences = sentences_2016_sorted[-k:]
bottom_sentences = [sentence['text'] for sentence in bottom_sentences]
print('\n\nBottom %d sentences: ' % k)
print('\n'.join(split_sentences_to_lines(bottom_sentences, 1)))

NameError: name 'sentences_2016_sorted' is not defined

We will now run the run_kpa method over the top 1000 quality sentences

In [9]:
sentences_2016_top_1000_aq = sentences_2016_sorted[:1000]
kpa_result_top_aq_1000_2016 = run_kpa(sentences_2016_top_1000_aq)
print_results(kpa_result_top_aq_1000_2016, n_sentences_per_kp=2, title='Top aq 2016')

NameError: name 'sentences_2016_sorted' is not defined

Exercise 2:<br/>
We have reached a nice coverage of ???. In order to increase the coverage a little more, we will add another parameter to the run_param called mapping_threshold. We will reimplement the run_kpa method (please copy paste the previous one and modify it) but this time method will also receive a threshold parameter and we will use it in the run_param in the following way: run_param={'n_top_kps': 20, 'mapping_threshold': threshold}<br/>The mapping_threshold is responsible of deciding whether a sentences matches (supports) a key point. Therefore reducing the threshold from the 0.99 default value makes more sentences match key points and increases the coverage, at the risk of reducing the precision.<br/>In additio, the method will now also return the job_id stored in the future (using the future.get_job_id() method). We will need this job_id in the next excercise.

In [None]:
def run_kpa(sentences, threshold):
    sentences_texts = [sentence['text'] for sentence in sentences]
    sentences_ids = [sentence['id'] for sentence in sentences]

    keypoints_client.upload_comments(domain=domain,
                                     comments_ids=sentences_ids,
                                     comments_texts=sentences_texts,
                                     dont_split=True)

    keypoints_client.wait_till_all_comments_are_processed(domain)

    future = keypoints_client.start_kp_analysis_job(domain=domain, comments_ids=sentences_ids,
                                                    run_params={'n_top_kps': 20, 
                                                                'mapping_threshold': threshold})

    kpa_result = future.get_result(high_verbosity=True, polling_timout_secs=5)
    return kpa_result, future.get_job_id()

Lets now run again over the top 1000 quality sentences, this time with a 0.95 threshold

In [None]:
kpa_result_top_aq_1000_2016, kpa_top_aq_1000_2016_job_id = run_kpa(top_aq_sentences_2016, 0.95)
print_results(kpa_result_top_aq_1000_2016, n_sentences_per_kp=2, title='Top aq 2016')

The coverage was indeed increased to ???. Lets examine the top 5 and bottom 5 matched senteces to the first KP and make sure we didn't sacrifice precision too much.

In [None]:
from austin_utils import print_top_and_bottom_matches_for_kp


print_top_and_bottom_matches_for_kp(kpa_result_top_aq_1000_2016, 'Traffic congestion needs major improvement', 5, 5)

## 3. run over 2017 survey using the key points from 2016 survey
It is very useful to be able to compare between different subsets of the data (compare between different years, or different districts). We will now demonstrate how easy it is to  compare the 2017 data to the 2016 data. A similar comparisson can be done between districts or other subsets.

Lets first filter the 2017 sentences and take the top 1000 quality sentences, as done for the 2016 sentences

In [None]:
sentences_2017 = [sentence for sentence in sentences if sentence['year'] == '2017']
sentences_topic = [{'sentence': sentence['text'], 'topic': 'Austin'} for sentence in sentences_2017]
arg_quality_scores = arg_quality_client.run(sentences_topic)
sentences_2017_and_scores = zip(sentences_2017, arg_quality_scores)
sentences_2017_and_scores_sorted = sorted(sentences_2017_and_scores, key=lambda x: x[1], reverse=True)
sentences_2017_sorted = [sentence for sentence, _ in sentences_2017_and_scores_sorted]
sentences_2017_top_1000_aq = sorted_aq_sentences_2017[:1000]

Exercise 3:<br/>
In order to compare the 2017 sentences to 2016 sentences we will want to map the 2017 sentences to the same key points extracted on the 2016 sentences (therwise different key points could be automattically extracted on the 2017 sentences and it would be hard to compare between them).
For this end we will reimplement the run_kpa method (please copy paste the previous one and modify it). This time it will receive a new "key_points_by_job_id" parameter. This parameter is passed to the key_points_by_job_id parameter in the "keypoints_client.start_kp_analysis_job" method. When it is equal to None, key points are automatically extracted. However when it is set with a job_id of a previous job it uses the key points extracted in that job.

In [None]:
def run_kpa(sentences, threshold, key_points_by_job_id=None):
    sentences_texts = [sentence['text'] for sentence in sentences]
    sentences_ids = [sentence['id'] for sentence in sentences]

    keypoints_client.upload_comments(domain=domain,
                                     comments_ids=sentences_ids,
                                     comments_texts=sentences_texts,
                                     dont_split=True)

    keypoints_client.wait_till_all_comments_are_processed(domain)

    future = keypoints_client.start_kp_analysis_job(domain=domain, comments_ids=sentences_ids,
                                                    run_params={'n_top_kps': 20, 
                                                                'mapping_threshold': threshold},
                                                    key_points_by_job_id=key_points_by_job_id)

    kpa_result = future.get_result(high_verbosity=True, polling_timout_secs=5)
    return kpa_result, future.get_job_id()

Lets use the new run_kpa and provide it with the "top 1000 qulity sentences from 2017" and the job_id of "top 1000 qulity sentences from 2016".

In [None]:
kpa_result_top_aq_1000_2017, _ = run_kpa(sentences_2017_top_1000_aq, 0.95, kpa_top_aq_1000_2016_job_id)
print_results(kpa_result_top_aq_1000_2017, n_sentences_per_kp=2, title='Top aq 2017, using 2016 key points')

Since both jobs have the same key points, we can now easily compare the two results.

In [None]:
from austin_utils import compare_results

compare_results(kpa_result_top_aq_1000_2016, '2016', kpa_result_top_aq_1000_2017, '2017')

## 4. Expend the traffic-problem in austin using the Term-Wikifier and Term-Relater services
As we've seen in the 2016 results, the traffic problem in Austin is significant. In this section we will use the Term-Wikifier and Term-Relater service to select a subset of the sentences that mention and related to traffic and run KPA over them. This will help us create many key points specifically to the traffic problem and expose relevant complaints and suggestions on the topic.

Exercise 4:<br/>
Lets use the Term-Wikifier service and create a dictionary from sentences to their mentions. The method receives sentences_texts (a list of the sentences texts as strings) and runs the term wikifier over them. Ther term wikifier returns a list of mentions_lists. One mentions_list for each sentence. Each mention is a dictionary. We will extract the mention title this way: mention['concept']['title'].

In [None]:
def get_sentence_to_mentions(sentences_texts):
    term_wikifier_client = debater_api.get_term_wikifier_client()
    mentions_list_list = term_wikifier_client.run(sentences_texts)
    sentence_to_mentions = {}
    for sentence_text, mentions_list in zip(sentences_texts, mentions_list_list):
        sentence_to_mentions[sentence_text] = set([mention['concept']['title'] for mention in mentions_list])
    return sentence_to_mentions

Lets get the text of 2016 sentences and get their mentions

In [None]:
sentences_2016_texts = [sentence['text'] for sentence in sentences_2016]
sentence_to_mentions = get_sentence_to_mentions(sentences_2016_texts)

Since we're inrested in the "traffic" concept, we will now take all mentions and find the ones that are related to that concept. Then we will select all sentences that have at least one mention that is related to the "traffic" concept.

In [None]:
all_mentions = [mention for sentence in sentence_to_mentions 
                   for mention in sentence_to_mentions[sentence]]
all_mentions = sorted(list(set(all_mentions)))

Exercise 5:<br/>
Implement a method that receives a given concept, a threshold and all_mentions. It then uses the Term-Relater service to calculate the relatedness between the mentions and the concept and returns all mentions that have relatedness score above the given threhold.

In [None]:
def get_related_mentions(concept, threshold, all_mentions):
    term_relater_client = debater_api.get_term_relater_client()
    concept_mention_pairs = [[concept, mention] for mention in all_mentions]
    scores = term_relater_client.run(concept_mention_pairs)
    return [mention for mention, score in zip(all_mentions, scores) if score > threshold]

In [None]:
matched_mentions = get_related_mentions('traffic', 0.5, all_mentions)
print(matched_mentions)

We will now select the sentences that have mentions that are related to the "traffic" concept and run over them. We will need to switch back from sentences_texts to sentences dictionaries since our run_kpa method needs the sentnces dictionaries.

In [None]:
matched_sentences_texts = [sentence for sentence in sentences_2016_texts 
                     if len(sentence_to_mentions[sentence].intersection(matched_mentions)) > 0]
print('Running over %d sentences' % len(matched_sentences_texts))
matched_sentences = [sentence for sentence in sentences_2016 if sentence['text'] in matched_sentences_texts]

Finally, lets run over these sentences and examine the "traffic" related key points

In [None]:
kpa_result_traffic_2016, _ = run_kpa(matched_sentences, None)
print_results(kpa_result_traffic_2016, n_sentences_per_kp=2, title='Traffic KPA 2016')