# Using *Key Point Analysis* service for analyzing and finding insights in a survey data 

#### **Important Notice**: This tutorial describes the legacy SDK, of debater-python-api version up to 4.3.2. The tutorial of the updated SDK, starting from debater-python-api version 5.0.0, is available [here](new_sdk/kpa_quick_start_tutorial-with_results.ipynb).

When you have a large collection of texts representing people’s opinions (such as product reviews, survey answers or  social media), it is difficult to understand the key issues that come up in the data. Going over thousands of comments is prohibitively expensive.  Existing automated approaches are often limited to identifying recurring phrases or concepts and the overall sentiment toward them, but do not provide detailed or actionable insights.

In this tutorial you will gain hands-on experience in using *Key Point Analysis* (KPA) for analyzing and deriving insights from open-ended answers.  

The data we will use is a [community survey conducted in the city of Austin](https://data.austintexas.gov/dataset/Community-Survey/s2py-ceb7). In this survey, the citizens of Austin were asked "If there was ONE thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?". 

## 1. Run *Key Point Analysis* (data from 2016)

Lets first import all required packages for this tutoarial and initialize the *Key Point Analysis* client. The client prints information using the logger and a suitable verbosity level should be set. The client object is configured with an API key. It should be  retrieved from the [Project Debater Early Access Program](https://early-access-program.debater.res.ibm.com/) site.  In the code bellow it is passed by the enviroment variable *DEBATER_API_KEY* (you may also modify the code and place the api-key directly).

In [None]:
from debater_python_api.api.clients.keypoints_client import KpAnalysisClient, KpAnalysisTaskFuture
from debater_python_api.api.clients.key_point_analysis.KpAnalysisUtils import KpAnalysisUtils
import os
import csv
import random

KpAnalysisUtils.init_logger()
api_key = os.environ['DEBATER_API_KEY']
host = 'https://keypoint-matching-backend.debater.res.ibm.com'
keypoints_client = KpAnalysisClient(api_key, host)

### 1.1 Read the data and run *key point analysis*  over it
Let's read the data from *dataset_austin.csv* file, which holds the Austin survey dataset, and print the first comment.

In [None]:
with open('./dataset_austin.csv') as csv_file:
    reader = csv.DictReader(csv_file)
    comments = [dict(d) for d in reader]

print(f'There are {len(comments)} comments in the dataset')
print(comments[0])

Each comment is a dictionary with an unique_id 'id' and 'text' and a 'year'. We will first remove all comments with text longer than 1000 characters since this is a systme's limit. Then we will filter the comments and take the ones from 2016. 

The *Key Point Analysis* service is able to run over hundreds of thousands of sentences, however since the computation is heavy in resources (particularly GPUs) the trial version is limited to 1000 comments. You may request to increase this limit if needed. Since we want the tutorial to be relativly fast and lightweight, we will only run on a sample of 400 comments. Note that running over a larger set improves both the quality and coverage of the results.

In [None]:
comments = [c for c in comments if len(c['text'])<=1000]
comments_2016 = [c for c in comments if c['year'] == '2016']
sample_size = 400
random.seed(0)
comments_2016_sample = random.sample(comments_2016, sample_size)

*Key point analysis* is a novel and promising approach for summarization, with an important quantitative angle. This service summarizes a collection of comments on a given topic as a small set of key points. The salience of each key point is given by the number of its matching sentences in the given comments.

In order to run *Key Point Analysis*, do the following steps:

### 1.2 Create a domain
The *Key Point Analysis* service stores the data (and cached-results) in a *domain*. A user can create several domains, one for each dataset. Domains are only accessible to the user who created them.

Create a domin using the **keypoints_client.create_domain(domain=domain, domain_params={})** method. Several params can be passed when creating a domain in the domain_params dictionary as described in the documentation. Leaving it empty gives us a good default behaviour. You can also use **KpAnalysisUtils.create_domain_ignore_exists(client=keypoints_client, domain=domain, domain_params={})** if you don't want an exception to be thrown if the domain already exists (note that in such case the domain_params are not updated and are remained as they where before). In this tutorial we will first delete the domain if it exists, since we want to start with an empty domain.

Full documentation of the supported *domain_params* and how they affect the domain can be found [here](kpa_parameters.pdf).

In [None]:
domain = 'austin_demo'
KpAnalysisUtils.delete_domain_ignore_doesnt_exist(client=keypoints_client, domain=domain)
keypoints_client.create_domain(domain=domain, domain_params={})

Few domain related points:
* We can always delete a domain we no longer need using: **KpAnalysisUtils.delete_domain_ignore_doesnt_exist(client=keypoints_client, domain=domain)**
* Keep in mind that a domain has a state. It stores all comments that had beed uploaded into it and a cache with all calculations performed over this data.
* If we want to restart and run over the domain from scratch (no comments and no cache), we can delete the domain and then re-create it or obviously use a different domain. Keep in mind that the cache is also cleared and consecutive runs will take longer.

### 1.3 Upload comments into the domain
Upload the comments into the domain using the **keypoints_client.upload_comments(domain=domain, comments_ids=comments_ids, comments_texts=comments_texts)** method. This method receives the domain, a list of comment_ids and a list of comment_texts. When uploading comments into a domain, the *Key Point Analysis* service splits the comments into sentences and runs a minor cleansing on the sentences. If you have domain-specific knowladge and want to split the comments into sentences yourself, you can upload comments that are already splitted into sentences and set the *dont_split* parameter to True (in the domain_params when creating the domain) and *Key Point Analysis* will use the provided sentences as is. 

Note that:
* Comments_ids must be unique
* The number of comments_ids must match the number comments_texts
* Comments_texts must not be longer than 1000 characters
* Uploading the same comment several times (same domain + comment_id, comment_text is ignored) is not a problem and the comment is only uploaded once (if the comment_text is different, it is NOT updated).

In [None]:
comments_texts = [comment['text'] for comment in comments_2016_sample]
comments_ids = [comment['id'] for comment in comments_2016_sample]
keypoints_client.upload_comments(domain=domain, comments_ids=comments_ids, comments_texts=comments_texts)

### 1.4 Wait for the comments to be processed
Comments that are uploaded to the domain are being processed. This takes some times and runs in an async manner. We can't run an analysis before this phase finishes and we need to wait till all comments in the domain are processed using the **keypoints_client.wait_till_all_comments_are_processed(domain=domain)** method.

In [None]:
keypoints_client.wait_till_all_comments_are_processed(domain=domain)

### 1.5 Start a Key Point Analysis job
Start a *Key Point Analysis* job using the **future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params)** method. This method receives the domain and a *run_params*. The run_params is a dictionary with various parameters for customizing the job. Leaving it empty gives us a good default behaviour. The job runs in an async manner therefore the method returns a future object.

Few additional options when running an analysis job:
* The analysis is performed over all comments in the domain. If we need to run over a subset of the comments (split the data by different GEOs/users types/timeframes etc') we can pass a list of comments_ids to the comments_ids parameter and it will create an analysis using only the provided comments.
* By default, key points are extracted automatically. When we want to provide key points and match all sentences to these key points we can do so by passing them to the keypoints parameter: **run_param['keypoints'] = [...]**. This enables a mode of work named human-in-the-loop where we first automatically extract key points, then we manually edit them (refine non-perfect key points, remove duplicated and add missing ones) and then run again, this time providing the edited keypoints as a given set of key points.
* It is also possible to provide key points and let KPA add additional missing key points. To do so pass the key points to the keypoint_candidates parameter: **run_param['keypoint_candidates'] = [...]** (see section 4 for an elaborated example).
* Full documentation of the supported *domain_params* and *run_params* and how they affect the analysis can be found [here](kpa_parameters.pdf).

In [None]:
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params={})

### 1.6 Wait for the Key Point Analysis job to finish
Use the returned future and wait till results are available using the **kpa_result = future.get_result()** method. The method waits for the job to finish and eventually returns the result. The result is a dictionary containing the key points (sorted descendingly according to number of matched sentences) and for each key point has a list of matched sentences (sorted descendingly according to their match score). An additional 'none' key point is added which holds all the sentences that don't match any key point.

In [None]:
kpa_result_2016 = future.get_result(high_verbosity=True, polling_timout_secs=30)

Let's print the results:

In [None]:
KpAnalysisUtils.print_result(kpa_result_2016, n_sentences_per_kp=2, title='2016 Random sample')

We can also save the results to file. This creates two files, one with the key points and all matched sentences and another summary file with only the key points and their saliance.

In [None]:
KpAnalysisUtils.write_result_to_csv(kpa_result_2016, 'austin_survey_2016_kpa_results.csv')

It is always possible to cancel a pending/running job in the following way:
* **keypoints_client.cancel_kp_extraction_job(\<Job Id\>)**

Job Id can be found: 
1. It's printed when a job is started 
2. From the future object: **future.get_job_id()**
3. From user report: **keypoints_client.get_full_report()** (see bellow)

It is also possibe to stop all jobs in a domain, or even all jobs in all domains (might be simpler since there is no need of the job_id):
* **keypoints_client.cancel_all_extraction_jobs_for_domain(domain)**
* **keypoints_client.cancel_all_extraction_jobs_all_domains()**

Please cancel long jobs if the results are no longer needed.

### 1.7 Modify the run_params and increase coverage
Each domain has a cache that stores all intermediate results that are calculated during the analysis. Therefore modifing the run_params and running another analysis runs much faster and all intersecting intermediate results are retreived from cache. 

Let's run again, but now change the **mapping_policy**. The **mapping_policy** is used when mapping all sentences to the final key points: the default value is **NORMAL**. Changing to **STRICT** will cause only the sentence and key point pairs with very high matching confidence to be considered matched, increasing precision but potentially decreasing coverage. We will change it to **LOOSE**, which matches also sentences and key points with lower confidence, and is therefore expected to increase coverage at cost of precision. We will also increase the number of required key points to 100 using the **n_top_kps** parameter. 

In [None]:
run_params = {'mapping_policy':'LOOSE', 'n_top_kps': 100}
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params)
kpa_result_2016 = future.get_result(high_verbosity=True, polling_timout_secs=30)
KpAnalysisUtils.write_result_to_csv(kpa_result_2016, 'austin_survey_2016_kpa_results.csv')
KpAnalysisUtils.print_result(kpa_result_2016, n_sentences_per_kp=2, title='Random sample')

By changing the mapping policy to **LOOSE** and increasing the number of key points, the coverage was increased from 44% to 68%.

### 1.8 User Report
When we want to see what domains we have, maybe delete old ones that are not needed, see past and present analysis jobs, perhaps take their job_id and fetch their result 
(via **KpAnalysisTaskFuture(keypoints_client, \<job_id\>).get_result()** ), 
we can get a report with all the needed information

In [None]:
report = keypoints_client.get_full_report()
KpAnalysisUtils.print_report(report)

## 2. Mapping sentences to multiple key points, and creating a Key-Points-Graphs
By default, each sentence is mapped to one key point at most (the key point with the highest match-score, that follows the **mapping_policy**). We can run again and ask KPA to map each sentence to all key points that are matched according to the **mapping_policy**, by adding the **sentence_to_multiple_kps** parameter.

In [None]:
run_params = {'sentence_to_multiple_kps': True, 'n_top_kps': 100}
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params)
kpa_2016_job_id = future.get_job_id() # saving the job_id for a following section
kpa_result_2016 = future.get_result(high_verbosity=True, polling_timout_secs=30)

In [None]:
KpAnalysisUtils.print_result(kpa_result_2016, n_sentences_per_kp=2, title='Random sample')

Now that sentences are mapped to multiple key points, it is possible to create a *key points graph* by first saving the results as before, then translating the results file into a graph-data json file, then load this json file in our demo graph visualization, available at: [key points graph demo](https://keypoint-matching-ui.ris2-debater-event.us-east.containers.appdomain.cloud/)

In [None]:
KpAnalysisUtils.write_result_to_csv(kpa_result_2016, 'austin_survey_2016_multiple_kpa_results.csv')
KpAnalysisUtils.generate_graphs_and_textual_summary('austin_survey_2016_multiple_kpa_results.csv')

**generate_graphs_and_textual_summary** creates 4 files:
* **austin_survey_2016_multiple_kpa_results_graph_data.json**: a graph_data file that can be loaded to: [key points graph demo](https://keypoint-matching-ui.ris2-debater-event.us-east.containers.appdomain.cloud/). It presents the relations between the key points as a graph of key points.
* **austin_survey_2016_multiple_kpa_results_hierarchical_graph_data.json**: another graph_data file that can be loaded to the graph-demo-site. This graph is simplified, it's more convenient to extract insights from it.
* **austin_survey_2016_multiple_kpa_results_hierarchical.txt**: This textual file shows the simplified graph (from the previous bullet) as a list of hierarchical bullets.
* **austin_survey_2016_multiple_kpa_results_hierarchical.docx**: This Microsoft Word document shows the textual bullets (from the previous bullet) as a user-friendly report.

## 3. Run *Key Point Analysis* incrementally
### 3.1 Run *Key Point Analysis* incrementally on new data (data from 2016 + 2017)
A year passed, and we collect additional data (data from 2017). We can now upload the 2017 data to the same domain (austin_demo) and have both 2016 and 2017 data in one domain. 

In [None]:
comments_2017 = [c for c in comments if c['year'] == '2017']
random.seed(0)
comments_2017_sample = random.sample(comments_2017, sample_size)

domain = 'austin_demo'
comments_texts = [comment['text'] for comment in comments_2017_sample]
comments_ids = [comment['id'] for comment in comments_2017_sample]
keypoints_client.upload_comments(domain=domain, comments_ids=comments_ids, comments_texts=comments_texts)
keypoints_client.wait_till_all_comments_are_processed(domain=domain)

We can now run a new analysis over all the data in the domain, as we did before, and automatically extract new key points. We can assume that some will be identical to the key points extracted on the 2016 data, some will be similar and some key points will be new.

A better option is to run a new analysis but provide the keypoints from the 2016 analysis and let *Key Point Analysis* add new key points of 2017 data if there are such. One benefit of this approach is that the new result will mostly use 2016 key point and we will be able to compare between them, see what changed, what improved and what not. Another major benefit for this approach is run-time. 2016 data was already analyzed with these key points and since we have a cache in place much of the computation can be avoided. The 2016 key points can be provided via the: **run_param['keypoint_candidates'] = [...]** parameter, passing a list of strings, or we can use: **run_param['keypoint_candidates_by_job_id'] = <job_id>** and provide the job_id of an analysis job. KPA will take the key points from the job's result automatically. We will use this parameter and provide the *kpa_2016_job_id* we saved in advance.

In [None]:
run_params = {'sentence_to_multiple_kps': True,
              'keypoint_candidates_by_job_id': kpa_2016_job_id, 'n_top_kps': 100}
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params)
kpa_result_2016_2017 = future.get_result(high_verbosity=True, polling_timout_secs=30)

In [None]:
KpAnalysisUtils.write_result_to_csv(kpa_result_2016_2017, 'austin_survey_2016_2017_kpa_results.csv')
KpAnalysisUtils.compare_results(kpa_result_2016, kpa_result_2016_2017, '2016', '2016 + 2017')

### 3.2 Run *Key Point Analysis* incrementaly on new data (2017 independantly)
Using the **comments_ids** parameter in **start_kp_analysis_job** method, we can run over a subset of the comments in the domain. Let's do that and run an analysis over 2017 comments independantly. We will provide the key points from 2016 since we want to able to compare between them:

In [None]:
comments_ids = [comment['id'] for comment in comments_2017_sample]
run_params = {'sentence_to_multiple_kps': True,
              'keypoint_candidates_by_job_id': kpa_2016_job_id, 'n_top_kps': 100}
future = keypoints_client.start_kp_analysis_job(comments_ids=comments_ids, domain=domain, run_params=run_params)
kpa_result_2017 = future.get_result(high_verbosity=True, polling_timout_secs=30)

KpAnalysisUtils.write_result_to_csv(kpa_result_2017, 'austin_survey_2017_kpa_results.csv')

In [None]:
KpAnalysisUtils.compare_results(kpa_result_2016, kpa_result_2017, '2016', '2017')

Running over subsets of the data in the domain enable us to compare results between them (subsets can be data from different GEOs, different organizations, different users (e.g. promoters/detractors) etc').

## 4. Run *Key Point Analysis* on each stance separately
In many use-cases (surveys, customer feedback, etc') the comments have positive and/or negative stance, and it is usful to create a KPA analysis on each stance seperatly. Most stance detection models don't perfome too well on survey data (also costumer feedbacks etc') since the comments tend to have many "suggestions" in them, and the suggestions tend to apear positive to the model while the user suggests to improve something that needs improvement.
For that end we trained a stance-model that handles suggestions well and labels each sentence as 'Positive', 'Negative', 'Neutral' and 'Suggestion'. We usually treat Suggestions as negatives and run two separate analysis, first over 'Positive' sentences and second over 'Negative' and 'Suggestions' sentences.

This has the following advantages:
* Creates a separate positive/negative summary that shows clearly what works well and what needs to be improved.
* Filters-out neutral sentences that usually don't contain valuable information.
* Helps the matching model avoid stance mistakes (matching a positive sentence to a negative key point and vice-versa).

Lets run again, over the Austin survey dataset, but this time create two seperate KPA analyses (positive and negative). We will first need to create a new domain and add the domain_param **do_stance_analysis**.

In [None]:
domain = 'austin_demo_two_stances'
domain_params = {'do_stance_analysis': True}
KpAnalysisUtils.delete_domain_ignore_doesnt_exist(client=keypoints_client, domain=domain)
keypoints_client.create_domain(domain=domain, domain_params=domain_params)

Let's upload all 2016 comments to the new domain and wait for them to be processed. This time the sentences' stance is also calculated.

In [None]:
comments_texts = [comment['text'] for comment in comments_2016]
comments_ids = [comment['id'] for comment in comments_2016]
keypoints_client.upload_comments(domain=domain, comments_ids=comments_ids, comments_texts=comments_texts)
keypoints_client.wait_till_all_comments_are_processed(domain=domain)

We can download the processed sentences and save them into a csv if we want to examine the processed data.

In [None]:
sentences = keypoints_client.get_sentences_for_domain(domain=domain)
KpAnalysisUtils.write_sentences_to_csv(sentences, f'{domain}_sentences.csv')

And now, run two analyses, one over the positive sentences and one over the negative + suggestions.

In [None]:
run_params = {'sentence_to_multiple_kps': True, "n_top_kps":100}
run_params['stances_to_run'] = ['pos']
run_params['stances_threshold'] = 0.5
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params)
kpa_pos_result = future.get_result(high_verbosity=True, polling_timout_secs=30)
KpAnalysisUtils.print_result(kpa_pos_result, n_sentences_per_kp=2, title='Random sample positives')

As in many surveys, most comments are negative/suggestions therefore the positive analysis is relativly limited. Let's see how the negative analysis goes.

In [None]:
run_params['stances_to_run'] = ['neg', 'sug']
run_params['stances_threshold'] = 0.5
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params, comments_ids=comments_ids)
kpa_neg_result = future.get_result(high_verbosity=True, polling_timout_secs=30)

Lets print the results:

In [None]:
KpAnalysisUtils.print_result(kpa_neg_result, n_sentences_per_kp=2, title='Random sample negatives')

Reaching a nice 67% coverage, most of the sentences are matched to the 100 automatically extracted key points.

We can increase the stances_threshold when we want to run over less sentences with a stronger stance. This is useful when we have a large dataset with many less-relevant sentences and we want to filter them out.

We can mark the stance in the results:

In [None]:
kpa_pos_result = KpAnalysisUtils.set_stance_to_result(kpa_pos_result, 'pos')
kpa_neg_result = KpAnalysisUtils.set_stance_to_result(kpa_neg_result, 'neg')

And save the results (both pos/neg seperatly and merged) and create key points graphs' data files as we did before

In [None]:
pos_result_file = 'austin_survey_2016_pro_kpa_results.csv'
KpAnalysisUtils.write_result_to_csv(kpa_pos_result, pos_result_file)
KpAnalysisUtils.generate_graphs_and_textual_summary(pos_result_file)

neg_result_file = 'austin_survey_2016_neg_kpa_results.csv'
KpAnalysisUtils.write_result_to_csv(kpa_neg_result, neg_result_file)
KpAnalysisUtils.generate_graphs_and_textual_summary(neg_result_file)

kpa_merged_result = KpAnalysisUtils.merge_two_results(kpa_pos_result, kpa_neg_result)
merged_result_file = 'austin_survey_2016_merged_kpa_results.csv'
KpAnalysisUtils.write_result_to_csv(kpa_merged_result, merged_result_file)
KpAnalysisUtils.generate_graphs_and_textual_summary(merged_result_file)

We can also use the incremental approach when running on each stance seperatly. We will need to provide the job_id of the positive analysis of 2016 when running on the positive sentences of 2016 + 2017 and the job_id of negative analysis of 2016 when running on the negative sentences of 2016 + 2017, but for simplicity reasons, we didn't combine the features in this tutorial.

## 5. Cleanup
If you finished the tutorial and no longer need the domains and the results, cleaning up is always advised:

In [None]:
KpAnalysisUtils.delete_domain_ignore_doesnt_exist(client=keypoints_client, domain='austin_demo')
KpAnalysisUtils.delete_domain_ignore_doesnt_exist(client=keypoints_client, domain='austin_demo_two_stances')

## 6. Conclusion
In this tutorial, we showed how to use the *Key Point Analysis* service, and how it provides detailed insights over survey data right out of the box - significantly reducing the effort required by a data scientist to analyze the data. We also demonstrated key *key point analysis* features such as how to modify the analysis parameters and increase coverage, how to use the stance-model and create per-stance results, how to create *key points graph* and further improve the quality and the clarity of the results, and how to incrementally add new data.

Feel free to contact us for questions or assistance: *yoavka@il.ibm.com*