# Using *Key Point Analysis* service for analyzing and finding insights in a survey data 
When you have a large collection of texts representing people’s opinions (such as product reviews, survey answers or  social media), it is difficult to understand the key issues that come up in the data. Going over thousands of comments is prohibitively expensive.  Existing automated approaches are often limited to identifying recurring phrases or concepts and the overall sentiment toward them, but do not provide detailed or actionable insights.

In this tutorial you will gain hands-on experience in using *Key Point Analysis* (KPA) for analyzing and deriving insights from open-ended answers.  

The data we will use is a community survey conducted in the city of Austin (https://data.world/cityofaustin/mf9f-kvkk). In this survey, the citizens of Austin were asked "If there was ONE thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?". 

## 1. Run *Key Point Analysis* (data from 2016)

### 1.1 Read the data and run *key point analysis*  over it
Let's read the data from *dataset_austin.csv* file, which holds the Austin survey dataset, and print the first comment.

In [1]:
from debater_python_api.api.clients.keypoints_client import KpAnalysisClient, KpAnalysisUtils, KpAnalysisTaskFuture
import os
import csv
import random


with open('./dataset_austin.csv') as csv_file:
    reader = csv.DictReader(csv_file)
    comments = [dict(d) for d in reader]

print(f'There are {len(comments)} comments in the dataset')
print(comments[0])

There are 3187 comments in the dataset
{'id': '1', 'year': '2016', 'text': "Dissatisfied traffic and with traffic, timing of street lights.  EXTREMELY dissatisfied with cit govt. interfering in local businesses (Uber/Lyft, income property owners).  Also, extremely dissatisfied with all the free handouts to people who are perfectly capable of earning their own money.  I'm very dissatisfied with the liberal leaning local politicians."}


Each comment is a dictionary with an unique_id 'id' and 'text' and a 'year'. We will first remove all comments with text longer than 1000 characters since this is a systme's limit. Then we will filter the comments and take the ones from 2016. 

The *Key Point Analysis* service is able to run over hundreds of thousands of sentences, however since the computation is heavy in resources (particularly GPUs) the trial version is limited to 1000 comments. You may request to increase this limit if needed. Since we want the tutorial to be relativly fast and lightweight, we will only run on a sample of 400 comments. Note that running over a larger set improves both the quality and coverage of the results.

In [2]:
comments = [c for c in comments if len(c['text'])<=1000]
comments_2016 = [c for c in comments if c['year'] == '2016']
sample_size = 400
random.seed(0)
comments_2016_sample = random.sample(comments_2016, sample_size)

*Key point analysis* is a novel and promising approach for summarization, with an important quantitative angle. This service summarizes a collection of comments on a given topic as a small set of key points. The salience of each key point is given by the number of its matching sentences in the given comments.

Before running the *Key Point Analysis* service we first need to initialize our client. The clients print information using the logger and a suitable verbosity level should be set. The client object is configured with an API key. It should be  retrieved from the [Project Debater Early Access Program](https://early-access-program.debater.res.ibm.com/) site.  In the code bellow it is passed by the enviroment variable *DEBATER_API_KEY* (you may also modify the code and place the api-key directly).

The *Key Point Analysis* service stores the data (and cached-results) in a *domain*. A user can create several domains, one for each dataset. Domains are only accessible to the user who created them.

Full documentation of the *Key Point Analysis* service can be found [here](https://early-access-program.debater.res.ibm.com/docs/services/keypoints/keypoints_pydoc.html).

In [3]:
KpAnalysisUtils.init_logger()
api_key = os.environ['DEBATER_API_KEY']
host = 'https://keypoint-matching-backend.debater.res.ibm.com'
keypoints_client = KpAnalysisClient(api_key, host)

In order to run *Key Point Analysis*, do the following steps:

### 1.2 Create a domain
Create a domin using the **keypoints_client.create_domain(domain=domain, domain_params={})** method. Several params can be passed when creating a domain in the domain_params dictionary as described in the documentation. Leaving it empty gives us a good default behaviour. You can also use **KpAnalysisUtils.create_domain_ignore_exists(client=keypoints_client, domain=domain, domain_params={})** if you don't want an exception to be thrown if the domain already exists. Note that in such case the domain_params are not updated and are remained as they where before.

Full documentation of the supported *domain_params* and how they affect the domain can be found [here](https://early-access-program.debater.res.ibm.com/docs/services/keypoints/keypoint_parameters_users.pdf).

In [4]:
domain = 'austin_demo'
KpAnalysisUtils.create_domain_ignore_exists(client=keypoints_client, domain=domain, domain_params={})

2022-09-22 11:53:54,145 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/domains
2022-09-22 11:53:55,356 [INFO] keypoints_client.py 475: created domain: austin_demo with domain_params: {}
2022-09-22 11:53:55,357 [INFO] keypoints_client.py 270: domain: austin_demo was created


Few domain related points:
* We can always delete a domain we no longer need using: **KpAnalysisUtils.delete_domain_ignore_doesnt_exist(client=keypoints_client, domain=domain)**
* Keep in mind that a domain has a state. It stores all comments that had beed uploaded into it and a cache with all calculations performed over this data.
* If we want to restart and run over the domain from scratch (no comments and no cache), we can delete the domain and then re-create it or obviously use a different domain. Keep in mind that the cache is also cleared and consecutive runs will take longer.

### 1.3 Upload comments into the domain
Upload the comments into the domain using the **keypoints_client.upload_comments(domain=domain, comments_ids=comments_ids, comments_texts=comments_texts)** method. This method receives the domain, a list of comment_ids and a list of comment_texts. When uploading comments into a domain, the *Key Point Analysis* service splits the comments into sentences and runs a minor cleansing on the sentences. If you have domain-specific knowladge and want to split the comments into sentences yourself, you can upload comments that are already splitted into sentences and set the *dont_split* parameter to True (in the domain_params when creating the domain) and *Key Point Analysis* will use the provided sentences as is. 

Note that:
* Comments_ids must be unique
* The number of comments_ids must match the number comments_texts
* Comments_texts must not be longer than 1000 characters
* Uploading the same comment several times (same domain + comment_id, comment_text is ignored) is not a problem and the comment is only uploaded once (if the comment_text is different, it is NOT updated).

In [5]:
comments_texts = [comment['text'] for comment in comments_2016_sample]
comments_ids = [comment['id'] for comment in comments_2016_sample]
keypoints_client.upload_comments(domain=domain, comments_ids=comments_ids, comments_texts=comments_texts)

2022-09-22 11:53:55,373 [INFO] keypoints_client.py 497: uploading 400 comments in batches
2022-09-22 11:53:55,377 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-09-22 11:53:56,297 [INFO] keypoints_client.py 511: uploaded 400 comments, out of 400


### 1.4 Wait for the comments to be processed
Comments that are uploaded to the domain are being processed. This takes some times and runs in an async manner. We can't run an analysis before this phase finishes and we need to wait till all comments in the domain are processed using the **keypoints_client.wait_till_all_comments_are_processed(domain=domain)** method.

In [6]:
keypoints_client.wait_till_all_comments_are_processed(domain=domain)

2022-09-22 11:53:56,314 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-09-22 11:53:56,908 [INFO] keypoints_client.py 523: domain: austin_demo, comments status: {'processed_comments': 0, 'processed_sentences': 0, 'pending_comments': 400}
2022-09-22 11:54:06,914 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-09-22 11:54:07,521 [INFO] keypoints_client.py 523: domain: austin_demo, comments status: {'processed_comments': 400, 'processed_sentences': 682, 'pending_comments': 0}


### 1.5 Start a Key Point Analysis job
Start a *Key Point Analysis* job using the **future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params)** method. This method receives the domain and a *run_params*. The run_params is a dictionary with various parameters for customizing the job. Leaving it empty gives us a good default behaviour. The job runs in an async manner therefore the method returns a future object.

Few additional options when running an analysis job:
* The analysis is performed over all comments in the domain. If we need to run over a subset of the comments (split the data by different GEOs/users types/timeframes etc') we can pass a list of comments_ids to the comments_ids parameter and it will create an analysis using only the provided comments.
* By default, key points are extracted automatically. When we want to provide key points and match all sentences to these key points we can do so by passing them to the keypoints parameter: **run_param['keypoints'] = [...]**. This enables a mode of work named human-in-the-loop where we first automatically extract key points, then we manually edit them (refine non-perfect key points, remove duplicated and add missing ones) and then run again, this time providing the edited keypoints as a given set of key points.
* It is also possible to provide key points and let KPA add additional missing key points. To do so pass the key points to the keypoint_candidates parameter: **run_param['keypoint_candidates'] = [...]** (see section 4 for an elaborated example).
* Full documentation of the supported *domain_params* and *run_params* and how they affect the analysis can be found [here](https://early-access-program.debater.res.ibm.com/docs/services/keypoints/keypoint_parameters_users.pdf).

In [7]:
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params={})

2022-09-22 11:54:07,535 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:54:08,197 [INFO] keypoints_client.py 579: started a kp analysis job - domain: austin_demo, run_params: {}, job_id: 632c22b0116742ef7549706b


### 1.6 Wait for the Key Point Analysis job to finish
Use the returned future and wait till results are available using the **kpa_result = future.get_result()** method. The method waits for the job to finish and eventually returns the result. The result is a dictionary containing the key points (sorted descendingly according to number of matched sentences) and for each key point has a list of matched sentences (sorted descendingly according to their match score). An additional 'none' key point is added which holds all the sentences that don't match any key point.

In [8]:
kpa_result_2016 = future.get_result(high_verbosity=True, polling_timout_secs=30)

2022-09-22 11:54:08,211 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:54:08,770 [INFO] keypoints_client.py 760: job_id 632c22b0116742ef7549706b is pending
2022-09-22 11:54:38,778 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:54:39,382 [INFO] keypoints_client.py 764: job_id 632c22b0116742ef7549706b is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 0, 'total_batches': 20, 'batch_size': 2000}}


Stage 1/2: |--------------------------------------------------| 0.0% Complete



2022-09-22 11:55:09,392 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:55:09,983 [INFO] keypoints_client.py 764: job_id 632c22b0116742ef7549706b is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 5, 'total_batches': 20, 'batch_size': 2000}}


Stage 1/2: |████████████--------------------------------------| 25.0% Complete



2022-09-22 11:55:39,991 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:55:40,586 [INFO] keypoints_client.py 764: job_id 632c22b0116742ef7549706b is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 18, 'total_batches': 20, 'batch_size': 2000}}


Stage 1/2: |█████████████████████████████████████████████-----| 90.0% Complete



2022-09-22 11:56:10,590 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:56:11,997 [INFO] keypoints_client.py 767: job_id 632c22b0116742ef7549706b is done, returning result


Let's print the results:

In [9]:
from austin_utils import print_results
print_results(kpa_result_2016, n_sentences_per_kp=2, title='2016 Random sample')

2016 Random sample coverage: 42.57
2016 Random sample key points:
84 - Improvement on traffic problem.
	- Fix the traffic problems, forget bikes and add more lanes.
	- FIX THE TRAFFIC SITUATION MOPAC.
60 - Improve affordable housing/living.
	- BUILD MORE AFFORDABLY HOUSING
	- AFFORDABLE HOUSING FOR LOW INCOME & TRAFFIC
50 - Develop public transportation network.
	- DEVELOP REALISTIC PLAN FOR PUBLIC TRANSPORTATION.
	- affordable housing in key and public transportation to reduce the number of cars on the
	  roads
26 - COST OF UTILITIES IS VERY HIGH
	- Utilities, particularly water, is too high.
	- Rest is too expensive, cost of living too high.
15 - City wide planning pertaining to infrastructure.
	- Plan on more roads that run east and west of the city, between ih35 and mopac
	- Traffic all facets-planning, congestion, enforcement
12 - The highways need a major overhaul.
	- Please improve the roadways.
	- Fix the darn roads!
9 - DEVELOPMENT SHOULD NOT DISPLACE LOW INCOME FAMILIES.
	- N

We can also save the results to file. This creates two files, one with the key points and all matched sentences and another summary file with only the key points and their saliance.

In [10]:
KpAnalysisUtils.write_result_to_csv(kpa_result_2016, 'austin_survey_2016_kpa_results.csv')

2022-09-22 11:56:12,024 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2016_kpa_results_kps_summary.csv
2022-09-22 11:56:12,027 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2016_kpa_results.csv


It is always possible to cancel a pending/running job in the following way:
* **keypoints_client.cancel_kp_extraction_job(\<Job Id\>)**

Job Id can be found: 
1. It's printed when a job is started 
2. From the future object: **future.get_job_id()**
3. From user report: **keypoints_client.get_full_report()** (see bellow)

It is also possibe to stop all jobs in a domain, or even all jobs in all domains (might be simpler since there is no need of the job_id):
* **keypoints_client.cancel_all_extraction_jobs_for_domain(domain)**
* **keypoints_client.cancel_all_extraction_jobs_all_domains()**

Please cancel long jobs if the results are no longer needed.

### 1.7 Modify the run_params and increase coverage
Each domain has a cache that stores all intermediate results that are calculated during the analysis. Therefore modifing the run_params and running another analysis runs much faster and all intersecting intermediate results are retreived from cache. 

Let's run again, but now reduce the **clustering_threshold** and **mapping_threshold**. The **clustering_threshold** is used for the key points selection (choose higher values for more fine-grained key points, and lower for more distinct key points). The **mapping_threshold** is used when mapping all sentences to the final key points (a lower threshold leads to a higher coverage with the risk of a lower precision).

In [11]:
run_params = {'clustering_threshold': 0.95, 'mapping_threshold': 0.95}
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params)
kpa_result_2016 = future.get_result(high_verbosity=True, polling_timout_secs=30)
KpAnalysisUtils.write_result_to_csv(kpa_result_2016, 'austin_survey_2016_kpa_results.csv')
print_results(kpa_result_2016, n_sentences_per_kp=2, title='Random sample')

2022-09-22 11:56:12,033 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:56:12,621 [INFO] keypoints_client.py 579: started a kp analysis job - domain: austin_demo, run_params: {'clustering_threshold': 0.95, 'mapping_threshold': 0.95}, job_id: 632c232c116742ef7549706e
2022-09-22 11:56:12,624 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:56:13,164 [INFO] keypoints_client.py 760: job_id 632c232c116742ef7549706e is pending
2022-09-22 11:56:43,170 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:56:44,523 [INFO] keypoints_client.py 767: job_id 632c232c116742ef7549706e is done, returning result
2022-09-22 11:56:44,528 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2016_kpa_results_kps_summary.

Random sample coverage: 46.86
Random sample key points:
96 - Improvement on traffic problem.
	- Fix the traffic problems, forget bikes and add more lanes.
	- FIX THE TRAFFIC SITUATION MOPAC.
64 - Improve affordable housing/living.
	- BUILD MORE AFFORDABLY HOUSING
	- AFFORDABLE HOUSING FOR LOW INCOME & TRAFFIC
52 - Develop public transportation network.
	- DEVELOP REALISTIC PLAN FOR PUBLIC TRANSPORTATION.
	- affordable housing in key and public transportation to reduce the number of cars on the
	  roads
35 - COST OF UTILITIES IS VERY HIGH
	- Utilities, particularly water, is too high.
	- Rest is too expensive, cost of living too high.
15 - TO HAVE BETTER PLANNING FOR CITY GROWTH.
	- Plan out the growth for Austin as the city grows.
	- Should have been better prepared for the city growth like Houston, San Antonio, Dallas.
11 - Streamline the residential permitting process.
	- Have permits for new commercial build outs be processed faster.
	- FIX PERMITS AND ZONING REQUIREMENTS.
10 - DEVE

By reducing the thresholds, the coverage was increased from 42.5% to 46.8%.

### 1.8 User Report
When we want to see what domains we have, maybe delete old ones that are not needed, see past and present analysis jobs, perhaps take their job_id and fetch their result 
(via **KpAnalysisTaskFuture(keypoints_client, \<job_id\>).get_result()** ), 
we can get a report with all the needed information

In [None]:
report = keypoints_client.get_full_report()
KpAnalysisUtils.print_report(report)

## 2. Mapping sentences to multiple key points, and creating a Key-Points-Graphs
By default, each sentence is mapped to one key point at most (the key point with the highest match-score, above the **mapping_threshold**). We can run again and ask KPA to map each sentence to all key points with a match-score above the **mapping_threshold**, by adding the **sentence_to_multiple_kps** parameter.

In [13]:
run_params = {'clustering_threshold': 0.95, 'mapping_threshold': 0.95, 
              'sentence_to_multiple_kps': True}
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params)
kpa_analysis_2016_job_id = future.get_job_id() # saving the job_id for a following section
kpa_result_multiple_kps = future.get_result(high_verbosity=True, polling_timout_secs=30)

2022-09-22 11:56:45,673 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:56:46,317 [INFO] keypoints_client.py 579: started a kp analysis job - domain: austin_demo, run_params: {'clustering_threshold': 0.95, 'mapping_threshold': 0.95, 'sentence_to_multiple_kps': True}, job_id: 632c234e116742ef75497071
2022-09-22 11:56:46,320 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:56:46,863 [INFO] keypoints_client.py 760: job_id 632c234e116742ef75497071 is pending
2022-09-22 11:57:16,871 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:57:18,309 [INFO] keypoints_client.py 767: job_id 632c234e116742ef75497071 is done, returning result


In [14]:
print_results(kpa_result_multiple_kps, n_sentences_per_kp=2, title='Random sample')

Random sample coverage: 54.43
Random sample key points:
115 - Improvement on traffic problem.
	- Fix the traffic problems, forget bikes and add more lanes.
	- FIX THE TRAFFIC SITUATION MOPAC.
92 - Work on making Austin affordable again.
	- Should be able to purchase a 3bd 2ba for under 300K in Austin.
	- Make Austin affordable again for hard working families
72 - Develop public transportation network.
	- DEVELOP REALISTIC PLAN FOR PUBLIC TRANSPORTATION.
	- affordable housing in key and public transportation to reduce the number of cars on the
	  roads
33 - Utilities, particularly water, is too high.
	- Traffic and high utility bills are a problem for me
	- COST OF UTILITIES IS VERY HIGH
29 - PLEASE MAKE THIS CITY MORE BIKE FRIENDLY.
	- AND PLEASE GIVE US SAFE BIKE LANES THAT ARE PHYSICALLY SEPARATED FROM CARS - THIS COULD
	  BE A TOTAL BIKING CITY YEAR ROUND.
	- Contributing to the traffic congestion going into the city.
23 - make every effort to eradicate poverty
	- PROTECT EAST AUSTI

Now that sentences are mapped to multiple key points, it is possible to create a *key points graph* by first saving the results as before, then translating the results file into a graph-data json file, then load this json file in our demo graph visualization, available at: [key points graph demo](https://keypoint-matching-ui.ris2-debater-event.us-east.containers.appdomain.cloud/)

In [15]:
KpAnalysisUtils.write_result_to_csv(kpa_result_multiple_kps, 'austin_survey_2016_multiple_kpa_results.csv')
KpAnalysisUtils.create_graph_data_file_for_ui('austin_survey_2016_multiple_kpa_results.csv')

2022-09-22 11:57:18,325 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2016_multiple_kpa_results_kps_summary.csv
2022-09-22 11:57:18,328 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2016_multiple_kpa_results.csv
2022-09-22 11:57:18,333 [INFO] keypoints_client.py 355: Creating key points graph data-file for results file: austin_survey_2016_multiple_kpa_results.csv
2022-09-22 11:57:18,334 [INFO] keypoints_client.py 330: reading file: austin_survey_2016_multiple_kpa_results.csv
2022-09-22 11:57:18,352 [INFO] keypoints_client.py 388: saving graph in file: austin_survey_2016_multiple_kpa_results_graph_data.json
2022-09-22 11:57:18,352 [INFO] keypoints_client.py 389: saving graph in file: austin_survey_2016_multiple_kpa_results_graph_data.json


You can now go to: [key points graph demo](https://keypoint-matching-ui.ris2-debater-event.us-east.containers.appdomain.cloud/) and load the graph's data file **austin_survey_2016_multiple_kpa_results_graph_data.json** to the ui.

## 3. Run *Key Point Analysis* incrementally
### 3.1 Run *Key Point Analysis* incrementally on new data (data from 2016 + 2017)
A year passed, and we collect additional data (data from 2017). We can now upload the 2017 data to the same domain (austin_demo) and have both 2016 and 2017 data in one domain. 

In [16]:
comments_2017 = [c for c in comments if c['year'] == '2017']
comments_2017_sample = random.sample(comments_2017, sample_size)

domain = 'austin_demo'
comments_texts = [comment['text'] for comment in comments_2017_sample]
comments_ids = [comment['id'] for comment in comments_2017_sample]
keypoints_client.upload_comments(domain=domain, comments_ids=comments_ids, comments_texts=comments_texts)
keypoints_client.wait_till_all_comments_are_processed(domain=domain)

2022-09-22 11:57:18,358 [INFO] keypoints_client.py 497: uploading 400 comments in batches
2022-09-22 11:57:18,358 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-09-22 11:57:19,233 [INFO] keypoints_client.py 511: uploaded 400 comments, out of 400
2022-09-22 11:57:19,234 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-09-22 11:57:19,862 [INFO] keypoints_client.py 523: domain: austin_demo, comments status: {'processed_comments': 400, 'processed_sentences': 682, 'pending_comments': 400}
2022-09-22 11:57:29,866 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-09-22 11:57:30,508 [INFO] keypoints_client.py 523: domain: austin_demo, comments status: {'processed_comments': 800, 'processed_sentences': 1440, 'pending_comments': 0}


We can now run a new analysis over all the data in the domain, as we did before, and automatically extract new key points. We can assume that some will be identical to the key points extracted on the 2016 data, some will be similar and some key points will be new.

A better option is to run a new analysis but provide the keypoints from the 2016 analysis and let *Key Point Analysis* add new key points of 2017 data if there are such. One benefit of this approach is that the new result will mostly use 2016 key point and we will be able to compare between them, see what changed, what improved and what not. Another major benefit for this approach is run-time. 2016 data was already analyzed with these key points and since we have a cache in place much of the computation can be avoided. The 2016 key points can be provided via the: **run_param['keypoint_candidates'] = [...]** parameter, passing a list of strings, or we can use: **run_param['keypoint_candidates_by_job_id'] = <job_id>** and provide the job_id of an analysis job. KPA will take the key points from the job's result automatically. We will use this parameter and provide the *kpa_analysis_2016_job_id* we saved in advance.

In [17]:
run_params = {'clustering_threshold': 0.95, 'mapping_threshold': 0.95, 'sentence_to_multiple_kps': True,
              'keypoint_candidates_by_job_id': kpa_analysis_2016_job_id}
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params)
kpa_result_2016_2017 = future.get_result(high_verbosity=True, polling_timout_secs=30)

2022-09-22 11:57:30,524 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:57:31,279 [INFO] keypoints_client.py 579: started a kp analysis job - domain: austin_demo, run_params: {'clustering_threshold': 0.95, 'mapping_threshold': 0.95, 'sentence_to_multiple_kps': True, 'keypoint_candidates_by_job_id': '632c234e116742ef75497071'}, job_id: 632c237b116742ef75497075
2022-09-22 11:57:31,282 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:57:31,813 [INFO] keypoints_client.py 760: job_id 632c237b116742ef75497075 is pending
2022-09-22 11:58:01,816 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:58:02,418 [INFO] keypoints_client.py 764: job_id 632c237b116742ef75497075 is running, progress: {'total_stages': 3, 'stage_0':

Stage 1/3: |--------------------------------------------------| 0.0% Complete



2022-09-22 11:59:03,096 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:59:03,689 [INFO] keypoints_client.py 764: job_id 632c237b116742ef75497075 is running, progress: {'total_stages': 3, 'stage_0': {'inferred_batches': 1, 'total_batches': 1, 'batch_size': 2000}, 'stage_1': {'inferred_batches': 0, 'total_batches': 28, 'batch_size': 2000}}


Stage 1/3: |--------------------------------------------------| 0.0% Complete



2022-09-22 11:59:33,694 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 11:59:34,303 [INFO] keypoints_client.py 764: job_id 632c237b116742ef75497075 is running, progress: {'total_stages': 3, 'stage_0': {'inferred_batches': 1, 'total_batches': 1, 'batch_size': 2000}, 'stage_1': {'inferred_batches': 4, 'total_batches': 28, 'batch_size': 2000}}


Stage 1/3: |███████-------------------------------------------| 14.3% Complete



2022-09-22 12:00:04,296 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:00:04,925 [INFO] keypoints_client.py 764: job_id 632c237b116742ef75497075 is running, progress: {'total_stages': 3, 'stage_0': {'inferred_batches': 1, 'total_batches': 1, 'batch_size': 2000}, 'stage_1': {'inferred_batches': 19, 'total_batches': 28, 'batch_size': 2000}}


Stage 1/3: |█████████████████████████████████-----------------| 67.9% Complete



2022-09-22 12:00:34,926 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:00:35,551 [INFO] keypoints_client.py 764: job_id 632c237b116742ef75497075 is running, progress: {'total_stages': 3, 'stage_0': {'inferred_batches': 1, 'total_batches': 1, 'batch_size': 2000}, 'stage_1': {'inferred_batches': 24, 'total_batches': 28, 'batch_size': 2000}}


Stage 1/3: |██████████████████████████████████████████--------| 85.7% Complete



2022-09-22 12:01:05,555 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:01:07,304 [INFO] keypoints_client.py 767: job_id 632c237b116742ef75497075 is done, returning result


In [18]:
KpAnalysisUtils.write_result_to_csv(kpa_result_2016_2017, 'austin_survey_2016_2017_kpa_results.csv')
from austin_utils import compare_results
comparison_df = compare_results(kpa_result_2016, '2016', kpa_result_2016_2017, '2016 + 2017')
display(comparison_df)

2022-09-22 12:01:07,317 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2016_2017_kpa_results_kps_summary.csv
2022-09-22 12:01:07,325 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2016_2017_kpa_results.csv


Unnamed: 0,key point,2016_n_sents,2016_percent,2016 + 2017_n_sents,2016 + 2017_percent,change_n_sents,change_percent
0,Improvement on traffic problem.,96,14.70%,210,11.76%,114,-2.94%
1,Develop public transportation network.,52,7.96%,112,6.27%,60,-1.69%
2,TO HAVE BETTER PLANNING FOR CITY GROWTH.,15,2.30%,30,1.68%,15,-0.62%
3,Streamline the residential permitting process.,11,1.68%,19,1.06%,8,-0.62%
4,PLEASE MAKE THIS CITY MORE BIKE FRIENDLY.,9,1.38%,64,3.58%,55,2.21%
5,Stop the gentrification.,8,1.23%,22,1.23%,14,0.01%
6,Improve affordable housing/living.,64,9.80%,---,---,---,---
7,COST OF UTILITIES IS VERY HIGH,35,5.36%,---,---,---,---
8,DEVELOPMENT SHOULD NOT DISPLACE LOW INCOME FAM...,10,1.53%,---,---,---,---
9,Attract a most diverse population to Austin,6,0.92%,---,---,---,---


### 3.2 Run *Key Point Analysis* incrementaly on new data (2017 independantly)
Using the **comments_ids** parameter in **start_kp_analysis_job** method, we can run over a subset of the comments in the domain. Let's do that and run an analysis over 2017 comments independantly. We will provide the key points from 2016 since we want to able to compare between them:

In [19]:
comments_ids = [comment['id'] for comment in comments_2017_sample]
run_params = {'clustering_threshold': 0.95, 'mapping_threshold': 0.95, 'sentence_to_multiple_kps': True,
              'keypoint_candidates_by_job_id': kpa_analysis_2016_job_id}
future = keypoints_client.start_kp_analysis_job(comments_ids=comments_ids, domain=domain, run_params=run_params)
kpa_result_2017 = future.get_result(high_verbosity=True, polling_timout_secs=30)

KpAnalysisUtils.write_result_to_csv(kpa_result_2017, 'austin_survey_2017_kpa_results.csv')
comparison_df = compare_results(kpa_result_2016, '2016', kpa_result_2017, '2017')
display(comparison_df)

2022-09-22 12:01:07,353 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:01:08,490 [INFO] keypoints_client.py 579: started a kp analysis job - domain: austin_demo, run_params: {'clustering_threshold': 0.95, 'mapping_threshold': 0.95, 'sentence_to_multiple_kps': True, 'keypoint_candidates_by_job_id': '632c234e116742ef75497071'}, job_id: 632c2454116742ef7549707a
2022-09-22 12:01:08,493 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:01:09,093 [INFO] keypoints_client.py 760: job_id 632c2454116742ef7549707a is pending
2022-09-22 12:01:39,101 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:01:39,723 [INFO] keypoints_client.py 764: job_id 632c2454116742ef7549707a is running, progress: {'total_stages': 3, 'stage_0':

Stage 1/3: |--------------------------------------------------| 0.0% Complete



2022-09-22 12:02:09,726 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:02:11,195 [INFO] keypoints_client.py 767: job_id 632c2454116742ef7549707a is done, returning result
2022-09-22 12:02:11,202 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2017_kpa_results_kps_summary.csv
2022-09-22 12:02:11,210 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2017_kpa_results.csv


Unnamed: 0,key point,2016_n_sents,2016_percent,2017_n_sents,2017_percent,change_n_sents,change_percent
0,Improvement on traffic problem.,96,14.70%,95,11.28%,-1,-3.42%
1,Develop public transportation network.,52,7.96%,40,4.75%,-12,-3.21%
2,TO HAVE BETTER PLANNING FOR CITY GROWTH.,15,2.30%,12,1.43%,-3,-0.87%
3,PLEASE MAKE THIS CITY MORE BIKE FRIENDLY.,9,1.38%,35,4.16%,26,2.78%
4,Improve affordable housing/living.,64,9.80%,---,---,---,---
5,COST OF UTILITIES IS VERY HIGH,35,5.36%,---,---,---,---
6,Streamline the residential permitting process.,11,1.68%,---,---,---,---
7,DEVELOPMENT SHOULD NOT DISPLACE LOW INCOME FAM...,10,1.53%,---,---,---,---
8,Stop the gentrification.,8,1.23%,---,---,---,---
9,Attract a most diverse population to Austin,6,0.92%,---,---,---,---


Running over subsets of the data in the domain enable us to compare results between them (subsets can be data from different GEOs, different organizations, different users (e.g. promoters/detractors) etc').

## 4. Run *Key Point Analysis* on each stance separately
In many use-cases (surveys, customer feedback, etc') the comments have positive and/or negative stance, and it is usful to create a KPA analysis on each stance seperatly. Most stance detection models don't perfome too well on survey data (also costumer feedbacks etc') since the comments tend to have many "suggestions" in them, and the suggestions tend to apear positive to the model while the user suggests to improve something that needs improvement.
For that end we trained a stance-model that handles suggestions well and labels each sentence as 'Positive', 'Negative', 'Neutral' and 'Suggestion'. We usually treat Suggestions as negatives and run two separate analysis, first over 'Positive' sentences and second over 'Negative' and 'Suggestions' sentences.

This has the following advantages:
* Creates a separate positive/negative summary that shows clearly what works well and what needs to be improved.
* Filters-out neutral sentences that usually don't contain valuable information.
* Helps the matching model avoid stance mistakes (matching a positive sentence to a negative key point and vice-versa).

Lets run again, over the Austin survey dataset, but this time create two seperate KPA analyses (positive and negative). We will first need to create a new domain and add the domain_param **do_stance_analysis**.

In [20]:
domain = 'austin_demo_two_stances'
domain_params = {'do_stance_analysis': True}
KpAnalysisUtils.create_domain_ignore_exists(client=keypoints_client, domain=domain, domain_params=domain_params)

2022-09-22 12:02:11,232 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/domains
2022-09-22 12:02:12,147 [INFO] keypoints_client.py 475: created domain: austin_demo_two_stances with domain_params: {'do_stance_analysis': True}
2022-09-22 12:02:12,149 [INFO] keypoints_client.py 270: domain: austin_demo_two_stances was created


Let's upload all 2016 comments to the new domain and wait for them to be processed. This time the sentences' stance is also calculated.

In [21]:
comments_texts = [comment['text'] for comment in comments_2016]
comments_ids = [comment['id'] for comment in comments_2016]
keypoints_client.upload_comments(domain=domain, comments_ids=comments_ids, comments_texts=comments_texts)
keypoints_client.wait_till_all_comments_are_processed(domain=domain)

2022-09-22 12:02:12,166 [INFO] keypoints_client.py 497: uploading 1588 comments in batches
2022-09-22 12:02:12,169 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-09-22 12:02:13,378 [INFO] keypoints_client.py 511: uploaded 1588 comments, out of 1588
2022-09-22 12:02:13,381 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-09-22 12:02:14,104 [INFO] keypoints_client.py 523: domain: austin_demo_two_stances, comments status: {'processed_comments': 0, 'processed_sentences': 0, 'pending_comments': 1588}
2022-09-22 12:02:24,111 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-09-22 12:02:24,710 [INFO] keypoints_client.py 523: domain: austin_demo_two_stances, comments status: {'processed_comments': 1588, 'processed_sentences': 2708, 'pending_comments': 0}


We can download the processed sentences and save them into a csv if we want to examine the processed data.

In [22]:
sentences = keypoints_client.get_sentences_for_domain(domain=domain)
KpAnalysisUtils.write_sentences_to_csv(sentences, f'{domain}_sentences.csv')

2022-09-22 12:02:24,723 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/data
2022-09-22 12:02:26,552 [INFO] keypoints_client.py 710: returning 2708 sentences for domain austin_demo_two_stances


And now, run two analyses, one over the positive sentences and one over the negative + suggestions.

In [23]:
run_params = {'clustering_threshold': 0.95, 'mapping_threshold': 0.95, 
              'sentence_to_multiple_kps': True}
run_params['stances_to_run'] = ['pos']
run_params['stances_threshold'] = 0.5
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params)
kpa_pos_result = future.get_result(high_verbosity=True, polling_timout_secs=30)
print_results(kpa_pos_result, n_sentences_per_kp=2, title='Random sample positives')

2022-09-22 12:02:27,285 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:02:28,055 [INFO] keypoints_client.py 579: started a kp analysis job - domain: austin_demo_two_stances, run_params: {'clustering_threshold': 0.95, 'mapping_threshold': 0.95, 'sentence_to_multiple_kps': True, 'stances_to_run': ['pos'], 'stances_threshold': 0.5}, job_id: 632c24a4116742ef7549707f
2022-09-22 12:02:28,057 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:02:28,661 [INFO] keypoints_client.py 760: job_id 632c24a4116742ef7549707f is pending
2022-09-22 12:02:58,674 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:02:59,748 [INFO] keypoints_client.py 767: job_id 632c24a4116742ef7549707f is done, returning result


Random sample positives coverage: 3.16
Random sample positives key points:
3 - RESIDENTIAL SERVICES ARE EXCELLENT!
	- City services (water, streets, electric) are outstanding!!
	- I was extremely impressed with the response time & professionalism of the workers!


As in many surveys, most comments are negative/suggestions therefore the positive analysis is relativly limited. Let's see how the negative analysis goes.

In [24]:
run_params['stances_to_run'] = ['neg', 'sug']
run_params['stances_threshold'] = 0.5
future = keypoints_client.start_kp_analysis_job(domain=domain, run_params=run_params, comments_ids=comments_ids)
kpa_neg_result = future.get_result(high_verbosity=True, polling_timout_secs=30)

2022-09-22 12:02:59,764 [INFO] keypoints_client.py 424: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:03:01,482 [INFO] keypoints_client.py 579: started a kp analysis job - domain: austin_demo_two_stances, run_params: {'clustering_threshold': 0.95, 'mapping_threshold': 0.95, 'sentence_to_multiple_kps': True, 'stances_to_run': ['neg', 'sug'], 'stances_threshold': 0.5}, job_id: 632c24c5116742ef75497082
2022-09-22 12:03:01,485 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:03:02,080 [INFO] keypoints_client.py 760: job_id 632c24c5116742ef75497082 is pending
2022-09-22 12:03:32,089 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:03:32,873 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2,

Stage 1/2: |--------------------------------------------------| 0.0% Complete



2022-09-22 12:04:02,885 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:04:03,574 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 0, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |--------------------------------------------------| 0.0% Complete



2022-09-22 12:04:33,584 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:04:34,228 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 0, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |--------------------------------------------------| 0.0% Complete



2022-09-22 12:05:04,236 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:05:04,882 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 7, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |███-----------------------------------------------| 7.4% Complete



2022-09-22 12:05:34,891 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:05:35,573 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 20, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |██████████----------------------------------------| 21.3% Complete



2022-09-22 12:06:05,583 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:06:06,210 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 20, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |██████████----------------------------------------| 21.3% Complete



2022-09-22 12:06:36,222 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:06:36,885 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 22, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |███████████---------------------------------------| 23.4% Complete



2022-09-22 12:07:06,888 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:07:07,541 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 28, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |██████████████------------------------------------| 29.8% Complete



2022-09-22 12:07:37,549 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:07:38,211 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 40, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |█████████████████████-----------------------------| 42.6% Complete



2022-09-22 12:08:08,216 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:08:08,923 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 40, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |█████████████████████-----------------------------| 42.6% Complete



2022-09-22 12:08:38,925 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:08:39,541 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 42, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |██████████████████████----------------------------| 44.7% Complete



2022-09-22 12:09:09,547 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:09:10,276 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 56, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |█████████████████████████████---------------------| 59.6% Complete



2022-09-22 12:09:40,283 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:09:40,913 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 60, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |███████████████████████████████-------------------| 63.8% Complete



2022-09-22 12:10:10,920 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:10:11,596 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 63, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |█████████████████████████████████-----------------| 67.0% Complete



2022-09-22 12:10:41,602 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:10:42,273 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 68, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |████████████████████████████████████--------------| 72.3% Complete



2022-09-22 12:11:12,280 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:11:12,906 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 80, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |██████████████████████████████████████████--------| 85.1% Complete



2022-09-22 12:11:42,910 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:11:43,571 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 83, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |████████████████████████████████████████████------| 88.3% Complete



2022-09-22 12:12:13,578 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:12:14,240 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 93, 'total_batches': 94, 'batch_size': 2000}}


Stage 1/2: |█████████████████████████████████████████████████-| 98.9% Complete



2022-09-22 12:12:44,245 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:12:44,864 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 94, 'total_batches': 94, 'batch_size': 2000}, 'stage_2': {'inferred_batches': 0, 'total_batches': 8, 'batch_size': 2000}}


Stage 2/2: |--------------------------------------------------| 0.0% Complete



2022-09-22 12:13:14,870 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:13:15,479 [INFO] keypoints_client.py 764: job_id 632c24c5116742ef75497082 is running, progress: {'total_stages': 2, 'stage_1': {'inferred_batches': 94, 'total_batches': 94, 'batch_size': 2000}, 'stage_2': {'inferred_batches': 4, 'total_batches': 8, 'batch_size': 2000}}


Stage 2/2: |█████████████████████████-------------------------| 50.0% Complete



2022-09-22 12:13:45,486 [INFO] keypoints_client.py 424: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-09-22 12:13:47,575 [INFO] keypoints_client.py 767: job_id 632c24c5116742ef75497082 is done, returning result


Lets print the results:

In [25]:
print_results(kpa_neg_result, n_sentences_per_kp=2, title='Random sample negatives')

Random sample negatives coverage: 70.91
Random sample negatives key points:
541 - Address transportation problems NOW.
	- PLEASE FIX THE TRAFFIC PROBLEMS, TOLL ROADS ARE NOT THE ANSWER
	- PLEASE WORK ON TRAFFIC.
294 - Work on making Austin affordable again.
	- The lack of affordable housing in Austin has pushed many of the original Austinites out
	  of the city.
	- PLEASE KEEP THE COST OF LIVING IN AUSTIN AFFORDABLE SO WE CAN STAY
140 - Property taxes are outrageous.
	- PROPERTY TAX OUTRAGEOUS!
	- PROPERTY TAXES ARE TOO HIGH.
138 - Spend our tax dollars wisely!!!
	- I can no longer afford to live in Austin due to the high property taxes-look for, find
	  and eliminate tax dollar waste!
	- Cost, waste,, us taxpayers a lot.
129 - Better pedestrian and biking lifestyle options.
	- CONSIDER BETTER DEVELOPED BIKE LANES THROUGHOUT THE CITY.
	- PUBLIC MASS TRANSIT,HOV,BIKE LANES AND WALKABLE NEIGHBORHOODS.
129 - SMARTER TRAFFIC MANAGEMENT.
	- I would like to gave seen the money I paid go towa

Reaching a nice 70.9% coverage, most of the sentences are matched to the 20 automatically extracted key points.

We can increase the stances_threshold when we want to run over less sentences with a stronger stance. This is useful when we have a large dataset with many less-relevant sentences and we want to filter them out.

We can mark the stance in the results:

In [26]:
kpa_pos_result = KpAnalysisUtils.set_stance_to_result(kpa_pos_result, 'pos')
kpa_neg_result = KpAnalysisUtils.set_stance_to_result(kpa_neg_result, 'neg')

And save the results (both pos/neg seperatly and merged) and create key points graphs' data files as we did before

In [27]:
pos_result_file = 'austin_survey_2016_pro_kpa_results.csv'
KpAnalysisUtils.write_result_to_csv(kpa_pos_result, pos_result_file)
KpAnalysisUtils.create_graph_data_file_for_ui(pos_result_file)

neg_result_file = 'austin_survey_2016_neg_kpa_results.csv'
KpAnalysisUtils.write_result_to_csv(kpa_neg_result, neg_result_file)
KpAnalysisUtils.create_graph_data_file_for_ui(neg_result_file)

kpa_merged_result = KpAnalysisUtils.merge_two_results(kpa_pos_result, kpa_neg_result)
merged_result_file = 'austin_survey_2016_merged_kpa_results.csv'
KpAnalysisUtils.write_result_to_csv(kpa_merged_result, merged_result_file)
KpAnalysisUtils.create_graph_data_file_for_ui(merged_result_file)

2022-09-22 12:13:47,590 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2016_pro_kpa_results_kps_summary.csv
2022-09-22 12:13:47,593 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2016_pro_kpa_results.csv
2022-09-22 12:13:47,596 [INFO] keypoints_client.py 355: Creating key points graph data-file for results file: austin_survey_2016_pro_kpa_results.csv
2022-09-22 12:13:47,596 [INFO] keypoints_client.py 330: reading file: austin_survey_2016_pro_kpa_results.csv
2022-09-22 12:13:47,602 [INFO] keypoints_client.py 388: saving graph in file: austin_survey_2016_pro_kpa_results_graph_data.json
2022-09-22 12:13:47,602 [INFO] keypoints_client.py 389: saving graph in file: austin_survey_2016_pro_kpa_results_graph_data.json
2022-09-22 12:13:47,633 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2016_neg_kpa_results_kps_summary.csv
2022-09-22 12:13:47,639 [INFO] keypoints_client.py 115: Writing dataframe to: austin_survey_2016_neg_kpa_resu

We can also use the incremental approach when running on each stance seperatly. We will need to provide the job_id of the positive analysis of 2016 when running on the positive sentences of 2016 + 2017 and the job_id of negative analysis of 2016 when running on the negative sentences of 2016 + 2017, but for simplicity reasons, we didn't combine the features in this tutorial.

## 5. Cleanup
If you finished the tutorial and no longer need the domains and the results, or want to run the tutorial again from scratch, delete the domains:

In [28]:
delete_domains = False
if delete_domains:
    KpAnalysisUtils.delete_domain_ignore_doesnt_exist(client=keypoints_client, domain='austin_demo')
    KpAnalysisUtils.delete_domain_ignore_doesnt_exist(client=keypoints_client, domain='austin_demo_two_stances')

## 6. Conclusion
In this tutorial, we showed how to use the *Key Point Analysis* service, and how it provides detailed insights over survey data right out of the box - significantly reducing the effort required by a data scientist to analyze the data. We also demonstrated key *key point analysis* features such as how to modify the analysis parameters and increase coverage, how to use the stance-model and create per-stance results, how to create *key points graph* and further improve the quality and the clarity of the results, and how to incrementally add new data.

Feel free to contact us for questions or assistance: *yoavka@il.ibm.com*