# Analysing Medical Transcripts with Project Debater 


### Introduction to Project Debater  

Project Debater is the first AI system that can debate humans on complex topics. Project Debater digests massive texts, constructs a well-structured speech on a given topic, delivers it with clarity and purpose, and rebuts its opponent. Eventually, Project Debater will help people reason by providing compelling, evidence-based arguments and limiting the influence of emotion, bias, or ambiguity. 


- In this notebook you will get an insight on how to use Project Debater to analyse and derive insights from medical transcipts.


**For prerequisites please refer to this [GitHub Repository](https://github.com/IBM/Analysing-Medical-Transcipts-using-Project-Debater)**

**Please also make sure to use this script with helper functions [austin_utils.py](https://github.ibm.com/TechnologyGarageUKI/Project-Debater/blob/master/Code/austin_utils.py)**

### Data

**The data that you will explore in this notebook contains sample medical transcriptions for various medical specialities.**

You can download this data directly [using this link](https://www.kaggle.com/tboyle10/medicaltranscriptions) 

**Let's start with importing the required Python packages and loading our data into the notebook.**

In [None]:
print('Set Api-Key:')
api_key = ''

print('Install Early-Access-Program SDK:')
!wget -P . https://early-access-program.debater.res.ibm.com/sdk/python_api.tar.gz
!tar -xvf python_api.tar.gz
!cd python_api ; pip install .
!rm -f python_api.tar.gz*

print('Retrieve datset and additional code from the Github repo: https://github.com/IBM/Analysing-Medical-Transcipts-using-Project-Debater :') 
!rm -f mtsamples_descriptions_clean*
!rm -f austin_utils*


!wget -P . https://raw.githubusercontent.com/IBM/Analysing-Medical-Transcipts-using-Project-Debater/main/Data/mtsamples_descriptions_clean.csv
!wget -P . https://raw.githubusercontent.com/IBM/Analysing-Medical-Transcipts-using-Project-Debater/main/Data/austin_utils.py

In [None]:
# Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import csv
import plotly.express as px
import urllib.request



In [None]:
    
with open('./mtsamples_descriptions_clean.csv') as csv_file:
    reader = csv.DictReader(csv_file)
    sentences = list(reader)

In [None]:
print('There are %d sentences in the dataset' % len(sentences))
print('Each sentence is a dictionary with the following keys: %s' % str(sentences[0].keys()))


In [None]:
viz_sentences = pd.DataFrame(sentences)
viz_sentences.head()

**Initialise Debater client**

The Key Point Analysis service stores the data (and results cache) in a domain. A user can create several domains, one for each dataset. Domains are only accessible to the user who created them.

Full documentation of the Key Point Analysis service can be found [here](https://early-access-program.debater.res.ibm.com/docs/services/keypoints/keypoints_pydoc.html)

In [None]:
# Initialise client
from debater_python_api.api.debater_api import DebaterApi
from austin_utils import init_logger
# import os

init_logger()
api_key = ''
debater_api = DebaterApi(apikey=api_key)
keypoints_client = debater_api.get_keypoints_client()

domain = 'medical_demo'


**Select top 1000 sentences from data using _Argument Quality_ service**

In [None]:
from austin_utils import print_top_and_bottom_k_sentences

def get_top_quality_sentences(sentences, top_k, topic):    
    arg_quality_client = debater_api.get_argument_quality_client()
    sentences_topic = [{'sentence': sentence['text'], 'topic': topic} for sentence in sentences]
    arg_quality_scores = arg_quality_client.run(sentences_topic)
    sentences_and_scores = zip(sentences, arg_quality_scores)
    sentences_and_scores_sorted = sorted(sentences_and_scores, key=lambda x: x[1], reverse=True)
    sentences_sorted = [sentence for sentence, _ in sentences_and_scores_sorted]
    print_top_and_bottom_k_sentences(sentences_sorted, 10)
    return sentences_sorted[:top_k]

sentences_top_1000_aq = get_top_quality_sentences(sentences, 1000, 
                            'The patient is a 30-year-old who was admitted with symptoms including obstructions, failures and pain that started four days ago.')

In [None]:
def run_kpa(sentences, run_params):
    sentences_texts = [sentence['text'] for sentence in sentences]
    sentences_ids = [sentence['id'] for sentence in sentences]

    keypoints_client.delete_domain_cannot_be_undone(domain) # Clear domain in case it existed already

    keypoints_client.upload_comments(domain=domain, 
                                     comments_ids=sentences_ids, 
                                     comments_texts=sentences_texts, 
                                     dont_split=True)

    keypoints_client.wait_till_all_comments_are_processed(domain=domain)

    future = keypoints_client.start_kp_analysis_job(domain=domain, 
                                                    comments_ids=sentences_ids, 
                                                    run_params=run_params)

    kpa_result = future.get_result(high_verbosity=False, 
                                   polling_timout_secs=5)
    
    return kpa_result, future.get_job_id()

* **mapping_threshold**  (Float in [0.0,1.0], set to 0.99 by default): The matching threshold, scores above are considered a match. A higher threshold leads to a higher precision and a lower coverage.
* **n_top_kps** (Integer, default is set by an internal algorithm): Number of key points to generate. Lower value will make the job finish faster. All sentences are re-mapped to these key point.

In [None]:
from austin_utils import print_results

kpa_result, _ = run_kpa(sentences_top_1000_aq, {'n_top_kps': 20,
                                                'mapping_threshold': 0.95})
# print_results(kpa_result, n_sentences_per_kp=2, title='Top 1000 sample')

**Explore results**

In [None]:
from austin_utils import print_results_in_a_table
print_results_in_a_table(kpa_result, n_sentences_per_kp=5, title='Top 1000 sample')

**Export results to dataframe**

In [None]:
def result_to_df(result):  
    matchings_rows = []
    for keypoint_matching in result['keypoint_matchings']:
        kp = keypoint_matching['keypoint']
        for match in keypoint_matching['matching']:
            match_row = [kp, match["sentence_text"], match["score"], match["comment_id"], match["sentence_id"],
                            match["sents_in_comment"], match["span_start"], match["span_end"], match["num_tokens"],
                            match["argument_quality"]]

            matchings_rows.append(match_row)

    cols = ["kp", "sentence_text", "match_score", 'comment_id', 'sentence_id', 'sents_in_comment', 'span_start',
            'span_end', 'num_tokens', 'argument_quality']
    match_df = pd.DataFrame(matchings_rows, columns=cols)
    
    return match_df

df_results = result_to_df(kpa_result)
df_results.tail()

In [None]:
df_sentences = pd.DataFrame(sentences)
df_sentences

**Merge results to original dataset**

In [None]:
#df_sentences = pd.read_csv(data + '/mtsamples_descriptions_clean.csv')
#df_results['comment_id'] = df_results['comment_id'].astype(int)

df_merge = df_results.merge(df_sentences[['id', 'id_description', 'medical_specialty_new']], left_on='comment_id', right_on='id', validate = 'one_to_one')

**Compare results to distribution of medical specialties**

In [None]:
# let's have a preliminary idea of how big each cluster is
df_merge['kp'].value_counts()

In [None]:
plt.figure (figsize = (10,8))

df_merge['kp'].value_counts().plot(kind = 'barh', color = '#ff00bf')

In [None]:
for kp in df_merge['kp'].value_counts().index:
    df_merge[df_merge['kp'] == kp]['medical_specialty_new'].value_counts(normalize=True).plot(kind = 'bar')
    plt.title('KP: ' + kp)
    plt.show()

In [None]:
df_merge['medical_specialty_new'].value_counts(normalize=True).plot(kind = 'bar')

### Term Wikifier

This service identifies the Wikipedia articles that are referenced by phrases or words or ideas, related to as mentions, in the sentence. For each such mention, the service returns several pieces of information, known together as the respective annotation.


In [None]:
def get_sentence_to_mentions(sentences_texts):
    term_wikifier_client = debater_api.get_term_wikifier_client()
    mentions_list = term_wikifier_client.run(sentences_texts)
    sentence_to_mentions = {}
    for sentence_text, mentions in zip(sentences_texts,    
                                       mentions_list):
        sentence_to_mentions[sentence_text] = set([mention['concept']['title'] for mention in mentions])
    
    return sentence_to_mentions

In [None]:
# Count Wikipedia terms in each key point
from collections import Counter
terms = {}
for kp in set(df_merge['kp'].values):
    sentence_to_mentions = get_sentence_to_mentions(df_merge['sentence_text'][df_merge['kp']==kp].values) # Extract Wikipedia terms
    all_mentions = [mention for sentence in sentence_to_mentions for mention in sentence_to_mentions[sentence]] # Put terms in list
    term_count = dict(Counter(all_mentions)) # Count terms and put in dictionary
    if 'History' in term_count.keys():
        term_count.pop('History')
   
    terms[kp] = term_count

In [None]:
# Check that it works
pd.DataFrame(list(terms[' Fever, otitis media, and possible sepsis.'].items()),columns = ['Term','Count']).sort_values(by = 'Count', ascending=False).head(10)

In [None]:
# Visualise
for kp in df_merge['kp'].value_counts().index:
    
    _df_viz = pd.DataFrame(list(terms[kp].items()),columns = ['Term','Count']).sort_values(by = 'Count', ascending=True)
    
    fig = px.bar(x = _df_viz['Count'].tail(10),
            y = _df_viz['Term'].tail(10),
            color=_df_viz['Term'].tail(10),
            color_discrete_sequence=px.colors.sequential.GnBu_r,
            orientation = 'h',
            title = 'Cluster:' + kp
            )

    fig.layout.update(showlegend = False, template = 'ggplot2', width = 700, height = 500,
                yaxis = dict(title_text = 'Top 10 Wikipedia Terms',showline = True, showticklabels = True, color = 'black'),
                xaxis = dict(title_text = 'Number of Mentions')
                )

    fig.show()