## Topic Models
Dynamic topic models can be used to vizualise the topics of a collection of documents.

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

Inspired by this notebook: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Installing BERTopic

We start by installing BERTopic from PyPi:

In [1]:
%%capture
!pip install bertopic

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# **Data**

In [12]:
%cd ../..

/content/drive/MyDrive/projects/medical_txt_parser


In [15]:
import glob
import pandas as pd
import os
from pprint import pprint

import matplotlib.pyplot as plt

from src.utils.parse_data import parse_ast, parse_concept, parse_relation

In [16]:
train_data_path = "data/train"
val_data_path = "data/val"
ast_folder_name = "ast"
concept_folder_name = "concept"
rel_folder_name = "rel"
txt_folder_name = "txt"

In [18]:
text_files = glob.glob(train_data_path + os.sep + txt_folder_name + os.sep +  "*.txt")
filename = ""
df = pd.DataFrame()
from tqdm import tqdm
for file in tqdm(text_files):
    with open(file, 'r') as f:
        text = f.read()
        filename = file.split("/")[-1].split(".")[0]
        concept = parse_concept(train_data_path + os.sep + concept_folder_name + os.sep +  filename + ".con")
        
        df = df.append(pd.DataFrame({"text": [text], "filename": [filename] , "concept": [concept]}), ignore_index=True)
df.head()

100%|██████████| 170/170 [00:21<00:00,  7.92it/s]


Unnamed: 0,text,filename,concept
0,Admission Date :\n2017-08-14\nDischarge Date :...,record-142,"{'concept_text': ['cyanotic', 'a more pervasiv..."
1,Admission Date :\n2014-10-21\nDischarge Date :...,record-54,"{'concept_text': ['intraparenchymal bleed', 'i..."
2,Admission Date :\n2017-06-13\nDischarge Date :...,record-105,"{'concept_text': ['left basilar atelectasis', ..."
3,Admission Date :\n2015-10-05\nDischarge Date :...,record-106,"{'concept_text': ['vomiting', 'asa', 'ck-mb', ..."
4,Admission Date :\n2015-06-05\nDischarge Date :...,record-107,"{'concept_text': ['his respiratory distress', ..."


In [19]:
concept_df = pd.DataFrame(columns=[ "filename"]+list(concept.keys()))
for i, file in df.iterrows():
    concept_dict = file["concept"]
    tmp = pd.DataFrame(concept_dict)
    tmp["filename"] = file["filename"]
    concept_df = concept_df.append(tmp, ignore_index=True)
concept_df.head()

Unnamed: 0,filename,concept_text,start_line,start_word_number,end_line,end_word_number,concept_type
0,record-142,cyanotic,26,5,26,5,problem
1,record-142,a more pervasive process,169,24,169,27,problem
2,record-142,old twi v4-6,106,0,106,2,problem
3,record-142,"his new , severe global deficit",169,10,169,15,problem
4,record-142,anoxic encephalopathy,169,30,169,31,problem


In [21]:
texts = concept_df.concept_text.values.tolist()
texts

['cyanotic',
 'a more pervasive process',
 'old twi v4-6',
 'his new , severe global deficit',
 'anoxic encephalopathy',
 'rate',
 'ekg',
 'anemia of chronic disease',
 'sepsis',
 'the left ventricular inflow pattern',
 'impaired relaxation',
 'vancomycin',
 'mrsa',
 'pseudomonas',
 'mild ( 1+ ) aortic regurgitation',
 'jevity',
 'dusky',
 'long qt interval',
 'st depressions v4 , 5',
 'hr',
 'the ventilator',
 'pseudomonas aeruginosa',
 'bp',
 'qtc',
 'lansoprazole',
 'ac',
 'aortic valve stenosis',
 'hyponatremia',
 'asa',
 'mild st elevation in v3',
 'cxr',
 'vancomycin',
 'vt',
 'abg',
 'subsequent tracheostomy',
 'respiratory failure',
 'pulmonary edema',
 'lasix',
 'peep',
 'fio2',
 'old twi v4-6',
 'metoprolol',
 'failure to wean',
 'mildly dilated',
 'rr',
 's/p right pneumonectomy',
 'ac',
 'fio2',
 'zosyn',
 'hr',
 'effusion',
 'heparin',
 'blunting of left cpa',
 'bp',
 'fio2',
 'albuterol inhaler',
 'non-small cell lung cancer',
 'flagyl',
 'fi02',
 'new lower left lateral 

# **Dynamic Topic Modeling**


## Basic Topic Model
To perform Dynamic Topic Modeling with BERTopic we will first need to create a basic topic model using all texts. The temporal aspect will be ignored as we are, for now, only interested in the topics that reside in those texts. 

In [22]:
from bertopic import BERTopic
topic_model = BERTopic(min_topic_size=35, verbose=True)
topics, _ = topic_model.fit_transform(texts)

Batches:   0%|          | 0/517 [00:00<?, ?it/s]

2022-01-13 23:16:41,493 - BERTopic - Transformed documents to Embeddings
2022-01-13 23:17:13,460 - BERTopic - Reduced dimensionality with UMAP
2022-01-13 23:17:17,007 - BERTopic - Clustered UMAP embeddings with HDBSCAN


We can then extract most frequent topics:

In [23]:
freq = topic_model.get_topic_info(); freq.head(10)

Unnamed: 0,Topic,Count,Name
0,-1,4782,-1_cultures_weakness_copd_pulmonary
1,0,236,0_scan_ct_mri_contrast
2,1,229,1_bowel_diarrhea_stools_urinary
3,2,226,2_dvt_prophylaxis_lymphs_lvh
4,3,172,3_carotid_stenosis_angioplasty_angiogram
5,4,167,4_fluid_volume_fluids_saline
6,5,150,5_reflexes_sleep_agitated_agitation
7,6,149,6_pain_trigeminal_control_analgesia
8,7,149,7_aortic_regurgitation_valve_mitral
9,8,146,8_neuts_bs_nrb_fbs


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [27]:
topic_nr = freq.iloc[0]["Topic"]  # We select a frequent topic
topic_model.get_topic(topic_nr)   # You can select a topic number as shown above

[('cultures', 0.007831482072993435),
 ('weakness', 0.007452124208368343),
 ('copd', 0.00725176907345469),
 ('pulmonary', 0.007158068977883096),
 ('physical', 0.007065120259498133),
 ('ekg', 0.006936450306212081),
 ('pericardial', 0.0067155095249536775),
 ('hgb', 0.006601568572832218),
 ('mch', 0.006601568572832218),
 ('extremity', 0.0064627940838306265)]

We can visualize the basic topics that were created with the Intertopic Distance Map.

In [25]:
fig = topic_model.visualize_topics(); fig