### Clinical Text Representation

`What is clinical text?`
* Clinical text are words and sentences from clinical narratives 
* Narratives such as case notes, family history, laboratory reports 
* Can be electronic or handwritten format

`What is different about clinical text?`
* Domain nuances
* Notations and abbreviations 
* Making sense of this text is important in designing pipeplines and how data pre-processing is done.

`Types of text representation`

There are different techniques for text representation. The decision on which to apply is dependent on:
* Practicality
* Team experience 
* Availability
* Cost of computation

Techniques used:
* One-hot encoding: based on binary vectors
* Count vectors: based on frequency, such as BoW and TF-IDF
* Embedding: encodes context for words and sentences 

This notebook explores entering custom fast text models on clinical data. 

Sample data from `STI Treatment Guidelines`

In [1]:
import nltk
# nltk.download('punkt')
from pprint import pprint
from gensim.models import FastText
import fasttext
from nltk import word_tokenize
import tika
from tika import parser
import string
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [2]:
raw_text = parser.from_file('data/STI-Guidelines-2021.pdf')

2023-01-26 09:52:03,943 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


In [3]:
# tika parser returns a dictionary
print(type(raw_text))

# Returned dictionary contains metadata and content 
# print(raw_text.items())

<class 'dict'>


In [4]:
# Here i will pre-process the data by lowering the case and strip-
# punctuations and then convert it back to string from the list format. 
rawtext_list = raw_text['content'].splitlines()
rawtext_list = [item.lower() for item in rawtext_list if item.strip()]

print(len(rawtext_list))

# sample sentence 
print(rawtext_list[5000])

23392
follow-up titer should not be repeated until approximately 


In [5]:
# Join list of sentences into a string 
sti_str = ' '.join(rawtext_list)
for c in string.punctuation:
    # remove punctuations 
    sti_str = sti_str.replace(c, "")
len(sti_str)

1294718

In [6]:
# I can view the output of the STI string by using pprint
pprint(sti_str)

('sexually transmitted infections treatment guidelines 2021 morbidity and '
 'mortality weekly report recommendations and reports  vol 70  no 4 july 23 '
 '2021  sexually transmitted infections treatment  guidelines 2021 us '
 'department of health and human services centers for disease control and '
 'prevention the mmwr series of publications is published by the center for '
 'surveillance epidemiology and laboratory services centers for disease '
 'control and prevention cdc  us department of health and human services '
 'atlanta ga 303294027 suggested citation author names first three then et al '
 'if more than six title mmwr recomm rep 202170no rrinclusive page numbers '
 'centers for disease control and prevention rochelle p walensky md mph '
 'director debra houry md mph acting principal deputy director daniel b '
 'jernigan md mph acting deputy director for public health science and '
 'surveillance rebecca bunnell phd med director office of science jennifer '
 'layden md phd 

In [8]:
# Now I can write the text into a text file so that I can use it later-
# to train my model
with open('sti2021.txt', 'w', encoding="utf-8") as text_file:
    text_file.write(sti_str)

In [12]:
# Now I will go ahead and train 3 unsupervised fast text models
sti_model_10 = fasttext.train_unsupervised('data/sti2021.txt', epoch=10, dim=300)

# save the trained model
sti_model_10.save_model('sti2021_10.bin')

In [10]:
sti_model_20 = fasttext.train_unsupervised('data/sti2021.txt', epoch=20, dim=300)

# save the trained model
sti_model_20.save_model('sti2021_20.bin')

In [11]:
sti_model_50 = fasttext.train_unsupervised('data/sti2021.txt', epoch=50, dim=300)

# save the trained model
sti_model_50.save_model('sti2021_50.bin')

In [13]:
# Load models using genism's FastText
ft_trained_model_10 = FastText.load_fasttext_format('data/sti2021_10.bin')
ft_trained_model_20 = FastText.load_fasttext_format('data/sti2021_20.bin')
ft_trained_model_50 = FastText.load_fasttext_format('data/sti2021_50.bin')

  ft_trained_model_10 = FastText.load_fasttext_format('data/sti2021_10.bin')
  ft_trained_model_20 = FastText.load_fasttext_format('data/sti2021_20.bin')
  ft_trained_model_50 = FastText.load_fasttext_format('data/sti2021_50.bin')


In [14]:
# Now I will use the models to find the top 5 most similar words-
# and confidence score to the clinical text
print("For 10 epochs model{}\n" .format(ft_trained_model_10.wv.most_similar(["Chlamydia"], topn=5)))
print("For 20 epochs model{}\n" .format(ft_trained_model_20.wv.most_similar(["Chlamydia"], topn=5)))
print("For 50 epochs model{}\n" .format(ft_trained_model_50.wv.most_similar(["Chlamydia"], topn=5)))

# *For chlamydia, there was no significant difference in the predictions of the models.
# *All three were able to predict gonorrhea, and trachomatis as similar words. 

For 10 epochs model[('chlamydia', 0.9944021105766296), ('chlamydial', 0.8954966068267822), ('gonorrhea', 0.810062050819397), ('trachomatis', 0.7779380083084106), ('149', 0.7044327259063721)]

For 20 epochs model[('chlamydia', 0.9932137131690979), ('chlamydial', 0.8480893969535828), ('gonorrhea', 0.7431206107139587), ('trachomatis', 0.633487343788147), ('gonorrhoea', 0.563438355922699)]

For 50 epochs model[('chlamydia', 0.9863393902778625), ('chlamydial', 0.6639330387115479), ('gonorrhea', 0.5729731321334839), ('trachomatis', 0.4673398435115814), ('≤35', 0.3125862777233124)]



In [15]:
# For doxycycline, the models predicted tetracycline and this makes sense because-
# they belong to the same class of antibiotics and can be substituted in practice.
print("For 10 epochs model{}\n" .format(ft_trained_model_10.wv.most_similar(["doxycycline"], topn=5)))
print("For 20 epochs model{}\n" .format(ft_trained_model_20.wv.most_similar(["doxycycline"], topn=5)))
print("For 50 epochs model{}\n" .format(ft_trained_model_50.wv.most_similar(["doxycycline"], topn=5)))

For 10 epochs model[('tetracycline', 0.881211519241333), ('mg', 0.8328772783279419), ('500', 0.7964303493499756), ('800', 0.7955899238586426), ('1g', 0.7763998508453369)]

For 20 epochs model[('tetracycline', 0.7427312135696411), ('azithromycin', 0.6004793047904968), ('100', 0.5848250985145569), ('timesday', 0.5741227269172668), ('cefoxitin', 0.5729455947875977)]

For 50 epochs model[('tetracycline', 0.5282465815544128), ('100', 0.4659227132797241), ('azithromycin', 0.4103972613811493), ('efficacious', 0.3827889859676361), ('orally', 0.3694431483745575)]



In [16]:
# Here I mispelled doxycycline and all 3 models were able to predict doxycycline and tetracycline-
# as similar words. 
print("For 10 epochs model{}\n" .format(ft_trained_model_10.wv.most_similar(["doxycline"], topn=5)))
print("For 20 epochs model{}\n" .format(ft_trained_model_20.wv.most_similar(["doxycline"], topn=5)))
print("For 50 epochs model{}\n" .format(ft_trained_model_50.wv.most_similar(["doxycline"], topn=5)))

For 10 epochs model[('doxycycline', 0.9831497669219971), ('tetracycline', 0.8936043977737427), ('mg', 0.8175896406173706), ('500', 0.7822662591934204), ('800', 0.7731585502624512)]

For 20 epochs model[('doxycycline', 0.9771069884300232), ('tetracycline', 0.7636500000953674), ('decline', 0.6026915907859802), ('cefoxitin', 0.5819800496101379), ('azithromycin', 0.5721902847290039)]

For 50 epochs model[('doxycycline', 0.9644572138786316), ('tetracycline', 0.6112635135650635), ('decline', 0.5081950426101685), ('100', 0.45984116196632385), ('efficacious', 0.41410011053085327)]



In [20]:
# I will also apply the models to decide which word does not match in the series
print("For 10 epochs model {}\n" .format(ft_trained_model_10.wv.doesnt_match(["doxycycline", "tetracycline", "azithromycin"])))
print("For 20 epochs model {}\n" .format(ft_trained_model_20.wv.doesnt_match(["doxycycline", "tetracycline", "azithromycin"])))
print("For 50 epochs model {}\n" .format(ft_trained_model_50.wv.doesnt_match(["doxycycline", "tetracycline", "azithromycin"])))

# The model predicted azithromycin as the odd one out. 
# This is true, even though the 3 medications are antibiotics, azithromycin belongs to a different subclass.

For 10 epochs model azithromycin

For 20 epochs model azithromycin

For 50 epochs model azithromycin



In [21]:
# Fast Text models can also be used to compare similarities with confidence scores
print("For 10 epochs model {}\n" .format(ft_trained_model_10.wv.similarity(w1='drowsiness', w2='headache')))
print("For 20 epochs model {}\n" .format(ft_trained_model_20.wv.similarity(w1='drowsiness', w2='headache')))
print("For 50 epochs model {}\n" .format(ft_trained_model_50.wv.similarity(w1='drowsiness', w2='headache')))

# Headache and drowsiness are common symptoms and when compared-
# have very similar scores

For 10 epochs model 0.4946211874485016

For 20 epochs model 0.4388333559036255

For 50 epochs model 0.33636391162872314



In [22]:
# When discharge and itchig were comapred, there is similarity detected.
# While the spellings are different, the contextual similarity exists. 

# If you think of STI's, they may cause itching and discharge
print("For 10 epochs model {}\n" .format(ft_trained_model_10.wv.similarity(w1='discharge', w2='itching')))
print("For 20 epochs model {}\n" .format(ft_trained_model_20.wv.similarity(w1='discharge', w2='itching')))
print("For 50 epochs model {}\n" .format(ft_trained_model_50.wv.similarity(w1='discharge', w2='itching')))

For 10 epochs model 0.7411637306213379

For 20 epochs model 0.6067762970924377

For 50 epochs model 0.4368440508842468

