# Implementing Custom Word Embeddings for Healthcare domain

Implementation of word2vec model
Proposed in the research paper - 'Efficient Estimation of Word Representations in Vector Space' by T. Mikolov

1.   Focus on the **Continuous Skip Gram** model architecture
2.   Motivation to build customised to understand complex medical terminologies.
3.   Idea to scale in future (possible by training on large corpus and larger dimensions of word embeddings)



## Data preprocessing

In [1]:
# get raw data
corpus = "The patient was diagnosed with hypertension and prescribed an antihypertensive medication to help lower blood pressure. Regular monitoring of blood pressure levels was recommended to assess the effectiveness of the treatment. The doctor also advised lifestyle modifications, including a healthier diet, regular exercise, and reducing stress. Hypertension, if left untreated, can lead to serious complications such as heart disease, stroke, and kidney failure. The healthcare provider emphasized the importance of adherence to the prescribed medication and follow-up visits for ongoing management."



In [2]:
# tokenise text
import numpy as np
import re

def tokenise(text):
  text = re.sub(r"[^\w\s]", "", text.lower())
  tokens = text.split()
  return tokens

tokens = tokenise(corpus)
print(len(tokens))
tokens

81


['the',
 'patient',
 'was',
 'diagnosed',
 'with',
 'hypertension',
 'and',
 'prescribed',
 'an',
 'antihypertensive',
 'medication',
 'to',
 'help',
 'lower',
 'blood',
 'pressure',
 'regular',
 'monitoring',
 'of',
 'blood',
 'pressure',
 'levels',
 'was',
 'recommended',
 'to',
 'assess',
 'the',
 'effectiveness',
 'of',
 'the',
 'treatment',
 'the',
 'doctor',
 'also',
 'advised',
 'lifestyle',
 'modifications',
 'including',
 'a',
 'healthier',
 'diet',
 'regular',
 'exercise',
 'and',
 'reducing',
 'stress',
 'hypertension',
 'if',
 'left',
 'untreated',
 'can',
 'lead',
 'to',
 'serious',
 'complications',
 'such',
 'as',
 'heart',
 'disease',
 'stroke',
 'and',
 'kidney',
 'failure',
 'the',
 'healthcare',
 'provider',
 'emphasized',
 'the',
 'importance',
 'of',
 'adherence',
 'to',
 'the',
 'prescribed',
 'medication',
 'and',
 'followup',
 'visits',
 'for',
 'ongoing',
 'management']

In [3]:
# build vocab
from collections import Counter
word_count = Counter(tokens)
vocab = {word: count for word, count in word_count.items()}
print(f'Vocab size: {len(vocab)}')

Vocab size: 60


In [4]:
print(vocab)

{'the': 7, 'patient': 1, 'was': 2, 'diagnosed': 1, 'with': 1, 'hypertension': 2, 'and': 4, 'prescribed': 2, 'an': 1, 'antihypertensive': 1, 'medication': 2, 'to': 4, 'help': 1, 'lower': 1, 'blood': 2, 'pressure': 2, 'regular': 2, 'monitoring': 1, 'of': 3, 'levels': 1, 'recommended': 1, 'assess': 1, 'effectiveness': 1, 'treatment': 1, 'doctor': 1, 'also': 1, 'advised': 1, 'lifestyle': 1, 'modifications': 1, 'including': 1, 'a': 1, 'healthier': 1, 'diet': 1, 'exercise': 1, 'reducing': 1, 'stress': 1, 'if': 1, 'left': 1, 'untreated': 1, 'can': 1, 'lead': 1, 'serious': 1, 'complications': 1, 'such': 1, 'as': 1, 'heart': 1, 'disease': 1, 'stroke': 1, 'kidney': 1, 'failure': 1, 'healthcare': 1, 'provider': 1, 'emphasized': 1, 'importance': 1, 'adherence': 1, 'followup': 1, 'visits': 1, 'for': 1, 'ongoing': 1, 'management': 1}


In [5]:
#mappings
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for idx, word in enumerate(vocab)}

In [6]:
word2idx['a']

30

In [7]:
idx2word[30]

'a'

In [8]:
len(word2idx)

60

Generate training data for SkipGrams model <br>
Task - Predicting range(=window size) of words before and after the current word
*   Prepare context labels as pairs of **(input token, context token)**<br>
Example, "The patient was diagnosed with hypertension" - <br>
("diagnosed", "patient")<br>
("diagnosed", "was")<br>
("diagnosed", "with")<br>
("diagnosed", "hypertension")<br>
For window size = 2. (2 words before and 2 words after)

In [10]:
def concat(*iterables):
  for iterable in iterables:
    yield from iterable
array = concat(range(0, 2), range(2, 4))
for a in array:
  print(a)

0
1
2
3


In [11]:
def generate_training_data(tokens, window_size, word2idx):
  X = []
  y = []
  for i in range(len(tokens)):
    context_start = max(0, i - window_size)
    context_end = min(len(tokens), i + window_size + 1)
    context = tokens[context_start:i] + tokens[i+1:context_end]
    for j in range(len(context)):
      X.append(one_hot_encode(word2idx[tokens[i]], len(vocab)))
      y.append(one_hot_encode(word2idx[context[j]], len(vocab)))
  return np.array(X), np.array(y)

def one_hot_encode(idx, size):
  one_hot = np.zeros(size)
  one_hot[idx] = 1
  return one_hot

X, y = generate_training_data(tokens, 2, word2idx)

X and y are made separately instead of pair as shown in example. This is done for future benefit for matrix operations.

In [12]:
X.shape

(318, 60)

In [13]:
y.shape

(318, 60)

In [14]:
X

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [15]:
y

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.]])