# Creating Custom Word2Vec Embeddings

---

This notebook goes through creating custom word embeddings from a list of documents (strings). It parses each document into a list of sentances then splits each sentance into a sequence of words, before being passed to a gensim Word2Vec model.

In [1]:
%reset -f

In [2]:
import nltk
import pandas as pd
import numpy as np
from section_parse import run
import json
import re
import random
from gensim.models import Word2Vec

### Extracting all HPI Sections

In [3]:
title = "HISTORY OF PRESENT ILLNESS:"
#title = "DISCHARGE MEDICATIONS:"
medication_sections = run(title)
medication_sections = [i for i in medication_sections if i != "NOT FOUND"]

## Saving Unannotated Sentances

In [4]:
def clean_text(text):
    bad_chars = [":","*"]
    space_chars = ["[","]","(",")","\n"]
    for c in bad_chars:
        text = text.replace(c,"")
    for c in space_chars:
        text = text.replace(c," ")
    return text

def sections_to_sentances(sections):
    sentances = []
    for section in sections:
        section = clean_text(section)
        sentances += [i.lstrip() for i in re.split("\. ",section) if len(i)>0]
    return sentances

def get_sentances(sections,start,samples,min_len=3):
    sections = medication_sections[start:start+n]
    seqs = sections_to_sentances(sections)
    seqs = [i for i in seqs if len(i)>min_len]
    return seqs

def create_and_save_data(sections,start,n,foldername="nlp_data"):
    seqs = get_sentances(sections,start,n)
    file = f"{foldername}/sentances_{start}-{start+n}.txt"
    with open(file,"w") as f:
        f.writelines(seqs)
    print("~~~File Saved Successfully")
    return df

This function can be used to get batches of data points but here we get the entire dataset.

In [5]:
start = 0
n = len(medication_sections)
seqs = get_sentances(medication_sections,start,n)

print("Total Sentances:",len(seqs))
print("Sentance Examples:\n")
[print(f"{i} - ",seqs[i],"\n") for i in range(10)];

Total Sentances: 643084
Sentance Examples:

0 -  HISTORY OF PRESENT ILLNESS  This is an 81-year-old female with a history of emphysema  not on home O2 , who presents with three days of shortness of breath thought by her primary care doctor to be a COPD flare 

1 -  Two days prior to admission, she was started on a prednisone taper and one day prior to admission she required oxygen at home in order to maintain oxygen saturation greater than 90% 

2 -  She has also been on levofloxacin and nebulizers, and was not getting better, and presented to the  Hospital1 18  Emergency Room 

3 -  In the  Hospital3   Emergency Room, her oxygen saturation was 100% on CPAP 

4 -  She was not able to be weaned off of this despite nebulizer treatment and Solu-Medrol 125 mg IV x2 

5 -  Review of systems is negative for the following  Fevers, chills, nausea, vomiting, night sweats, change in weight, gastrointestinal complaints, neurologic changes, rashes, palpitations, orthopnea 

6 -  Is positive for th

In [7]:
# Our LSTMs will use only lower cases so we get embeddings for these
seqs = [seq.lower() for seq in seqs]
seqs[0]

'history of present illness  this is an 81-year-old female with a history of emphysema  not on home o2 , who presents with three days of shortness of breath thought by her primary care doctor to be a copd flare'

In [22]:
def sentance_to_seqs(sentance):
    seq = re.split(" ",sentance)
    return [word for word in seq if len(word)>0]

In [23]:
cleaned_seqs = [sentance_to_seqs(sentance) for sentance in seqs]
cleaned_seqs[0]

['history',
 'of',
 'present',
 'illness',
 'this',
 'is',
 'an',
 '81-year-old',
 'female',
 'with',
 'a',
 'history',
 'of',
 'emphysema',
 'not',
 'on',
 'home',
 'o2',
 ',',
 'who',
 'presents',
 'with',
 'three',
 'days',
 'of',
 'shortness',
 'of',
 'breath',
 'thought',
 'by',
 'her',
 'primary',
 'care',
 'doctor',
 'to',
 'be',
 'a',
 'copd',
 'flare']

In [24]:
model = Word2Vec(cleaned_seqs,size=100,window=5,min_count=1)
model.save("./nlp_data/word2vec.model")

Embeddings are now ready to be loaded and used as the weights matrix for the embedding layer of an LSTM (with some formatting required).

---